Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization

Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problem...

Full description

Bibliographic Details
Main Authors: Liwei Hou, Hengsheng Wang, Haoran Zou, Qun Wang
Format: Article
Language:English
Published: MDPI AG 2021-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/3/1131
_version_ 1797407299243868160
author Liwei Hou
Hengsheng Wang
Haoran Zou
Qun Wang
author_facet Liwei Hou
Hengsheng Wang
Haoran Zou
Qun Wang
author_sort Liwei Hou
collection DOAJ
description Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efficiency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efficient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efficiency at the same time. The training starts with what we call the first stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efficiency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efficiently update the policy network parameters in the process of robot skills learning.
first_indexed 2024-03-09T03:39:20Z
format Article
id doaj.art-547d8d8b777a46168cdcdc407c17611b
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T03:39:20Z
publishDate 2021-01-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-547d8d8b777a46168cdcdc407c17611b2023-12-03T14:43:41ZengMDPI AGApplied Sciences2076-34172021-01-01113113110.3390/app11031131Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy OptimizationLiwei Hou0Hengsheng Wang1Haoran Zou2Qun Wang3College of Mechanical and Electrical Engineering, Central South University, Changsha 410083, ChinaCollege of Mechanical and Electrical Engineering, Central South University, Changsha 410083, ChinaCollege of Mechanical and Electrical Engineering, Central South University, Changsha 410083, ChinaCollege of Mechanical and Electrical Engineering, Central South University, Changsha 410083, ChinaAutonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efficiency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efficient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efficiency at the same time. The training starts with what we call the first stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efficiency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efficiently update the policy network parameters in the process of robot skills learning.https://www.mdpi.com/2076-3417/11/3/1131robot skills learningpolicy learningpolicy gradientexperiencedata efficiency
spellingShingle Liwei Hou
Hengsheng Wang
Haoran Zou
Qun Wang
Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization
Applied Sciences
robot skills learning
policy learning
policy gradient
experience
data efficiency
title Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization
title_full Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization
title_fullStr Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization
title_full_unstemmed Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization
title_short Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization
title_sort efficient robot skills learning with weighted near optimal experiences policy optimization
topic robot skills learning
policy learning
policy gradient
experience
data efficiency
url https://www.mdpi.com/2076-3417/11/3/1131
work_keys_str_mv AT liweihou efficientrobotskillslearningwithweightednearoptimalexperiencespolicyoptimization
AT hengshengwang efficientrobotskillslearningwithweightednearoptimalexperiencespolicyoptimization
AT haoranzou efficientrobotskillslearningwithweightednearoptimalexperiencespolicyoptimization
AT qunwang efficientrobotskillslearningwithweightednearoptimalexperiencespolicyoptimization