Next Article in Journal
Platonin, a Cyanine Photosensitizing Dye, Ameliorates Inflammatory Responses in Vascular Smooth Muscle Cells by Modulating Inflammatory Transcription Factors
Previous Article in Journal
Shadow Estimation for Ultrasound Images Using Auto-Encoding Structures and Synthetic Shadows
Open AccessArticle

Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization

by 1, 1,2,*, 1 and 1,3
1
College of Mechanical and Electrical Engineering, Central South University, Changsha 410083, China
2
State Key Laboratory for High Performance Complex Manufacturing, Central South University, Changsha 410083, China
3
Modern Engineering Training Center, Hunan University, Changsha 410082, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(3), 1131; https://doi.org/10.3390/app11031131
Received: 26 November 2020 / Revised: 8 January 2021 / Accepted: 18 January 2021 / Published: 26 January 2021
(This article belongs to the Section Robotics and Automation)
Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efficiency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efficient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efficiency at the same time. The training starts with what we call the first stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efficiency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efficiently update the policy network parameters in the process of robot skills learning. View Full-Text
Keywords: robot skills learning; policy learning; policy gradient; experience; data efficiency robot skills learning; policy learning; policy gradient; experience; data efficiency
Show Figures

Figure 1

MDPI and ACS Style

Hou, L.; Wang, H.; Zou, H.; Wang, Q. Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization. Appl. Sci. 2021, 11, 1131. https://doi.org/10.3390/app11031131

AMA Style

Hou L, Wang H, Zou H, Wang Q. Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization. Applied Sciences. 2021; 11(3):1131. https://doi.org/10.3390/app11031131

Chicago/Turabian Style

Hou, Liwei; Wang, Hengsheng; Zou, Haoran; Wang, Qun. 2021. "Efficient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization" Appl. Sci. 11, no. 3: 1131. https://doi.org/10.3390/app11031131

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop