Efﬁcient Robot Skills Learning with Weighted Near-Optimal Experiences Policy Optimization

: Autonomous learning of robotic skills seems to be more natural and more practical than engineered skills, analogous to the learning process of human individuals. Policy gradient methods are a type of reinforcement learning technique which have great potential in solving robot skills learning problems. However, policy gradient methods require too many instances of robot online interaction with the environment in order to learn a good policy, which means lower efﬁciency of the learning process and a higher likelihood of damage to both the robot and the environment. In this paper, we propose a two-phase (imitation phase and practice phase) framework for efﬁcient learning of robot walking skills, in which we pay more attention to the quality of skill learning and sample efﬁciency at the same time. The training starts with what we call the ﬁrst stage or the imitation phase of learning, updating the parameters of the policy network in a supervised learning manner. The training set used in the policy network learning is composed of the experienced trajectories output by the iterative linear Gaussian controller. This paper also refers to these trajectories as near-optimal experiences. In the second stage, or the practice phase, the experiences for policy network learning are collected directly from online interactions, and the policy network parameters are updated with model-free reinforcement learning. The experiences from both stages are stored in the weighted replay buffer, and they are arranged in order according to the experience scoring algorithm proposed in this paper. The proposed framework is tested on a biped robot walking task in a MATLAB simulation environment. The results show that the sample efﬁciency of the proposed framework is much higher than ordinary policy gradient algorithms. The algorithm proposed in this paper achieved the highest cumulative reward, and the robot learned better walking skills autonomously. In addition, the weighted replay buffer method can be made as a general module for other model-free reinforcement learning algorithms. Our framework provides a new way to combine model-based reinforcement learning with model-free reinforcement learning to efﬁciently update the policy network parameters in the process of robot skills learning.


Introduction
Dynamic modeling and control techniques have been widely used to develop robot skills such as walking [1,2] and grasping [3,4], but robot abilities related to adapting to the uncertainty of real-life environment are still insufficient. Making a robot learn a new skill depends largely on the knowledge of expert experiences, and the resulting skill tends to deteriorate under environmental disturbances. In recent years, many studies have been reported about robots autonomously understanding tasks and learning skills without expert knowledge. To this end, reinforcement learning has received much attention, which provides a mathematical formation for learning-based control, making autonomous robotic skills learning possible [5].  (GPS). WNEPO replaces the replay buffer in GPS with the weighted replay buffer to select the experiences with high quality. When the performance of the policy network is close to that of the linear Gaussian controller, the policy network directly interacts with the environment. In WNEPO, the blue dotted line in the above figure is not executed.
To solve the problems that GPS encounters, weighted near-optimal experiences policy optimization (WNEPO) is proposed in this paper. The algorithm framework of WNEPO is shown in Figure 1. In WNEPO, the update of the policy network can be divided into two stages. In the first stage, the optimization objective adopted by GPS is used to update the policy network, and the near-optimal experiences generated by the linear Gaussian controller are used as the training samples; in the second stage, the policy network interacts with the environment directly and is trained by the policy gradient method. The first stage of learning is a supervised learning manner-more accurately speaking, imitation learning [20]-and the second stage is a model-free RL manner. Compared with GPS, WNEPO has two improvements: (1) it replaces the traditional replay buffer with a weighted replay buffer; (2) the policy network not only learns from the near-optimal experiences but also uses the data directly interacting with the environment to improve the performance after approaching the performance of the linear Gaussian controller. The pros and cons of the RL algorithms mentioned above are summarized in Table 1.

RL Methods Advantages Disadvantages
Model-free RL -No need for prior knowledge -Strong asymptotic performance -Slow convergence speed -High risk of damage to robot and environment Model-based RL -Less online interaction, safer for robot and environment -Fast convergence speed -Depends on explicit models -Poor asymptotic performance GPS -All the above advantages -May never explore the optimal space -Cannot be updated after being deployed WNEPO -Same as GPS -Not all of the above, but additional component is required The goal of this paper is to introduce our WNEPO method through training a biped robot to walk, and the whole work is performed under a MATLAB simulation environment. The paper is organized as follows: Section 1 discusses the shortcomings of the existing algorithms in solving the problem of robot skill learning and describes the overall idea; Section 2 provides a brief introduction to the underlying theory; Section 3 introduces the algorithm framework proposed in detail; Section 4 describes the experimental verification and analysis; Section 5 summarizes the whole paper.

Preliminary
In this paper, the reason why WNEPO is introduced in the background of a biped robot walking task is that the state and action dimensions of this task are high, and the system dynamics are nonlinear [21], which puts forward higher requirements for the learning algorithm. Scholars have proposed gait planning methods based on different principles, such as methods based on a mathematical model [22], imitating human walking characteristics [23] and a central pattern generator [24]. All the methods above need accurate modeling and have poor generalization. The RL-based method proposed in this paper can provide some ideas for solving these problems.
Our goal is to have the robot learn to walk in a straight line using minimal control effort without any prior information. As a sequential decision-making problem, this walking task can be modeled as a Markov decision process, where at any moment t, the agent in state s t selects action a t with probability π θ (a t |s t ), causing the environment to enter a new state s t+1 with state transition probability p(s t+1 |s t , a t ), and then receives an instantaneous reward r t [25]. In this paper, we use u and x instead of a and s as the control variable and system state, respectively, to be consistent with the conventions in optimal control.
A trajectory τ = {x 0 , u 0 , . . . , x t , u t , x t+1 , . . .} is a sequence of state-action pairs (x t , u t ) along the timeline from x 0 until the biped robot enters a terminal state. The quality of robot walking can be evaluated by calculating the reward R(τ) of trajectory τ. The reward function used in this paper is inspired by reference [25]. The compound probability of a possible trajectory τ can be expressed as where π θ (u t |x t ) is the policy (or probability) of choosing action u t in state x t with the parameter θ. A real trajectory of robot interaction with the environment τ is an experience or a sampling trajectory which is stored in the replay buffer for policy learning.
The goal of reinforcement learning is to optimize the parameter θ in policy π θ (u t |x t ) that maximizes the expectation of the cumulative reward, which is an optimization problem of Equation (1) [26]: where r t is the reward at time t, and the discount factor γ is a real value in [0, 1].
The following of this section focuses on the policy gradient approach, which is widely used in model-free RL and is advantageous in solving high-dimensional robotic problems.
The policy gradient method estimates the gradient of J π (θ) and then updates the parameters θ using the mini-batch gradient descent. Because policy and environment dynamics are independent of each other, the formula for policy gradient is as follows.
Then, we can update θ iteratively by calculating ∇ θ J π using (u n t , x n t , R(τ n )) tuples, which is called the minimum decision unit (MDU) in this paper. The gradient descent method is used to update θ: In practice, small changes in θ can lead to dramatic changes in π θ (u t |x t ). To achieve a stable improvement in the performance of the policy, it is necessary to limit the Kullback-Leibler(KL) divergence before and after the policy update to a certain threshold value. The corresponding optimization goals are: where π θ (·|x t ) is the pre-update policy. Based on this idea, methods such as Natural Policy Gradient(NPG), Trust Region Policy Optimization(TRPO) [27] and Proximal Policy Optimization(PPO) [28] have emerged. Among them, the PPO algorithm simplifies the optimization problem containing the KL divergence constraint, and the learning process of the policy is more stable.

Methods
Our goal was to train a policy network and use it to control the robot to realize the walking skill. The WNEPO proposed in this paper can improve the sample efficiency for robot skill learning, as shown in Figure 1. In this section, we will introduce WNEPO in detail.
To achieve end-to-end control of the robot, the policy π θ (u t |x t ) is expressed as: We use a neural network to model π θ (u t |x t ). µ π (x t ) and Σ π (x t ) are the outputs of the neural network, and the two respectively represent the mean and variance used to determine a normal distribution. That is, when the robot is in the state x t , the action u t output by the policy network obeys the normal distribution. The state x t of the robot can be measured directly by sensors, and this paper considers the sensor observation o t = x t and the action u t ∈ R E to be the torque applied to the joints.
Since the policy π θ (u t |x t ) is modeled with a neural network, it is usually necessary to construct a large-scale training set first to update the parameters θ. However, in the case of unknown environment dynamics, the sample size obtained from the collection is small and insufficient to train a good performance neural network. To address this issue, we divided the optimization of the policy network into two stages: the imitation phase and the practice phase. In the imitation phase, the policy network learns in a supervised learning manner from the near-optimal trajectories generated by optimizing the linear Gaussian controller. In this way, a policy network with good performance can be trained in the case of small sample size. When the performance of the policy network is close to the linear Gaussian controller, the training process will switch to the next phase. In the practice phase, the network interacts with the environment in a self-exploratory manner. With the help of the weighted replay buffer, the update of the policy network will be more stable.
In the two phases above, a mini-batch of MDUs is sampled randomly from the weighted replay buffer to train the network. Weighted replay buffer plays a key role in the fast and stable learning of a policy network. The following will first discuss the idea of the weighted replay buffer. Then, the details of policy network learning in the imitation phase and the practice phase are analyzed respectively.

Experience Scoring Algorithm
The original GPS and PPO use random experiences in the replay buffer and discard the old memory according to the first-in-first-out order. This strategy of storing and updating historical data has two major disadvantages: (1) it does not distinguish the quality of the memory experience and uses it indiscriminately, which results in the low efficiency of the replay buffer; (2) discarding memory experiences according to the time sequence and losing the early high-value memory experiences may cause instability of the policy network learning process.
In order to resolve these problems, the experience scoring algorithm is proposed. The experience scoring algorithm evaluates the quality of the experience so as to make more effective use of the trajectories in the replay buffer. For a trajectory τ, we evaluate its quality from the following three aspects:

1.
Cumulative discounted reward R(τ). The ultimate goal of reinforcement learning is to obtain the maximum cumulative expected reward. It is intuitive to use R(τ) as an indicator to measure the quality of experience data. For one trajectory, the greater the final cumulative expected reward, the better the overall performance of this episode, and the more valuable it is to learn from. 2.
The variance var τ (r) of all single step rewards. If the single step reward value is much larger than the average value, it will guide the network update direction from the positive direction more effectively. If the single step reward value is much smaller than the average value, it can guide the network update from the opposite direction more effectively. The reward information close to the average value is less efficient for network updates. The analogy is that people can accumulate more life experience in great success or frustration. However, experiences with too large var(r) may lead to more radical network updates and increase the instability of network updating.

3.
Episode length T τ . There is a correlation between T τ and R(τ), but not a positive one. For example, an episode in which the single step reward is always low but lasts for a long time has a larger T τ and a smaller R(τ). In this way, even if there is a large T τ , it will not be considered as a valuable trajectory.
The quality W of a trajectory τ can be calculated by the weighted sum of the three evaluation indicators above: α 1 , α 2 and α 3 are parameters that need to be tuned according to the task. Figure 2 shows how the weighted replay buffer works based on the experience scoring algorithm. The left side of Figure 2 shows the update process of the traditional experience replay buffer, which stores and discards data similarly to the data structure of a queue, meeting the principle of first-in-first-out. The right side of Figure 2 describes the updating process of the weighted replay buffer. In the weighted replay buffer, the trajectory is the smallest unit to be discarded rather than the MDUs in the traditional replay buffer. More importantly, the data with the highest score are discarded in the weighted replay buffer rather than the data stored first. Besides, the length of the traditional replay buffer is fixed, while the length of the weighted replay buffer is variable. When a new experience with length l is stored in a full traditional replay buffer, the first MDUs will be discarded. However, if this new trajectory is put into the full weighted replay buffer, a complete trajectory with the lowest score will be discarded.

Figure 2.
Comparison of updating processes between traditional replay buffer (left) and weighted replay buffer (right). The yellow line represents the new experiences that will be stored. The green line and the red line represent the experiences that need to be discarded in the traditional replay buffer and the weighted replay buffer, respectively.

Weighted Near-Optimal Experiences Policy Optimization
After initialization, the policy network learns from the linear Gaussian controller in a supervised learning manner. The linear Gaussian controller is constantly updated in the interaction with the environment, and the near-optimal experiences are stored in the weighted replay buffer. The goal of policy network learning in this phase is to achieve the performance of the linear Gaussian controller. We transformed the original optimization problem as follows. min where p(u t |x t ) = N(K t x t + k t , C t ). The optimization problem is decomposed into two sub-optimization problems using the Bergman Alternate Direction Multiplier Method (BADMM) [29]: To facilitate the updating of the Lagrange multiplier λ t , the equation constraint p(u t |x t ) = π θ (u t |x t ) in Equation (8) can be relaxed to the point where the first-order moments of both are equal. Thus, the final optimization problem is: Before policy network optimization, two sections must be achieved iteratively: dynamics model fitting and linear Gaussian controller optimization.

Dynamics Model Fitting
When the robot interacts with the environment, the stochastic, uncertain nature of the environment itself makes the environment dynamics non-stationary. How the environment dynamics are modeled determines the quality of the trajectory. The universal approximation theorem [30] of neural networks makes it possible to fit any environment dynamics, but training a neural network requires a large number of training samples, which somewhat defeats the purpose of this paper to solve the sample inefficiency problem of robotic skill learning. Considering that linear models can adapt to the environment better than neural networks, but are less generalizable, a Gaussian distribution can be used to represent the uncertainty of the dynamic model. Therefore, this paper uses a linear time-varying system to model the environment.
Assuming that the observed state is Markovian, then the environment dynamics model satisfies: To estimate parameters { f x t , f u t , F t } of the environment dynamics model, a linear Gaussian controller is used to interact with the environment to obtain trajectories τ = {x 1 , u 1 , x 2 , u 2 , . . . , x T , u T }, which, in turn, can construct the input sets D is the dimension of the system state space and E is the dimension of the system action space. Based on the linear mean square estimation theory [31], it is known that: In the above equation, the covariance cov(y t , The other parameters are calculated as follows:

Linear Gaussian Controller Optimization
If the linear Gaussian controller parameters change too much after the update, it will not be conducive to the convergence of the controller parameters and may also cause unpredictable damage to the robot. Therefore, the KL divergence of the controller before and after the update should be less than a certain threshold value ε: wherep(u t |x t ) is the controller before the parameter update. Then, we convert the above equation to the Lagrangian dual form [29]: where η t is the Lagrange multiplier. Set the cost function c(x t , u t ) equal to: Therefore, Equation (14) can be abbreviated as: Since we have fitted the system dynamics model p(x t+1 |x t , u t ), the parameters of the linear Gaussian controller p(u t |x t ) ∼ N(K t x t + k t , C t ) can be calculated using differential dynamic planning. Define the state value function V(x t ) for any moment t as [32]: In the equation above, x t+1 = f x t x t + f u t u t + f t . To find the local minimum of V(x t ), the action value function Q(x t , u t ) is expanded by second-order Taylor at x t = 0, u t = 0, with: where Q xut is the partial derivative of Q(x t , u t ) over x t and u t sequentially, and Q xu,xut is the Hessian matrix.
To obtain a minimum value for the state value function V(x t ), the partial derivative of Equation (19) with respect to u t is: The parameters of the linear Gaussian controller at moment t are: Then, substitute Equation (20) for Equation (17): All parameters of a linear Gaussian controller can be calculated iteratively.

Policy Network Optimization
For the imitation phase, since the policy π θ (u t |x t ) obeys a Gaussian distribution N(u π (x t ), Σ π (x t )) and the linear Gaussian controller p(u t |x t ) also obeys the distribution N(µ π (x t ), C t ), the optimization problem in Equation (9) can be further written as follows.
where µ = µ π (x t ) − µ p ti (x t ), and M is the number of trajectories sampled by the system. To simplify the solution of the objective function, the variance Σ π (x t ) of the policy is assumed to be independent of the state x t . A partial derivative of the above equation yields: Simplifying the optimization issue to: It can be seen that when the output µ π (o t ) of the policy is consistent with the output µ p ti (x t ) of the controller, the corresponding policy parameters θ are the optimal solution to this optimization problem.
For the practice phase, we adopt the PPO-clip algorithm [28]. The whole idea of this approach is to find a functional relationship between policy change and value, take the new policy that can improve the old policy the most and prevent the policy from being updated too much by truncating the policy. In the training process of this phase, the experiences in the weighted replay buffer contains the near-optimal trajectories output by the linear Gaussian controller, as well as the data generated when the policy network interacts with the environment.
We describe the implementation steps of WNEPO as follows: (1) Initialize the linear Gaussian controller and policy network; (2) Control the robot to walk with the linear Gaussian controller and record the experiences in the weighted replay buffer; (3) Use experiences stored in the weighted replay buffer to update the linear Gaussian controller and policy network in a supervised learning manner; (4) Check whether the cumulative reward obtained by the linear Gaussian controller converges; if it converges, skip to step (5), otherwise return to step (2); (5) Control the robot to walk with the policy network and record the experiences in the weighted replay buffer; (6) Use the PPO algorithm to update the policy network until the cumulative reward converges.
The output of the WNEPO algorithm is a policy network, which can directly control the robot to complete the desired task. The implementation steps described above are an experience-based circular process. The pseudo code of the WNEPO algorithm is shown in Algorithm 1. Algorithm 1. WNEPO: A two-phase framework for efficient robotic skills learning 1: Initialize p(u t |x t ) , π θ (u t |x t ), π θ (u t |x t ), weighted replay buffer D 2: Initialize λ t , K, J, α 3: For k ∈ {1, . . . , K}: 4: Initialize D p 5: Interacting with the environment M times with p(u t |x t ) 6: collect For j ∈ {1, . . . , J}: 9: Update the linear Gaussian controller p(u t |x t ) with Equation (16)  10: Optimize π θ (u t |x t ) with Adam according to Equation (25) 12: Else:

Experiments
In this section, we will answer the following three questions by carrying out simulation experiments on a MATLAB platform: (1) Can a biped robot learn walking skills only from the imitation phase without any prior knowledge? (2) Can the WNEPO algorithm make the robot learn better walking skills in a shorter time? From the perspective of RL, are asymptotic performance and sample efficiency better? (3) How do weighted near-optimal experiences affect the performance of different algorithms?

Description of the Environment
The WNEPO proposed in this paper will be validated with a simulation experiment of a biped robot, which was built using MATLAB's Simscape toolbox. The biped robot is shown in Figure 3. The robot was required to stay upright during walking and walk as far as possible in a straight line within a limited time, during which the motors of each joint execute the torque output by the policy network. The motor at the three joints (ankle, knee and hip) of each leg can be regarded as the agent. Our task is to make the motor output an appropriate torque to control the robot to walk along a straight line.
The robot had two legs and a torso, and each leg contained three joints (ankle, knee and hip). Torque u i ∈ [−3, 3] (Nm) was applied to each joint of the leg, i = 1, 2, . . . , n. The key physical property parameters of the robot are shown in Table 2.  The contact between the robot feet and the ground adopted a point-to-surface contact mode, and the contact stiffness and damping were 500 (N/m) and 50 (Ns/m), respectively. In order to make the walking task closer to a real situation, Gaussian white noise was added to the contact stiffness and damping.

Parameter Specification
PPO has strong robustness, so the neural network structure and other hyperparameters had little influence on the training process. Various network structures, including a threelayer network and a four-layer network, were tested, and there was no significant difference in the training effect, so the default network structure commonly used by PPO, i.e., the policy network and the critic network, both use two hidden layers, with 300 and 400 nodes, respectively. The output layer of the network is six nodes, representing the moments applied to a total of six joints in both legs; the output layer of the evaluation network has only one node and is a score of the current state.
Set the capacity of weighted replay buffer D to 600 and mini-batch size to 500. The maximum episode is 5000. The values of α 1 , α 2 and α 3 are 0.4, 0.5 and 0.1, respectively.
The reward function r t = −l(x t , u t ) with the following expression: where v x is the displacement in the x-direction, y is the distance of the trajectory from the preset line,ẑ is the normalized vertical translation displacement of the robot center of mass, T s is the sample time of the environment and T f is the final simulation time of the environment. T f was set to 10 s in this simulation. Both the policy and value networks were trained using Adam, with a learning rate of 0.0001 for the policy network and 0.002 for the value network. The order of the trajectory data in the experience pool was not scrambled in order to ensure that the complete sequence fragments could be taken for the estimation of the generalized advantage function.

Comparison Methods
WNEPO can be regarded as the combination of model-free and model-based RL, which can give full play to the advantages of sample efficiency of model-based RL and the asymptotic performance of model-free RL. In order to evaluate the performance of WNEPO, we compared it with the following algorithms: • iLQG [19]. iLQG is a typical model-based RL algorithm. When the environment dynamics are known, the optimal analytical solution can be obtained. • GPS [6]. GPS is a state-of-the-art algorithm combining model-based RL with modelfree RL. • WE-GPS. Replaces the experience pool in GPS with a weighted replay buffer. • PPO [28]. State-of-the-art model-free on-policy RL algorithm. This paper does not consider off-policy RL methods because off-policy RL usually encounters higher risks, which is not suitable for robot skills learning [33].

•
In practice, the last point can make a big difference if mistakes are costly-e.g., you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk if there is real time and money at stake if the robot were to be damaged. • WE-PPO. Replaces the replay buffer in PPO with a weighted replay buffer. • GPS-PPO. GPS is used to update the policy network offline, and then, the PPO algorithm is used to train the policy network online. The only difference between GPS-PPO and WNEPO is that GPS-PPO directly uses the online interactive data between the policy network and the environment instead of the experiences in the weighted replay buffer to update the policy.

Evaluation of Walking Skills Learned from the Imitation Phase
The optimization process of the linear Gaussian controller p(u t |x t ) is shown in the following Figure 4.
At the beginning of the iteration, an accurate dynamics model was unavailable and the output torque was small. The agent only explored near the initial configuration to avoid possible damage to the robotic system. After a period of interaction, the controller tried to control the robot to walk, but it was unable to complete the walking task because the dynamics model was inaccurate. Therefore, the trajectories generated by the controller were far from optimal. With the increase in samples in the weighted experience replay buffer, the dynamics model was fitted more and more accurately. After eight iterations (k ≥ 8), the robot could walk a distance, although the behavior seemed unnatural. After 12 iterations (k ≥ 12), the average cumulative reward of the linear Gaussian controller had nearly converged. With the increase in interaction data between the iLQG controller and the environment, more experiences are used to fit the dynamics, and the average cumulative reward of the iLQG agent is multiplying. When the interaction experiences accumulated to a certain amount, the performance improvement of iLQG is unobvious.
In the process of the iLQG interacting with the environment, the policy network is simultaneously trained using samples obtained from the iLQG's experiences. Since the average cumulative reward of the iLQG controller almost converges when k = 12, we evaluated the walking skills learned by the policy network at this time. The training process of the policy network can be regarded as a typical regression problem. We used root-mean-square error(RMSE) loss function to evaluate how well the policy network imitates the iLQG controller. The RMSE curve of policy network training is shown in Figure 5 for when k = 12. It can be seen that the policy network converges when epoch = 9, which means that the performance of the policy network is close to the iLQG controller. Next, we used the policy network to control the robot directly, and we take a trajectory as an example to analyze the movement of the biped robot. Walking gait refers to the posture and behavior characteristics of walking, including the continuous activities of hip, knee and ankle. Figure 6 shows the joint forces exerted on the six joints of the biped robot during walking. It can be seen from the figure that the torque exerted on the ankle joint is periodic, because the two feet alternately support the robot to move forward during the movement. The torque applied to the knee joint does not change periodically because the robot needs to adjust the torque of the knee joint in real time to ensure body balance. The torque exerted on the hip joint is the smallest, and the direction of the torque applied to the left hip joint and the right hip joint is always opposite, which is consistent with the characteristics of human walking.  Figure 7 shows the motion of the robot. Figure 7a shows the lateral offset. It can be found that the robot moves in the positive direction of the y-axis in the process of moving forward. The maximum offset is 0.36 m, which can be ignored compared with the total distance. Figure 7b shows how the center of gravity changes over time. At the beginning of the movement, the center of gravity changes greatly because the robot needs to adapt to the state change from the initial standing state to the walking state. However, after 5 s, the change in gravity center is controlled in the range of 0.005 m, which indicates that the robot can deal with the uncertainty of the environment and effectively adjust its own posture to complete the walking task. Figure 7c shows that the robot's speed is uneven during the forward process, which may be caused by the lateral offset of the robot. The robot needs to make a trade-off between moving forward and maintaining body balance. Even so, the robot can walk a longer distance and move more smoothly at this time. It can be considered that the trajectories generated by the linear Gaussian controller at this moment are near-optimal. In the next subsection, we will use more indicators to evaluate whether the robot walking skills can be further improved in practice phase.

Asymptotic Performance and Sample Efficiency
When the performance of the linear Gaussian controller tends to converge, we enter the practice phase from the imitation phase. The role of the iLQG controller now is to expand the weighted replay buffer D to provide high-value trajectories for policy network training. Different from the traditional PPO, the experiences used for policy network optimization are not only from the interaction between the robot itself and the environment but also from the near-optimal trajectories generated by the iLQG controller. Figure 8 demonstrates the training process of different algorithms. It shows that the iLQG controller has the fastest convergence rate and can achieve a reward expectation of close to 170 using fewer training samples and less computational power. The method proposed in this paper can hardly surpass the performance of the linear Gaussian controller during the imitation phase. This result is intuitive because the learning samples determine the upper limit of the model performance. GPS and WE-GPS encounter the same situation in the training process. The policy network can achieve higher reward than iLQG after switching to the practice phase. In the practice phase, the performance of GPS and WE-GPS is limited by the performance of the linear Gaussian controller. WNEPO can use the exploration mechanism of model-free RL to let the robot experience more situations that have not been seen before. This will help the neural network to search the global optimal policy better, which makes it more robust than the linear Gaussian controller when facing unseen situations. WNEPO obtains higher reward than the current best-performing on-policy model-free RL method, PPO. The reason is that the parameter update process of the policy network uses both near-optimal experiences and online interactive experiences.
The comparison of these algorithms in learning walking skills is shown in Table 3. In this experiment, we required the robot to keep upright and walk as far as possible in 10 s. When all the policies converged, 30 trials were run under different environment parameters (contact stiffness and damping), and the average value was calculated. WNEPO achieved the best asymptotic performance and achieved the longest walking distance with the least control output. At the same time, we know from Table 2 that the sample efficiency of WNEPO is significantly improved compared to PPO. Although the sample efficiency of WNEPO is worse than that of GPS, our biggest concern is to better learn motor skills, so more online interaction is obviously desirable.
Finally, we will discuss the role of the weighted replay buffer in WNEPO. The role of the weighted replay buffer in the imitation phase can be reflected by comparing the performances of WNEPO and GPS. It can be seen that the weighted replay buffer helps to make the performance of the policy network closer to the iLQG controller. This phenomenon can be explained as the experience of a high score helping the policy network not to be trapped in low-reward states. The role of the weighted replay buffer in the practice phase can be illustrated by the performance of GPS-PPO in Figure 8. The online interaction of GPS-PPO makes the training process fluctuate greatly, and there is still no convergence until the maximum number of episodes is reached. It can be inferred that the use of a weighted replay buffer can improve training stability in the practice phase. The results in Figure 8 also show that the weighted replay buffer can improve the performances of GPS and PPO. The weighted replay buffer is a plug-and-play module which can be widely used for other RL algorithms.

Conclusions
In this paper, a two-phase framework for efficient learning of robot skills is proposed based on reinforcement learning, which we call WNEPO. WNEPO can be regarded as the combination of model-free RL and model-based RL. In the imitation phase, the policy network uses the near-optimal experiences of the linear Gaussian controller to update the parameters, which is more efficient than PPO. By continuing to train the policy network with the model-free RL algorithm in the practice phase, the robot can better learn walking skills than with other algorithms. The weighted replay buffer proposed in this paper plays a key role in the policy network training process. The weighted replay buffer is used to store the historical experience data with high scores so as to improve the training stability and strengthen the exploration of high-reward areas.
The advantage of our method is that the environment dynamics do not need to be known in advance and highly robust skills can be learned through fewer interactions with the environment. The weighted replay buffer proposed has been proved to be a plug-and-play module that can be used in other RL algorithms.
In order to further tap into the potential of the weighted replay buffer, in future work, we will conduct a theoretical analysis on the tuning method of the three parameters in the experience-scoring algorithm. In addition, we will also test the effect of the WNEPO algorithm in other robot skill-learning situations, such as high-dimensional manipulation tasks.