Next Article in Journal
The Effect of Depth Information on Visual Complexity Perception in Three-Dimensional Textures
Next Article in Special Issue
An Approach to Human Walking Analysis Based on Balance, Symmetry and Stability Using COG, ZMP and CP
Previous Article in Journal
The Phase Composition and Mechanical Properties of the Novel Precipitation-Strengthening Al-Cu-Er-Mn-Zr Alloy
Previous Article in Special Issue
A Challenge: Support of Standing Balance in Assistive Robotic Devices
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition

1
School of Automation, Wuhan University of Technology, Wuhan 430070, China
2
School of Electronic and Information Engineering, University of Science and Technology Liaoning, Anshan 114051, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(15), 5346; https://doi.org/10.3390/app10155346
Submission received: 30 June 2020 / Revised: 28 July 2020 / Accepted: 30 July 2020 / Published: 3 August 2020
(This article belongs to the Special Issue Biorobotics: Challenges, Technologies, and Trends)

Abstract

:
Discovering the implicit pattern and using it as heuristic information to guide the policy search is one of the core factors to speed up the procedure of robot motor skill acquisition. This paper proposes a compound heuristic information guided reinforcement learning algorithm PI 2 -CMA-KCCA for policy improvement. Its structure and workflow are similar to a double closed-loop control system. The outer loop realized by Kernel Canonical Correlation Analysis (KCCA) infers the implicit nonlinear heuristic information between the joints of the robot. In addition, the inner loop operated by Covariance Matrix Adaptation (CMA) discovers the hidden linear correlations between the basis functions within the joint of the robot. These patterns which are good for learning the new task can automatically determine the mean and variance of the exploring perturbation for Path Integral Policy Improvement (PI 2 ). Compared with classical PI2, PI2-CMA, and PI2-KCCA, PI2-CMA-KCCA can not only endow the robot with the ability to realize transfer learning of trajectory planning from the demonstration to the new task, but also complete it more efficiently. The classical via-point experiments based on SCARA and Swayer robots have validated that the proposed method has fast learning convergence and can find a solution for the new task.

1. Introduction

Imitation learning (IL) and reinforcement learning (RL) [1] have always been a hot topic in the field of robot skill acquisition. Imitation learning can be divided into two categories: behavioral cloning (BC) and inverse reinforcement learning (IRL). BC is the method of learning expected policy directly from expert teaching information, while IRL learns policy indirectly using reward function. Hidden Markov Model (HMM) [2], Dynamic Movement Primitives (DMPs), Probabilistic Movement Primitives (ProMPs) [3], Dynamic Systems (DS) [4], and Cross Entropy Regression (CER) are popular Behavior Cloning methods. The most frequently used methods of IRL are Maximum Margin Planning (MMP) [5] and Markov Process (MP).
Imitation learning is limited because it requires the robot to learn only from demonstrated trajectories. When the reproduction environment is different from the demonstration environment or there is a big deviation, such as placing an obstacle on the path of the robot, the imitation learning method may fail. Instead, RL allows the robot to find a new control policy by exploring the state-action space freely. The combination of IL and RL aims to use the advantages of two methods to overcome their respective shortcomings, so that the robot can adapt to the deviation from the demonstration behavior, so as to improve the performance of the robot.
The classical RL methods include SARSA [6], Natural Actor-Critic (NAC) [7], Policy Learning by Weighting Exploration with the Returns (PoWER) [8], Relative Entropy Policy Search (REPS) [9], and Path Integral Policy Improvement (PI2). We believe that PI2 [10] is one of the most effective, numerically robust, and easy reinforcement algorithms. However, classical PI2 searches the whole parameter space, so it is less efficient to complete the task. Kernel Canonical Correlation Analysis (KCCA) can infer the implicit nonlinear heuristic information between the joints of the robot when facing the new task, leading PI2 to find a solution. Based on the works of Covariance Matrix Adaptation (CMA) [11] together with our previous research on KCCA [12], we propose a new algorithm PI2-CMA-KCCA in this paper, where KCCA and CMA are integrated as compound heuristic information to speed up the learning procedure from the demonstration to a new task.
This paper is structured into several major sections: Section 2 investigates the DMP used in the imitation learning phase. In this paper, we use Dynamic Movement Primitives as the underlying policy representation. Section 3 briefly introduces the algorithm PI2-CMA which aims at discovering the hidden relationship between the components of the weight. In Section 4, we introduce the algorithm KCCA, and finally derive the algorithm PI2-CMA-KCCA. In Section 5, based on SCARA and Swayer robots, we validate our algorithm through the classical via-point task and analyze the experimental results. Finally, the conclusions are given in Section 6.

2. Dynamic Movement Primitives

Dynamic Movement Primitives (DMPs) have been used in many disciplines to model complex behaviors. In this paper, we use DMPs as the underlying policy representation. A joint of the robot can be regarded as a DMP. The DMP [13] consists of a damped spring system and a learnable nonlinear forcing term, by which the desired behavior of the joint can be obtained. Its expression is given by:
ζ 2 y ¨ t = α y ( β y ( g y t ) ζ y ˙ t ) α z + h ( x t ) x t ( g y 0 ) α f ζ x ˙ t = α x x t h ( x t ) = i = 1 M Ψ i ( x ) ω i i = 1 M Ψ i ( x ) = i = 1 M Ψ i ω i = Ψ ω ,
where, if the forced term α f = 0 , α z represents a globally stable second-order damped spring system. ζ is a time constant and represents the proportional coefficient of the duration of the motion. g is the target position. y 0 is the initial position, and variable y t would be interpreted as the desired position of the joint. h ( x t ) is a fitting function, and x t can be conceived of as a phase variable. Ψ i ( x ) represents the ith basis function Ψ i ( x ) = exp ( ( x t c i ) 2 / 2 σ i 2 ) , where c i and σ i are constants that determine the width and center of the basis function respectively. w i is the weight corresponding to the ith basis function. M is the number of exponential basis functions. ω R M × 1 is the weight vector. α y , β y can be obtained by adjusting the damped spring system to the second-order critical damping system.
The parameter ω i of DMP is learned by Locally Weighted Regression algorithm (LWR). The algorithm will find the corresponding ω i for each basis function Ψ i by minimizing the cost function S. The function is defined by:
S ( ω ) = j = 1 N i = 1 M Ψ i x j ω i h x j 2 ,
where ( x j , h x j ) is the jth sample. Obviously, ω i cooperates with each other in linear combination patterns because of the introduction of basis functions. Thus, we can optimize them by means of linear correlation techniques. A summary of the notation frequently used in this article is given by Table 1.

3. Path Integral Policy Improvement with Covariance Matrix Adaption

PI2 is derived from the first principles of stochastic optimal control. What sets PI2 apart from other direct policy improvement algorithms is its use of probability-weighted averaging to perform a parameter update, rather than using an estimate of the gradient. In Section 2, we obtain the weight vector ω by parameterizing the demonstration trajectory. The idea of PI2 is to add stochastic exploration noise ε d ( ε d N ( 0 , σ 2 ) ) to ω d and generate K roll-outs with different costs by executing parameterized policy. As for a robot with D DOF, the cost function of the kth roll-out at time i is
J ( τ · , i , k ) = d = 1 D φ d , N , k + j = i N 1 q d , j , k + 1 2 j = i N 1 ( ω d + M d , j , k ε d , j , k ) T R ( ω d + M d , j , k ε d , j , k ) ,
where τ · , i , k τ is a sample path (or trajectory piece). · indicates all DOF. φ d , N , k represents the terminal reward of the kth trajectory of the dth DOF. q d , j , k is the immediate reward of the kth trajectory of the dth DOF at time j. Specifically, it is expressed as y t ¨ (i.e., the square of joint’s acceleration). M d , j , k is the mapping matrix of the kth trajectory of the dth DOF at time j. R is the positive semi-definite weight matrix of the quadratic control cost. N is the maximum value of the time index.
Next, the exploration is evaluated. First, the cost of obtained trajectories is sorted, then K e elite samples are selected, and finally perform probability-weighted averaging to obtain δ ω d , i with th DOF at time i:
δ ω d , i = k = 1 K e P ( τ . , i , k ) M d , i , k ε d , i , k ,
where P ( τ . , i , k ) is the probability-weighted value of the kth trajectory of the dth DOF at time i. It is obtained by softmax transformation of the cost function:
P . , k , i = P τ . , i , k = e 1 λ J τ . , i , k k = 1 K e e 1 λ J τ . , i , k ,
where λ is an appropriate constant.
Later, δ ω d , i is averaged over time to get δ ω d further to obtain the new weight (i.e., ω d new = ω d old + δ ω d ). By searching the parameter space iteratively, PI2 will eventually find a solution for the new task. Classical PI2 only updates the mean ω , and the covariance σ 2 is a constant ( σ 2 = λ i n i t I M ). M is the number of base functions. λ i n i t determines the magnitude of initial exploration noise. PI2-CMA aims to determine the magnitude of the exploration noise automatically and to infer the implicit linear correlation between the basis functions within the joint of the robot. In other words, covariance matrix adaption is expressed as:
Σ d , i n e w = k = 1 K e P . , k , i ( δ ω d , i ) ( δ ω d , i ) T .
In addition, the covariance update equation is
Σ d n e w = i = 1 N ( N i ) Σ d , i n e w i = l N ( N l ) .
As we can see that the vanilla PI2-CMA only infers linear correlation of weight within a DMP independently.

4. PI2-CMA with Kernel Canonical Correlation Analysis

In Equation (3), it can be seen that the perturbation ε of PI2 is generated with equal probability for each joint’s weight vector ω d . The PI 2 -CMA takes into account the implicit linear correlations between the basis functions of the joint and automatically updates the covariance. However, when a task is assigned to a multi-joint robot, there exists an unknown hidden task-oriented pattern between weight vector with perturbation ω ˇ d i and ω ˇ d j ( ω ˇ d k = ω d k + δ ω d k , k { i , j } ). The pattern can be inferred from experience and expressed as a nonlinear correlation. Specifically, we can apply Kernel Canonical Correlation Analysis (KCCA) to infer the implicit nonlinear heuristic information between the joints of the robot to guide PI 2 to a search policy for the new task of trajectory planning. In the paper, we not only consider the linear correlations between the perturbation vector’s components within a DMP but also take nonlinear correlations of the perturbation vector between DMPs into account to expedite the learning procedure of the robot.

4.1. Nonlinear Correlation Heuristic Information

According to Equations (3) and (4), perturbation is generated with equal probability for each joint’s weight vector. In other words, there is no correlation between the joints. Therefore, [ ω ˇ 1 T , ω ˇ 2 T , , ω ˇ D T ] T R D M × 1 is denoted as ω ˜ , and covariance matrix of ω ˜ is given by:
Σ ω ˜ = d i a g [ σ 1 2 , σ 2 2 , , σ D 2 ] R D M × D M ,
where d i a g [ · ] represents diagonal matrix and σ 1 2 = = σ D 2 = σ 2 .
When the multi-joint robot completes the task, we believe that there are some hidden patterns between the joints. The implicit patterns between the perturbation vectors of the multi-joint robot are expressed as nonlinear correlations. With the help of the kernel method, the dth joint’s perturbation vector ω ˇ d R M × 1 is mapped to high-dimensional feature space ϕ ( ω ˇ d ) R V × 1 . ω ˜ is also mapped to high-dimensional feature space ϕ ( ω ˜ ) R D V × 1 , and then covariance matrix of ω ˜ can be described as:
Σ ϕ ( ω ˜ ) = Γ ( ω ˇ 1 , ω ˇ 1 ) Γ ( ω ˇ 1 , ω ˇ 2 ) Γ ( ω ˇ 1 , ω ˇ D ) Γ ( ω ˇ 2 , ω ˇ 1 ) Γ ( ω ˇ 2 , ω ˇ 2 ) Γ ( ω ˇ 2 , ω ˇ D ) Γ ( ω ˇ D , ω ˇ 1 ) Γ ( ω ˇ D , ω ˇ 2 ) Γ ( ω ˇ D , ω ˇ D ) ,
where Γ ( ω ˇ d i , ω ˇ d j ) = c o v ( ϕ ( ω ˇ d i ) , ϕ ( ω ˇ d j ) ) is the covariance of ω ˇ d i and ω ˇ d j projected on high-dimensional space, expressing the implicit nonlinear pattern between the perturbation vectors of the d i th joint and the d j th joint. The intuitive policy is to obtain Σ ϕ ( ω ˜ ) based on empirical samples, and then use Γ ( ω ˇ d i , ω ˇ d j ) as the heuristic information. That is, given the perturbation vector ω ˇ d i , the exploration of ω ˇ d j is guided by the covariance Γ ( ω ˇ d i , ω ˇ d j ) .
Note that Γ ( ω ˇ d i , ω ˇ d j ) is in the unified space coordinate system (i.e., ϕ ( ω ˇ ) ), but analyzing the correlation between the perturbation vector ω ˇ d i of the d i th joint and the perturbation vector ω ˇ d j of the d j th joint in a uniform coordinate system is not the best choice. Because after the perturbation vectors of different joints are mapped to a high-dimensional space, the correlation coefficients between the joints can be maximized only after a proper projection transformation, and such projection matrices are usually not necessarily identical. Here, the generalized Rayleigh Entropy is used to find the maximum correlation coefficient of ϕ ( ω ˇ d i ) and ϕ ( ω ˇ d j ) , and Maximum Likelihood Estimation is used to infer the nonlinear correlation between ω ˇ d i and ω ˇ d j . If d i d j , Θ ( ω ˇ d i , ω ˇ d j ) = c o v ( P r { Φ ( ω ˇ d i ) } , P r { Φ ( ω ˇ d j ) } ) can be the heuristic information, and P r is the projection operator. That is, given the perturbation vector ω ˇ d i , ω ˇ d j can be obtained from the covariance Θ ( ω ˇ d i , ω ˇ d j ) . Equation (9) can now be expressed as follows:
Σ ϕ ( ω ˜ ) + = Γ ( ω ˇ 1 , ω ˇ 1 ) Θ ( ω ˇ 1 , ω ˇ 2 ) Θ ( ω ˇ 1 , ω ˇ D ) Θ ( ω ˇ 2 , ω ˇ 1 ) Γ ( ω ˇ 2 , ω ˇ 2 ) Θ ( ω ˇ 2 , ω ˇ D ) Θ ( ω ˇ D , ω ˇ 1 ) Θ ( ω ˇ D , ω ˇ 2 ) Γ ( ω ˇ D , ω ˇ D ) .

4.2. Robot Intelligent Trajectory Inference with KCCA

KCCA [14] is a nonlinear correlation analysis method. In this paper, we employ KCCA on the elite samples from the robot’s first joint to other joints, and make heuristic inference. After the pth iteration of PI2-CMA, the program records the total reward J p ( τ ) (one episodic samples), based on the current ω ˇ 1 , , D p (perturbation vectors of D joints). Then, J p ( τ ) is compared with the total reward J p 1 ( τ ) based on ω ˇ 1 , , D p 1 at the (p−1)th iteration to obtain a decline rate T r .
If T r is greater than its upper threshold T m a x , T r is updated to T m a x , and the program executes KCCA learning. At this time, the weight perturbation sample ω ˇ 1 s = { ω ˇ 1 , 1 , ω ˇ 1 , 2 , , ω ˇ 1 , N } on the first joint and the weight perturbation sample ω ˇ 2 s = { ω ˇ 2 , 1 , ω ˇ 2 , 2 , , ω ˇ 2 , N } on the second joint in the pth iteration are respectively mapped in high dimensions to obtain [ ϕ ( ω ˇ d , 1 ) ϕ ( ω ˇ d , 2 ) ϕ ( ω ˇ d , N ) ] , denoted as Φ ( ω ˇ d s ) R V × N ( d [ 1 , 2 ] ) . Next, find two sets of vectors w 1 , w 2 R V × 1 , so that the correlation coefficient between the data u and v after the projection of ϕ ( ω ˇ 1 ) and ϕ ( ω ˇ 2 ) is maximized. We will have:
u = w 1 T ϕ ( ω ˇ 1 ) , v = w 2 T ϕ ( ω ˇ 2 ) .
The covariance of u and v is given by:
v a r ( u , v ) = 1 N 1 w 1 T Φ ( ω ˇ 1 s ) Φ ( ω ˇ 2 s ) T w 2 .
Obviously, the vectors w 1 and w 2 are located in the space spanned by data Φ ( ω ˇ 1 s ) and Φ ( ω ˇ 2 s ) :
w 1 = Φ ( ω ˇ 1 s ) α 1 , w 2 = Φ ( ω ˇ 2 s ) α 2 ,
where α 1 , α 2 R N × 1 . From Equations (12) and (13), correlation coefficient ρ can be obtained:
ρ = α 1 T Φ T ( ω ˇ 1 s ) Φ ( ω ˇ 1 s ) Φ T ( ω ˇ 2 s ) Φ ( ω ˇ 2 s ) α 2 α 1 T ( Φ T ( ω ˇ 1 s ) Φ ( ω ˇ 1 s ) ) 2 α 1 α 2 T ( Φ T ( ω ˇ 2 s ) Φ ( ω ˇ 2 s ) ) 2 α 2 = α 1 T K ω 1 K ω 2 α 2 α 1 T K ω 1 K ω 1 α 1 α 2 T K ω 2 K ω 2 α 2 .
Kernel method is introduced in Equation (14), where K ( · ) is a kernel matrix. Without loss of generality, we fixed the denominator α 1 T K ω 1 K ω 1 α 1 = 1 , α 2 T K ω 2 K ω 2 α 2 = 1 of Equation (14) to find a suitable α 1 and α 2 to maximize α 1 T K ω 1 K ω 2 α 2 . Construct the Lagrange function:
L = α 1 T K ω 1 K ω 2 α 2 λ 1 2 ( α 1 T K ω 1 K ω 1 α 1 1 ) λ 2 2 ( α 2 T K ω 2 K ω 2 α 2 1 ) .
We take partial derivatives of α 1 and α 2 in Equation (15), and obtain:
K ω 2 α 2 = λ K ω 1 α 1 λ = 2 λ 1 = 2 λ 2 = α 1 T K ω 1 K ω 2 α 2 .
According to Equation (16), the implicit correlation P c a ( Θ ( ω ˇ 1 , ω ˇ 2 ) ) between the first joint and the second joint perturbation vector of the robot is obtained. When the current T r is greater than the upper threshold T m a x , α 1 , α 2 are recorded. Furthermore, Equation (15) can be repeatedly executed to obtain the implicit correlation between the first joint and the dth ( d [ 2 , , D ] ) joint. In this way, we get the implicit patterns { ( α 1 , α 2 ) , , ( α 1 , α D ) } between the joints.
If T r is less than the lower threshold T m i n , the independent explorations of the whole joint are abandoned, and the KCCA prediction is turned on. That is, the first joint explores randomly to obtain a perturbation sample ω ˇ 1 s and K ω 1 in the pth iteration, then ( α 1 , α 2 ) is used to calculate K ω 2 of the second joint. Obviously, ω ˇ 2 can be computed from K ω 2 , and that will guide the exploration of the second joint. Repeat the same steps for the subsequent joints until the Dth joint.

4.3. The Combination of KCCA and CMA

We draw the schematic of the PI2-CMA-KCCA’s control flow as Figure 1, which works like a double closed-loop control system. Without loss of generality, there are D joints, and each joint corresponds to M basis functions, then D × M parameters need to be adjusted.
The outer loop aims at discovering the hidden patterns between the joints which are helpful to fulfill the new task. Specifically, we use the KCCA to infer the nonlinear correlations between the first joint and other joints for good declining rate, recorded as { ( α 1 , α 2 ) , , ( α 1 , α D ) } . After that, we can compute ω ˇ d from K ω d based on perturbation sample ω ˇ 1 s and K ω 1 , and it will guide the perturbation of the dth joint as a means of exploring noise.
The inner loop extracts heuristic information from a corresponding single joint to guide the search. Specifically, in the previous iteration, it evaluates the cost J d of K roll-outs (sorting and probability average), then selects the first K e elites of the perturbation sample to obtain the covariance Σ ω d by the CMA algorithm. Furthermore, in current iteration, ω d N ω ˇ d , Σ ω d is used to explore the policy parameter ω d . In other words, it applies compound heuristic information which integrates the inferred hidden nonlinear patterns between the joints and the linear patterns within the joint to automatically determine the exploring proportion for each component of ω d .
In short, the outer loop discovers and predicts the exploring mean of ω d for the dth ( d 1 ) DOF, the inner loop discloses and infers the exploring variance of ω d for the dth ( d 1 ) DOF. As for 1st DOF, a vanilla PI 2 is employed.

5. Evaluations

The cost function in our experiments is given by:
J = 0.5 × d = 1 D i = 1 N 1 [ 10 7 ( y ¨ ( d ) ( i ) ) 2 + ( a f ( d ) ( i ) ) 2 ] + d = 1 D 10 11 ( y ( d ) ( m ) y v ( d ) ) 2 + d = 1 D 10 3 [ ( y ˙ ( d ) ) 2 + ( y ( d ) ( N ) y g ( d ) ) 2 ] .
In Equation (17), D is the number of joints of SCARA and Swayer. y ( d ) ( i ) , y ˙ ( d ) ( i ) , y ¨ ( d ) ( i ) denote the position, velocity, and acceleration of the dth joint of the robot at the time i. y v ( d ) is the present point (i.e., via-point) to be passed through by the dth joint at time i. y g ( d ) is the trajectory’s end point of the d joint. y d ( m ) is the point passed through by the dth joint at time m. a f ( d ) ( i ) is the acceleration of forcing term of the dth joint at time i. Equation (17) is a typical quadratic expression of total reward in the finite stage of reinforcement learning.

5.1. Passing through One Via-Point with SCARA

SCARA has three revolute joints q 1 , q 2 , q 3 and one prismatic joint q 4 . In this experiment, the orientation of the end-effector of the robot arm is ignored, thus the SCARA robot arm can be regarded as a planar two-link mechanism.
The experiment is divided into the following four steps: ( 1 ) Set the Cartesian coordinate of the starting position of the end-effector to ( 20 , 0 , 0 ) T cm, and the corresponding joint angle is ( 0 , 0 , 0 ) T rad. Set the endpoint ( 4.5 , 16 , 0 ) T cm, and the corresponding joint angle is ( 0.7068 , 1.1796 , 0 ) T rad. ( 2 ) Drive the manipulator based on the principle of minimum jitter to obtain the demonstration. ( 3 ) Given a new task, use PI2, PI2-CMA, PI2-KCCA, and PI2-CMA-KCCA, respectively, to make the movement pass through a via-point at m = 1.8   s . ( 4 ) The four algorithms are iterated for 80 times respectively to gain new motor skills.
In this experiment, each of the four different algorithms is repeated 20 times. The best experimental results of PI2, PI2-CMA, PI2-KCCA, and the five experimental results of PI2-CMA-KCCA are selected for comparison. The optimization effect of PI2-CMA-KCCA is analyzed by comparing the final cost of the system and the trajectories of joint space. It can be clearly seen from Figure 2 that PI2-CMA-KCCA converges faster, and its final cost is the lowest (as shown in Table 2).
Figure 3 shows the movement trajectories of the three joints of the SCARA in the joint space when the number of iterations is 35. Because the joint q 3 has no effect on the position of the end-effector, the value of q 3 in the joint space remains constant.
The blue line in Figure 3 is the trajectory of each joint of the robot arm under the PI2-CMA-KCCA. Figure 3 demonstrates that only the blue lines pass through an intermediate via-point.

5.2. Passing through Two Via-Point with SCARA

This experimental procedure is similar to Section 5.1, requiring the end-effector to pass the point ( 18.2 , 8.1 , 0 ) T cm at m = 1.8 s and the point ( 12.0 , 13.5 , 0 ) T cm at m = 3.6 s. Four different algorithms are also respectively repeated 20 times, and the best experimental results of PI2, PI2-CMA, PI2-KCCA, and the five experimental results of PI2-CMA-KCCA are selected for comparison. Figure 4 demonstrates that the PI2-CMA-KCCA has better learning performance than the other algorithms. After 80 iterations, the final cost of the four algorithms is given in Table 3.
Figure 5 shows the trajectories of the three joints of the SCARA in the joint space when the number of iterations is 50, where the blue lines represent the joints’ trajectories of the robot arm in joint space under the PI2-CMA-KCCA. It demonstrates that, when the number of iterations is 50, the blue trajectories pass through two intermediate via-point at t = 1.8 s and t = 3.6 s accurately.

5.3. Passing Through One Via-Point with Swayer

Swayer is a lightweight collaborative robot created by Rethink Robotics. It has seven joints. In this subsection, q i is used to represent the ith joint of Swayer. Because the q 7 has no effect on Cartesian position of the end-effector, only six joints are considered. This experiment uses the ROS platform to control the movement trajectory of the Swayer manipulator in the Ubuntu16 system, and validates the effectiveness of the PI2-CMA-KCCA.
First, we set an arbitrary starting point ( 697 , 159 , 514 ) T mm in the workspace of Swayer and an arbitrary endpoint ( 300 , 569 , 65 ) T mm, then drag Swayer to record a demonstration from the starting point to the endpoint, then choose any reachable point ( 840 , 248 , 595 ) T mm far away from the demonstration as a new task at m = 1.2 s. Finally, PI2, PI2-CMA, PI2-KCCA, and PI2-CMA-KCCA are repeated 20 times on Swayer. The five experimental results of the PI2-CMA-KCCA and the best experimental results of the PI2, PI2-CMA and PI2-KCCA are selected for analysis. Figure 6 shows the downward trend of the cost of the four algorithms, and the final cost of the four algorithms after 80 iterations is given in Table 4.
Figure 6 illustrates that during the reinforcement learning, learning efficiency of PI2-CMA-KCCA is higher under the same number of iterations. At the same time, Figure 7 shows that the Swayer passes through an intermediate via-point accurately at m = 1.2 s when the number of iterations is 60.
Figure 7 shows the trajectories of joint space in Swayer. The blue lines represent the joints’ trajectories after PI2-CMA-KCCA learning. The result shows that only the blue lines can pass through the via-point accurately after about 60 iterations.

5.4. Performance Comparison of Four Algorithms

In Table 5, the cost decline rate of PI2-CMA-KCCA is always higher than that of PI2, PI2-CMA, and PI2-KCCA. When the experimental objects are the same, the new task is more complicated, and the learning performance of PI2-CMA-KCCA is better than that of the other three algorithms. In addition, especially when the tasks are the same, the greater the number of degrees of freedom of the experimental objects, the better the optimization effect of PI2-CMA-KCCA than the other algorithms because PI2-CMA-KCCA not only adjusts the magnitude of exploration noise automatically but also considers the nonlinear heuristic information between the joints.

6. Conclusions

Compound heuristic information is applied in the paper to guide PI2’s variational exploration and expedite the procedure of reinforcement learning. This information is derived by KCCA and CMA together. KCCA infers the nonlinear heuristic information between the joints of robot, and CMA infers the linear heuristic information within single DOF. This information may cause the cost function to drop rapidly with the MLE on the roll-out data. In addition, the proposed algorithm PI2-CMA-KCCA works like a double closed-loop control system, in which the outer loop discovers and predicts the means of exploring vectors for each DOF and the inner loop discloses and infers the variance of exploring vector for each DOF. In this way, the new algorithm can quickly search the optimal parameters of new tasks. The experimental results on SCARA and Swayer also demonstrate that the algorithm can speed up the process of updating parameters while maintaining the accuracy of completing new tasks, which is suitable for multi-degree-of-freedom objects and more complex tasks.

Author Contributions

Methodology, J.F.; software, C.L.; validation, C.L., X.T.; formal analysis, J.F.; investigation, F.L.; resources, X.T.; data curation, B.L.; writing—original draft preparation, C.L.; writing—review and editing, J.F.; visualization, C.L.; supervision, J.F.; project administration, J.F.; funding acquisition, J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grants 61773299, 51575412.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 61773299, 51575412.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Siciliano, B.; Khatib, O. Springer Handbook of Robotics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 987–1008. [Google Scholar]
  2. Takano, W.; Nakamura, Y. Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions. Int. J. Robot. Res. 2015, 34, 1314–1328. [Google Scholar]
  3. Paraschos, A.; Daniel, C.; Peters, J.; Neumann, G. Probabilistic movement primitives. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2616–2624. [Google Scholar]
  4. Khansari-Zadeh, S.M.; Billard, A. BM: An iterative algorithm to learn stable nonlinear dynamical systems with Gaussian mixture models. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 2381–2388. [Google Scholar]
  5. Ratliff, N.D.; Bagnell, J.A.; Zinkevich, M.A. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 729–736. [Google Scholar]
  6. Sutton, R.S.; Barto, A.G. Reinforcement Learning; A Bradford Book; The MIT Press: Cambridge, MA, USA; London, UK, 1998; pp. 665–685. [Google Scholar]
  7. Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190. [Google Scholar]
  8. Kober, J.; Peters, J. Learning motor primitives for robotics. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 2112–2118. [Google Scholar]
  9. Peters, J.; Altun, Y. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 1607–1612. [Google Scholar]
  10. Theodorou, E.; Buchli, J.; Schaal, S. A Generalized Path Integral Control Approach to Reinforcement Learning. J. Mach. Learn. Res. 2010, 11, 3137–3181. [Google Scholar]
  11. Stulp, F.; Sigaud, O. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012; pp. 281–288. [Google Scholar]
  12. Fu, J.; Teng, X.; Cao, C.; Lou, P. Intelligent trajectory planning based on reinforcement learning with KCCA inference for robot. J. Huazhong Univ. Sci. Technol. (Nat. Sci. Ed.) 2019, 47, 96–102. [Google Scholar]
  13. Ijspeert, A.J.; Nakanishi, J.; Hoffmann, H.; Pastor, P.; Schaal, S. Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors. Neural Comput. 2013, 25, 328–373. [Google Scholar]
  14. Melzer, T.; Reiter, M.; Bischof, H. Appearance models based on kernel canonical correlation analysis. Pattern Recognit. 2003, 36, 1961–1971. [Google Scholar]
Figure 1. A typical system’s architecture for PI 2 -CMA-KCCA.
Figure 1. A typical system’s architecture for PI 2 -CMA-KCCA.
Applsci 10 05346 g001
Figure 2. Cost of one via-point task.
Figure 2. Cost of one via-point task.
Applsci 10 05346 g002
Figure 3. One via-point task when the number of iterations is 30.
Figure 3. One via-point task when the number of iterations is 30.
Applsci 10 05346 g003
Figure 4. Cost of two via-point task.
Figure 4. Cost of two via-point task.
Applsci 10 05346 g004
Figure 5. Two via-point task when the number of iterations is 50.
Figure 5. Two via-point task when the number of iterations is 50.
Applsci 10 05346 g005
Figure 6. Cost of one via-point task.
Figure 6. Cost of one via-point task.
Applsci 10 05346 g006
Figure 7. One via-point task when the number of iterations is 60.
Figure 7. One via-point task when the number of iterations is 60.
Applsci 10 05346 g007
Table 1. Summary of the notation frequently used in the article.
Table 1. Summary of the notation frequently used in the article.
SymbolDefinition
y ¨ t , y ˙ t , y t The desired acceleration, velocity, position of the joint
h ( x t ) Fitting function. x t is a phase variable
P . , k , i The probability-weighted value of the kth trajectory of the dth DOF at time i
δ ω d , i The dth joint’s correction weight vector at time i
δ ω d Average value of δ ω d , i over time
ω d The dth joint’s weight vector
ω ˇ d The dth joint’s weight vector with perturbation
ω ˜ [ ω ˇ 1 T , ω ˇ 2 T , , ω ˇ D T ] T
ϕ ( · ) Mapping function. Mapping a variable to V-dimensional space
ω ˇ d s The weight perturbation sample of the dth joint, i.e.,  ω ˇ d s = { ω ˇ d , 1 , ω ˇ d , 2 , , ω ˇ d , N }
J d The dth joint’s cost for given task
J p The total cost of D joints at the pth iteration
K ( · ) Kernel matrix
T r Decline rate of cost between J p 1 and J p
{ J d , δ ω d , i } K J d and δ ω d , i which are calculated from K roll-outs
{ J d , δ ω d , i } K e J d and δ ω d , i which are calculated from K e elite roll-outs,
and K e elite roll-outs are obtained by sorting K roll-outs according to the cost
Table 2. The final cost of one via-point task with SCARA.
Table 2. The final cost of one via-point task with SCARA.
AlgorithmFinal Cost
PI2 2.2341 × 10 8
PI2-CMA 1.7569 × 10 8
PI2-KCCA 1.4089 × 10 8
PI2-CMA-KCCA(test1) 1.0249 × 10 8
PI2-CMA-KCCA(test2) 1.1095 × 10 8
PI2-CMA-KCCA(test3) 1.0677 × 10 8
PI2-CMA-KCCA(test4) 1.2106 × 10 8
PI2-CMA-KCCA(test5) 1.0155 × 10 8
Table 3. The final cost of two via-point task with SCARA.
Table 3. The final cost of two via-point task with SCARA.
AlgorithmFinal Cost
PI2 7.1203 × 10 9
PI2-CMA 5.5471 × 10 9
PI2-KCCA 4.4999 × 10 9
PI2-CMA-KCCA(test1) 2.7410 × 10 9
PI2-CMA-KCCA(test2) 3.3489 × 10 9
PI2-CMA-KCCA(test3) 3.4429 × 10 9
PI2-CMA-KCCA(test4) 3.1620 × 10 9
PI2-CMA-KCCA(test5) 3.0731 × 10 9
Table 4. The final cost of one via-point task with Swayer.
Table 4. The final cost of one via-point task with Swayer.
AlgorithmFinal Cost
PI2 7.1889 × 10 10
PI2-CMA 5.0168 × 10 10
PI2-KCCA 4.5494 × 10 10
PI2-CMA-KCCA(test1) 2.2861 × 10 10
PI2-CMA-KCCA(test2) 2.2581 × 10 10
PI2-CMA-KCCA(test3) 2.2641 × 10 10
PI2-CMA-KCCA(test4) 2.2554 × 10 10
PI2-CMA-KCCA(test5) 2.2513 × 10 10
Table 5. Average decline rate of four algorithms.
Table 5. Average decline rate of four algorithms.
TaskPI2PI2-CMAPI2-KCCAPI2-CMA-KCCA
One Via-Point Task with SCARA96.5%97.2%97.8%98.5%
Two Via-Point Task with SCARA84.0%87.4%89.7%92.8%
One Via-Point Task with Swayer73.0%81.3%83.0%91.5%

Share and Cite

MDPI and ACS Style

Fu, J.; Li, C.; Teng, X.; Luo, F.; Li, B. Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition. Appl. Sci. 2020, 10, 5346. https://doi.org/10.3390/app10155346

AMA Style

Fu J, Li C, Teng X, Luo F, Li B. Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition. Applied Sciences. 2020; 10(15):5346. https://doi.org/10.3390/app10155346

Chicago/Turabian Style

Fu, Jian, Cong Li, Xiang Teng, Fan Luo, and Boqun Li. 2020. "Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition" Applied Sciences 10, no. 15: 5346. https://doi.org/10.3390/app10155346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop