Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition

: Discovering the implicit pattern and using it as heuristic information to guide the policy search is one of the core factors to speed up the procedure of robot motor skill acquisition. This paper proposes a compound heuristic information guided reinforcement learning algorithm PI 2 -CMA-KCCA for policy improvement. Its structure and workﬂow are similar to a double closed-loop control system. The outer loop realized by Kernel Canonical Correlation Analysis (KCCA) infers the implicit nonlinear heuristic information between the joints of the robot. In addition, the inner loop operated by Covariance Matrix Adaptation (CMA) discovers the hidden linear correlations between the basis functions within the joint of the robot. These patterns which are good for learning the new task can automatically determine the mean and variance of the exploring perturbation for Path Integral Policy Improvement (PI 2 ). Compared with classical PI 2 , PI 2 -CMA, and PI 2 -KCCA, PI 2 -CMA-KCCA can not only endow the robot with the ability to realize transfer learning of trajectory planning from the demonstration to the new task, but also complete it more efﬁciently. The classical via-point experiments based on SCARA and Swayer robots have validated that the proposed method has fast learning convergence and can ﬁnd a solution for the new task.


Introduction
Imitation learning (IL) and reinforcement learning (RL) [1] have always been a hot topic in the field of robot skill acquisition. Imitation learning can be divided into two categories: behavioral cloning (BC) and inverse reinforcement learning (IRL). BC is the method of learning expected policy directly from expert teaching information, while IRL learns policy indirectly using reward function. Hidden Markov Model (HMM) [2], Dynamic Movement Primitives (DMPs), Probabilistic Movement Primitives (ProMPs) [3], Dynamic Systems (DS) [4], and Cross Entropy Regression (CER) are popular Behavior Cloning methods. The most frequently used methods of IRL are Maximum Margin Planning (MMP) [5] and Markov Process (MP).
Imitation learning is limited because it requires the robot to learn only from demonstrated trajectories. When the reproduction environment is different from the demonstration environment or there is a big deviation, such as placing an obstacle on the path of the robot, the imitation learning method may fail. Instead, RL allows the robot to find a new control policy by exploring the state-action space freely. The combination of IL and RL aims to use the advantages of two methods to overcome their respective shortcomings, so that the robot can adapt to the deviation from the demonstration behavior, so as to improve the performance of the robot.

Dynamic Movement Primitives
Dynamic Movement Primitives (DMPs) have been used in many disciplines to model complex behaviors. In this paper, we use DMPs as the underlying policy representation. A joint of the robot can be regarded as a DMP. The DMP [13] consists of a damped spring system and a learnable nonlinear forcing term, by which the desired behavior of the joint can be obtained. Its expression is given by: where, if the forced term α f = 0, α z represents a globally stable second-order damped spring system. ζ is a time constant and represents the proportional coefficient of the duration of the motion. g is the target position. y 0 is the initial position, and variable y t would be interpreted as the desired position of the joint. h(x t ) is a fitting function, and x t can be conceived of as a phase variable. Ψ i (x) represents the ith basis function Ψ i (x) = exp(−(x t − c i ) 2 /2σ 2 i ), where c i and σ i are constants that determine the width and center of the basis function respectively. w i is the weight corresponding to the ith basis function. M is the number of exponential basis functions. ω ∈ R M×1 is the weight vector. α y , β y can be obtained by adjusting the damped spring system to the second-order critical damping system.
The parameter ω i of DMP is learned by Locally Weighted Regression algorithm (LWR). The algorithm will find the corresponding ω i for each basis function Ψ i by minimizing the cost function S. The function is defined by: where (x j , h x j ) is the jth sample. Obviously, ω i cooperates with each other in linear combination patterns because of the introduction of basis functions. Thus, we can optimize them by means of linear correlation techniques. A summary of the notation frequently used in this article is given by Table 1.
Mapping function. Mapping a variable to V-dimensional spacě The weight perturbation sample of the dth joint, i.e.,ω s The dth joint's cost for given task

Path Integral Policy Improvement with Covariance Matrix Adaption
PI 2 is derived from the first principles of stochastic optimal control. What sets PI 2 apart from other direct policy improvement algorithms is its use of probability-weighted averaging to perform a parameter update, rather than using an estimate of the gradient. In Section 2, we obtain the weight vector ω by parameterizing the demonstration trajectory. The idea of PI 2 is to add stochastic exploration noise ε d (ε d ∼ N (0, σ 2 )) to ω d and generate K roll-outs with different costs by executing parameterized policy. As for a robot with D DOF, the cost function of the kth roll-out at time i is where τ ·,i,k τ is a sample path (or trajectory piece). · indicates all DOF. ϕ d,N,k represents the terminal reward of the kth trajectory of the dth DOF. q d,j,k is the immediate reward of the kth trajectory of the dth DOF at time j. Specifically, it is expressed asÿ t (i.e., the square of joint's acceleration). M d,j,k is the mapping matrix of the kth trajectory of the dth DOF at time j. R is the positive semi-definite weight matrix of the quadratic control cost. N is the maximum value of the time index. Next, the exploration is evaluated. First, the cost of obtained trajectories is sorted, then K e elite samples are selected, and finally perform probability-weighted averaging to obtain δω d,i with th DOF at time i: where P(τ .,i,k ) is the probability-weighted value of the kth trajectory of the dth DOF at time i. It is obtained by softmax transformation of the cost function: where λ is an appropriate constant. Later, δω d,i is averaged over time to get δω d further to obtain the new weight (i.e., ω new d = ω old d + δω d ). By searching the parameter space iteratively, PI 2 will eventually find a solution for the new task. Classical PI 2 only updates the mean ω, and the covariance σ 2 is a constant (σ 2 = λ init I M ). M is the number of base functions. λ init determines the magnitude of initial exploration noise. PI 2 -CMA aims to determine the magnitude of the exploration noise automatically and to infer the implicit linear correlation between the basis functions within the joint of the robot. In other words, covariance matrix adaption is expressed as: In addition, the covariance update equation is As we can see that the vanilla PI 2 -CMA only infers linear correlation of weight within a DMP independently.

PI 2 -CMA with Kernel Canonical Correlation Analysis
In Equation (3), it can be seen that the perturbation ε of PI 2 is generated with equal probability for each joint's weight vector ω d . The PI 2 -CMA takes into account the implicit linear correlations between the basis functions of the joint and automatically updates the covariance. However, when a task is assigned to a multi-joint robot , there exists an unknown hidden task-oriented pattern between weight vector with perturbationω d i andω d j (ω d k = ω d k + δω d k , k ∈ {i, j}). The pattern can be inferred from experience and expressed as a nonlinear correlation. Specifically, we can apply Kernel Canonical Correlation Analysis (KCCA) to infer the implicit nonlinear heuristic information between the joints of the robot to guide PI 2 to a search policy for the new task of trajectory planning. In the paper, we not only consider the linear correlations between the perturbation vector's components within a DMP but also take nonlinear correlations of the perturbation vector between DMPs into account to expedite the learning procedure of the robot.

Nonlinear Correlation Heuristic Information
According to Equations (3) and (4), perturbation is generated with equal probability for each joint's weight vector. In other words, there is no correlation between the joints. Therefore, is denoted asω, and covariance matrix ofω is given by: where diag[·] represents diagonal matrix and σ 2 1 = · · · = σ 2 D = σ 2 . When the multi-joint robot completes the task, we believe that there are some hidden patterns between the joints. The implicit patterns between the perturbation vectors of the multi-joint robot are expressed as nonlinear correlations. With the help of the kernel method, the dth joint's perturbation vectorω d ∈ R M×1 is mapped to high-dimensional feature space φ(ω d ) ∈ R V×1 .ω is also mapped to high-dimensional feature space φ(ω) ∈ R DV×1 , and then covariance matrix ofω can be described as: where ) is the covariance ofω d i andω d j projected on high-dimensional space, expressing the implicit nonlinear pattern between the perturbation vectors of the d i th joint and the d j th joint . The intuitive policy is to obtain Σ φ(ω) based on empirical samples, and then use Γ(ω d i ,ω d j ) as the heuristic information. That is, given the perturbation vectorω d i , the exploration ofω d j is guided by the covariance is in the unified space coordinate system (i.e., φ(ω)), but analyzing the correlation between the perturbation vectorω d i of the d i th joint and the perturbation vectorω d j of the d j th joint in a uniform coordinate system is not the best choice. Because after the perturbation vectors of different joints are mapped to a high-dimensional space, the correlation coefficients between the joints can be maximized only after a proper projection transformation, and such projection matrices are usually not necessarily identical. Here, the generalized Rayleigh Entropy is used to find the maximum correlation coefficient of φ(ω d i ) and φ(ω d j ), and Maximum Likelihood Estimation is used to infer the nonlinear correlation betweenω can be the heuristic information, and P r is the projection operator. That is, given the perturbation vectorω d i ,ω d j can be obtained from the covariance Θ(ω d i ,ω d j ). Equation (9) can now be expressed as follows:

Robot Intelligent Trajectory Inference with KCCA
KCCA [14] is a nonlinear correlation analysis method. In this paper, we employ KCCA on the elite samples from the robot's first joint to other joints, and make heuristic inference. After the pth iteration of PI 2 -CMA, the program records the total reward J p (τ) (one episodic samples), based on the currentω p 1,··· ,D (perturbation vectors of D joints). Then, J p (τ) is compared with the total reward J p−1 (τ) based onω p−1 1,··· ,D at the (p-1)th iteration to obtain a decline rate T r . If T r is greater than its upper threshold T max , T r is updated to T max , and the program executes KCCA learning. At this time, the weight perturbation sampleω s 1 = {ω 1,1 ,ω 1,2 , · · · ,ω 1,N } on the first joint and the weight perturbation sampleω s 2 = {ω 2,1 ,ω 2,2 , · · · ,ω 2,N } on the second joint in the pth iteration are respectively mapped in high dimensions to obtain [φ(ω d, ). Next, find two sets of vectors w 1 , w 2 ∈ R V×1 , so that the correlation coefficient between the data u and v after the projection of φ(ω 1 ) and φ(ω 2 ) is maximized. We will have: The covariance of u and v is given by: Obviously, the vectors w 1 and w 2 are located in the space spanned by data Φ(ω s 1 ) and Φ(ω s 2 ): where α 1 , α 2 ∈ R N×1 . From Equations (12) and (13), correlation coefficient ρ can be obtained: Kernel method is introduced in Equation (14), where K(·) is a kernel matrix. Without loss of generality, we fixed the denominator α T (14) to find a suitable α 1 and α 2 to maximize α T 1 K ω 1 K ω 2 α 2 . Construct the Lagrange function: We take partial derivatives of α 1 and α 2 in Equation (15), and obtain: According to Equation (16), the implicit correlation P ca (Θ(ω 1 ,ω 2 )) between the first joint and the second joint perturbation vector of the robot is obtained. When the current T r is greater than the upper threshold T max , α 1 , α 2 are recorded. Furthermore, Equation (15) can be repeatedly executed to obtain the implicit correlation between the first joint and the dth (d ∈ [2, · · · , D]) joint. In this way, we get the implicit patterns {(α 1 , α 2 ), · · · , (α 1 , α D )} between the joints.
If T r is less than the lower threshold T min , the independent explorations of the whole joint are abandoned, and the KCCA prediction is turned on. That is, the first joint explores randomly to obtain a perturbation sampleω s 1 and K ω 1 in the pth iteration, then (α 1 , α 2 ) is used to calculate K ω 2 of the second joint. Obviously,ω 2 can be computed from K ω 2 , and that will guide the exploration of the second joint. Repeat the same steps for the subsequent joints until the Dth joint.

The Combination of KCCA and CMA
We draw the schematic of the PI 2 -CMA-KCCA's control flow as Figure 1, which works like a double closed-loop control system. Without loss of generality, there are D joints, and each joint corresponds to M basis functions, then D × M parameters need to be adjusted.
The outer loop aims at discovering the hidden patterns between the joints which are helpful to fulfill the new task. Specifically, we use the KCCA to infer the nonlinear correlations between the first joint and other joints for good declining rate, recorded as {(α 1 , α 2 ), · · · , (α 1 , α D )}. After that, we can computeω d from K ω d based on perturbation sampleω s 1 and K ω 1 , and it will guide the perturbation of the dth joint as a means of exploring noise.
The inner loop extracts heuristic information from a corresponding single joint to guide the search. Specifically, in the previous iteration, it evaluates the cost J d of K roll-outs (sorting and probability average), then selects the first K e elites of the perturbation sample to obtain the covariance Σ ω d by the CMA algorithm. Furthermore, in current iteration , ω d ∼ N ω d , Σ ω d is used to explore the policy parameter ω d . In other words, it applies compound heuristic information which integrates the inferred hidden nonlinear patterns between the joints and the linear patterns within the joint to automatically determine the exploring proportion for each component of ω d . In short, the outer loop discovers and predicts the exploring mean of ω d for the dth (d = 1) DOF, the inner loop discloses and infers the exploring variance of ω d for the dth (d = 1) DOF. As for 1st DOF, a vanilla PI 2 is employed.

Evaluations
The cost function in our experiments is given by: In Equation (17), D is the number of joints of SCARA and Swayer. y (d) (i),ẏ (d) (i),ÿ (d) (i) denote the position, velocity, and acceleration of the dth joint of the robot at the time i. y

Passing through One Via-Point with SCARA
SCARA has three revolute joints q 1 , q 2 , q 3 and one prismatic joint q 4 . In this experiment, the orientation of the end-effector of the robot arm is ignored, thus the SCARA robot arm can be regarded as a planar two-link mechanism.
The experiment is divided into the following four steps: (1) Set the Cartesian coordinate of the starting position of the end-effector to (20, 0, 0) T cm, and the corresponding joint angle is (0, 0, 0) T rad. Set the endpoint (4.5, 16, 0) T cm, and the corresponding joint angle is (0.7068, 1.1796, 0) T rad.
(2) Drive the manipulator based on the principle of minimum jitter to obtain the demonstration.
(3) Given a new task, use PI 2 , PI 2 -CMA, PI 2 -KCCA, and PI 2 -CMA-KCCA, respectively, to make the movement pass through a via-point at m = 1.8 s. (4) The four algorithms are iterated for 80 times respectively to gain new motor skills.
In this experiment, each of the four different algorithms is repeated 20 times. The best experimental results of PI 2 , PI 2 -CMA, PI 2 -KCCA, and the five experimental results of PI 2 -CMA-KCCA are selected for comparison. The optimization effect of PI 2 -CMA-KCCA is analyzed by comparing the final cost of the system and the trajectories of joint space. It can be clearly seen from Figure 2 that PI 2 -CMA-KCCA converges faster, and its final cost is the lowest (as shown in Table 2).   Figure 3 shows the movement trajectories of the three joints of the SCARA in the joint space when the number of iterations is 35. Because the joint q 3 has no effect on the position of the end-effector, the value of q 3 in the joint space remains constant.
The blue line in Figure 3 is the trajectory of each joint of the robot arm under the PI 2 -CMA-KCCA. Figure 3 demonstrates that only the blue lines pass through an intermediate via-point.

Passing through Two Via-Point with SCARA
This experimental procedure is similar to Section 5.1, requiring the end-effector to pass the point (18.2, 8.1, 0) T cm at m = 1.8 s and the point (12.0, 13.5, 0) T cm at m = 3.6 s. Four different algorithms are also respectively repeated 20 times, and the best experimental results of PI 2 , PI 2 -CMA, PI 2 -KCCA, and the five experimental results of PI 2 -CMA-KCCA are selected for comparison. Figure 4 demonstrates that the PI 2 -CMA-KCCA has better learning performance than the other algorithms. After 80 iterations, the final cost of the four algorithms is given in Table 3. Figure 5 shows the trajectories of the three joints of the SCARA in the joint space when the number of iterations is 50, where the blue lines represent the joints' trajectories of the robot arm in joint space under the PI 2 -CMA-KCCA. It demonstrates that, when the number of iterations is 50, the blue trajectories pass through two intermediate via-point at t = 1.8 s and t = 3.6 s accurately.

Passing Through One Via-Point with Swayer
Swayer is a lightweight collaborative robot created by Rethink Robotics. It has seven joints. In this subsection, q i is used to represent the ith joint of Swayer. Because the q 7 has no effect on Cartesian position of the end-effector, only six joints are considered. This experiment uses the ROS platform to control the movement trajectory of the Swayer manipulator in the Ubuntu16 system, and validates the effectiveness of the PI 2 -CMA-KCCA.
First, we set an arbitrary starting point (697, 159, 514) T mm in the workspace of Swayer and an arbitrary endpoint (300, −569, −65) T mm, then drag Swayer to record a demonstration from the starting point to the endpoint, then choose any reachable point (840, −248, 595) T mm far away from the demonstration as a new task at m = 1.2 s. Finally, PI 2 , PI 2 -CMA, PI 2 -KCCA, and PI 2 -CMA-KCCA are repeated 20 times on Swayer. The five experimental results of the PI 2 -CMA-KCCA and the best experimental results of the PI 2 , PI 2 -CMA and PI 2 -KCCA are selected for analysis. Figure 6 shows the downward trend of the cost of the four algorithms, and the final cost of the four algorithms after 80 iterations is given in Table 4.

Performance Comparison of Four Algorithms
In Table 5, the cost decline rate of PI 2 -CMA-KCCA is always higher than that of PI 2 , PI 2 -CMA, and PI 2 -KCCA. When the experimental objects are the same, the new task is more complicated, and the learning performance of PI 2 -CMA-KCCA is better than that of the other three algorithms. In addition, especially when the tasks are the same, the greater the number of degrees of freedom of the experimental objects, the better the optimization effect of PI 2 -CMA-KCCA than the other algorithms because PI 2 -CMA-KCCA not only adjusts the magnitude of exploration noise automatically but also considers the nonlinear heuristic information between the joints.

Conclusions
Compound heuristic information is applied in the paper to guide PI2's variational exploration and expedite the procedure of reinforcement learning. This information is derived by KCCA and CMA together. KCCA infers the nonlinear heuristic information between the joints of robot, and CMA infers the linear heuristic information within single DOF. This information may cause the cost function to drop rapidly with the MLE on the roll-out data. In addition, the proposed algorithm PI 2 -CMA-KCCA works like a double closed-loop control system, in which the outer loop discovers and predicts the means of exploring vectors for each DOF and the inner loop discloses and infers the variance of exploring vector for each DOF. In this way, the new algorithm can quickly search the optimal parameters of new tasks. The experimental results on SCARA and Swayer also demonstrate that the algorithm can speed up the process of updating parameters while maintaining the accuracy of completing new tasks, which is suitable for multi-degree-of-freedom objects and more complex tasks.