Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition

Fu, Jian; Li, Cong; Teng, Xiang; Luo, Fan; Li, Boqun

doi:10.3390/app10155346

Open AccessArticle

Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition

by

Jian Fu

^1,*,

Cong Li

¹

,

Xiang Teng

¹,

Fan Luo

¹ and

Boqun Li

²

¹

School of Automation, Wuhan University of Technology, Wuhan 430070, China

²

School of Electronic and Information Engineering, University of Science and Technology Liaoning, Anshan 114051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2020, 10(15), 5346; https://doi.org/10.3390/app10155346

Submission received: 30 June 2020 / Revised: 28 July 2020 / Accepted: 30 July 2020 / Published: 3 August 2020

(This article belongs to the Special Issue Biorobotics: Challenges, Technologies, and Trends)

Download

Browse Figures

Versions Notes

Abstract

:

Discovering the implicit pattern and using it as heuristic information to guide the policy search is one of the core factors to speed up the procedure of robot motor skill acquisition. This paper proposes a compound heuristic information guided reinforcement learning algorithm PI

^{2}

-CMA-KCCA for policy improvement. Its structure and workflow are similar to a double closed-loop control system. The outer loop realized by Kernel Canonical Correlation Analysis (KCCA) infers the implicit nonlinear heuristic information between the joints of the robot. In addition, the inner loop operated by Covariance Matrix Adaptation (CMA) discovers the hidden linear correlations between the basis functions within the joint of the robot. These patterns which are good for learning the new task can automatically determine the mean and variance of the exploring perturbation for Path Integral Policy Improvement (PI

^{2}

). Compared with classical PI², PI²-CMA, and PI²-KCCA, PI²-CMA-KCCA can not only endow the robot with the ability to realize transfer learning of trajectory planning from the demonstration to the new task, but also complete it more efficiently. The classical via-point experiments based on SCARA and Swayer robots have validated that the proposed method has fast learning convergence and can find a solution for the new task.

Keywords:

robot learning; reinforcement learning; heuristic information; KCCA; PI²-CMA

1. Introduction

Imitation learning (IL) and reinforcement learning (RL) [1] have always been a hot topic in the field of robot skill acquisition. Imitation learning can be divided into two categories: behavioral cloning (BC) and inverse reinforcement learning (IRL). BC is the method of learning expected policy directly from expert teaching information, while IRL learns policy indirectly using reward function. Hidden Markov Model (HMM) [2], Dynamic Movement Primitives (DMPs), Probabilistic Movement Primitives (ProMPs) [3], Dynamic Systems (DS) [4], and Cross Entropy Regression (CER) are popular Behavior Cloning methods. The most frequently used methods of IRL are Maximum Margin Planning (MMP) [5] and Markov Process (MP).

Imitation learning is limited because it requires the robot to learn only from demonstrated trajectories. When the reproduction environment is different from the demonstration environment or there is a big deviation, such as placing an obstacle on the path of the robot, the imitation learning method may fail. Instead, RL allows the robot to find a new control policy by exploring the state-action space freely. The combination of IL and RL aims to use the advantages of two methods to overcome their respective shortcomings, so that the robot can adapt to the deviation from the demonstration behavior, so as to improve the performance of the robot.

The classical RL methods include SARSA [6], Natural Actor-Critic (NAC) [7], Policy Learning by Weighting Exploration with the Returns (PoWER) [8], Relative Entropy Policy Search (REPS) [9], and Path Integral Policy Improvement (PI²). We believe that PI² [10] is one of the most effective, numerically robust, and easy reinforcement algorithms. However, classical PI² searches the whole parameter space, so it is less efficient to complete the task. Kernel Canonical Correlation Analysis (KCCA) can infer the implicit nonlinear heuristic information between the joints of the robot when facing the new task, leading PI² to find a solution. Based on the works of Covariance Matrix Adaptation (CMA) [11] together with our previous research on KCCA [12], we propose a new algorithm PI²-CMA-KCCA in this paper, where KCCA and CMA are integrated as compound heuristic information to speed up the learning procedure from the demonstration to a new task.

This paper is structured into several major sections: Section 2 investigates the DMP used in the imitation learning phase. In this paper, we use Dynamic Movement Primitives as the underlying policy representation. Section 3 briefly introduces the algorithm PI²-CMA which aims at discovering the hidden relationship between the components of the weight. In Section 4, we introduce the algorithm KCCA, and finally derive the algorithm PI²-CMA-KCCA. In Section 5, based on SCARA and Swayer robots, we validate our algorithm through the classical via-point task and analyze the experimental results. Finally, the conclusions are given in Section 6.

2. Dynamic Movement Primitives

Dynamic Movement Primitives (DMPs) have been used in many disciplines to model complex behaviors. In this paper, we use DMPs as the underlying policy representation. A joint of the robot can be regarded as a DMP. The DMP [13] consists of a damped spring system and a learnable nonlinear forcing term, by which the desired behavior of the joint can be obtained. Its expression is given by:

\{\begin{matrix} ζ^{2} {\ddot{y}}_{t} = \underset{α_{z}}{\underset{︸}{α_{y} (β_{y} (g - y_{t}) - ζ {\dot{y}}_{t})}} + \underset{α_{f}}{\underset{︸}{h (x_{t}) x_{t} (g - y_{0})}} \\ ζ {\dot{x}}_{t} = - α_{x} x_{t} \\ h (x_{t}) = \frac{\sum_{i = 1}^{M} Ψ_{i} (x) ω_{i}}{\sum_{i = 1}^{M} Ψ_{i} (x)} = \sum_{i = 1}^{M} Ψ_{i} ω_{i} = Ψ ω, \end{matrix}

(1)

where, if the forced term

α_{f} = 0

,

α_{z}

represents a globally stable second-order damped spring system.

ζ

is a time constant and represents the proportional coefficient of the duration of the motion. g is the target position.

y_{0}

is the initial position, and variable

y_{t}

would be interpreted as the desired position of the joint.

h (x_{t})

is a fitting function, and

x_{t}

can be conceived of as a phase variable.

Ψ_{i} (x)

represents the ith basis function

Ψ_{i} (x) = \exp (- {(x_{t} - c_{i})}^{2}

/

2 σ_{i}^{2})

, where

c_{i}

and

σ_{i}

are constants that determine the width and center of the basis function respectively.

w_{i}

is the weight corresponding to the ith basis function. M is the number of exponential basis functions.

ω \in R^{M \times 1}

is the weight vector.

α_{y}

,

β_{y}

can be obtained by adjusting the damped spring system to the second-order critical damping system.

The parameter

ω_{i}

of DMP is learned by Locally Weighted Regression algorithm (LWR). The algorithm will find the corresponding

ω_{i}

for each basis function

Ψ_{i}

by minimizing the cost function S. The function is defined by:

S (ω) = \sum_{j = 1}^{N} {(\sum_{i = 1}^{M} Ψ_{i} (x^{j}) ω_{i} - h (x^{j}))}^{2},

(2)

where

(x^{j}, h (x^{j}))

is the jth sample. Obviously,

ω_{i}

cooperates with each other in linear combination patterns because of the introduction of basis functions. Thus, we can optimize them by means of linear correlation techniques. A summary of the notation frequently used in this article is given by Table 1.

3. Path Integral Policy Improvement with Covariance Matrix Adaption

PI² is derived from the first principles of stochastic optimal control. What sets PI² apart from other direct policy improvement algorithms is its use of probability-weighted averaging to perform a parameter update, rather than using an estimate of the gradient. In Section 2, we obtain the weight vector

ω

by parameterizing the demonstration trajectory. The idea of PI² is to add stochastic exploration noise

ε_{d}

(ε_{d} \sim N (0, σ^{2}))

to

ω_{d}

and generate K roll-outs with different costs by executing parameterized policy. As for a robot with D DOF, the cost function of the kth roll-out at time i is

J (τ_{\cdot, i, k}) = \sum_{d = 1}^{D} [φ_{d, N, k} + \sum_{j = i}^{N - 1} q_{d, j, k} + \frac{1}{2} \sum_{j = i}^{N - 1} {(ω_{d} + M_{d, j, k} ε_{d, j, k})}^{T} R (ω_{d} + M_{d, j, k} ε_{d, j, k})],

(3)

where

τ_{\cdot, i, k}

τ

is a sample path (or trajectory piece). · indicates all DOF.

φ_{d, N, k}

represents the terminal reward of the kth trajectory of the dth DOF.

q_{d, j, k}

is the immediate reward of the kth trajectory of the dth DOF at time j. Specifically, it is expressed as

\ddot{y_{t}}

(i.e., the square of joint’s acceleration).

M_{d, j, k}

is the mapping matrix of the kth trajectory of the dth DOF at time j.

R

is the positive semi-definite weight matrix of the quadratic control cost. N is the maximum value of the time index.

Next, the exploration is evaluated. First, the cost of obtained trajectories is sorted, then

K_{e}

elite samples are selected, and finally perform probability-weighted averaging to obtain

δ ω_{d, i}

with th DOF at time i:

δ ω_{d, i} = \sum_{k = 1}^{K_{e}} [P (τ_{., i, k}) M_{d, i, k} ε_{d, i, k}],

(4)

where

P (τ_{., i, k})

is the probability-weighted value of the kth trajectory of the dth DOF at time i. It is obtained by softmax transformation of the cost function:

P_{., k, i} = P (τ_{., i, k}) = \frac{e^{- \frac{1}{λ} (J (τ_{., i, k}))}}{\sum_{k = 1}^{K_{e}} e^{- \frac{1}{λ} (J (τ_{., i, k}))}},

(5)

where

λ

is an appropriate constant.

Later,

δ ω_{d, i}

is averaged over time to get

δ ω_{d}

further to obtain the new weight (i.e.,

ω_{d}^{new}

=

ω_{d}^{old}

+

δ ω_{d}

). By searching the parameter space iteratively, PI² will eventually find a solution for the new task. Classical PI² only updates the mean

ω

, and the covariance

σ^{2}

is a constant (

σ^{2} = λ_{i n i t} I_{M}

). M is the number of base functions.

λ_{i n i t}

determines the magnitude of initial exploration noise. PI²-CMA aims to determine the magnitude of the exploration noise automatically and to infer the implicit linear correlation between the basis functions within the joint of the robot. In other words, covariance matrix adaption is expressed as:

Σ_{d, i}^{n e w} = \sum_{k = 1}^{K_{e}} P_{., k, i} (δ ω_{d, i}) {(δ ω_{d, i})}^{T} .

(6)

In addition, the covariance update equation is

Σ_{d}^{n e w} = \frac{\sum_{i = 1}^{N} (N - i) Σ_{d, i}^{n e w}}{\sum_{i = l}^{N} (N - l)} .

(7)

As we can see that the vanilla PI²-CMA only infers linear correlation of weight within a DMP independently.

4. PI²-CMA with Kernel Canonical Correlation Analysis

In Equation (3), it can be seen that the perturbation

ε

of PI² is generated with equal probability for each joint’s weight vector

ω_{d}

. The PI

^{2}

-CMA takes into account the implicit linear correlations between the basis functions of the joint and automatically updates the covariance. However, when a task is assigned to a multi-joint robot, there exists an unknown hidden task-oriented pattern between weight vector with perturbation

{\overset{ˇ}{ω}}_{d_{i}}

and

{\overset{ˇ}{ω}}_{d_{j}}

(

{\overset{ˇ}{ω}}_{d_{k}} = ω_{d_{k}} + δ ω_{d_{k}}, k \in {i, j}

). The pattern can be inferred from experience and expressed as a nonlinear correlation. Specifically, we can apply Kernel Canonical Correlation Analysis (KCCA) to infer the implicit nonlinear heuristic information between the joints of the robot to guide PI

^{2}

to a search policy for the new task of trajectory planning. In the paper, we not only consider the linear correlations between the perturbation vector’s components within a DMP but also take nonlinear correlations of the perturbation vector between DMPs into account to expedite the learning procedure of the robot.

4.1. Nonlinear Correlation Heuristic Information

According to Equations (3) and (4), perturbation is generated with equal probability for each joint’s weight vector. In other words, there is no correlation between the joints. Therefore,

{[{\overset{ˇ}{ω}}_{1}^{T}, {\overset{ˇ}{ω}}_{2}^{T}, \dots, {\overset{ˇ}{ω}}_{D}^{T}]}^{T} \in R^{D M \times 1}

is denoted as

\tilde{ω}

, and covariance matrix of

\tilde{ω}

is given by:

Σ_{\tilde{ω}} = d i a g [σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{D}^{2}] \in R^{D M \times D M},

(8)

where

d i a g [\cdot]

represents diagonal matrix and

σ_{1}^{2} = \dots = σ_{D}^{2} = σ^{2}

.

When the multi-joint robot completes the task, we believe that there are some hidden patterns between the joints. The implicit patterns between the perturbation vectors of the multi-joint robot are expressed as nonlinear correlations. With the help of the kernel method, the dth joint’s perturbation vector

{\overset{ˇ}{ω}}_{d} \in R^{M \times 1}

is mapped to high-dimensional feature space

ϕ ({\overset{ˇ}{ω}}_{d}) \in R^{V \times 1}

.

\tilde{ω}

is also mapped to high-dimensional feature space

ϕ (\tilde{ω}) \in R^{D V \times 1}

, and then covariance matrix of

\tilde{ω}

can be described as:

Σ_{ϕ (\tilde{ω})} = [\begin{matrix} Γ ({\overset{ˇ}{ω}}_{1}, {\overset{ˇ}{ω}}_{1}) & Γ ({\overset{ˇ}{ω}}_{1}, {\overset{ˇ}{ω}}_{2}) & \dots & Γ ({\overset{ˇ}{ω}}_{1}, {\overset{ˇ}{ω}}_{D}) \\ Γ ({\overset{ˇ}{ω}}_{2}, {\overset{ˇ}{ω}}_{1}) & Γ ({\overset{ˇ}{ω}}_{2}, {\overset{ˇ}{ω}}_{2}) & \dots & Γ ({\overset{ˇ}{ω}}_{2}, {\overset{ˇ}{ω}}_{D}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Γ ({\overset{ˇ}{ω}}_{D}, {\overset{ˇ}{ω}}_{1}) & Γ ({\overset{ˇ}{ω}}_{D}, {\overset{ˇ}{ω}}_{2}) & \dots & Γ ({\overset{ˇ}{ω}}_{D}, {\overset{ˇ}{ω}}_{D}) \end{matrix}],

(9)

where

Γ ({\overset{ˇ}{ω}}_{d_{i}}, {\overset{ˇ}{ω}}_{d_{j}}) = c o v (ϕ ({\overset{ˇ}{ω}}_{d_{i}}), ϕ ({\overset{ˇ}{ω}}_{d_{j}}))

is the covariance of

{\overset{ˇ}{ω}}_{d_{i}}

and

{\overset{ˇ}{ω}}_{d_{j}}

projected on high-dimensional space, expressing the implicit nonlinear pattern between the perturbation vectors of the

d_{i}

th joint and the

d_{j}

th joint. The intuitive policy is to obtain

Σ_{ϕ (\tilde{ω})}

based on empirical samples, and then use

Γ ({\overset{ˇ}{ω}}_{d_{i}}, {\overset{ˇ}{ω}}_{d_{j}})

as the heuristic information. That is, given the perturbation vector

{\overset{ˇ}{ω}}_{d_{i}}

, the exploration of

{\overset{ˇ}{ω}}_{d_{j}}

is guided by the covariance

Γ ({\overset{ˇ}{ω}}_{d_{i}}, {\overset{ˇ}{ω}}_{d_{j}})

.

Note that

Γ ({\overset{ˇ}{ω}}_{d_{i}}, {\overset{ˇ}{ω}}_{d_{j}})

is in the unified space coordinate system (i.e.,

ϕ (\overset{ˇ}{ω})

), but analyzing the correlation between the perturbation vector

{\overset{ˇ}{ω}}_{d_{i}}

of the

d_{i}

th joint and the perturbation vector

{\overset{ˇ}{ω}}_{d_{j}}

of the

d_{j}

th joint in a uniform coordinate system is not the best choice. Because after the perturbation vectors of different joints are mapped to a high-dimensional space, the correlation coefficients between the joints can be maximized only after a proper projection transformation, and such projection matrices are usually not necessarily identical. Here, the generalized Rayleigh Entropy is used to find the maximum correlation coefficient of

ϕ ({\overset{ˇ}{ω}}_{d_{i}})

and

ϕ ({\overset{ˇ}{ω}}_{d_{j}})

, and Maximum Likelihood Estimation is used to infer the nonlinear correlation between

{\overset{ˇ}{ω}}_{d_{i}}

and

{\overset{ˇ}{ω}}_{d_{j}}

. If

d_{i} \neq d_{j}

,

Θ ({\overset{ˇ}{ω}}_{d_{i}}, {\overset{ˇ}{ω}}_{d_{j}}) = c o v (P_{r} {Φ ({\overset{ˇ}{ω}}_{d_{i}})}, P_{r} {Φ ({\overset{ˇ}{ω}}_{d_{j}})})

can be the heuristic information, and

P_{r}

is the projection operator. That is, given the perturbation vector

{\overset{ˇ}{ω}}_{d_{i}}

,

{\overset{ˇ}{ω}}_{d_{j}}

can be obtained from the covariance

Θ ({\overset{ˇ}{ω}}_{d_{i}}, {\overset{ˇ}{ω}}_{d_{j}})

. Equation (9) can now be expressed as follows:

Σ_{ϕ (\tilde{ω})}^{+} = [\begin{matrix} Γ ({\overset{ˇ}{ω}}_{1}, {\overset{ˇ}{ω}}_{1}) & Θ ({\overset{ˇ}{ω}}_{1}, {\overset{ˇ}{ω}}_{2}) & \dots & Θ ({\overset{ˇ}{ω}}_{1}, {\overset{ˇ}{ω}}_{D}) \\ Θ ({\overset{ˇ}{ω}}_{2}, {\overset{ˇ}{ω}}_{1}) & Γ ({\overset{ˇ}{ω}}_{2}, {\overset{ˇ}{ω}}_{2}) & \dots & Θ ({\overset{ˇ}{ω}}_{2}, {\overset{ˇ}{ω}}_{D}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Θ ({\overset{ˇ}{ω}}_{D}, {\overset{ˇ}{ω}}_{1}) & Θ ({\overset{ˇ}{ω}}_{D}, {\overset{ˇ}{ω}}_{2}) & \dots & Γ ({\overset{ˇ}{ω}}_{D}, {\overset{ˇ}{ω}}_{D}) \end{matrix}] .

(10)

4.2. Robot Intelligent Trajectory Inference with KCCA

KCCA [14] is a nonlinear correlation analysis method. In this paper, we employ KCCA on the elite samples from the robot’s first joint to other joints, and make heuristic inference. After the pth iteration of PI²-CMA, the program records the total reward

J^{p} (τ)

(one episodic samples), based on the current

{\overset{ˇ}{ω}}_{1, \dots, D}^{p}

(perturbation vectors of D joints). Then,

J^{p} (τ)

is compared with the total reward

J^{p - 1} (τ)

based on

{\overset{ˇ}{ω}}_{1, \dots, D}^{p - 1}

at the (p−1)th iteration to obtain a decline rate

T_{r}

.

If

T_{r}

is greater than its upper threshold

T_{m a x}

,

T_{r}

is updated to

T_{m a x}

, and the program executes KCCA learning. At this time, the weight perturbation sample

{\overset{ˇ}{ω}}_{1}^{s} = {{\overset{ˇ}{ω}}_{1, 1}, {\overset{ˇ}{ω}}_{1, 2}, \dots, {\overset{ˇ}{ω}}_{1, N}}

on the first joint and the weight perturbation sample

{\overset{ˇ}{ω}}_{2}^{s} = {{\overset{ˇ}{ω}}_{2, 1}, {\overset{ˇ}{ω}}_{2, 2}, \dots, {\overset{ˇ}{ω}}_{2, N}}

on the second joint in the pth iteration are respectively mapped in high dimensions to obtain

[ϕ ({\overset{ˇ}{ω}}_{d, 1}) ϕ ({\overset{ˇ}{ω}}_{d, 2}) \dots ϕ ({\overset{ˇ}{ω}}_{d, N})]

, denoted as

Φ ({\overset{ˇ}{ω}}_{d}^{s}) \in R^{V \times N} (d \in [1, 2])

. Next, find two sets of vectors

w_{1}

,

w_{2} \in R^{V \times 1}

, so that the correlation coefficient between the data u and v after the projection of

ϕ ({\overset{ˇ}{ω}}_{1})

and

ϕ ({\overset{ˇ}{ω}}_{2})

is maximized. We will have:

\begin{matrix} u = w_{1}^{T} ϕ ({\overset{ˇ}{ω}}_{1}), \\ v = w_{2}^{T} ϕ ({\overset{ˇ}{ω}}_{2}) . \end{matrix}

(11)

The covariance of u and v is given by:

v a r (u, v) = \frac{1}{N - 1} w_{1}^{T} Φ ({\overset{ˇ}{ω}}_{1}^{s}) Φ {({\overset{ˇ}{ω}}_{2}^{s})}^{T} w_{2} .

(12)

Obviously, the vectors

w_{1}

and

w_{2}

are located in the space spanned by data

Φ ({\overset{ˇ}{ω}}_{1}^{s})

and

Φ ({\overset{ˇ}{ω}}_{2}^{s})

:

\begin{matrix} w_{1} = Φ ({\overset{ˇ}{ω}}_{1}^{s}) α_{1}, \\ w_{2} = Φ ({\overset{ˇ}{ω}}_{2}^{s}) α_{2}, \end{matrix}

(13)

where

α_{1}

,

α_{2} \in R^{N \times 1}

. From Equations (12) and (13), correlation coefficient

ρ

can be obtained:

\begin{matrix} ρ & = \frac{α_{1}^{T} Φ^{T} ({\overset{ˇ}{ω}}_{1}^{s}) Φ ({\overset{ˇ}{ω}}_{1}^{s}) Φ^{T} ({\overset{ˇ}{ω}}_{2}^{s}) Φ ({\overset{ˇ}{ω}}_{2}^{s}) α_{2}}{{\sqrt{α_{1}^{T} {(Φ^{T} ({\overset{ˇ}{ω}}_{1}^{s}) Φ ({\overset{ˇ}{ω}}_{1}^{s}))}^{2} α}}_{1} \sqrt{α_{2}^{T} {(Φ^{T} ({\overset{ˇ}{ω}}_{2}^{s}) Φ ({\overset{ˇ}{ω}}_{2}^{s}))}^{2} α_{2}}} \\ = \frac{α_{1}^{T} K_{ω_{1}} K_{ω_{2}} α_{2}}{\sqrt{α_{1}^{T} K_{ω_{1}} K_{ω_{1}} α_{1}} \sqrt{α_{2}^{T} K_{ω_{2}} K_{ω_{2}} α_{2}}} . \end{matrix}

(14)

Kernel method is introduced in Equation (14), where

K (\cdot)

is a kernel matrix. Without loss of generality, we fixed the denominator

α_{1}^{T} K_{ω_{1}} K_{ω_{1}} α_{1} = 1

,

α_{2}^{T} K_{ω_{2}} K_{ω_{2}} α_{2} = 1

of Equation (14) to find a suitable

α_{1}

and

α_{2}

to maximize

α_{1}^{T} K_{ω_{1}} K_{ω_{2}} α_{2}

. Construct the Lagrange function:

\begin{matrix} L = & α_{1}^{T} K_{ω_{1}} K_{ω_{2}} α_{2} - \frac{λ_{1}}{2} (α_{1}^{T} K_{ω_{1}} K_{ω_{1}} α_{1} - 1) \\ - \frac{λ_{2}}{2} (α_{2}^{T} K_{ω_{2}} K_{ω_{2}} α_{2} - 1) . \end{matrix}

(15)

We take partial derivatives of

α_{1}

and

α_{2}

in Equation (15), and obtain:

\{\begin{matrix} K_{ω_{2}} α_{2} = λ K_{ω_{1}} α_{1} \\ λ = 2 λ_{1} = 2 λ_{2} = α_{1}^{T} K_{ω_{1}} K_{ω_{2}} α_{2} . \end{matrix}

(16)

According to Equation (16), the implicit correlation P

_{c a}

(Θ ({\overset{ˇ}{ω}}_{1}, {\overset{ˇ}{ω}}_{2}))

between the first joint and the second joint perturbation vector of the robot is obtained. When the current

T_{r}

is greater than the upper threshold

T_{m a x}

,

α_{1}, α_{2}

are recorded. Furthermore, Equation (15) can be repeatedly executed to obtain the implicit correlation between the first joint and the dth (

d \in [2, \dots, D]

) joint. In this way, we get the implicit patterns

{(α_{1}, α_{2}), \dots, (α_{1}, α_{D})}

between the joints.

If

T_{r}

is less than the lower threshold

T_{m i n}

, the independent explorations of the whole joint are abandoned, and the KCCA prediction is turned on. That is, the first joint explores randomly to obtain a perturbation sample

{\overset{ˇ}{ω}}_{1}^{s}

and

K_{ω_{1}}

in the pth iteration, then (

α_{1}, α_{2}

) is used to calculate

K_{ω_{2}}

of the second joint. Obviously,

{\overset{ˇ}{ω}}_{2}

can be computed from

K_{ω_{2}}

, and that will guide the exploration of the second joint. Repeat the same steps for the subsequent joints until the Dth joint.

4.3. The Combination of KCCA and CMA

We draw the schematic of the PI²-CMA-KCCA’s control flow as Figure 1, which works like a double closed-loop control system. Without loss of generality, there are D joints, and each joint corresponds to M basis functions, then

D \times M

parameters need to be adjusted.

The outer loop aims at discovering the hidden patterns between the joints which are helpful to fulfill the new task. Specifically, we use the KCCA to infer the nonlinear correlations between the first joint and other joints for good declining rate, recorded as

{(α_{1}, α_{2}), \dots, (α_{1}, α_{D})}

. After that, we can compute

{\overset{ˇ}{ω}}_{d}

from

K_{ω_{d}}

based on perturbation sample

{\overset{ˇ}{ω}}_{1}^{s}

and

K_{ω_{1}}

, and it will guide the perturbation of the dth joint as a means of exploring noise.

The inner loop extracts heuristic information from a corresponding single joint to guide the search. Specifically, in the previous iteration, it evaluates the cost

J_{d}

of K roll-outs (sorting and probability average), then selects the first

K_{e}

elites of the perturbation sample to obtain the covariance

Σ_{ω_{d}}

by the CMA algorithm. Furthermore, in current iteration,

ω_{d} \sim N ({\overset{ˇ}{ω}}_{d}, Σ_{ω_{d}})

is used to explore the policy parameter

ω_{d}

. In other words, it applies compound heuristic information which integrates the inferred hidden nonlinear patterns between the joints and the linear patterns within the joint to automatically determine the exploring proportion for each component of

ω_{d}

.

In short, the outer loop discovers and predicts the exploring mean of

ω_{d}

for the dth (

d \neq 1

) DOF, the inner loop discloses and infers the exploring variance of

ω_{d}

for the dth (

d \neq 1

) DOF. As for 1st DOF, a vanilla PI

^{2}

is employed.

5. Evaluations

The cost function in our experiments is given by:

\begin{matrix} J = & 0.5 \times \sum_{d = 1}^{D} \sum_{i = 1}^{N - 1} [10^{7} {({\ddot{y}}^{(d)} (i))}^{2} + {(a_{f}^{(d)} (i))}^{2}] \\ + \sum_{d = 1}^{D} 10^{11} {(y^{(d)} (m) - y_{v}^{(d)})}^{2} \\ + \sum_{d = 1}^{D} 10^{3} [{({\dot{y}}^{(d)})}^{2} + {(y^{(d)} (N) - y_{g}^{(d)})}^{2}] . \end{matrix}

(17)

In Equation (17), D is the number of joints of SCARA and Swayer.

y^{(d)} (i), {\dot{y}}^{(d)} (i), {\ddot{y}}^{(d)} (i)

denote the position, velocity, and acceleration of the dth joint of the robot at the time i.

y_{v}^{(d)}

is the present point (i.e., via-point) to be passed through by the dth joint at time i.

y_{g}^{(d)}

is the trajectory’s end point of the d joint.

y^{d} (m)

is the point passed through by the dth joint at time m.

a_{f}^{(d)} (i)

is the acceleration of forcing term of the dth joint at time i. Equation (17) is a typical quadratic expression of total reward in the finite stage of reinforcement learning.

5.1. Passing through One Via-Point with SCARA

SCARA has three revolute joints

q_{1}

,

q_{2}

,

q_{3}

and one prismatic joint

q_{4}

. In this experiment, the orientation of the end-effector of the robot arm is ignored, thus the SCARA robot arm can be regarded as a planar two-link mechanism.

The experiment is divided into the following four steps:

(1)

Set the Cartesian coordinate of the starting position of the end-effector to

{(20, 0, 0)}^{T}

cm, and the corresponding joint angle is

{(0, 0, 0)}^{T}

rad. Set the endpoint

{(4.5, 16, 0)}^{T}

cm, and the corresponding joint angle is

{(0.7068, 1.1796, 0)}^{T}

rad.

(2)

Drive the manipulator based on the principle of minimum jitter to obtain the demonstration.

(3)

Given a new task, use PI², PI²-CMA, PI²-KCCA, and PI²-CMA-KCCA, respectively, to make the movement pass through a via-point at

m = 1.8 s

.

(4)

The four algorithms are iterated for 80 times respectively to gain new motor skills.

In this experiment, each of the four different algorithms is repeated 20 times. The best experimental results of PI², PI²-CMA, PI²-KCCA, and the five experimental results of PI²-CMA-KCCA are selected for comparison. The optimization effect of PI²-CMA-KCCA is analyzed by comparing the final cost of the system and the trajectories of joint space. It can be clearly seen from Figure 2 that PI²-CMA-KCCA converges faster, and its final cost is the lowest (as shown in Table 2).

Figure 3 shows the movement trajectories of the three joints of the SCARA in the joint space when the number of iterations is 35. Because the joint

q_{3}

has no effect on the position of the end-effector, the value of

q_{3}

in the joint space remains constant.

The blue line in Figure 3 is the trajectory of each joint of the robot arm under the PI²-CMA-KCCA. Figure 3 demonstrates that only the blue lines pass through an intermediate via-point.

5.2. Passing through Two Via-Point with SCARA

This experimental procedure is similar to Section 5.1, requiring the end-effector to pass the point

{(18.2, 8.1, 0)}^{T}

cm at

m = 1.8

s and the point

{(12.0, 13.5, 0)}^{T}

cm at

m = 3.6

s. Four different algorithms are also respectively repeated 20 times, and the best experimental results of PI², PI²-CMA, PI²-KCCA, and the five experimental results of PI²-CMA-KCCA are selected for comparison. Figure 4 demonstrates that the PI²-CMA-KCCA has better learning performance than the other algorithms. After 80 iterations, the final cost of the four algorithms is given in Table 3.

Figure 5 shows the trajectories of the three joints of the SCARA in the joint space when the number of iterations is 50, where the blue lines represent the joints’ trajectories of the robot arm in joint space under the PI²-CMA-KCCA. It demonstrates that, when the number of iterations is 50, the blue trajectories pass through two intermediate via-point at

t = 1.8

s and

t = 3.6

s accurately.

5.3. Passing Through One Via-Point with Swayer

Swayer is a lightweight collaborative robot created by Rethink Robotics. It has seven joints. In this subsection,

q_{i}

is used to represent the ith joint of Swayer. Because the

q_{7}

has no effect on Cartesian position of the end-effector, only six joints are considered. This experiment uses the ROS platform to control the movement trajectory of the Swayer manipulator in the Ubuntu16 system, and validates the effectiveness of the PI²-CMA-KCCA.

First, we set an arbitrary starting point

{(697, 159, 514)}^{T}

mm in the workspace of Swayer and an arbitrary endpoint

{(300, - 569, - 65)}^{T}

mm, then drag Swayer to record a demonstration from the starting point to the endpoint, then choose any reachable point

{(840, - 248, 595)}^{T}

mm far away from the demonstration as a new task at

m = 1.2

s. Finally, PI², PI²-CMA, PI²-KCCA, and PI²-CMA-KCCA are repeated 20 times on Swayer. The five experimental results of the PI²-CMA-KCCA and the best experimental results of the PI², PI²-CMA and PI²-KCCA are selected for analysis. Figure 6 shows the downward trend of the cost of the four algorithms, and the final cost of the four algorithms after 80 iterations is given in Table 4.

Figure 6 illustrates that during the reinforcement learning, learning efficiency of PI²-CMA-KCCA is higher under the same number of iterations. At the same time, Figure 7 shows that the Swayer passes through an intermediate via-point accurately at

m = 1.2

s when the number of iterations is 60.

Figure 7 shows the trajectories of joint space in Swayer. The blue lines represent the joints’ trajectories after PI²-CMA-KCCA learning. The result shows that only the blue lines can pass through the via-point accurately after about 60 iterations.

5.4. Performance Comparison of Four Algorithms

In Table 5, the cost decline rate of PI²-CMA-KCCA is always higher than that of PI², PI²-CMA, and PI²-KCCA. When the experimental objects are the same, the new task is more complicated, and the learning performance of PI²-CMA-KCCA is better than that of the other three algorithms. In addition, especially when the tasks are the same, the greater the number of degrees of freedom of the experimental objects, the better the optimization effect of PI²-CMA-KCCA than the other algorithms because PI²-CMA-KCCA not only adjusts the magnitude of exploration noise automatically but also considers the nonlinear heuristic information between the joints.

6. Conclusions

Compound heuristic information is applied in the paper to guide PI2’s variational exploration and expedite the procedure of reinforcement learning. This information is derived by KCCA and CMA together. KCCA infers the nonlinear heuristic information between the joints of robot, and CMA infers the linear heuristic information within single DOF. This information may cause the cost function to drop rapidly with the MLE on the roll-out data. In addition, the proposed algorithm PI²-CMA-KCCA works like a double closed-loop control system, in which the outer loop discovers and predicts the means of exploring vectors for each DOF and the inner loop discloses and infers the variance of exploring vector for each DOF. In this way, the new algorithm can quickly search the optimal parameters of new tasks. The experimental results on SCARA and Swayer also demonstrate that the algorithm can speed up the process of updating parameters while maintaining the accuracy of completing new tasks, which is suitable for multi-degree-of-freedom objects and more complex tasks.

Author Contributions

Methodology, J.F.; software, C.L.; validation, C.L., X.T.; formal analysis, J.F.; investigation, F.L.; resources, X.T.; data curation, B.L.; writing—original draft preparation, C.L.; writing—review and editing, J.F.; visualization, C.L.; supervision, J.F.; project administration, J.F.; funding acquisition, J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grants 61773299, 51575412.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 61773299, 51575412.

Conflicts of Interest

The authors declare no conflict of interest.

References

Siciliano, B.; Khatib, O. Springer Handbook of Robotics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 987–1008. [Google Scholar]
Takano, W.; Nakamura, Y. Statistical mutual conversion between whole body motion primitives and linguistic sentences for human motions. Int. J. Robot. Res. 2015, 34, 1314–1328. [Google Scholar]
Paraschos, A.; Daniel, C.; Peters, J.; Neumann, G. Probabilistic movement primitives. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2616–2624. [Google Scholar]
Khansari-Zadeh, S.M.; Billard, A. BM: An iterative algorithm to learn stable nonlinear dynamical systems with Gaussian mixture models. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation, Anchorage, AK, USA, 3–7 May 2010; pp. 2381–2388. [Google Scholar]
Ratliff, N.D.; Bagnell, J.A.; Zinkevich, M.A. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 729–736. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning; A Bradford Book; The MIT Press: Cambridge, MA, USA; London, UK, 1998; pp. 665–685. [Google Scholar]
Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190. [Google Scholar]
Kober, J.; Peters, J. Learning motor primitives for robotics. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 2112–2118. [Google Scholar]
Peters, J.; Altun, Y. Relative entropy policy search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 1607–1612. [Google Scholar]
Theodorou, E.; Buchli, J.; Schaal, S. A Generalized Path Integral Control Approach to Reinforcement Learning. J. Mach. Learn. Res. 2010, 11, 3137–3181. [Google Scholar]
Stulp, F.; Sigaud, O. Path integral policy improvement with covariance matrix adaptation. In Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, UK, 26 June–1 July 2012; pp. 281–288. [Google Scholar]
Fu, J.; Teng, X.; Cao, C.; Lou, P. Intelligent trajectory planning based on reinforcement learning with KCCA inference for robot. J. Huazhong Univ. Sci. Technol. (Nat. Sci. Ed.) 2019, 47, 96–102. [Google Scholar]
Ijspeert, A.J.; Nakanishi, J.; Hoffmann, H.; Pastor, P.; Schaal, S. Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors. Neural Comput. 2013, 25, 328–373. [Google Scholar]
Melzer, T.; Reiter, M.; Bischof, H. Appearance models based on kernel canonical correlation analysis. Pattern Recognit. 2003, 36, 1961–1971. [Google Scholar]

Figure 1. A typical system’s architecture for PI

^{2}

-CMA-KCCA.

Figure 1. A typical system’s architecture for PI

^{2}

-CMA-KCCA.

Figure 2. Cost of one via-point task.

Figure 3. One via-point task when the number of iterations is 30.

Figure 4. Cost of two via-point task.

Figure 5. Two via-point task when the number of iterations is 50.

Figure 6. Cost of one via-point task.

Figure 7. One via-point task when the number of iterations is 60.

Table 1. Summary of the notation frequently used in the article.

Symbol	Definition
${\ddot{y}}_{t}, {\dot{y}}_{t}, y_{t}$	The desired acceleration, velocity, position of the joint
$h (x_{t})$	Fitting function. $x_{t}$ is a phase variable
$P_{., k, i}$	The probability-weighted value of the kth trajectory of the dth DOF at time i
$δ ω_{d, i}$	The dth joint’s correction weight vector at time i
$δ ω_{d}$	Average value of $δ ω_{d, i}$ over time
$ω_{d}$	The dth joint’s weight vector
${\overset{ˇ}{ω}}_{d}$	The dth joint’s weight vector with perturbation
$\tilde{ω}$	${[{\overset{ˇ}{ω}}_{1}^{T}, {\overset{ˇ}{ω}}_{2}^{T}, \dots, {\overset{ˇ}{ω}}_{D}^{T}]}^{T}$
$ϕ (\cdot)$	Mapping function. Mapping a variable to V-dimensional space
${\overset{ˇ}{ω}}_{d}^{s}$	The weight perturbation sample of the dth joint, i.e., ${\overset{ˇ}{ω}}_{d}^{s} = {{\overset{ˇ}{ω}}_{d, 1}, {\overset{ˇ}{ω}}_{d, 2}, \dots, {\overset{ˇ}{ω}}_{d, N}}$
$J_{d}$	The dth joint’s cost for given task
$J^{p}$	The total cost of D joints at the pth iteration
$K (\cdot)$	Kernel matrix
$T_{r}$	Decline rate of cost between $J^{p - 1}$ and $J^{p}$
${J_{d}, δ ω_{d, i}}_{K}$	$J_{d}$ and $δ ω_{d, i}$ which are calculated from K roll-outs
${J_{d}, δ ω_{d, i}}_{K e}$	$J_{d}$ and $δ ω_{d, i}$ which are calculated from $K e$ elite roll-outs, and $K e$ elite roll-outs are obtained by sorting K roll-outs according to the cost

Table 2. The final cost of one via-point task with SCARA.

Algorithm	Final Cost
PI²	$2.2341 \times 10^{8}$
PI²-CMA	$1.7569 \times 10^{8}$
PI²-KCCA	$1.4089 \times 10^{8}$
PI²-CMA-KCCA(test1)	$1.0249 \times 10^{8}$
PI²-CMA-KCCA(test2)	$1.1095 \times 10^{8}$
PI²-CMA-KCCA(test3)	$1.0677 \times 10^{8}$
PI²-CMA-KCCA(test4)	$1.2106 \times 10^{8}$
PI²-CMA-KCCA(test5)	$1.0155 \times 10^{8}$

Table 3. The final cost of two via-point task with SCARA.

Algorithm	Final Cost
PI²	$7.1203 \times 10^{9}$
PI²-CMA	$5.5471 \times 10^{9}$
PI²-KCCA	$4.4999 \times 10^{9}$
PI²-CMA-KCCA(test1)	$2.7410 \times 10^{9}$
PI²-CMA-KCCA(test2)	$3.3489 \times 10^{9}$
PI²-CMA-KCCA(test3)	$3.4429 \times 10^{9}$
PI²-CMA-KCCA(test4)	$3.1620 \times 10^{9}$
PI²-CMA-KCCA(test5)	$3.0731 \times 10^{9}$

Table 4. The final cost of one via-point task with Swayer.

Algorithm	Final Cost
PI²	$7.1889 \times 10^{10}$
PI²-CMA	$5.0168 \times 10^{10}$
PI²-KCCA	$4.5494 \times 10^{10}$
PI²-CMA-KCCA(test1)	$2.2861 \times 10^{10}$
PI²-CMA-KCCA(test2)	$2.2581 \times 10^{10}$
PI²-CMA-KCCA(test3)	$2.2641 \times 10^{10}$
PI²-CMA-KCCA(test4)	$2.2554 \times 10^{10}$
PI²-CMA-KCCA(test5)	$2.2513 \times 10^{10}$

Table 5. Average decline rate of four algorithms.

Task	PI²	PI²-CMA	PI²-KCCA	PI²-CMA-KCCA
One Via-Point Task with SCARA	96.5%	97.2%	97.8%	98.5%
Two Via-Point Task with SCARA	84.0%	87.4%	89.7%	92.8%
One Via-Point Task with Swayer	73.0%	81.3%	83.0%	91.5%

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, J.; Li, C.; Teng, X.; Luo, F.; Li, B. Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition. Appl. Sci. 2020, 10, 5346. https://doi.org/10.3390/app10155346

AMA Style

Fu J, Li C, Teng X, Luo F, Li B. Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition. Applied Sciences. 2020; 10(15):5346. https://doi.org/10.3390/app10155346

Chicago/Turabian Style

Fu, Jian, Cong Li, Xiang Teng, Fan Luo, and Boqun Li. 2020. "Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition" Applied Sciences 10, no. 15: 5346. https://doi.org/10.3390/app10155346

APA Style

Fu, J., Li, C., Teng, X., Luo, F., & Li, B. (2020). Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition. Applied Sciences, 10(15), 5346. https://doi.org/10.3390/app10155346

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition

Abstract

1. Introduction

2. Dynamic Movement Primitives

3. Path Integral Policy Improvement with Covariance Matrix Adaption

4. PI²-CMA with Kernel Canonical Correlation Analysis

4.1. Nonlinear Correlation Heuristic Information

4.2. Robot Intelligent Trajectory Inference with KCCA

4.3. The Combination of KCCA and CMA

5. Evaluations

5.1. Passing through One Via-Point with SCARA

5.2. Passing through Two Via-Point with SCARA

5.3. Passing Through One Via-Point with Swayer

5.4. Performance Comparison of Four Algorithms

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Compound Heuristic Information Guided Policy Improvement for Robot Motor Skill Acquisition

Abstract

1. Introduction

2. Dynamic Movement Primitives

3. Path Integral Policy Improvement with Covariance Matrix Adaption

4. PI2-CMA with Kernel Canonical Correlation Analysis

4.1. Nonlinear Correlation Heuristic Information

4.2. Robot Intelligent Trajectory Inference with KCCA

4.3. The Combination of KCCA and CMA

5. Evaluations

5.1. Passing through One Via-Point with SCARA

5.2. Passing through Two Via-Point with SCARA

5.3. Passing Through One Via-Point with Swayer

5.4. Performance Comparison of Four Algorithms

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4. PI²-CMA with Kernel Canonical Correlation Analysis