Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient

Lu, Puwei; Huang, Wenkai; Xiao, Junlong; Zhou, Fobao; Hu, Wei

doi:10.3390/math9172055

Open AccessArticle

Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient

by

Puwei Lu

,

Wenkai Huang

^*

,

Junlong Xiao

,

Fobao Zhou

and

Wei Hu

School of Mechanical & Electrical Engineering, Guangzhou University, Guangzhou 510006, China

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(17), 2055; https://doi.org/10.3390/math9172055

Submission received: 18 July 2021 / Revised: 20 August 2021 / Accepted: 25 August 2021 / Published: 26 August 2021

Download

Browse Figures

Versions Notes

Abstract

:

An adaptive proportional integral robust (PIR) control method based on deep deterministic policy gradient (DDPGPIR) is proposed for n-link robotic manipulator systems with model uncertainty and time-varying external disturbances. In this paper, the uncertainty of the nonlinear dynamic model, time-varying external disturbance, and friction resistance of the n-link robotic manipulator are integrated into the uncertainty of the system, and the adaptive robust term is used to compensate for the uncertainty of the system. In addition, dynamic information of the n-link robotic manipulator is used as the input of the DDPG agent to search for the optimal parameters of the proportional integral robust controller in continuous action space. To ensure the DDPG agent’s stable and efficient learning, a reward function combining a Gaussian function and the Euclidean distance is designed. Finally, taking a two-link robot as an example, the simulation experiments of DDPGPIR and other control methods are compared. The results show that DDPGPIR has better adaptive ability, robustness, and higher trajectory tracking accuracy.

Keywords:

n-link robot; deep deterministic policy gradient; adaptive control; proportional integral robust control; reward function

1. Introduction

A robotic manipulator is similar to the human arm and can replace or assist humans to complete the tasks of picking, placing, painting, welding, and assembling. The manipulator plays an important role in industrial production, underwater exploration, medical application, aerospace, and other fields [1,2,3,4]. To achieve a better control effect and meet the control requirements of different fields, the manipulator must have the ability to track a trajectory with high precision. Due to the highly nonlinear, dynamic characteristics of a robotic manipulator, and the influence of joint friction and time-varying external interference in practical applications, it is difficult to obtain accurate information about model parameters. Therefore, when designing a control strategy, good adaptability and high-precision trajectory tracking abilities are necessary for the uncertainty of the n-link robotic manipulator system.

In order to better control the robot manipulator, the robustness of the control strategy has attracted extensive attention. Robustness here refers to the ability to produce good dynamic behavior in the face of modelling errors and unmodelled dynamics of the robot manipulator [5,6]. Loucif and Kechida [7] and Elkhateeb et al. [8] used a whale optimization algorithm and an artificial bee colony algorithm, respectively, to optimize the parameters of the proportion integral differential (PID) controller, improve the trajectory tracking accuracy of the robot manipulator under unmodeled dynamics, and make the controller have a certain robustness. In order to model the control process of the robot manipulator more accurately, Ardeshiri et al. [9,10] proposed a fractional order fuzzy PID controller. A fractional order controller summarizes the design of an integer order PID controller and extends it from point to plane. This extension increases the flexibility of control system design and can realize the control process more accurately. With the help of a fractional order PID controller, the controller can be designed to ensure that the closed-loop system has a stronger robustness to gain variation and aging effect [11]. Therefore, PID and other model-based control strategies have proved to be effective, but these methods need to obtain dynamic model information of the controlled object. In the actual control process, it is difficult to obtain accurate information due to the complexity of the manipulator mechanism and the uncertainty of external interference [12,13]. To solve this problem, it is necessary to compensate or approximate the uncertainty and time-varying external disturbance of the nonlinear dynamic model to meet the demand of actual control. Wang [14] used the robust controller to compensate for the uncertainty, unmodeled dynamics, and external interference of the dynamic model parameters of the robot manipulator, and so realized accurate tracking for it. Yang and Jiang [15] and Rouhani [16] used a fuzzy logic system to approximate the nonlinear dynamic model of the robot manipulator. However, the design of the fuzzy logic system depends on expert knowledge. A neural network is good at approximating the uncertain mathematical model, and it is one of the effective solutions to nonlinear system control problems. Yang et al. [17] proposed an adaptive neural network control method based on a nonlinear observer. The joint speed of the manipulator is estimated by a nonlinear observer, and, based on the estimated value of the speed, an adaptive radial basis function neural network is used to compensate for the uncertainty of the robotic manipulator system, to improve the tracking accuracy of its end force and its joint position. Guo et al. [18] proposed an adaptive neural network control method, which uses the weighted performance function to control the joint angle and the trajectory tracking error within an expected range, approximates the dynamic model of the manipulator through the radial basis function neural network, and uses the adaptive law to adjust the weights of the neural network, to improve the robustness of the controller. Although the neural network has a good compensation or approximation effect for the uncertainty and time-varying disturbance of the nonlinear dynamic model, training the network is likely to converge to the local optimal problem. Therefore, a robust term based on deep reinforcement learning is proposed to compensate for the modeling error of the nonlinear dynamic model of an n-link robot manipulator. Under conditions of structural parameter perturbation, time-varying external interference, and friction resistance, the influence of the uncertainty of the dynamic model on the controller can be reduced so as to maintain the stability of the control system and improve the trajectory tracking performance of an n-link robotic manipulator.

As an important branch of artificial intelligence technology, reinforcement learning mainly selects actions through interactive learning between agents and the environment. The environment responds to the actions of agents and transforms them into a new state. At the same time, it generates a reward. The agent’s goal is to maximize the accumulated discount reward value [19,20]. Compared with the classical control method, reinforcement learning does not need to obtain an accurate dynamic model, which is very advantageous in solving the decision sequence problem under highly nonlinear and uncertain conditions [21]. Kukker and Sharma [22] and Runa et al. [23] used the fuzzy Q-learning algorithm to realize trajectory tracking control of the manipulator. Kim et al. [24] used the State-Action-Reward-State-Action (SARSA) algorithm to locate fixed and random target points of the end effector of a three-link Planar Arm. Although Q-learning and SARSA can effectively solve some typical reinforcement learning tasks, the algorithms need to be built in discrete space. Therefore, in the control problem, it is often necessary to discretize the continuous process. However, sparse discretization can easily reduce the control accuracy, and dense discretization can easily fall into the curse of the dimension problem [25,26]. One method for solving this problem is to effectively combine deep learning with reinforcement learning. A deep neural network is used, in traditional reinforcement learning, to model solutions to continuous reinforcement learning tasks [27,28]. Based on this method, Lillicrap et al. [29] proposed a depth deterministic strategy gradient algorithm based on the actor critic framework. Shi et al. [30] used the DDPG algorithm to deal with controlling the zinc electro winning (Zep) process, which effectively solved the problems of inaccurate modeling and time delay, while also being more energy-saving than the traditional control method. Sun et al. [31] used this algorithm to solve the heavy vehicle adaptive cruise decision-making problem, which has good adaptability in a strange and complex environment. Zhao et al. [32] solved the cooperative control problem of wind farms through the DDPG algorithm and reduced the learning cost in the learning process. Therefore, the DDPG algorithm seems to be effective in solving multiple continuous-state space reinforcement learning tasks.

The purpose of this paper is to establish an n-link robotic manipulator control system with a model for uncertainty and time-varying external disturbances. An adaptive PIR control method based on deep reinforcement learning is proposed. The modeling error of the nonlinear dynamic model of an n-link manipulator is compensated by robust control, and the parameters of the controller are adjusted by a DDPG algorithm to improve the adaptability of the controller to the uncertain, nonlinear, dynamic model. The main contributions of this paper are as follows:

Considering the uncertainty and time-varying disturbance of the dynamic model of the n-link robot manipulator system and the influence of friction resistance, the adaptive robust term is used to compensate for the uncertainty of the system. An adaptive PIR control method based on the DDPG is proposed, which has good adaptability and high-precision trajectory tracking ability for the uncertainty of the n-link robot manipulator system.
A reward function combining a Gaussian function and the Euclidean distance is proposed, which can ensure the reinforcement learning agent learns efficiently and stably and can effectively avoid a convergence of the deep neural network to the local optimal problem.
Taking a two-link robotic manipulator as an example, the simulation results show that the proposed method is effective compared with an adaptive control based on radial basis function neural network (RBFNN) approximation and PIR control with fixed parameters.

2. Dynamic Model of the n-Link Robot Manipulator

The dynamic model of the n-link robotic manipulator system expresses the relationship between the joint torque and the position, velocity, and acceleration of the connecting rod:

M (q) \ddot{q} + C (q, \dot{q}) \dot{q} + G (q) + F_{f} (\dot{q}) + τ_{d} = τ

(1)

where

q \in ℝ^{n}

is the joint position vector of the manipulator,

\dot{q} \in ℝ^{n}

is the velocity vector of the manipulator,

\ddot{q} \in ℝ^{n}

is the acceleration vector of the manipulator,

M (q) \in ℝ^{n \times n}

is the mass inertia matrix,

C (q, \dot{q}) \in ℝ^{n \times n}

is the Coriolis force and the centrifugal force vector,

G (q) \in ℝ^{n \times n}

is the gravity vector,

F_{f} (\dot{q}) \in ℝ^{n}

is the friction vector,

τ_{d} \in ℝ^{n}

is the time-varying external disturbance, and

τ \in ℝ^{n}

is the torque vector acting on the joint.

The precise values of the

M (q)

,

C (q, \dot{q})

, and

G (q)

parameters in the dynamic model are difficult to obtain due to a series of influential factors, such as the complexity of the manipulator mechanism, environmental variations, and measurement errors in the actual operation of the manipulator. Therefore, the actual values for

M (q)

,

C (q, \dot{q})

, and

G (q)

are divided into the model part and the error part as follows:

M (q) = M_{0} (q) + Δ E_{M} (q)

(2)

C (q, \dot{q}) = C_{0} (q, \dot{q}) + Δ E_{C} (q, \dot{q})

(3)

G (q) = G_{0} (q) + Δ E_{G} (q)

(4)

The dynamic model Formula (1) of the n-link robot manipulator can also be expressed as follows:

M_{0} (q) \ddot{q} + C_{0} (q, \dot{q}) \dot{q} + G_{0} (q) + E (q, \dot{q}) + F_{f} (\dot{q}) + τ_{d} = τ

(5)

E (q, \dot{q}) = Δ E_{M} (q) \ddot{q} + Δ E_{C} (q, \dot{q}) \dot{q} + Δ E_{G} (q)

(6)

Property 1

([33]). The mass inertia matrix

M_{0} (q)

is symmetric, positive definite and bounded and can be expressed as follows:

μ_{m} < ‖ M_{0} (q) ‖ < μ_{n}

(7)

where

‖ M_{0} (q) ‖

is the norm of the mass inertia matrix

M_{0} (q)

;

μ_{n}

and

μ_{m}

are the upper and lower boundaries, respectively, and both are positive numbers.

Property 2

([34]). Coriolis force and centrifugal force matrix is

C_{0} (q, \dot{q}) .

The following equation is satisfied:

ξ^{T} (\dot{M_{0}} (q) - 2 C_{0} (q, \dot{q})) ξ = 0

(8)

Among them,

(\dot{M_{0}} (q) - 2 C_{0} (q, \dot{q}))

. It is a skew symmetric matrix,

ξ \in ℝ^{n}

.

Property 3

([35]). The gravity vector

G (q)

satisfies

‖ G (q) ‖ < ρ

,

ρ \in (0, \infty)

.

3. DDPGPIR Control Design

In this paper, a control strategy for DDPGPIR for the n-link robotic manipulator system with a model for uncertainty and time-varying external disturbances is proposed. The control strategy includes PIR control design, reinforcement learning and policy gradient method, DDPG adaptive PIR control, DDPGPIR network design, the DDPGPIR learning process, and the reward function.

3.1. PIR Control Design

In the n-link robotic manipulator system, the position error

e (t)

is the difference between the expected joint angle

q_{d} (t)

and the actual joint angle

q (t)

. The position error and the error function are defined as follows:

e (t) = q_{d} (t) - q (t)

(9)

s = \dot{e} + Λ e

(10)

where

Λ = Λ^{T} = (\begin{matrix} K_{r 1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & K_{r n} \end{matrix})

. Take

{\dot{q}}_{s} = s (t) + \dot{q} (t)

. Then:

{\dot{q}}_{s} = {\dot{q}}_{d} + Λ e

(11)

{\ddot{q}}_{s} = {\ddot{q}}_{d} + Λ e

(12)

Therefore, the dynamic model Equation

(3)

of the n-link robotic manipulator can be expressed as follows:

\begin{matrix} τ & = M (q) \ddot{q} + C (q, \dot{q}) \dot{q} + G (q) + F_{f} (\dot{q}) + τ_{d} \\ = M (q) (\dot{s} + \ddot{q}) + C (q, \dot{q}) (s + \dot{q}) + G (q) - M (q) \dot{s} - C (q, \dot{q}) s + F_{f} (\dot{q}) + τ_{d} \\ = M (q) {\ddot{q}}_{s} + C (q, \dot{q}) {\dot{q}}_{s} + G (q) - M (q) \dot{s} - C (q, \dot{q}) s + F_{f} (\dot{q}) + τ_{d} \\ = E_{m} - M (q) \dot{s} - C (q, \dot{q}) s + E_{s} \end{matrix}

(13)

where

E_{m} = M_{0} (q) {\ddot{q}}_{s} + C_{0} (q, \dot{q}) {\dot{q}}_{s} + G_{0} (q)

(14)

E_{s} = Δ M (q) {\ddot{q}}_{s} + Δ C (q, \dot{q}) {\dot{q}}_{s} + Δ G (q) + F_{f} (\dot{q}) + τ_{d}

(15)

In PIR control [36], the control law is designed as follows:

τ = τ_{m} + τ_{p i} + v

(16)

τ_{m} = M_{0} (q) {\ddot{q}}_{s} + C_{0} (q, \dot{q}) {\dot{q}}_{s} + G_{0} (q)

(17)

τ_{p i} = K_{p} s + K_{i} \int s d t

(18)

v = K_{s} s g n (s)

(19)

where

τ

is the torque applied to each joint of the n-link robotic manipulator,

τ_{m}

is the torque control term of the model,

K_{p}

and

K_{i}

are the gain of the proportional term and the gain of the integral term, respectively, and

τ_{s}

is the robust term used to compensate the nonlinear dynamic model error and external disturbance. From Equation (13) and Equation (16), it can be concluded that:

M (q) \dot{s} + C (q, \dot{q}) s + K_{i} \int_{0}^{t} s d t = - K_{p} s - K_{s} s g n (s) + E_{s}

(20)

Select the Lyapunov function as follows:

V = \frac{1}{2} s^{T} M s + \frac{1}{2} {(\int_{0}^{t} s d τ)}^{T} K_{i} (\int_{0}^{t} s d τ)

(21)

The derivation on both sides of the equation leads to:

\begin{matrix} \dot{V} & = s^{T} [M \dot{s} + \frac{1}{2} \dot{M} s + K_{i} \int_{0}^{t} s d τ] \\ = s^{T} [M \dot{s} + C s + K_{i} \int_{0}^{t} s d τ] \\ = s^{T} [- K_{p} s - K_{s} s g n (s) + E_{s}] \\ = - s^{T} K_{p} s - \sum_{i = 1}^{n} K_{s i i} {| s |}_{i} + s^{T} E \end{matrix}

(22)

Because of

K_{s i i} \geq | E_{i} |

,

\dot{V} \leq - s^{T} K_{p} s \leq 0

. Therefore, the control system is asymptotically stable.

3.2. Reinforcement Learning and Policy Gradient Method

Reinforcement learning is an important branch of machine learning, which is mainly composed of environment, agent, reward, state, and action. When the agent performs action

a_{t}

on the environment in state

s_{t}

, the environment will give the agent a reward

r_{t + 1}

, the state changes to the next state

s_{t + 1}

, and the future reward value passes through the discount coefficient

γ (0 \leq γ \leq 1)

After weighting, the cumulative reward

r_{t}

can be expressed as:

r_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} = r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots

(23)

The policy of reinforcement learning is the functional relationship

π

between state space and action space. The objective of a policy-based reinforcement learning method is to try to find the optimal strategy

π^{*}

to maximize the cumulative reward. In the strategy gradient method, the optimal strategy is updated along the gradient direction of the expected cumulative reward as follows:

J (θ) = E (\sum_{l = 0}^{N} r (s_{l}, a_{l}) | π_{θ}) = \sum_{σ} P (σ | θ) r (σ)

(24)

θ_{h + 1} = θ_{h} + ϑ \nabla_{θ} J (π (θ_{h}))

(25)

where

θ

is the parameter vector of the policy,

J (θ)

is the objective function of reinforcement learning,

σ = (s_{0}, a_{0}, s_{1}, a_{1}, \dots s_{l}, a_{l})

is a group of state action sequences, and

P (σ | θ)

is the action sequence

σ

Probability of occurrence,

ϑ

is the learning rate and

h

is the number of the current update.

3.3. DDPG Adaptive PIR Control

The schematic diagram of the DDPGPIR control system of the n-link robot manipulator is shown in Figure 1. The input of the controller is the error vector

e = (e_{1}, e_{2}, \dots e_{n})

of the n-link robot manipulator. The output is the torque vector acting on the joint

τ = (τ_{1}, τ_{2}, \dots τ_{n})

. The control performance of the DDPGPIR mainly depends on the parameter vector

g = (K_{p 1}, K_{i 2}, K_{s 2}, K_{r 2}, \dots K_{p n}, K_{i n}, K_{s n}, K_{r n})

. The control problem of the n-link robot manipulator can be expressed as:

\min_{g} \sum_{j = 1}^{n} e_{j} (q_{d}_{j}, q_{j} (g_{j}, p_{j}))

(26)

where the vector

q_{d}

is the expected joint angle,

q

is the actual joint angle,

p

is the physical parameter of n-link robot manipulator, and

j

is the

j

-th link.

To improve the adaptability and trajectory tracking accuracy of the n-link robot manipulator, the parameter vector of DDPGPIR needs to be adjusted and optimized in real time. However, the process of setting the parameter is time-consuming, and the optimization process is continuous; it is not advisable to adjust the parameters manually. Therefore, it is necessary to find the optimal strategy function

μ^{*} (x)

, which is one of the effective methods for solving this problem. The state vector

x_{t} = (τ_{1}, e_{1}, \int e_{1} d t, \dots, τ_{n}, e_{n}, \int e_{n} d t)

is input into the optimal strategy function to obtain the optimal parameter vector

g_{t}

. The goal of reinforcement learning is to find the optimal strategy for maximizing cumulative rewards. The objective function can be expressed as:

J_{β} (μ) = \max_{μ} E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} | x_{t} = x, g_{t} = μ (x_{t})]

(27)

where

β

is the behavior strategy and

γ \in (0, 1)

is the discount factor.

To find the optimal strategy for maximizing the objective function, the strategy gradient method is usually used to select and execute actions from the distribution function of strategy probability in each time step. However, this method needs to sample continuous actions in each time step, which is a huge calculation process. To solve this problem, the deterministic strategy gradient method is used to simplify the calculation process. Therefore, the gradient of the objective function is calculated as follows:

\begin{matrix} \nabla_{θ^{μ}} J & \approx E_{x_{t} \sim ρ^{β}} [\nabla_{θ^{μ}} Q_{μ} (x_{t}, μ (x_{t}))] \\ = E_{x_{t} \sim ρ^{β}} [\nabla_{θ^{μ}} Q (x, g; θ^{Q}) |_{x = x_{t}, g = μ (x_{t} {| θ}^{μ})}] \\ = E_{x_{t} \sim ρ^{β}} [\nabla_{g} Q (x, g | θ^{Q}) |_{x = x_{t}, g = μ (x_{t})} \nabla_{θ^{μ}} μ (x_{t} | θ^{μ}) |_{x = x_{t}}] \end{matrix}

(28)

3.4. Network Design of DDPGPIR

The network structure of DDPGPIR includes an actor network, a critic network, and a corresponding target network. The structure of the actor network is shown in Figure 2. The input is the state vector

x_{t}

of the n-link robot manipulator, the two middle hidden layers are the full connection layer and the activation layer, and the output layer is the parameter vector

g_{t}

. The structure of the critic network is shown in Figure 3. The input includes state vector

x_{t}

and parameter vector

g_{t}

. The four middle hidden layers are the full connection layer, activation layer, superposition layer and activation layer. The output layer is the

Q

value of action.

To make the training data relatively independent, to accelerate the convergence speed, and to improve the stability of the network update process, the data used for the current network update are not the previous state data obtained by decision-making, but

M

small batch sample data randomly selected from the experience replay memory. The critic network includes the current critic network

Q (g, x | θ^{Q})

and the target critic network

Q^{'} (θ^{Q^{'}})

.The current critic network is updated using a gradient descent method by minimizing the loss function as follows:

Q_{t a r g e t} = r_{i} + γ Q^{'} (x_{i + 1}, μ^{'} (x_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

(29)

L = \frac{1}{M} \sum_{i = 1}^{M} {(Q_{t a r g e t} - Q (x_{i}, g_{i} | θ^{Q}))}^{2}

(30)

\nabla L (θ^{Q}) = \frac{1}{M} [Q_{t a r g e t} - Q (x, g | θ^{Q}) \nabla_{θ^{Q}} Q (x, g | θ^{μ})]

(31)

where

Q_{t a r g e t}

is the value of the target critic network,

Q (x, g | θ^{μ})

is the value of the critic network,

i

is the

i

th sample data, and

γ (0 \leq γ \leq 1)

is the discount rate. The actor network includes the current actor network

μ (x | θ^{μ})

and the target actor network

μ^{'} (θ^{μ^{'}})

. The current actor network is updated with the deterministic strategy gradient as follows:

\nabla_{θ^{μ}} J_{β} (μ) \approx \frac{1}{M} \sum_{i} (\nabla_{g} Q (x, g | θ^{Q}) |_{x = x_{i}, g = μ (x_{i})} \nabla_{θ^{μ}} μ (x; θ^{μ}) |_{x = x_{i}})

(32)

where

\nabla_{θ^{μ}} J_{β} (μ)

represents the gradient direction of the

Q

value caused by the action strategy

μ

,

\nabla_{g} Q (x, g | θ^{Q}) |_{x = x_{i}, g = μ (x_{i})}

represents the change in the

Q

value caused by action

μ (x_{i})

in the current state, and

\nabla_{θ^{μ}} μ (x; θ^{μ}) |_{x = x_{i}}

is the gradient direction of the current strategy.

The target critic network and the target actor network update the network with a soft update with an update rate of

ρ

as follows:

{\begin{matrix} θ_{i + 1}^{Q^{'}} \leftarrow ρ θ^{Q} + (1 - ρ) θ_{i}^{Q^{'}} \\ θ_{i + 1}^{μ^{'}} \leftarrow ρ θ^{μ} + (1 - ρ) θ_{i}^{μ^{'}} \end{matrix}

(33)

3.5. Learning Process of DDPGPIR

The DDPGPIR learning process applied to the manipulator is shown in Figure 4.

μ (x | θ^{μ})

and

μ^{'} (x | θ^{μ^{'}})

are the current actor network and the target actor network, respectively, and

Q (x, g | θ^{Q})

and

Q^{'} (x, g | θ^{Q^{'}})

are the current critic network and the target critic network, respectively. The learning process is described as Algorithm 1. First, parameters

(Q, μ, Q^{'}, μ^{'})

, memory playback space

R M

, and noise

G

of the online network and the target network are initialized. After the dynamic information

x_{t}

of the manipulator is input into the DDPGPIR agent, according to strategy

μ

and noise

G

to determine the optimal parameter

g_{t}

of the PIR controller, the output torque of the controller acts on the manipulator. In addition, the system monitors the joint angle

q_{d} (t)

in real time. If

q_{d} (t)

is within a reasonable range, the corresponding reward value will be obtained after this action is executed, and the next state

x_{t + 1}

will be input. Otherwise, the action is immediately stopped, a negative reward is given, and the agent re-selects the new action and executes it. The data

(x_{t}, g_{t}, r_{t}, x_{t + 1})

tuple formed in this process will be stored in the experience replay memory

R M

. Small-batch tuple data are randomly extracted from

R M

, the minimal loss function method is used to update the critic network, the deterministic strategy gradient method is used to update the actor network, and the target network is updated by the soft update method.

Algorithm 1. DDPGPIR Algorithm.

Initialize the critic network

Q (g, x | θ^{Q})

and the actor network

μ (x | θ^{μ})

Initialize the target network

Q^{'} (θ^{Q^{'}})

and

μ^{'} (θ^{μ^{'}})

with the same weights

Initialize replay memory

R M

Initialize Gaussian noise

G

for episode =

1 \dots M

do

Receive initial observation state

x_{1}

for

t = 1 \dots T

do

select action

g_{t} = (K_{p 1}, K_{i 1}, K_{s 1}, K_{r 1}, \dots K_{p n}, K_{i n}, K_{s n}, K_{r n}) = μ (x_{t} | θ^{μ}) + G

select execution action

g_{t}

if

q (t) \notin [- ε, ε]

reject

g_{t}

and add a negative number to

r

else:

execute

g_{t}

and get observed reward

r_{t}

and observe new state

x_{t + 1}

store transition

(x_{t}, g_{t}, r_{t}, x_{t + 1})

in

R M

sample mini-batch of

M

transitions

(x_{i}, g_{i}, r_{i}, x_{t + 1})

from

R M

set

Q_{t a r g e t} = r_{i} + γ Q^{'} (x_{i + 1}, μ^{'} (x_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

update critic according to Equations

(29)

and

(31)

update actor according to Equation

(32)

update the target networks according to Equation

(33)

end for

3.6. Reward Function

As stated, for most reinforcement learning tasks, there is always a reward function, which can reward each behavior of the agent accordingly, so that the agent can make a corresponding behavior when facing different states and obtain a higher cumulative reward value. To adapt to different reinforcement learning tasks, the reward function must be universal and provide abundant information for the reinforcement learning agents. In the problems discussed in this paper, the trajectory tracking error

e (t)

and the joint angle

q (t)

of the manipulator are the variables of most concern. When the tracking error

e (t)

increases, or the joint angle

q (t)

exceeds the reasonable range, a negative reward value should be given; otherwise, a positive reward value should be given. Therefore, the reward function combining the Gaussian function and the Euclidean distance is as follows:

r = α r_{1} + β r_{2} + δ r_{3}

(34)

r_{1} = \sum_{j = 1}^{n} - \frac{{(q_{d_{j}} (t) - q_{j} (t))}^{2}}{2 c^{2}}

(35)

r_{2} = \sqrt{\sum_{j = 1}^{n} {(q_{d_{j}} (t) - q_{j} (t))}^{2}}

(36)

r_{3} = {\begin{matrix} 0, & | q_{j} (t) | < ε \\ - 1, & o t h e r \end{matrix}

(37)

where

α

,

β

and

δ

are the coefficients of the reward items,

q_{d_{j}} (t)

and

q_{j} (t)

are the expected joint angle and the actual joint angle of the

j

-th joint, respectively, and

ε

is a reasonable critical value of the joint angle.

4. Experiment and Results

To verify the control performance of DDPGPIR, taking a two-link robotic manipulator as an example, the DDPGPIR, PIR and RBFNN are simulated and compared in MATLAB/Simulink. The dynamic model of two joint manipulators can be deduced by the Lagrange method [37,38]. The widely studied kinetic models and parameters can be expressed as follows [39]:

M (q) = [\begin{matrix} p_{1} + p_{2} + 2 p_{3} \cos q_{2} & p_{2} + p_{3} \cos q_{2} \\ p_{2} + p_{3} \cos q_{2} & p_{2} \end{matrix}]

(38)

C (q, \dot{q}) = [\begin{matrix} - p_{3} {\dot{q}}_{2} \sin q_{2} & - p_{3} ({\dot{q}}_{1} + {\dot{q}}_{2}) \sin q_{2} \\ p_{3} {\dot{q}}_{1} \sin q_{2} & 0 \end{matrix}]

(39)

G (q) = [\begin{matrix} p_{4} g \cos q_{1} + p_{5} g \cos (q_{1} + q_{2}) \\ p_{5} g \cos (q_{1} + q_{2}) \end{matrix}]

(40)

p = {[\begin{matrix} p_{1} & p_{2} & p_{3} & p_{4} & p_{5} \end{matrix}]}^{T} = {[\begin{matrix} 2.9 & 0.76 & 0.87 & 3.04 & 0.87 \end{matrix}]}^{T}

(41)

In order to achieve better control effect and facilitate comparisons with other control methods, the simulation sampling step size is set at 0.1 s and the simulation cycle is set at 20 s. The initial state of the system is

q_{1} (0) = - 0.5 rad

,

q_{2} (0) = - 0.5 rad

, The expected trajectory path is

q_{d 1} = \sin 0.5 π t

,

q_{d 2} = \sin 0.5 π t

. The friction force is

F_{f} = 5 s g n (\dot{q})

. The external interference is

τ_{d} = 10 \sin (\dot{q})

. After many attempts, a set of appropriate PIR controller parameters are selected as

K_{p 1} = 60, K_{i 1} = 45, K_{s 1} = 35, K_{r 1} = 3, K_{p 2} = 60, K_{i 2} = 45, K_{s 2} = 35, K_{r 2} = 3

.

RBFNN has good function approximation and generalization ability and is widely used in nonlinear function modeling [40,41]. The adaptive control of manipulators based on RBFNN approximation is as follows [42]:

τ = W^{* T} φ (x) + K_{v} s - v

(42)

where

W^{*}

is the network weight vector and

x

is the input signal of the network,

φ (x)

is the column vector of the basis function,

K_{v}

is the coefficient of error function term, and

v

is the robust term used to overcome the approximation error of neural network.

4.1. Learning Results for DDPGPIR

In Figure 5, the reward value obtained by the DDPGPIR agent in the initial learning process is low, because the process is in the exploratory stage. However, as the learning times increase, the reward value gradually increases and tends to be stable and close to the expected cumulative reward value, which verifies that the reward function proposed in this paper can effectively avoid the convergence of a deep neural network to the local optimum. At the same time, the correctness and stability of the DDPGPIR model are proved. Figure 6 shows the changing process of the controller parameters. Because the desired trajectory is constantly changing, the controller parameters are also adjusted in real time, to improve the tracking accuracy of the trajectory.

4.2. Control Effect Comparison of the Controller

Figure 7, Figure 8 and Figure 9 show the trajectory tracking performance of the RBFNN, PIR, and DDPGPIR controllers. The figures show that the DDPGPIR controller has a shorter response time and higher trajectory tracking accuracy than the PIR and RBFNN controllers in the case of friction and time-varying external interference. Figure 10 and Figure 11 show the trajectory tracking errors of the DDPGPIR, PIR, and RBFNN controllers, respectively. It can be seen that compared with DDPGPIR, the PIR and RBFNN controllers have larger overshoot and trajectory tracking errors.

4.3. Control Performance Index Comparison

To further highlight the effectiveness of the DDPGPIR controller, the integral absolute error (IAE) and the integral time absolute error (ITAE) were used to evaluate the performance of the controller. The definitions of IAE and ITAE are as follows:

IAE = \int | e | d t

(43)

ITAE = \int t | e | d t

(44)

Table 1 shows the IAE and ITAE values of RBFNN, PIR, and DDPGPIR. The table shows that DDPGPIR has smaller IAE and ITAE values than PIR and RBFNN. Therefore, DDPGPIR has better adaptability and robustness in the case of friction and external disturbance.

5. Conclusions

An adaptive PIR control method based on deep reinforcement learning is proposed for the n-link robot manipulator system with model uncertainty and time-varying external disturbances. In this method, the parameters of the PIR controller are adjusted and optimized in real time by using the DDPG algorithm. Among them, the adaptive robust term is used to compensate for the uncertainty of the robot manipulator system. In addition, the model-free reinforcement learning method does not need to rely on expert knowledge and human intervention. The agent of the deep neural network can effectively avoid reduction of control accuracy caused by sparse discretization and the curse of dimension caused by dense discretization. In addition, a reward function combining the Gaussian function and the Euclidean distance is designed to ensure efficient and stable learning of the reinforcement learning agent.

The proposed method was applied to control the two-link robot manipulator with a model for uncertainty and external disturbance. The experimental results show that the reward value obtained by the DDPGPIR agent increases gradually with the increase of learning times, and finally tends to be stable and close to the expected reward value, which proves the correctness and stability of the DDPGPIR model. In addition, compared with PIR and RBFNN, DDPGPIR has better adaptability and robustness, and a higher precision trajectory tracking ability, for the uncertainty of the n-link robot manipulator system. At the same time, it is better than PIR and RBFNN in the performance evaluation of IAE and ITAE.

In future work, since the proposed control method can control the n-link robot arm system, this method may be applied to more complex control tasks, such as unmanned aerial vehicles. However, the ability of the control system to handle emergencies remains a thorny issue. Therefore, our follow-up work will continue to carry out in-depth research for this problem.

Author Contributions

Conceptualization, P.L. and J.X.; methodology, P.L.; software, P.L.; validation, W.H. (Wenkai Huang), P.L. and J.X.; formal analysis, P.L.; investigation, F.Z.; resources, W.H. (Wei Hu); data curation, P.L.; writing—original draft preparation, P.L.; writing—review and editing, W.H. (Wenkai Huang); visualization, J.X.; supervision, W.H. (Wenkai Huang); project administration, W.H. (Wenkai Huang); funding acquisition, W.H. (Wenkai Huang). All authors have read and agreed to the published version of the manuscript.

Funding

The authors gratefully acknowledge support for this work by the Ministry of Science and Technology of the People’s Republic of China under grant nos. 2020AAA0104800 and 2020AAA0104804., as well as Guangzhou Science and Technology Planning Project, grant no. 202002030279.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Feng, Z.; Hu, G.; Sun, Y.; Soon, J. An overview of collaborative robotic manipulation in multi-robot systems. Annu. Rev. Control. 2020, 49, 113–127. [Google Scholar] [CrossRef]
Wang, Z.; Cui, W. For safe and compliant interaction: An outlook of soft underwater manipulators. Proc. Inst. Mech. Eng. Part M J. Eng. Marit. Environ. 2020, 235, 3–14. [Google Scholar] [CrossRef]
Kuo, C.-H.; Dai, J.S.; Dasgupta, P. Kinematic design considerations for minimally invasive surgical robots: An overview. Int. J. Med Robot. Comput. Assist. Surg. 2012, 8, 127–145. [Google Scholar] [CrossRef]
Albu-Schaffer, A.; Bertleff, W.; Rebele, B.; Schafer, B.; Landzettel, K.; Hirzinger, G. ROKVISS—Robotics component verification on ISS—Current experimental results on parameter identification. In Proceedings of the 2006 IEEE International Conference on Robotics and Automation, Orlando, FL, USA, 15–16 May 2006; pp. 3879–3885. [Google Scholar]
Sage, H.G.; De Mathelin, M.F.; Ostertag, E. Robust control of robot manipulators: A survey. Int. J. Control. 1999, 72, 1498–1522. [Google Scholar] [CrossRef]
Pan, H.; Xin, M. Nonlinear robust and optimal control of robot manipulators. Nonlinear Dyn. 2013, 76, 237–254. [Google Scholar] [CrossRef]
Loucif, F.; Kechida, S.; Sebbagh, A. Whale optimizer algorithm to tune PID controller for the trajectory tracking control of robot manipulator. J. Braz. Soc. Mech. Sci. Eng. 2019, 42, 1. [Google Scholar] [CrossRef]
Elkhateeb, N.A.; Badr, R.I. Novel PID Tracking Controller for 2DOF Robotic Manipulator System Based on Artificial Bee Colony Algorithm. Electr. Control. Commun. Eng. 2017, 13, 55–62. [Google Scholar] [CrossRef] [Green Version]
Ardeshiri, R.R.; Khooban, M.H.; Noshadi, A.; Vafamand, N.; Rakhshan, M. Robotic manipulator control based on an optimal fractional-order fuzzy PID approach: SiL real-time simulation. Soft Comput. 2019, 24, 3849–3860. [Google Scholar] [CrossRef]
Ardeshiri, R.R.; Kashani, H.N.; Ahrabi, A.R. Design and simulation of self-tuning fractional order fuzzy PID controller for robotic manipulator. Int. J. Autom. Control. 2019, 13, 595. [Google Scholar] [CrossRef]
Shah, D.; Chatterjee, S.; Bharati, K.; Chatterjee, S. Tuning of Fractional-Order PID Controller—A Review. In Frontiers in Computer, Communication and Electrical Engineering; Taylor & Francis Group: London, UK, 2016; pp. 323–329. [Google Scholar]
Kong, L.; Zhang, S.; Yu, X. Approximate optimal control for an uncertain robot based on adaptive dynamic programming. Neurocomputing 2020, 423, 308–317. [Google Scholar] [CrossRef]
Kong, L.; He, W.; Dong, Y.; Cheng, L.; Yang, C.; Li, Z. Asymmetric Bounded Neural Control for an Uncertain Robot by State Feedback and Output Feedback. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 1735–1746. [Google Scholar] [CrossRef] [Green Version]
Wang, S. Adaptive Fuzzy Sliding Mode and Robust Tracking Control for Manipulators with Uncertain Dynamics. Complexity 2020, 2020, 1492615. [Google Scholar] [CrossRef]
Yang, C.; Jiang, Y.; Na, J.; Li, Z.; Cheng, L.; Su, C.-Y. Finite-Time Convergence Adaptive Fuzzy Control for Dual-Arm Robot with Unknown Kinematics and Dynamics. IEEE Trans. Fuzzy Syst. 2018, 27, 574–588. [Google Scholar] [CrossRef]
Rouhani, E.; Erfanian, A. A Finite-time Adaptive Fuzzy Terminal Sliding Mode Control for Uncertain Nonlinear Systems. Int. J. Control. Autom. Syst. 2018, 16, 1938–1950. [Google Scholar] [CrossRef]
Yang, Z.; Peng, J.; Liu, Y. Adaptive neural network force tracking impedance control for uncertain robotic manipulator based on nonlinear velocity observer. Neurocomputing 2018, 331, 263–280. [Google Scholar] [CrossRef]
Guo, Q.; Zhang, Y.; Celler, B.G.; Su, S.W. Neural Adaptive Backstepping Control of a Robotic Manipulator with Prescribed Performance Constraint. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3572–3583. [Google Scholar] [CrossRef] [PubMed]
Gosavi, A. Reinforcement Learning: A Tutorial Survey and Recent Advances. INFORMS J. Comput. 2009, 21, 178–192. [Google Scholar] [CrossRef] [Green Version]
Khan, S.G.; Herrmann, G.; Lewis, F.L.; Pipe, T.; Melhuish, C. Reinforcement learning and optimal adaptive control: An overview and implementation examples. Annu. Rev. Control. 2012, 36, 42–59. [Google Scholar] [CrossRef]
Liu, C.; Xu, X.; Hu, D. Multiobjective Reinforcement Learning: A Comprehensive Overview. IEEE Trans. Syst. Man Cybern. Syst. 2014, 45, 385–398. [Google Scholar] [CrossRef]
Kukker, A.; Sharma, R. Stochastic Genetic Algorithm-Assisted Fuzzy Q-Learning for Robotic Manipulators. Arab. J. Sci. Eng. 2021, 1–13. [Google Scholar] [CrossRef]
Runa; Sharma, R.; IEEE. A Lyapunov theory based Adaptive Fuzzy Learning Control for Robotic Manipulator. In Proceedings of the International Conference on Recent Developments in Control, Automation and Power Engineering, Noida, India, 12–13 March 2015; pp. 247–252. [Google Scholar]
Kim, W.; Kim, T.; Kim, H.J.; Kim, S.; IEEE. Three-link Planar Arm Control Using Reinforcement Learning. In Proceedings of the 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence, Jeju, Korea, 28 June–1 July 2017; pp. 424–428. [Google Scholar]
Du, T.; Cox, M.T.; Perlis, D.; Shamwell, J.; Oates, T.; IEEE. From Robots to Reinforcement Learning. In Proceedings of the 25th International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 4–6 November 2013; pp. 540–545. [Google Scholar]
Agostinelli, F.; Hocquet, G.; Singh, S.; Baldi, P. From Reinforcement Learning to Deep Reinforcement Learning: An Overview. In Braverman Readings in Machine Learning: Key Ideas from Inception to Current State; Rozonoer, L., Mirkin, B., Muchnik, I., Eds.; Springer: Cham, Switzerland, 2018; Volume 11100, pp. 298–328. [Google Scholar]
Wang, H.-N.; Liu, N.; Zhang, Y.-Y.; Feng, D.-W.; Huang, F.; Li, D.-S.; Zhang, Y.-M. Deep reinforcement learning: A survey. Front. Inf. Technol. Electron. Eng. 2020, 21, 1726–1744. [Google Scholar] [CrossRef]
Mousavi, S.S.; Schukat, M.; Howley, E. Deep Reinforcement Learning: An Overview. In Proceedings of the Sai Intelligent Systems Conference, London, UK, 21–22 September 2016; Volume 16, pp. 426–440. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D.; Continuous Control with Deep Rein-Forcement Learning. Comput. Sci. 2015. Available online: https://arxiv.org/abs/1509.02971 (accessed on 18 July 2021).
Shi, X.; Li, Y.; Sun, B.; Xu, H.; Yang, C.; Zhu, H. Optimizing zinc electrowinning processes with current switching via Deep Deterministic Policy Gradient learning. Neurocomputing 2019, 380, 190–200. [Google Scholar] [CrossRef]
Sun, M.; Zhao, W.; Song, G.; Nie, Z.; Han, X.; Liu, Y. DDPG-Based Decision-Making Strategy of Adaptive Cruising for Heavy Vehicles Considering Stability. IEEE Access 2020, 8, 59225–59246. [Google Scholar] [CrossRef]
Zhao, H.; Zhao, J.; Qiu, J.; Liang, G.; Dong, Z.Y. Cooperative Wind Farm Control with Deep Reinforcement Learning and Knowledge-Assisted Learning. IEEE Trans. Ind. Inform. 2020, 16, 6912–6921. [Google Scholar] [CrossRef]
Özyer, B. Adaptive fast sliding neural control for robot manipulator. Turk. J. Electr. Eng. Comput. Sci. 2020, 28, 3154–3167. [Google Scholar] [CrossRef]
Yu, X.; Zhang, S.; Fu, Q.; Xue, C.; Sun, W. Fuzzy Logic Control of an Uncertain Manipulator with Full-State Constraints and Disturbance Observer. IEEE Access 2020, 8, 24284–24295. [Google Scholar] [CrossRef]
Nohooji, H.R. Constrained neural adaptive PID control for robot manipulators. J. Frankl. Inst. 2020, 357, 3907–3923. [Google Scholar] [CrossRef]
Ge, S.S.; Lee, T.H.; Harris, C.J. Adaptive Neural Network Control of Robotic Manipulator; World Scientific: London, UK, 1998. [Google Scholar]
Zhang, D.; Wei, B. A review on model reference adaptive control of robotic manipulators. Annu. Rev. Control. 2017, 43, 188–198. [Google Scholar] [CrossRef]
Liu, J.; Dong, X.; Yang, Y.; Chen, H. Trajectory Tracking Control for Uncertain Robot Manipulators with Repetitive Motions in Task Space. Math. Probl. Eng. 2021, 2021, 8838927. [Google Scholar] [CrossRef]
Xu, W.; Cai, C.; Zou, Y. Neural-network-based robot time-varying force control with uncertain manipulator–environment system. Trans. Inst. Meas. Control 2014, 36, 999–1009. [Google Scholar] [CrossRef]
Yu, L.; Fei, S.; Huang, J.; Gao, Y. Trajectory Switching Control of Robotic Manipulators Based on RBF Neural Networks. Circuits Syst. Signal Process. 2013, 33, 1119–1133. [Google Scholar] [CrossRef]
Wang, L.; Chai, T.; Yang, C. Neural-Network-Based Contouring Control for Robotic Manipulators in Operational Space. IEEE Trans. Control. Syst. Technol. 2011, 20, 1073–1080. [Google Scholar] [CrossRef]
Wang, N.; Wang, D. Adaptive manipulator control based on RBF network approximation. In Proceedings of the 2017 Chinese Automation Congress, Jinan, China, 20–22 October 2017; pp. 2625–2630. [Google Scholar]

Figure 1. Schematic diagram of the DDPGPIR control system.

Figure 2. The structure of the actor network.

Figure 3. The structure of the critic network.

Figure 4. DDPGPIR learning process for the manipulator.

Figure 5. Reward value for each episode by DDPGPIR agent.

Figure 6. Parameter change of the DDPGPIR controller.

Figure 7. Trajectory tracking performance of the RBFNN.

Figure 8. Trajectory tracking performance of PIR.

Figure 9. Trajectory tracking performance of DDPGPIR.

Figure 10. Tracking error of joint 1.

Figure 11. Tracking error of joint 2.

Table 1. Performance index calculation results.

Controller	Indicator	Joint 1	Joint 2
RBFNN	IAE	1.5978	0.5440
RBFNN	ITAE	13.4532	3.9454
PIR	IAE	0.4217	0.3476
PIR	ITAE	1.6596	1.5451
DDPGPIR	IAE	0.0866	0.0410
DDPGPIR	ITAE	0.0285	0.0848

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, P.; Huang, W.; Xiao, J.; Zhou, F.; Hu, W. Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient. Mathematics 2021, 9, 2055. https://doi.org/10.3390/math9172055

AMA Style

Lu P, Huang W, Xiao J, Zhou F, Hu W. Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient. Mathematics. 2021; 9(17):2055. https://doi.org/10.3390/math9172055

Chicago/Turabian Style

Lu, Puwei, Wenkai Huang, Junlong Xiao, Fobao Zhou, and Wei Hu. 2021. "Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient" Mathematics 9, no. 17: 2055. https://doi.org/10.3390/math9172055

APA Style

Lu, P., Huang, W., Xiao, J., Zhou, F., & Hu, W. (2021). Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient. Mathematics, 9(17), 2055. https://doi.org/10.3390/math9172055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient

Abstract

1. Introduction

2. Dynamic Model of the n-Link Robot Manipulator

3. DDPGPIR Control Design

3.1. PIR Control Design

3.2. Reinforcement Learning and Policy Gradient Method

3.3. DDPG Adaptive PIR Control

3.4. Network Design of DDPGPIR

3.5. Learning Process of DDPGPIR

3.6. Reward Function

4. Experiment and Results

4.1. Learning Results for DDPGPIR

4.2. Control Effect Comparison of the Controller

4.3. Control Performance Index Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI