Adaptive Proportional Integral Robust Control of an Uncertain Robotic Manipulator Based on Deep Deterministic Policy Gradient

: An adaptive proportional integral robust (PIR) control method based on deep deterministic policy gradient (DDPGPIR) is proposed for n-link robotic manipulator systems with model uncertainty and time-varying external disturbances. In this paper, the uncertainty of the nonlinear dynamic model, time-varying external disturbance, and friction resistance of the n-link robotic manipulator are integrated into the uncertainty of the system, and the adaptive robust term is used to compensate for the uncertainty of the system. In addition, dynamic information of the n-link robotic manipulator is used as the input of the DDPG agent to search for the optimal parameters of the proportional integral robust controller in continuous action space. To ensure the DDPG agent’s stable and efﬁcient learning, a reward function combining a Gaussian function and the Euclidean distance is designed. Finally, taking a two-link robot as an example, the simulation experiments of DDPGPIR and other control methods are compared. The results show that DDPGPIR has better adaptive ability, robustness, and higher trajectory tracking accuracy.


Introduction
A robotic manipulator is similar to the human arm and can replace or assist humans to complete the tasks of picking, placing, painting, welding, and assembling. The manipulator plays an important role in industrial production, underwater exploration, medical application, aerospace, and other fields [1][2][3][4]. To achieve a better control effect and meet the control requirements of different fields, the manipulator must have the ability to track a trajectory with high precision. Due to the highly nonlinear, dynamic characteristics of a robotic manipulator, and the influence of joint friction and time-varying external interference in practical applications, it is difficult to obtain accurate information about model parameters. Therefore, when designing a control strategy, good adaptability and high-precision trajectory tracking abilities are necessary for the uncertainty of the n-link robotic manipulator system.
In order to better control the robot manipulator, the robustness of the control strategy has attracted extensive attention. Robustness here refers to the ability to produce good dynamic behavior in the face of modelling errors and unmodelled dynamics of the robot manipulator [5,6]. Loucif and Kechida [7] and Elkhateeb et al. [8] used a whale optimization algorithm and an artificial bee colony algorithm, respectively, to optimize the parameters of the proportion integral differential (PID) controller, improve the trajectory tracking accuracy of the robot manipulator under unmodeled dynamics, and make the controller have a certain robustness. In order to model the control process of the robot manipulator more accurately, Ardeshiri et al. [9,10] proposed a fractional order fuzzy PID controller.
Lillicrap et al. [29] proposed a depth deterministic strategy gradient algorithm based on the actor critic framework. Shi et al. [30] used the DDPG algorithm to deal with controlling the zinc electro winning (Zep) process, which effectively solved the problems of inaccurate modeling and time delay, while also being more energy-saving than the traditional control method. Sun et al. [31] used this algorithm to solve the heavy vehicle adaptive cruise decision-making problem, which has good adaptability in a strange and complex environment. Zhao et al. [32] solved the cooperative control problem of wind farms through the DDPG algorithm and reduced the learning cost in the learning process. Therefore, the DDPG algorithm seems to be effective in solving multiple continuous-state space reinforcement learning tasks.
The purpose of this paper is to establish an n-link robotic manipulator control system with a model for uncertainty and time-varying external disturbances. An adaptive PIR control method based on deep reinforcement learning is proposed. The modeling error of the nonlinear dynamic model of an n-link manipulator is compensated by robust control, and the parameters of the controller are adjusted by a DDPG algorithm to improve the adaptability of the controller to the uncertain, nonlinear, dynamic model. The main contributions of this paper are as follows:

•
Considering the uncertainty and time-varying disturbance of the dynamic model of the n-link robot manipulator system and the influence of friction resistance, the adaptive robust term is used to compensate for the uncertainty of the system. An adaptive PIR control method based on the DDPG is proposed, which has good adaptability and high-precision trajectory tracking ability for the uncertainty of the n-link robot manipulator system. • A reward function combining a Gaussian function and the Euclidean distance is proposed, which can ensure the reinforcement learning agent learns efficiently and stably and can effectively avoid a convergence of the deep neural network to the local optimal problem. • Taking a two-link robotic manipulator as an example, the simulation results show that the proposed method is effective compared with an adaptive control based on radial basis function neural network (RBFNN) approximation and PIR control with fixed parameters.

Dynamic Model of the n-Link Robot Manipulator
The dynamic model of the n-link robotic manipulator system expresses the relationship between the joint torque and the position, velocity, and acceleration of the connecting rod: where q ∈ R n is the joint position vector of the manipulator, . q ∈ R n is the velocity vector of the manipulator, .. q ∈ R n is the acceleration vector of the manipulator, M(q) ∈ R n×n is the mass inertia matrix, C q, . q ∈ R n×n is the Coriolis force and the centrifugal force vector, G(q) ∈ R n×n is the gravity vector, F f . q ∈ R n is the friction vector, τ d ∈ R n is the time-varying external disturbance, and τ ∈ R n is the torque vector acting on the joint.
The precise values of the M(q), C q, . q , and G(q) parameters in the dynamic model are difficult to obtain due to a series of influential factors, such as the complexity of the manipulator mechanism, environmental variations, and measurement errors in the actual operation of the manipulator. Therefore, the actual values for M(q), C q, . q , and G(q) are divided into the model part and the error part as follows: C q, The dynamic model Formula (1) of the n-link robot manipulator can also be expressed as follows: M 0 (q) ..
Property 1 ([33]). The mass inertia matrix M 0 (q) is symmetric, positive definite and bounded and can be expressed as follows: where M 0 (q) is the norm of the mass inertia matrix M 0 (q); µ n and µ m are the upper and lower boundaries, respectively, and both are positive numbers.
q . The following equation is satisfied: q . It is a skew symmetric matrix, ξ ∈ R n .

DDPGPIR Control Design
In this paper, a control strategy for DDPGPIR for the n-link robotic manipulator system with a model for uncertainty and time-varying external disturbances is proposed. The control strategy includes PIR control design, reinforcement learning and policy gradient method, DDPG adaptive PIR control, DDPGPIR network design, the DDPGPIR learning process, and the reward function.

PIR Control Design
In the n-link robotic manipulator system, the position error e(t) is the difference between the expected joint angle q d (t) and the actual joint angle q(t). The position error and the error function are defined as follows: where q(t). Then: ..
Therefore, the dynamic model Equation (3) of the n-link robotic manipulator can be expressed as follows: where Mathematics 2021, 9,2055 5 of 16 In PIR control [36], the control law is designed as follows: where τ is the torque applied to each joint of the n-link robotic manipulator, τ m is the torque control term of the model, K p and K i are the gain of the proportional term and the gain of the integral term, respectively, and τ s is the robust term used to compensate the nonlinear dynamic model error and external disturbance. From Equation (13) and Equation (16), it can be concluded that: Select the Lyapunov function as follows: The derivation on both sides of the equation leads to: V ≤ −s T K p s ≤ 0. Therefore, the control system is asymptotically stable.

Reinforcement Learning and Policy Gradient Method
Reinforcement learning is an important branch of machine learning, which is mainly composed of environment, agent, reward, state, and action. When the agent performs action a t on the environment in state s t , the environment will give the agent a reward r t+1 , the state changes to the next state s t+1 , and the future reward value passes through the discount coefficient γ(0 ≤ γ ≤ 1) After weighting, the cumulative reward r t can be expressed as: The policy of reinforcement learning is the functional relationship π between state space and action space. The objective of a policy-based reinforcement learning method is to try to find the optimal strategy π * to maximize the cumulative reward. In the strategy gradient method, the optimal strategy is updated along the gradient direction of the expected cumulative reward as follows: where θ is the parameter vector of the policy, J(θ) is the objective function of reinforcement learning, σ = (s 0 , a 0 , s 1 , a 1 , · · · s l , a l ) is a group of state action sequences, and P(σ|θ) is the action sequence σ Probability of occurrence, ϑ is the learning rate and h is the number of the current update.

DDPG Adaptive PIR Control
The schematic diagram of the DDPGPIR control system of the n-link robot manipulator is shown in Figure 1. The input of the controller is the error vector e = (e 1 , e 2 , · · · e n ) of the n-link robot manipulator. The output is the torque vector acting on the joint τ = (τ 1 , τ 2 , · · · τ n ). The control performance of the DDPGPIR mainly depends on the parameter vector g = K p1 , K i2 , K s2 , K r2 , · · · K pn , K in , K sn , K rn . The control problem of the n-link robot manipulator can be expressed as: where the vector q d is the expected joint angle, q is the actual joint angle, p is the physical parameter of n-link robot manipulator, and j is the j-th link.
gradient method, the optimal strategy is updated along the gradient direction of the expected cumulative reward as follows: where is the parameter vector of the policy, ( ) is the objective function of reinforcement learning, = ( , , , , ⋯ , ) is a group of state action sequences, and ( | ) is the action sequence Probability of occurrence, is the learning rate and ℎ is the number of the current update.

DDPG Adaptive PIR Control
The schematic diagram of the DDPGPIR control system of the n-link robot manipulator is shown in Figure 1. The input of the controller is the error vector = ( , , ⋯ ) of the n-link robot manipulator. The output is the torque vector acting on the joint = ( , , ⋯ ). The control performance of the DDPGPIR mainly depends on the parameter vector = ( , , , . The control problem of the nlink robot manipulator can be expressed as: where the vector is the expected joint angle, is the actual joint angle, is the physical parameter of n-link robot manipulator, and is the -th link. To improve the adaptability and trajectory tracking accuracy of the n-link robot manipulator, the parameter vector of DDPGPIR needs to be adjusted and optimized in real time. However, the process of setting the parameter is time-consuming, and the optimization process is continuous; it is not advisable to adjust the parameters manually. Therefore, it is necessary to find the optimal strategy function * ( ), which is one of the effective methods for solving this problem. The state vector = ( , , , ⋯ , , , ) is input into the optimal strategy function to obtain the To improve the adaptability and trajectory tracking accuracy of the n-link robot manipulator, the parameter vector of DDPGPIR needs to be adjusted and optimized in real time. However, the process of setting the parameter is time-consuming, and the optimization process is continuous; it is not advisable to adjust the parameters manually. Therefore, it is necessary to find the optimal strategy function µ * (x), which is one of the effective methods for solving this problem. The state vector x t = (τ 1 , e 1 , e 1 dt, · · · , τ n , e n , e n dt) is input into the optimal strategy function to obtain the optimal parameter vector g t . The goal of reinforcement learning is to find the optimal strategy for maximizing cumulative rewards. The objective function can be expressed as: where β is the behavior strategy and γ ∈ (0, 1) is the discount factor. To find the optimal strategy for maximizing the objective function, the strategy gradient method is usually used to select and execute actions from the distribution function of strategy probability in each time step. However, this method needs to sample continuous actions in each time step, which is a huge calculation process. To solve this problem, the de- terministic strategy gradient method is used to simplify the calculation process. Therefore, the gradient of the objective function is calculated as follows:

Network Design of DDPGPIR
The network structure of DDPGPIR includes an actor network, a critic network, and a corresponding target network. The structure of the actor network is shown in Figure 2. The input is the state vector x t of the n-link robot manipulator, the two middle hidden layers are the full connection layer and the activation layer, and the output layer is the parameter vector g t . The structure of the critic network is shown in Figure 3. The input includes state vector x t and parameter vector g t . The four middle hidden layers are the full connection layer, activation layer, superposition layer and activation layer. The output layer is the Q value of action. strategy for maximizing cumulative rewards. The objective function can be expressed as: where is the behavior strategy and ∈ (0,1) is the discount factor. To find the optimal strategy for maximizing the objective function, the strategy gradient method is usually used to select and execute actions from the distribution function of strategy probability in each time step. However, this method needs to sample continuous actions in each time step, which is a huge calculation process. To solve this problem, the deterministic strategy gradient method is used to simplify the calculation process. Therefore, the gradient of the objective function is calculated as follows:

Network Design of DDPGPIR
The network structure of DDPGPIR includes an actor network, a critic network, and a corresponding target network. The structure of the actor network is shown in Figure 2. The input is the state vector of the n-link robot manipulator, the two middle hidden layers are the full connection layer and the activation layer, and the output layer is the parameter vector . The structure of the critic network is shown in Figure 3. The input includes state vector and parameter vector . The four middle hidden layers are the full connection layer, activation layer, superposition layer and activation layer. The output layer is the value of action.   strategy for maximizing cumulative rewards. The objective function can be expressed as: where is the behavior strategy and ∈ (0,1) is the discount factor. To find the optimal strategy for maximizing the objective function, the strategy gradient method is usually used to select and execute actions from the distribution function of strategy probability in each time step. However, this method needs to sample continuous actions in each time step, which is a huge calculation process. To solve this problem, the deterministic strategy gradient method is used to simplify the calculation process. Therefore, the gradient of the objective function is calculated as follows:

Network Design of DDPGPIR
The network structure of DDPGPIR includes an actor network, a critic network, and a corresponding target network. The structure of the actor network is shown in Figure 2. The input is the state vector of the n-link robot manipulator, the two middle hidden layers are the full connection layer and the activation layer, and the output layer is the parameter vector . The structure of the critic network is shown in Figure 3. The input includes state vector and parameter vector . The four middle hidden layers are the full connection layer, activation layer, superposition layer and activation layer. The output layer is the value of action.   To make the training data relatively independent, to accelerate the convergence speed, and to improve the stability of the network update process, the data used for the current network update are not the previous state data obtained by decision-making, but M small batch sample data randomly selected from the experience replay memory. The critic network includes the current critic network Q(g, x θ Q ) and the target critic net- work Q θ Q .The current critic network is updated using a gradient descent method by minimizing the loss function as follows: where Q target is the value of the target critic network, Q(x, g|θ µ ) is the value of the critic network, i is the ith sample data, and γ(0 ≤ γ ≤ 1) is the discount rate. The actor network includes the current actor network µ(x|θ µ ) and the target actor network µ θ µ . The current actor network is updated with the deterministic strategy gradient as follows: where ∇ θ µ J β (µ) represents the gradient direction of the Q value caused by the action strategy µ, ∇ g Q x, g θ Q x=x i ,g=µ(x i ) represents the change in the Q value caused by action µ(x i ) in the current state, and ∇ θ µ µ(x; θ µ )| x=x i is the gradient direction of the current strategy.
The target critic network and the target actor network update the network with a soft update with an update rate of ρ as follows:

Learning Process of DDPGPIR
The DDPGPIR learning process applied to the manipulator is shown in Figure 4.
µ(x|θ µ ) and µ x θ µ are the current actor network and the target actor network, respectively, and Q x, g θ Q and Q x, g θ Q are the current critic network and the target critic network, respectively. The learning process is described as Algorithm 1. First, parameters (Q, µ, Q , µ ), memory playback space RM, and noise G of the online network and the target network are initialized. After the dynamic information x t of the manipulator is input into the DDPGPIR agent, according to strategy µ and noise G to determine the optimal parameter g t of the PIR controller, the output torque of the controller acts on the manipulator. In addition, the system monitors the joint angle q d (t) in real time. If q d (t) is within a reasonable range, the corresponding reward value will be obtained after this action is executed, and the next state x t+1 will be input. Otherwise, the action is immediately stopped, a negative reward is given, and the agent re-selects the new action and executes it. The data (x t , g t , r t , x t+1 ) tuple formed in this process will be stored in the experience replay memory RM. Small-batch tuple data are randomly extracted from RM, the minimal loss function method is used to update the critic network, the deterministic strategy gradient method is used to update the actor network, and the target network is updated by the soft update method.

Algorithm 1. DDPGPIR Algorithm.
Initialize the critic network Q(g, x θ Q ) and the actor network µ(x|θ µ ) Initialize the target network Q θ Q and µ θ µ with the same weights Initialize replay memory RM Initialize Gaussian noise G for episode = 1 · · · M do Receive initial observation state x 1 for t = 1 · · · T do select action g t = K p1 , K i1 , K s1 , K r1 , · · · K pn , K in , K sn , K rn = µ(x t |θ µ ) + G select execution action g t if q(t) / ∈ [−ε, ε] reject g t and add a negative number to r else: execute g t and get observed reward r t and observe new state x t+1 store transition (x t , g t , r t ,

Reward Function
As stated, for most reinforcement learning tasks, there is always a reward function, which can reward each behavior of the agent accordingly, so that the agent can make a corresponding behavior when facing different states and obtain a higher cumulative reward value. To adapt to different reinforcement learning tasks, the reward function must be universal and provide abundant information for the reinforcement learning agents. In the problems discussed in this paper, the trajectory tracking error e(t) and the joint angle q(t) of the manipulator are the variables of most concern. When the tracking error e(t) increases, or the joint angle q(t) exceeds the reasonable range, a negative reward value should be given; otherwise, a positive reward value should be given. Therefore, the reward function combining the Gaussian function and the Euclidean distance is as follows: where α, β and δ are the coefficients of the reward items, q d j (t) and q j (t) are the expected joint angle and the actual joint angle of the j-th joint, respectively, and ε is a reasonable critical value of the joint angle.

Experiment and Results
To verify the control performance of DDPGPIR, taking a two-link robotic manipulator as an example, the DDPGPIR, PIR and RBFNN are simulated and compared in MATLAB/Simulink. The dynamic model of two joint manipulators can be deduced by the Lagrange method [37,38]. The widely studied kinetic models and parameters can be expressed as follows [39]: G(q) = p 4 g cos q 1 + p 5 g cos(q 1 + q 2 ) p 5 g cos(q 1 + q 2 ) In order to achieve better control effect and facilitate comparisons with other control methods, the simulation sampling step size is set at 0.1 s and the simulation cycle is set at 20 s. The initial state of the system is q 1 (0) = −0.5 rad, q 2 (0) = −0.5 rad, The expected trajectory path is q d1 = sin 0.5πt, q d2 = sin 0.5πt. The friction force is q . The external interference is τ d = 10 sin . q . After many attempts, a set of appropriate PIR controller parameters are selected as K p1 = 60, K i1 = 45, K s1 = 35, K r1 = 3, K p2 = 60, K i2 = 45, K s2 = 35, K r2 = 3.
RBFNN has good function approximation and generalization ability and is widely used in nonlinear function modeling [40,41]. The adaptive control of manipulators based on RBFNN approximation is as follows [42]: where W * is the network weight vector and x is the input signal of the network, ϕ(x) is the column vector of the basis function, K v is the coefficient of error function term, and v is the robust term used to overcome the approximation error of neural network.

Learning Results for DDPGPIR
In Figure 5, the reward value obtained by the DDPGPIR agent in the initial learning process is low, because the process is in the exploratory stage. However, as the learning times increase, the reward value gradually increases and tends to be stable and close to the expected cumulative reward value, which verifies that the reward function proposed in this paper can effectively avoid the convergence of a deep neural network to the local optimum. At the same time, the correctness and stability of the DDPGPIR model are proved. Figure 6 shows the changing process of the controller parameters. Because the desired trajectory is constantly changing, the controller parameters are also adjusted in real time, to improve the tracking accuracy of the trajectory.
In Figure 5, the reward value obtained by the DDPGPIR agent in the initial learning process is low, because the process is in the exploratory stage. However, as the learning times increase, the reward value gradually increases and tends to be stable and close to the expected cumulative reward value, which verifies that the reward function proposed in this paper can effectively avoid the convergence of a deep neural network to the local optimum. At the same time, the correctness and stability of the DDPGPIR model are proved. Figure 6 shows the changing process of the controller parameters. Because the desired trajectory is constantly changing, the controller parameters are also adjusted in real time, to improve the tracking accuracy of the trajectory.   Figures 7-9 show the trajectory tracking performance of the RBFNN, PIR, and DDPGPIR controllers. The figures show that the DDPGPIR controller has a shorter response time and higher trajectory tracking accuracy than the PIR and RBFNN controllers in the case of friction and time-varying external interference. Figures 10 and 11 show the trajectory tracking errors of the DDPGPIR, PIR, and RBFNN controllers, respectively. It can be seen that compared with DDPGPIR, the PIR and RBFNN controllers have larger overshoot and trajectory tracking errors. In Figure 5, the reward value obtained by the DDPGPIR agent in the initial learning process is low, because the process is in the exploratory stage. However, as the learning times increase, the reward value gradually increases and tends to be stable and close to the expected cumulative reward value, which verifies that the reward function proposed in this paper can effectively avoid the convergence of a deep neural network to the local optimum. At the same time, the correctness and stability of the DDPGPIR model are proved. Figure 6 shows the changing process of the controller parameters. Because the desired trajectory is constantly changing, the controller parameters are also adjusted in real time, to improve the tracking accuracy of the trajectory.   Figures 7-9 show the trajectory tracking performance of the RBFNN, PIR, and DDPGPIR controllers. The figures show that the DDPGPIR controller has a shorter response time and higher trajectory tracking accuracy than the PIR and RBFNN controllers in the case of friction and time-varying external interference. Figures 10 and 11 show the trajectory tracking errors of the DDPGPIR, PIR, and RBFNN controllers, respectively. It can be seen that compared with DDPGPIR, the PIR and RBFNN controllers have larger overshoot and trajectory tracking errors.

Control Performance Index Comparison
To further highlight the effectiveness of the DDPGPIR controller, the integral absolute error (IAE) and the integral time absolute error (ITAE) were used to evaluate the performance of the controller. The definitions of IAE and ITAE are as follows:

Control Performance Index Comparison
To further highlight the effectiveness of the DDPGPIR controller, the integral absolute error (IAE) and the integral time absolute error (ITAE) were used to evaluate the performance of the controller. The definitions of IAE and ITAE are as follows:

Control Performance Index Comparison
To further highlight the effectiveness of the DDPGPIR controller, the integral absolute error (IAE) and the integral time absolute error (ITAE) were used to evaluate the performance of the controller. The definitions of IAE and ITAE are as follows:

Conclusions
An adaptive PIR control method based on deep reinforcement learning is proposed for the n-link robot manipulator system with model uncertainty and time-varying external disturbances. In this method, the parameters of the PIR controller are adjusted and optimized in real time by using the DDPG algorithm. Among them, the adaptive robust term is used to compensate for the uncertainty of the robot manipulator system. In addition, the model-free reinforcement learning method does not need to rely on expert knowledge and human intervention. The agent of the deep neural network can effectively avoid reduction of control accuracy caused by sparse discretization and the curse of dimension caused by dense discretization. In addition, a reward function combining the Gaussian function and the Euclidean distance is designed to ensure efficient and stable learning of the reinforcement learning agent.
The proposed method was applied to control the two-link robot manipulator with a model for uncertainty and external disturbance. The experimental results show that the reward value obtained by the DDPGPIR agent increases gradually with the increase of learning times, and finally tends to be stable and close to the expected reward value, which proves the correctness and stability of the DDPGPIR model. In addition, compared with PIR and RBFNN, DDPGPIR has better adaptability and robustness, and a higher precision trajectory tracking ability, for the uncertainty of the n-link robot manipulator system. At the same time, it is better than PIR and RBFNN in the performance evaluation of IAE and ITAE.
In future work, since the proposed control method can control the n-link robot arm system, this method may be applied to more complex control tasks, such as unmanned aerial vehicles. However, the ability of the control system to handle emergencies remains a thorny issue. Therefore, our follow-up work will continue to carry out in-depth research for this problem.

Conflicts of Interest:
The authors declare no conflict of interest.