Nonlinear Nonsingular Fast Terminal Sliding Mode Control Using Deep Deterministic Policy Gradient

Featured Application: The control strategy proposed to this paper can be applied to the joint position and velocity tracking down industrial robots (series or parallel manipulators). Theo-retically, it is suitable for general second-order nonlinear systems, such as inverted pendulum control, motor coupling control, dual manipulator cooperative control, etc. Abstract: Background: As a control strategy of industrial robots, sliding mode control has the advantages of fast response and simple physical implementation, but it still has the problems of chattering and low tracking accuracy caused by chattering. This paper proposes a new sliding mode control strategy for the application of industrial robot control, which effectively solves these problems. Methods: In this paper, a deep deterministic policy gradient–nonlinear nonsingular fast terminal sliding mode control (DDPG–NNFTSMC) strategy is proposed for industrial robot control. In order to improve the tracking control accuracy and anti-interference ability, DDPG is used to approach the uncertainties of the system in real time, which ensures the robustness of the system in various uncertain environments. Lyapunov function is used to prove the stability and ﬁnite time convergence of the system. Compared with the nonsingular terminal sliding mode control (NTSMC), the time to reach the equilibrium point is shorter. With the help of MATLAB/Simulink, the tracking accuracy and control effects are compared with traditional terminal sliding mode control (TSMC), NTSMC and radial basis function–sliding mode control (RBF–SMC), the results showed that it had the advantages of nonsingularity, ﬁnite time convergence, small tracking error. The motion accuracy and anti-interference ability of the uncertain manipulator system was further improved, and the chattering problem of the system in the motion process is effectively eliminated.


Introduction
In recent years, with the development of industrial robots, nonlinear, external interference, a variety of uncertainty problems appear, and the performance requirements of the control system are more and more strict. At present, there are many control methods of general nonlinear systems-e.g., adaptive control, fuzzy control, neural network control, sliding mode control (SMC) [1][2][3][4], etc. Among them, in the dynamic process, sliding mode control (SMC) is subject to the continuous changes of the system according to the current state of the system, which forces the system to move according to the state trajectory of the predetermined sliding mode, the hypersurface in state space is defined as sliding surface [5], which has strong robustness to the external interference and uncertainty of the nonlinear system [6,7]. Therefore, SMC is usually used in the control system of industrial robots and manipulators [8][9][10][11]. However, the traditional sliding mode variable structure controls have some disadvantages, such as singularity, uncertain convergence time,

•
It is proved that NNFTSMC has the characteristics of nonsingularity and finite time convergence by mathematical derivation, and the robustness and stability of the system are verified by Lyapunov theorem; • DDPG is used to adaptively approximate the uncertainty of the system model, and the chattering is eliminated to realize the control system's smooth input, to improve the anti-interference ability which ensures the robustness of the system in various environments such as quality change, friction factor, external disturbance, and modeling uncertainty of the system [5,6,12]; • On the basis of eliminating chattering with DDPG, it brings the advantages of low steady-state error and high-precision position tracking.
Compared with the nonlinear control methods such as TSMC, NTSMC and RBF-SMC, the proposed method has better tracked performance and stronger anti-interference and uncertainty effects. The effectiveness and superiority of the proposed control method is verified. Finally, the conclusion is given. Table 1 provides the definition of acronyms.

Manipulator Model
According to [5,22,31], the dynamic differential equation model of n-DOF manipulator system is as follows: .. q+C q, . q . q + G(q)+F . q +τ d (t) = τ(t) (1) where: q, . q , .. q ∈ R n correspond to the position, velocity, and acceleration of the manipulator, M(q)= M(q)+δM(q) ∈ R n×n is the actual inertia matrix of order n × n, C q, . q = C q, . q +δC q, . q ∈ R n×1 is the n × 1 order inertia matrix of centrifugal force and Coriolis force, M(q), C q, . q is the standard model, δM(q), δC q, . q is the error of the real dynamic model. G(q) ∈ R n×1 is the inertia vector of order n × 1 and represents the gravity matrix; F . q is the n × 1 order inertia vector, which represents the friction force and disturbance load; τ d (t) ∈ R n×1 is interference term and uncertainty term; τ(t) is the system control input.
The actual dynamic equation of the manipulator can be written as follows: .. where: The change of joint rotation, angular velocity and angular acceleration are defined as: q d , . q d , .. q d , then the tracking position error of the system is: The speed error is: The dynamic error corresponding to Equation (2) is: where: D = − d M(q) −1 is the vector of uncertainty (including unknown disturbance, uncertainty, and approximation error). According to the physical characteristics of industrial robot [9,31,32], the following hypotheses are put forward: Assumption 1. M(q) is a positive definite, invertible symmetric matrix and bounded: Assumption 2. The uncertainty D in the set is a bounded function satisfying the constraint: where: φ 1 , φ 2 , A 0 , A 1 , A 2 are unknown constants, and φ 1 , φ 2 , Λ > 0; ||D || represents the Euclidean norm of a matrix.
In this paper, the control strategy is proposed to improve the tracking control accuracy of the manipulator with uncertain dynamic models further. The nonlinear sliding surface is established by using dynamic error, then the feedback control loop is developed. DDPG algorithm is used to adaptively approximate the uncertainties of the system. Therefore, for the control system with uncertainties, the tracking error can converge to zero synchronously and remain stable for a finite amount of time.

DDPG-NNFTSMC Control Design
In this part, a DDPG-NNFTSMC control method is proposed for the nonlinear secondorder system of the general manipulator. Then the sliding surface and DDPG algorithm design are given.

Design of the NNFTSMC
In this part, we design a new NNFTSMC sliding surface for the manipulator with uncertain dynamic model based on traditional NTSMC, and then solve the reaching law and design the controller based on the sliding surface.
According to [33], the function sig(x) a is introduced, when a > 0, ∀x ∈ R, sig(x) a monotonically increasing, and always returns a real number.
sig(x) a = |x| a sgn(x) (9) It can be seen from [5] that the traditional ntsmc sliding surface and control rate are as follows: where: β > 0, 2 > η > 1, because the exponential term of e 2 is greater than 0, the singular problem is avoided, but in the region far away from the equilibrium point, the state derivative of the system is smaller than the linear sliding surface with the same parameters, which affects the convergence rate of the system state. To accelerate the convergence speed, the tracking position error and change rate of Equations (4) and (5) are defined as NNFTSMC variables, combined with Equations (9) and (10), the sliding surface function is designed as follows: where: α, β > 0, 2 > η > 1.
According to the dynamic error Equation (6), the NNFTSMC control law is designed as: Theorem 1. For the system Equation (1), using Equation (11) as the sliding surface, the NN-FTSMC control law is designed as Equation (12), then the system will reach the sliding surface in finite time t s , and the tracking error on the sliding surface will converge to 0 in t s .
Proof of Theorem 1. The stability analysis of the controller is as follows: Equation (11) is taken as the first derivative of time to obtain the exponential reaching law, then Equation (12) is substituted into the calculation: Equation (13) makes the speed of the system fast before reaching the switching surface. When reaching the switching surface, the speed decreases and the chattering is weakened, so that the system has better adaptability, and robustness to parameter perturbation and external disturbance. The Lyapunov function is selected as: Equations (11) and (13) are substituted into Equation (14) and their derivatives are obtained: Due to 1 < η < 2, 0< η − 1 < 1, and β > 0, when (8) can be obtained as follows: (16) .
Therefore, when e 2 = 0 is applied, the controller satisfies Lyapunov stability condition [34] and has good stability and robustness. The system will arrive at sliding mode surface in finite time. The time for sliding mode surface s(0) = 0 to s = 0 is t r , when t = t r , s = 0, that is s(t r ) = 0, we can get: By integrating both sides of the above equation, we get the following results: In stage s = 0, supposed e 1 (t r ) = 0 passes through finite time t s from s to e 1 (t r +t s )= 0, . e 1 can be obtained by Equation (6), Equation (11) and s = 0: .
The total convergence time is as follows: Theorem 2. Was proved.

Remark 1.
Due to the existence of 1 α sig(e 1 ) γ term in Equation (11), the total convergence time of Equation (22) is less than that of NTSMC proposed in [26].

Design of DDPG Network
According to the derivation process of Section 3.1, the uncertainty D in assumption 2 satisfies: |D| ≤ Λ is an important condition of Lyapunov stability.
The modified DDPG algorithm can improve the stability and anti-interference of the control system. The delay of control parameters' return and the update makes the action calculated according to the current state take effect in the next control phase, the weights of the neural network and the parameters of experience pool are updated synchronously. During the execution of the algorithm, the data onto the previous step combined with the error information is converted into a reward parameter, which is combined with a data area with N steps. At the beginning of training, the actor network Q(Γ, a|θ Q (using hyper-parameter θ Q ) and critic network µ(Γ|θ µ ) (using hyperparameter theta θ µ ) are initialized, and the experience set □ Theorem 1. Was proved. Remark 1. Due to the existence of 1 α sig(e 1 ) γ term in Equation (11), the total convergence time of Equation (22) is less than that of NTSMC proposed in [26].

Design of DDPG Network
According to the derivation process of Section 3.1, the uncertainty D in assumption 2 satisfies: |D|≤Λ is an important condition of Lyapunov stability.
The modified DDPG algorithm can improve the stability and anti-interference of the control system. The delay of control parameters' return and the update makes the action calculated according to the current state take effect in the next control phase, the weights of the neural network and the parameters of experience pool are updated synchronously. During the execution of the algorithm, the data onto the previous step combined with the error information is converted into a reward parameter, which is combined with a data area with N steps. At the beginning of training, the actor network Q(Γ,a|θ Q ) (using hyper-parameter θ Q ) and critic network μ(Γ|θ μ ) (using hyperparameter theta θ μ ) are initialized, and the experience set ϰ is initialized. For the critic algorithm, the current round monitoring data and state and the next action (combined as vectors: Γ) are used as the input parameters of the target network, and then scalar values are output to calculate target values. The formula for target value was: The current network in the critic algorithm takes the latest state action history and current states Γ as input and outputs the new action to be taken. This paper uses a realtime update feedback training program with multiple training rounds. In each execution cycle, each batch (in random order) of training data is used to update the weights of the neural network using gradient descent. The critic network is updated by minimizing the mean square error between the output of Q and the original reward data. When taking action in the current state, according to the output of μ, the policy network is updated by minimizing the output of Q. For the action-value function of critic network output, one part is used to calculate the mean square error: is initialized. For the critic algorithm, the current round monitoring data and state and the next action (combined as vectors: Γ) are used as the input parameters of the target network, and then scalar values are output to calculate target values. The formula for target value was: The current network in the critic algorithm takes the latest state action history and current states Γ as input and outputs the new action to be taken. This paper uses a real-time update feedback training program with multiple training rounds. In each execution cycle, each batch (in random order) of training data is used to update the weights of the neural network using gradient descent. The critic network is updated by minimizing the mean square error between the output of Q and the original reward data. When taking action in the current state, according to the output of µ, the policy network is updated by minimizing the output of Q. For the action-value function of critic network output, one part is used to calculate the mean square error: Network used to update actor part at the same time: The key of the reward function is to make the network give correct feedback to the network according to the execution results after making decision actions. The merits of the reward function directly affect the training effect and convergence speeds. In reinforcement learning, there is no strict definition of the reward function. It only needs to be able to correctly evaluate the advantages and disadvantages of network output actions. There are two kinds of common reward functions. One is a sparse reward, which is often used in the game to score after completing the task. This kind of reward does not respond well to nondiscrete actions, so it is difficult to quantify the size of the reward. The other is a formal reward, which is only given in the target state, and not anywhere else. This formal reward is easier to promote neural network learning, even if the strategy does not find a solution to the problem, the reward function can also provide positive feedback.
In this paper, the input value of the theoretical signal (including but not limited to position, velocity, acceleration, etc.) and the parameters of feedback adjustment (including but not limited to tracking position, tracking velocity, tracking error, control rate, etc.) are combined. According to the actual situation, the correlation signal is multiplied by the adjustment coefficient and superimposed, and the results are provided for the reinforcement learning network as the reward function value.
In actor-network, the role of actor current network is responsible for the iterative update of policy network parameter θ, according to the current state Γ t selects the current action a and interacts with the environment to generate Γ t+1 . The actor target network is responsible for selecting the optimal next action a t according to the next state Γ t+1 sampled in the experience playback pool. At the same time, m samples are sampled in the experience set

Design of DDPG Network
According to the derivation process of Section 3.1, the uncertainty D in assumption 2 satisfies: |D|≤Λ is an important condition of Lyapunov stability.
The modified DDPG algorithm can improve the stability and anti-interference of the control system. The delay of control parameters' return and the update makes the action calculated according to the current state take effect in the next control phase, the weights of the neural network and the parameters of experience pool are updated synchronously. During the execution of the algorithm, the data onto the previous step combined with the error information is converted into a reward parameter, which is combined with a data area with N steps. At the beginning of training, the actor network Q(Γ,a|θ Q ) (using hyper-parameter θ Q ) and critic network μ(Γ|θ μ ) (using hyperparameter theta θ μ ) are initialized, and the experience set ϰ is initialized. For the critic algorithm, the current round monitoring data and state and the next action (combined as vectors: Γ) are used as the input parameters of the target network, and then scalar values are output to calculate target values. The formula for target value was: The current network in the critic algorithm takes the latest state action history and current states Γ as input and outputs the new action to be taken. This paper uses a realtime update feedback training program with multiple training rounds. In each execution cycle, each batch (in random order) of training data is used to update the weights of the neural network using gradient descent. The critic network is updated by minimizing the mean square error between the output of Q and the original reward data. When taking action in the current state, according to the output of μ, the policy network is updated by minimizing the output of Q. For the action-value function of critic network output, one part is used to calculate the mean square error: to calculate the Q value of the current state, and soft update is used to update the network weight periodically.
When the neural network is needed to output the action, the current state Γ t is inputted into the actor target network to get the output after selecting the action with the largest reward, a t corresponds to a variable within a predefined range. The formula is as follows: To improve the foresight of action selection, the random noise N t was added to ensure that the output action has certain randomness, which could be added or removed dynamically according to the need in practical application.
The DDPG algorithm used in this paper has the following improvements compared with the traditional DDPG algorithm: 1. In the critic network, the output signal was changed from the single numerical to the combined control rate and error, so as to improve the performance requirements of control rate for the early stage of the model; 2. In the last layer of the performer network, the output was mapped from the discrete action output with the highest probability selected from the experience pool to the numerical output with the highest probability selected. The control diagram is as Figure 1:  Aiming at the two joint control model used in this experiment, two lightweight DDPG networks are used to jointly control the target, which respectively undertakes the error compensation output and control adjustment of the two joints. In the experiment, using the traditional network architecture will make the training model too complex, the convergence speed is significantly reduced, and there will be an overfitting phenomenon. Therefore, the network uses two layers of network with 32 and 64 neurons as actor-network, two layers of critic-network with 32 neurons each are used to judge and correct the performer network. The control flow chart of DDPG-NNFTSMC is shown in Figure 2. Aiming at the two joint control model used in this experiment, two lightweight DDPG networks are used to jointly control the target, which respectively undertakes the error compensation output and control adjustment of the two joints. In the experiment, using the traditional network architecture will make the training model too complex, the convergence speed is significantly reduced, and there will be an overfitting phenomenon. Therefore, the network uses two layers of network with 32 and 64 neurons as actor-network, two layers of critic-network with 32 neurons each are used to judge and correct the performer network. The control flow chart of DDPG-NNFTSMC is shown in Figure 2.
DDPG networks are used to jointly control the target, which respectively undertakes the error compensation output and control adjustment of the two joints. In the experiment, using the traditional network architecture will make the training model too complex, the convergence speed is significantly reduced, and there will be an overfitting phenomenon. Therefore, the network uses two layers of network with 32 and 64 neurons as actor-network, two layers of critic-network with 32 neurons each are used to judge and correct the performer network. The control flow chart of DDPG-NNFTSMC is shown in Figure 2.

Simulation Comparison with Control Algorithms
To verify the effectiveness of DDPG-NNFTSMC proposed in this paper, the dynamic model of two joint manipulator was introduced into this section. The simulation analysis was carried out by using MATLAB/Simulink, and the sampling rate was set to 10 −3 s, the sensor was used to measure the corresponding position accuracy, response speed, and the path tracking control of each joint. Considering the characteristics of the control system, the external disturbance and uncertain friction were modeled. The control effect was compared with TSMC, NTSMC, and RBF-SMC. Figure 3 is the pseudo-code of the proposed algorithm.

Results and Discussion
From Tables 4 and 5 and Figures 4-7, it could be seen that the position error and velocity error of traditional TSMC, NTSMC, and RBF-SMC changed with the application of uncertain interference and friction. Compared with the three strategies, DDPG-NNFTSMC was much smaller in position error, velocity error, average position error and

Results and Discussion
From Tables 4 and 5 and Figures 4-7, it could be seen that the position error and velocity error of traditional TSMC, NTSMC, and RBF-SMC changed with the application of uncertain interference and friction. Compared with the three strategies, DDPG-NNFTSMC was much smaller in position error, velocity error, average position error and average velocity error, the corresponding tracking position and speed almost fitted the theoretical value. In addition, the proposed sliding surface function s was designed based on the control function in Equation (12), which played an important role in providing fast convergence and robustness to uncertainties and disturbances. Compared with other control strategies, the control strategy proposed in this paper provided the best path tracking performance and the fastest convergence speed.
In Figure 8a,b compared with the traditional TSMC and NTSMC which converged on finite time, but for nonlinear and nonsingular systems, there were still system chattering and errors, so we must choose between chattering elimination and path tracking accuracy. Therefore, the robustness of the system was reduced, and the tracking error was increased. As shown in Figure 8c, although the control input 1 of RBF-SMC had the characteristics of fast convergence speed and elimination of chattering phenomenon, the tracking error was greatly increased; control input 2 provided a continuous control signal with partial chattering behavior, but the tracking error also decreases, as shown in Figures 4-7. The NNFTSMC designed in this paper provided continuous control signals for the manipulator, and led DDPG network into the controller, the tracking accuracy was guaranteed, and the chattering phenomenon of Figure 8d was eliminated, and the convergence time of the control system signal was effectively reduced without losing its effectiveness, which is shown in Figure 8e. These adaptive feedbacks were estimated according to the change of system disturbance and uncertain disturbance terms. Once the error variables converge to the sliding surface, they would be close to a constant value. From the simulation results, the controller was superior to the traditional TSMC, NTSMC, and RBF-SMC in tracking accuracy, convergence speed, and chattering elimination.

Conclusions
In this paper, a DDPG-NNFTSMC control strategy was proposed to solve the problem of chattering and chattering caused by traditional sliding mode control, which is successfully applied to manipulator systems with uncertain dynamic characteristics. Based on NNFTSMC, the sliding surface was proposed, and the error function can converge on the sliding surface quickly. Then the Lyapunov stability condition was used to prove that the NNFTSMC has good stability and finite time convergence. Compared with the traditional TSMC, NTSMC, and RBF-SMC, the DDPG-NNFTSMC has the following advantages: (1) DDPG network is used to train and update the control parameters in real time to estimate the model uncertainties (including unknown disturbance, dynamic model uncertainty, and approximation error), which effectively eliminates the chattering phenomenon of the system and ensures the robustness of the system under various uncertain disturbances; (2) the chattering elimination greatly improves the tracking accuracy, reduces the tracking average error, and enhances the ability of antidisturbance and uncertainty of the system. Therefore, it can be concluded that DDPG-NNFTSMC proposed in this paper has excellent control performance, and theoretically has good application prospects for industrial manipulators with uncertain dynamic models and general second-order nonlinear systems. Next, based on the research of scholars [12,31], we will try to apply DDPG-NNFTSMC to other different scenarios and combine it with deep reinforcement learning to further improve the control performance [32,36,37].