1. Introduction
As the core executive units of modern industrial automation systems, robotic manipulators have expanded their application scenarios from traditional structured environments such as welding processing [
1] and material handling [
2] to highly dynamic unstructured environments including surgical assistance [
3], space exploration [
4], and service robots [
5]. During the execution of complex tasks, the end-effector of the robotic manipulator must accurately complete high-precision spatial movements along a predefined trajectory. However, robotic manipulators are highly complex multi-input multi-output systems characterized by strong nonlinearity, time-varying dynamics, and strong coupling [
6]. Additionally, the structural parameters of the robotic manipulator are difficult to obtain accurately, and the dynamic model is partially unknown, resulting in significant uncertainties in the mathematical state-space model describing the motion of the robotic manipulators [
7]. Functional configuration design and high-precision trajectory tracking control are effective approaches to address these challenges.
To better achieve trajectory tracking control of robotic manipulators, researchers have proposed various control methods, including fuzzy control [
8,
9], sliding mode control [
10,
11], neural network control [
12,
13], reinforcement learning control [
14,
15], and model predictive control [
16,
17].
Xian et al. [
18] established a dynamic model for coordinated robotic manipulators using fuzzy set theory and developed an approximate constrained following servo control to ensure the consistent boundedness and consistent ultimate boundedness of the controlled system. They selected optimal parameters by solving fuzzy-based performance metrics. Zhang et al. [
19] proposed a three-dimensional fuzzy active disturbance rejection controller for the mechanical arm of an iron roughneck by adding a three-dimensional fuzzy module to the classical ADRC. The controller output is adjusted based on tracking differential error, error rate of change, and error acceleration. Obadina et al. [
20] developed a hybrid optimization algorithm for gray-box model identification of robotic manipulators and applied the model for real-time fuzzy trajectory tracking control. Jiang et al. [
21] combined the Beetle Antenna Search (BAS) algorithm with Particle Swarm Optimization (PSO), introducing fractional calculus to dynamically adjust inertia weight and fractional order, achieving high-precision trajectory tracking control for robotic manipulators. However, fuzzy strategies overly rely on human experience, and parameter selection based on intelligent algorithms is prone to being trapped in local optima.
Sun et al. [
22] employed radial basis function neural networks (RBFNN) to compensate online for uncertainty and unknown dynamics of system parameters. Chen et al. [
23] optimized RBFNN parameters using an immune algorithm and designed adaptive rate compensation for unknown hysteresis errors similar to recoil and RBFNN approximation errors. However, the high computational complexity of neural network algorithms results in poor real-time tracking performance and presents physical implementation challenges. Carron et al. [
24] combined inverse dynamics feedback linearization and data-driven error models with model predictive control. They processed offline data using Gaussian filtering and utilized extended Kalman filtering to estimate residual disturbances online, achieving bias-free tracking of the robotic manipulator. Kang [
25] proposed an event-triggered model predictive control strategy for robotic manipulators with model uncertainty and input constraints. Trigger conditions were defined based on the weights of the predictive model and predicted tracking error. However, MPC is sensitive to system modeling accuracy, and its time-domain optimization mechanism only solves local optima within a finite time interval, making it difficult to ensure global asymptotic stability of the closed-loop system.
Sliding mode control has been widely applied in robotic manipulator control due to its simple physical implementation, strong robustness, and fast transient response [
26]. Kali et al. [
27] used a linear sliding mode surface for the trajectory tracking control of robotic manipulators, but this method only guarantees asymptotic convergence of the state [
28]. Rapid convergence requires high gains, whereas terminal sliding mode control can ensure system convergence within finite time. Xu et al. [
29] designed a discrete integral terminal sliding mode control law that incorporates delayed estimation of unknown disturbances and discretization errors in the robotic manipulator system and introduces an adaptive switching term to suppress sliding mode chattering effects. However, the chattering issue in sliding mode control can cause actuator damage or even system instability. To reduce or eliminate chattering, Makrini et al. [
30] designed a variable boundary layer (BLT) sliding mode control method, which achieves joint safety and high-performance control by adjusting torque limit parameters and the expansion factor of the variable boundary layer.
Additionally, since certain system states are unmeasurable in practical applications, system uncertainty can affect transient performance and even lead to instability in robotic manipulator systems. To better achieve disturbance compensation, Fan et al. [
31] proposed a fuzzy control strategy based on a spatial extended state observer to address the trajectory tracking control problem of robotic manipulator capturing a floating object in a microgravity environment. Yin et al. [
32] proposed an adaptive non-singular terminal sliding mode control method based on NDO to observe the internal modeling errors of robotic manipulator and the external unknown time-varying disturbances acting on each joint for feedforward compensation. Zha et al. [
33] estimated lumped uncertainties of the entire system using a hybrid observer composed of an adaptive time delay estimator based on gradient compensation and a second-order adaptive sliding mode observer. They then constructed an adaptive integral non-singular fast terminal sliding mode algorithm based on the backstepping method to stabilize the system and reduce chattering. Hu et al. [
34] achieved real-time disturbance estimation and compensation by combining load torque estimation based on an improved harmonic drive flexibility model with friction compensation using a hybrid friction model, effectively improving trajectory tracking performance of the robotic manipulator under high-speed variable load conditions.
The aforementioned methods have achieved good results in robotic manipulators, but with the rapid development of artificial intelligence technology, data-driven machine learning approaches offer new insights for the intelligent tracking control of robotic manipulators. Reinforcement learning does not require the establishment of precise dynamic models, which makes it advantageous for solving sequential decision-making problems under highly nonlinear and uncertain conditions [
35]. Viswanadhapalli et al. [
36] utilized deep reinforcement learning control based on the Deep Deterministic Policy Gradient (DDPG) framework to achieve precise servo tracking of a flexible robotic manipulator, and evaluated the controller performance through hardware-in-the-loop (HIL) testing. Lu et al. [
37] proposed an adaptive proportional-integral robust control method based on DDPG, which searches for the optimal controller parameters in a continuous action space using the dynamic information of the robotic manipulator. They also designed a reward function that combines a Gaussian function with Euclidean distance to ensure stable and efficient agent learning. Ren et al. [
38] proposed an adaptive sliding mode control method based on DDPG reinforcement learning, leveraging RL autonomous learning capabilities to adaptively adjust the key parameters of the controller online. However, the DDPG strategy is susceptible to Q-value overestimation and insufficient exploration efficiency. The TD3 algorithm introduces a double Q-network, target policy noise, and delayed policy update techniques, achieving significant improvements in mitigating over-bias in action-value function estimation and enhancing policy stability. Zhu et al. [
39] designed an adaptive sliding mode controller based on TD3 parameter optimization for variable-speed trajectory tracking of underactuated vessels in scenarios involving model uncertainty and external environmental disturbances. Fan et al. [
40] addressed path tracking of unmanned underwater vehicles by integrating an improved experience replay strategy into TD3 while enhancing learning efficiency through refined regularization methods and dynamic reward functions, achieving faster convergence and superior tracking performance compared with mainstream classical DRL approaches.
Due to the highly nonlinear, multivariable strong coupling and the difficulty of accurately obtaining physical parameters, establishing an accurate dynamic model of robotic manipulators is extremely challenging. At the same time, the time-varying and uncertain nature of friction and external disturbances further complicates modeling. Although traditional control methods achieve satisfactory performance in known systems, their adaptability remains insufficient for unknown or uncertain systems. In order to address the trajectory tracking problem of robotic manipulator under unknown time-varying disturbances and modeling uncertainties, this paper proposes an improved NFTSMC method based on the TD3 algorithm and NDO. The main contributions are summarized as follows:
For tasks such as precision assembly and high-accuracy positioning that impose strict requirements on position and velocity control, a novel 5-DOF robotic manipulator configuration is designed, which reduces the end-effector load by positioning the joint motors at the front and simplifies dynamic model computation by introducing prismatic joint.
A nonlinear disturbance observer is employed to estimate internal modeling errors and external unknown time-varying disturbances of the robotic manipulator, followed by feedforward compensation. An improved nonsingular fast terminal sliding mode control law is designed based on boundary layer technique, and global system stability is analyzed using Lyapunov theory.
Adaptive control is achieved through DRL by proposing an adaptive NDONFT control method based on the TD3 algorithm, which employs a dual Q-network structure and selects the minimum Q-value to effectively mitigate value overestimation in DRL. The learning efficiency and stability of the agent are enhanced by modifying the reward function, thereby avoiding convergence to local optima.
Using the three joints of the designed 5-DOF robotic manipulator as examples, this paper verifies that the proposed method achieves excellent trajectory tracking performance and demonstrates stronger robustness against external unknown sudden disturbances.
The remainder of this paper is organized as follows:
Section 2 presents the system principles and mathematical models of the robotic manipulator;
Section 3 introduces the design process and stability analysis of the TD3NDONFT algorithm-based controller;
Section 4 analyzes the simulation results; and
Section 5 summarizes the main contributions of this paper.
3. TD3NDONFT Controller Design
In order to achieve trajectory tracking control of robotic manipulators under complex environmental influences, this paper proposes a non-singular fast terminal sliding mode control method based on NDO and DRL, as shown in
Figure 2. The NDO is used to estimate and compensate for the lumped disturbance, while the autonomous learning capability of DRL is employed to learn the controller parameters.
In the n-link robotic manipulator system, let the desired joint position, velocity, and acceleration be
, respectively, and the actual joint position, velocity, and acceleration be
, respectively. Then, the tracking error and its derivative are defined as follows:
The error state variables
and
are constructed based on the position error
, velocity error
, and acceleration error
of the robotic manipulator joints, that is
. Then, the error dynamics equation of the manipulator can be written as:
3.1. Design of Nonlinear Disturbance Observer
The disturbance is estimated by correcting the estimated value based on the difference between the estimated output and the actual output. Combining with Equation (5), the following nonlinear disturbance observer is designed:
where
is the gain matrix of the observer.
Since Equation (10) involves acceleration signals that cannot be accurately obtained by directly differentiating velocity signals in practical applications. This is because velocity signals inevitably contain sensor noise, quantization errors, and sampling jitter. Differentiation inherently amplifies high-frequency components, which significantly increases the influence of noise and results in severe distortion of the estimated acceleration. Furthermore, due to the high coupling of the manipulator system along with external disturbances, a nonlinear disturbance observer needs to be further designed to accurately estimate external disturbances in the manipulator system and feed them back to the controller for disturbance compensation.
Define auxiliary parameter variables as:
where
is the designed function vector, according to [
42], the gain matrix
and function vector
of the designed disturbance observer are given by:
where
is an invertible matrix.
Substituting Equation (12) into Equation (11) and differentiating yields:
Substituting Equation (10) into Equation (13) yields:
In summary, Equations (11)–(14) constitute the designed nonlinear disturbance observer.
In order to enable the nonlinear disturbance observer to accurately estimate disturbances, the linear matrix inequality (LMI) method is introduced to solve . For convenience in derivation, the symbol is abbreviated as in the following steps.
To prove that the observation error converges asymptotically, define the Lyapunov function as:
where
is the observation error,
.
The derivative of the above Lyapunov function is:
According to Equations (11) and (14), the derivative of the observation error is:
Generally, without prior knowledge of
, it is assumed that the dynamic disturbance relative to the observer changes slowly [
43], that is
. Further, the observer error equation is obtained as:
Substituting Equation (18) into Equation (16) yields:
Construct the following inequality:
where
is a symmetric positive definite matrix, there exists
such that the following equation holds:
It can be seen that the disturbance observer converges exponentially.
Since Equation (20) contains nonlinear terms, it must be converted to a linear matrix inequality to be solved. Define matrix
, then Equation (20) can be transformed into:
Since
, then
, the sufficient condition for the above equation to hold is
, which is equivalent to:
Based on the value of obtained from the solution, the observer gain matrix and auxiliary variable can be calculated.
3.2. NFTSM Controller Design
In traditional sliding mode control, the sliding surface is generally selected as a linear sliding surface. When the system state reaches the sliding surface, the linear sliding surface ensures that the error converges asymptotically to zero but does not guarantee finite-time convergence. In contrast, the terminal sliding surface guarantees that the error variable converges to zero within finite time.
Therefore, this paper adopts a non-singular fast terminal sliding mode surface [
44]:
where
,
are all positive odd numbers, and
,
is the sign function. The derivative of Equation (24) with respect to time is as follows:
Combining Equations (9) and (25) yields:
Multiply both sides by the matrix
to obtain:
Because the acceleration vector
is difficult to obtain in practice, and to avoid its appearance in the control law, define the variable:
Substituting Equation (28) into Equation (27) yields:
According to the selected non-singular fast terminal sliding mode surface, the controller is formulated as follows:
is the equivalent controller obtained when the system reaches the sliding surface , and at this time the external disturbance is zero. and are the proportional gain and integral gain, respectively, with . is the sliding mode control term, which ensures that the system reaches stability. is employed to further eliminate errors and external disturbances. is a robust controller primarily used to suppress external disturbances outside the nonlinear disturbance observer and internal model reconstruction errors, thereby enhancing disturbance rejection capability.
At the same time, by introducing a nonlinear disturbance observer to estimate the disturbances, the control law after adding its compensation is:
From Equations (29) and (31), we can derive that:
In order to limit the control signal to the feasible range of the actuator and avoid output overshoot, thereby preventing input saturation, a saturation function
is introduced to replace the sign function in the robust controller. The saturation function maintains linearity or smooth variation when the error is small, thereby reducing chattering, and gradually saturates when the error is large, thus preventing shocks caused by abrupt changes. The saturation function is defined as follows:
where
is a positive constant representing the thickness of the boundary layer.
3.3. Stability Analysis of the Control System and Proof of Finite-Time Convergence
In the stability analysis of control systems, we consider a Lyapunov function in integral form:
Substituting Equation (15) into Equation (34), we obtain the derivative of
with respect to time is:
Considering the skew-symmetric property of the robotic manipulator dynamic equation, we have
, and substituting Equation (32) into Equation (35), we derive:
Substituting Equation (21) into Equation (36) and simplifying yields . According to Lyapunov stability theory, the system state converges asymptotically to the sliding surface . According to LaSalle’s invariance principle, when , , that is, , .
To prove that the system state converges to the sliding surface within a finite time, consider the Lyapunov function
, and differentiate it to obtain:
Therefore
, we obtain
, let
denote the system convergence time, and
represent the initial state, at this time
. Integrating the above yields:
The convergence time is calculated as , The trajectory tracking error of the robotic manipulator system tends to zero within finite time.
When
, the saturation function behaves as the sign function
, which does not affect the convergence of the control law. When
, the convergence speed decreases to some extent, but
has been controlled within a certain range. The robotic manipulator system achieves the desired convergence by adjusting
, so the saturation function does not affect the final convergence properties of the system [
34].
3.4. Deep Reinforcement Learning TD3 Adaptive Control
Reinforcement learning is a key branch of machine learning that obtains control strategies through interaction with the environment. RL problems are usually formulated as a Markovian decision process (MDP), which mainly includes the environment, agent, reward, state, and action [
40]. Specifically, at each time step
, the robotic manipulator agent selects an action
based on the current state
, and the environment provides a reward
quantifying the performance of the robotic manipulator. The ultimate goal of RL is to find an effective control strategy
that maximizes the cumulative long-term return
, where
is the discount factor indicating the importance of future rewards, and
represents the initial time step.
TD3 is an advanced DRL method designed to address control problems with continuous action spaces. It employs an actor-critic architecture, where the critic evaluates the value function
of actions and states. Specifically, at time step
,
represents the long-term reward obtained by taking action
under policy
. Under this formulation, the Q function satisfies:
The ultimate goal of DRL is to learn an optimal control strategy
through Equation (40) so that the value function is maximized.
Generally, neural networks (NNs) are employed to solve Equations (39) and (40) within the actor-critic architecture, as shown in
Figure 3. TD3 mainly consists of six NNs. Specifically, two evaluate critic networks are used to estimate the Q-functions,
with parameter
and
with parameter
. Additionally, an evaluate actor
with parameters
is used for policy updates. Each actor and critic network are paired with a target network to ensure training stability, denoted by
,
and
. The update of the evaluate critics is expressed by the following loss function:
where
is the batch size selected transitions at each step,
is known as the time differencing error (TD error), and
is the importance sampling weight (ISW) of the
i-th sample. The target value estimate is calculated by selecting the minimum estimated Q value to reduce the error overestimation problem in traditional Q learning. Additionally, to ensure smoother updates of the critics and prevent overfitting, a noise signal is added as a regularization term to the target actor,
, where
is the noise standard deviation,
denotes a standard normal distribution, and
represents the noise clipping limit.
The evaluate actor updates by maximizing the learned Q-function and adopts the deterministic policy gradient for policy updates.
where
is the expected return to be maximized,
is the gradient of Q-value with respect to action, and
is the gradient with respect to the evaluate actor parameter
. Additionally, to ensure a stable training process, the target network parameters are updated by soft updating to track the evaluated network.
where
is the soft factor that determines the update rate of the target networks. It is worth noting that the update frequencies of the actor and critic networks in TD3 are inconsistent. Generally, the critic network updates more frequently than the actor network to minimize the Q-value error before introducing policy updates.
3.4.1. State Space Design
The state
represents the environmental information perceived by the robotic manipulator agent. The objective is to design a controller that integrates sliding mode control based on disturbance observation compensation with TD3 to reduce the errors between the actual and target positions, as well as between the actual and desired velocities of the robotic manipulator. Based on the design of the NDONFT controller and considering the structural characteristics of the designed 5-DOF robotic manipulator, the first three joints are sufficient to cover most of the target workspace, while the two joints at the end are mainly used to adjust the posture of the actuator. Therefore, simulation is conducted on the three joints of the designed robotic manipulator, and the current state is defined as follows:
For , the observed values consist of the joint information of the robotic manipulator, specifically the sine and cosine of the joint position and the joint velocity. By replacing the joint position with and , the discontinuous position measurements can be expressed through continuous two-dimensional parameterization, while constraining the joint position to . This approach greatly reduces the complexity of the DRL training process and facilitates training convergence. For , the observed values include the joint position error and the error integral. The error integral reflects the long-term deviation of the system, and its inclusion helps to reduce the steady-state error. For , the observed values comprise the action at the previous time step and the disturbances observed at each joint. Observing the action helps prevent training from converging to extreme values and enhances the stability of the strategy, whereas observing the disturbances enables rapid assessment of the tracking performance and timely adjustment of the strategy. It should be noted that the observations are normalized to prevent gradient explosion.
3.4.2. Action Space Design
Based on the designed nonlinear disturbance observer-based terminal sliding mode control method, the action space of DRL is defined as the proportional gain, integral gain, and robust gain of NDONFT, that is .
The network structure of the critic and actor in TD3NDONFT is shown in
Figure 4. The two critic networks share the same structure, with an input layer consisting of the state vector
and the action vector
. The four intermediate hidden layers are fully connected with 256, 128, 128, and 64 nodes, respectively. The output layer produces a one-dimensional Q-value evaluating the action. The input layer of the actor network is the state vector
. The intermediate hidden layers are fully connected with 256, 128, 64, and 32 nodes, respectively. The output layer is a parameter vector representing the action vector
. According to the action parameter constraints, the output parameters are limited by a sigmoid layer to restrict actions within (0, 1). Additionally, the actions are scaled up by applying gains during simulation.
3.4.3. Reward Function Design
During DRL training, the reward function plays a crucial role in effectively guiding the agent to meet task requirements. To address the problem of sparse rewards, a composite reward function integrating trajectory tracking control is designed, consisting of the following components.
(1) Tracking error: To reduce the tracking error of the robotic manipulator, a distance-based reward is established based on the distance between the current position and the desired position. The tracking error includes position tracking error and velocity tracking error, the reward is expressed as follows:
where
,
,
and
are the weights assigned to the penalty components of position and velocity. As shown in Equation (46), when the position and velocity tracking errors approach zero, a higher reward is returned; when these errors are large, a lower reward is received.
(2) Influence of the previous time step: To reduce the influence of the previous control input on the current step and prevent actions from approaching extreme values that cause the agent to converge to a local optimum, the following equation is incorporated into the reward function:
where
is the weight assigned to penalizing the action.
(3) Boundary reward: when the joint position exceeds a predefined limit, the following reward function applies:
Through all the above designs, the reward for robotic manipulator trajectory tracking is summarized as .
Based on the above design, a NFTSMC method for robotic manipulators combining NDO and TD3 algorithm is proposed. The NDO estimates and compensates for lumped disturbances online, while the TD3 algorithm achieves parameter adaptation in terminal sliding mode control. Before training, random parameters are first used to initialize the evaluate critic network, evaluate actor network, target critic network, target actor network, and parameters in the experience replay buffer. Then, the robotic manipulator dynamics model is loaded as the environment. To better improve the adaptive performance of the controller, the desired trajectory of the robotic manipulator is randomly initialized at the beginning of each episode. At each time step, the current state of the robotic manipulator is input into the target network to obtain the action vector . The action vector is proportionally scaled and input into the sliding mode controller together with the disturbance estimate observed by the NDO. The reward at the current time step is calculated based on data including position error and velocity error, and then the next state of the robotic manipulator is obtained. The tuple data generated during this process is stored in the experience replay buffer. When the buffer reaches the predefined size, a mini-batch of tuple data is randomly sampled to update the six NNs mentioned above.
4. Simulation Studies
The recommended workstation configuration and simulation environment for this study are as follows: operating system, Windows 11; processor, intel(R) Core (TM) i7-10700 CPU @ 2.90 GHz; RAM, 32.0 GB; 64-bit operating system; simulation software, MATLAB, Version: R2024a. We also used the Simulink environment and Simscape toolbox in MATLAB. The development code is in MATLAB language.
In this section, in order to verify the effectiveness of the proposed algorithm, the two end joints of the designed robotic manipulator are fixed, and simulation is performed using the initial three joints as an example. Simulink and Simscape toolboxes are used to build a simulation model of the controlled robotic manipulator. The simulation results validate the effectiveness of the designed control algorithm. The simulation model of the designed robotic manipulator trajectory tracking control algorithm is shown in
Figure 5.
During the training process, the desired trajectory of the robotic manipulator is set as shown in Equation (49). Since the initial position of the prismatic joint is set at the base during dynamic modeling, an initial position offset exists for
.
The initial joint positions of the three joints for the robotic manipulator are set to
. The initial desired positions can be calculated using Equation (49). The physical parameters of the designed manipulator are listed in
Table 2.
To demonstrate the robustness of the proposed control scheme, modeling errors, friction, and external disturbances are introduced into the robotic manipulator system. In this study, 80% of the dynamic parameters correspond to the nominal model, while 20% represent modeling errors in the system dynamics.
The disturbance
acting on each joint of the robotic manipulator is assumed to be time-varying and consists of three components:
where
is a small time-varying disturbance,
is joint friction, and
is an unknown large time-varying disturbance introduced at 5s to verify the adaptability and robustness of the proposed control method.
The proposed algorithm is compared with non-singular fast terminal sliding mode control based on disturbance observer (NDONFT), non-singular fast terminal sliding mode control (NFTSM), NDOPID, and PID controllers to verify its effectiveness.
(1) PID: Proportional-integral-derivative controller, the control law is as follows:
For the PID controller, represent the proportional, integral, and derivative coefficients, respectively. The selection of parameters mainly relies on experience and tuning. The selection principle is to minimize oscillations during the initial position tracking phase while ensuring trajectory tracking accuracy.
(2) NDOPID: PID controller incorporating the nonlinear disturbance observer; the control law is as follows:
For the NDOPID controller, in order to ensure the fairness of the comparison, its basic parameters are selected based on those of the PID controller. The disturbance observer estimates disturbances to compensate the PID control.
(3) NFTSM: Non-singular fast terminal sliding mode control excluding disturbance compensation observed by the NDO and parameter adaptation via the TD3 algorithm. The control law is as follows:
To ensure fairness in the controller comparison, the sliding mode coefficient is set equal to those in the NDONFT and TD3NDONFT controllers, and the coefficient is set equal to that in the NDONFT controller.
(4) NDONFT: Non-singular fast terminal sliding mode controller based on the NDO, excluding the adaptive strategy of TD3. The control law is as follows:
For the NDONFT controller, is an invertible matrix used to calculate the observer gain matrix and auxiliary variables. The same value of is used across controllers involving disturbance observers.
(5) TD3NDONFT: Non-singular fast terminal sliding mode control based on TD3 and NDO, with adaptive adjustment of control parameters achieved via the TD3 algorithm. The control law is as follows:
For the TD3NDONFT controller, the adaptive parameters
are obtained via the TD3 algorithm. Sensitivity analysis of the controller parameters is conducted to set the range of parameter variation during training. The training parameters of the TD3 algorithm are listed in
Table 3.
The parameter settings of the five control laws mentioned above are listed in
Table 4.
Figure 6 shows the reward curve of the TD3 algorithm presented in this paper, which serves as a key indicator for evaluating the training effectiveness of DRL algorithms. During the policy exploration phase, the curve shows considerable volatility with a low and unstable average reward level, indicating that the strategy of the agent has not yet effectively explored action sequences with higher rewards and remains in an exploratory learning phase. Subsequently, the performance enters an optimization phase, with the average reward showing an overall upward trend and volatility gradually decreasing, indicating the strategy begins to learn more effective action selection mechanisms, leading to significant performance improvement. In the strategy convergence phase, the average reward tends to stabilize and the standard deviation converges to a low level, indicating the strategy has reached a stable convergence state, demonstrating strong robustness and generalization capability.
To further compare the dynamic characteristics of the five controllers, this paper evaluates controller performance using the mean absolute error (MAE) of joint position and joint velocity, the integral absolute error (IAE), and the integral time absolute error (ITAE).
where
represents the total time.
After the training is completed, the desired trajectories of joint 1 and joint 3 of the robotic manipulator are set to
, and the desired trajectory of joint 2 is set to
for simulation.
Figure 7 shows the simulation results of joint position tracking and tracking errors. All five algorithms ensure trajectory tracking within a certain error range before and after sudden disturbances. Under different control methods, the tracking errors of each joint fall within the ranges of [−0.03, 0.03] rad, [−0.01, 0.015] m, and [−0.04, 0.04] rad, respectively, demonstrating a certain degree of disturbance rejection under sudden disturbances. In the initial stage, the PID and NDOPID control methods exhibit significant overshoot, while the NFTSM, NDONFT, and TD3NDONFT controllers track the desired trajectory at a relatively fast speed. At 5 s, the joints experience a large unknown sudden disturbance. The maximum position tracking errors of the three joints after the disturbance are 0.0048 rad, 0.0014 m, and 0.0041 rad for the proposed TD3NDONFT algorithm; 0.0054 rad, 0.0016 m, and 0.0051 m for the NDONFT method; and 0.0188 rad, 0.0092 m, and 0.0393 rad for the NFTSM method. The results indicate that the proposed control strategy better suppresses external unknown time-varying disturbances.
The MAE indicators for position tracking under different control strategies are presented in
Table 5. Both before and after applying sudden disturbances, the proposed control method demonstrates superior tracking performance. During 0–10 s, the MAE of joint positions using the TD3NDONFT algorithm reduce by 7.14%, 19.94%, and 6.14% compared to NDONFT, by 64.58%, 88.06%, and 84.53% compared to NFTSM, by 53.35%, 85.40%, and 63.43% compared to NDOPID, and by 70.60%, 88.48%, and 77.70% compared to PID, respectively. The NDONFT algorithm reduces the MAE by 49.76%, 81.77%, and 61.04% compared to NDOPID.
Figure 8 shows the joint velocity tracking of the robotic manipulator and its error. In the initial stage, the velocity tracking exhibits a distinct reverse sudden change. This occurs because the initial state of the robotic manipulator does not match the desired state, which causes the controller to overcompensate at this stage and subsequently enables it to quickly track the desired velocity. Analysis of the velocity tracking error in
Figure 8 indicates that although the system velocity experiences significant disturbance after a sudden disturbance, all five control methods converge to the desired velocity. The maximum velocity tracking error of joint 1 after a sudden disturbance using the proposed TD3NDONFT algorithm is 0.0409 rad. 0.0436 rad using the NDONFT method, 0.0506 rad using the NFTSM method, 0.0602 rad using the NDOPID method, and 0.0691 rad using the PID method.
Table 6 shows the MAE performance indicators of velocity tracking under different control methods. Before and after applying sudden disturbances, the algorithm proposed maintains a smaller MAE of velocity after reaching the steady state. Within 0–10 s, the MAE of the velocity of each joint using the TD3NDONFT algorithm reduces by 1.78%, 9.10%, and 2.11% compared to NDONFT, by 19.26%, 58.01%, and 44.64% compared to NFTSM, by 14.18%, 37.37%, and 17.03% compared to NDOPID, and by 27.18%, 48.67%, and 32.45% compared to PID, respectively. This indicates that the proposed algorithm achieves high velocity tracking accuracy and robustness.
Figure 9 and
Figure 10 show the IAE and ITAE indicators for position and velocity tracking using the five algorithms. The IAE indicators for the three joints of the proposed algorithm are the lowest in the 0–5 s, 5–10 s, and 0–10 s intervals, indicating that the TD3NDONFT algorithm achieves high position and velocity tracking accuracy before and after experiencing unknown time-varying disturbances. Next, the effectiveness of the TD3 adaptive algorithm, nonlinear disturbance observer, and improved non-singular fast terminal sliding mode within the proposed TD3NDONFT algorithm is analyzed.
Within 0–10 s, the TD3NDONFT algorithm reduces the IAE of joint position tracking by 7.16%, 19.97%, and 6.16% compared to NDONFT, and reduces the IAE of velocity tracking by 1.78%, 9.11%, and 2.11%, indicating that introducing the TD3 adaptive algorithm improves the tracking performance of the robotic manipulator. Compared to NFTSM, NDONFT reduces the IAE of position tracking for each joint by 61.91%, 85.11%, and 83.56%, and reduces the IAE of velocity tracking by 17.81%, 53.82%, and 43.47%, respectively. Compared to PID, NDOPID reduces the position tracking IAE by 37.01%, 21.09%, and 39.05%, and the velocity tracking IAE by 15.17%, 18.05%, and 18.60%, indicating the effectiveness of introducing the nonlinear disturbance observer. Compared to NDOPID, NDONFT reduces the position tracking IAE by 49.82%, 81.79%, and 61.11%, and the velocity tracking IAE by 12.63%, 31.11%, and 15.25%, demonstrating the effectiveness of the improved non-singular fast terminal sliding mode. At the same time, the ITAE of position and velocity tracking for the three joints using the TD3NDONFT algorithm is the lowest among the five algorithms.
The adaptive parameters change proposed in this paper are shown in
Figure 11. During the variation in adaptive parameters, the proportional gain
is initially small to generate low joint torque, thereby avoiding excessive initial torque, and then gradually increases to improve joint tracking capability. However, the prismatic joint must overcome its own weight at startup, so a larger initial proportional gain is required. The integral gain
is larger for the prismatic joint than for the revolute joint to reduce the system steady-state error. Since the nonlinear disturbance observer has estimated most disturbances, only a small robust gain
is required to compensate the remaining disturbance. At 5 s, when a sudden disturbance occurs, all parameters adjust accordingly to ensure tracking accuracy under unknown time-varying disturbances.
Figure 12 shows the control input torque for each joint of the robotic manipulator. In the initial stage, the PID and NDOPID algorithms exhibit large torque oscillations, which may adversely affect the joint actuator and may cause motor driver overload or failure. At 5s, when a sudden disturbance occurs, the torques of all five control methods fluctuate significantly. The torque response under the proposed TD3NDONFT algorithm is smoother, indicating that the method maintains good dynamic performance while ensuring steady-state accuracy.
Figure 13 shows the actual disturbances during trajectory tracking and the disturbances estimated by the nonlinear disturbance observer. The nonlinear disturbance observer within the proposed TD3NDONFT algorithm effectively estimates disturbances affecting each joint of the robotic manipulator. The joint disturbance estimation errors are shown in
Figure 14.
To further validate the generalization capability of deep reinforcement learning, this paper introduces desired trajectories with more complex dynamic characteristics for testing. The desired trajectories for joint 1 and joint 3 of the robotic arm are set as
, while the desired trajectory for joint 2 is set as
.
Figure 15 presents the position tracking curves and corresponding tracking errors of each joint under different control strategies, while
Figure 16 illustrates the velocity tracking performance and velocity tracking errors of the joints. Experimental results demonstrate that the proposed method maintains excellent accuracy and robustness even in complex trajectory tracking tasks, fully reflecting its strong generalization performance.