DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems

: This research introduces a robust control design for multibody robot systems, incorporating sliding mode control (SMC) for robustness against uncertainties and disturbances. SMC achieves this through directing system states toward a predeﬁned sliding surface for ﬁnite-time stability. However, the challenge arises in selecting controller parameters, speciﬁcally the switching gain, as it depends on the upper bounds of perturbations, including nonlinearities, uncertainties, and disturbances, impacting the system. Consequently, gain selection becomes challenging when system dynamics are unknown. To address this issue, an extended state observer (ESO) is integrated with SMC, resulting in SMCESO, which treats system dynamics and disturbances as perturbations and estimates them to compensate for their effects on the system response, ensuring robust performance. To further enhance system performance, deep deterministic policy gradient (DDPG) is employed to ﬁne-tune SMCESO, utilizing both actual and estimated states as input states for the DDPG agent and reward selection. This training process enhances both tracking and estimation performance. Furthermore, the proposed method is compared with the optimal-PID, SMC, and H ∞ in the presence of external disturbances and parameter variation. MATLAB/Simulink simulations conﬁrm that overall, the SMCESO provides robust performance, especially with parameter variations, where other controllers struggle to converge the tracking error to zero.


Introduction
The expanding capabilities of multibody robot systems in autonomous operation and their versatility in performing a wide range of tasks have gathered significant attention from both researchers and industries, emphasizing the persistent need for precision and reliability in their operations.As a result, multibody robot systems require robust control algorithms.However, controlling multibody robot dynamics can be a challenging task, especially when the robot dynamics are unknown.In this effort, different robust control algorithms have been proposed in which sliding mode control (SMC) has been of great interest due to outstanding robustness against parametric uncertainties and external disturbances [1,2].Subsequent developments resulted in different types of SMC, including integral SMC (ISMC) [3], super twisting SMC (STSMC) [4], terminal SMC (TSMC) [5], SMC with a nonlinear disturbance observer known as sliding perturbation observer (SMC-SPO) [6], and SMC with extended state observer (SMCESO) [7].This research is conducted for the robust control of multibody industrial robot systems with aims of enhancing the trajectory tracking results.Therefore, we consider the nonlinear control SMC with ESO (SMCESO) for the robot control.ESO considers the system dynamics and external disturbances as perturbations to the system.Therefore, with ESO, the system is only affected by the perturbation estimation error because of the compensation provided by the ESO.Another advantage of the ESO is that it requires no system dynamics information and only uses partial state feedback (position) for estimating the states and the perturbation.Subsequently, the robustness of SMCESO now depends on the quality of estimation of the ESO, which is dependent on the selection of control parameters.However, tuning the parameters manually becomes a challenging task.Therefore, optimal parameter selection can be achieved through adapting the parameters for different sliding conditions.
Various methods for adaptive SMC have been explored, including model-free adaptation, intelligent adaptation, and observer-based adaptation. A. J. Humaidi et al. introduced particle swarm optimization-based adaptive STSMC [8].The adaptation is carried out based on the Lyapunov theory to guarantee global stability.Y. Wang and H. Wang introduced model-free adaptive SMC, initially estimating unknown dynamics through the time delay estimation method [9,10].Nevertheless, this approach exhibited undesirable chattering in the control input during experiments, which is deemed unacceptable in the present research.On the other hand, R-D.Xi et al. presented adaptive SMC with a disturbance observer for robust robot manipulator control [11].Observer-based adaptive SMC stands out for its ability to ensure robustness through minimizing the impact of lumped disturbances, a feature similarly emphasized by C. Jing et al. [12].Conclusively, this study states that implementing a disturbance observer can lead to finite-time stability and specific tracking performance quality.Furthermore, H. Zao et al. introduced fuzzy SMC for robot manipulator trajectory tracking [13].H. Khan et al. proposed extremum seeking (ES)-based adaptive SMCSPO for industrial robots [14].A unique cost function is used which consists of estimation error, and error dynamics to guarantee accurate states and perturbation estimation.H. Razmi et al. proposed neural network-based adaptive SMC [15], and Z. Chen et al. presented radial basis function neural network (NN)-based adaptive SMC [16], both demonstrating commendable performance.However, it is worth noting that the systems under consideration in these studies were relatively smaller than the industrial robot in our current research.Furthermore, a model-free reinforcement learning algorithm known as deep deterministic policy gradient (DDPG) has been observed to provide optimal SMC parameters, enhancing performance through learning and adapting to different sliding patterns [17][18][19].
Considering the diverse literature, initially, the model-free extremum seeking algorithm was a consideration.However, in the current study, the need to tune multiple (four different) parameters simultaneously led to the exploration of learning-based algorithms such as NN and DDPG for adapting controller parameters.Notably, NN is well suited for simpler systems, while DDPG is preferred for complex, high-dimensional systems with unknown dynamics.DDPG is a model-free, online, and off-policy reinforcement learning algorithm.It employs an actor-critic architecture, where the actor focuses on learning the optimal policy, while the critic approximates the Q-function [20].The Q-function is responsible for estimating the expected cumulative long-term rewards for state-action pairs.The critic achieves this by minimizing the temporal difference error, which represents the disparity between the predicted Q-value and the actual Q-value derived from environmental feedback.This process equips the critic to evaluate action quality in various states, guiding the actor in selecting actions that maximize expected rewards.Ensuring the convergence of the temporal difference error is a pivotal aspect of effective DDPG agent training.
The primary contribution of this study is the optimal tuning of SMCESO using the DDPG algorithm for a heavy-duty industrial robot manipulator with six degrees of freedom (DOF).Robust performance can be achieved through minimizing estimation errors, ensuring accurate perturbation estimation and compensation.To accomplish this, the DDPG input states incorporate tracking error, estimation error, current joint angle, and estimated joint angle.A reward has been designed, integrating an overall error tolerance of 0.01 rad for both tracking and estimation errors, yielding positive rewards if error is below the threshold.Conversely, if errors exceed this threshold, negative rewards are assigned.Through this approach, the DDPG agent learns a control pattern based on actual and estimated results, ultimately achieving optimal estimation and robust control performance.The proposed algorithm was implemented and compared with optimal proportional-integral-derivative (PID) control and SMC, and H ∞ control in an extensive MATLAB/Simulink simulation environment.The results demonstrated that SMCESO outperforms all the three controllers, particularly in the presence of variable system parameters, as it effectively reduces the effect of the actual perturbations on system performance.
The reminder manuscript Is organized as follows: Section 2 describes the general multibody dynamics and formulates the SMC.Section 3 presents the ESO and the DDPG algorithm.Section 4 then presents the simulation environment and the results of the proposed algorithm, whereas Section 5 provides the conclusions.
where x [x 1 . . .x n ] T are the state vectors representing the position, and f j (x) and ∆ f j (x) are the linear dynamic and dynamics uncertainties, respectively.Similarly, the control gain matrix and their uncertainties are represented by b ji (x) and ∆b ji (x), respectively.u i and d j are the control input and external disturbance, respectively.Combining the system nonlinearities, dynamics uncertainties, and disturbances as the perturbation (ψ) can be written as whereas it is assumed that the perturbation is bounded by an unknown continuous function, i.e., ψ j (x, t) ≤ Γ > 0, and, in addition, that it is smooth with the bounded derivative .ψ j (x, t) ≤ Γ.

Sliding Mode Control
The main concept of SMC is to design a sliding surface σ in the state space (position x 1 , and velocity x 2 ) [21], which is given as where e = x d − x is the tracking error, and c > 0 is a positive constant.Now, in order to drive the system dynamics, the state variable should converge to zero: i.e., lim t→∞ .
e, e = 0 asymptotically in the presence of perturbation.Therefore, SMC tends to bring system states on the sliding surface by means of control force u.Subsequently, SMC has two phases: The first is the reaching phase, during which the system states are not on the sliding surface and require a switching control u sw to reach the sliding surface.The second phase is the sliding phase, in which the system states have reached the sliding surface and now require continuous control, generally known as equivalent control u eq , to remain on the sliding surface, where the overall control input becomes u = u eq + u sw .To compute the control input, the derivative of the sliding surface is defined as follows: . σ = . .
where K smc represents the switching control gain, and 'sat' is the saturation function with a boundary layer thickness ε c , given as Assuming unknown system dynamics, ..
x = u is presumed.Substituting this condition with the dynamics error .. e = ..
x in (4) results in the following control input.
Here, K smc •sat(σ) is denotes the switching control (u sw ), and the negative sign embodies the error convention.The remaining terms are considered equivalent control (u eq ).Subsequently, taking the derivative of the sliding surface, with the system disturbed by perturbation (such as (10) in the subsequent section) yields (7): Substituting the control law from ( 6) and solving results in (8): .
Equation (8) shows that in SMC, the sliding surface is affected by the perturbation.Once the system states have reached the sliding phase, the relationship between the sliding surface and the perturbation is given as the following transfer function [6].
where p is the s-domain variable.Increasing the boundary layer will decrease the breaking frequency, making the system less sensitive to the higher frequency perturbations.However, at σ ≈ 0, increasing the boundary layer thickness reduces controller performance, leading to higher tracking error.If the sliding surface is tightly bounded, with a very small boundary layer, chattering occurs.

Problem Forumlation
Calculating the dynamics of a multibody robot system is a challenging task, further compounded by the presence of inaccurate dynamics, which introduce uncertainties.Therefore, for the later study, considering the complete dynamic model as perturbation and b = 1, the resulting dynamics is as follows. ..
Subsequently, to ensure the sliding condition outside the boundary later, the sliding dynamics can be written as .
For the asymptotic stability of (11) about the equilibrium point, .
V < 0 for σ = 0 must be satisfied [22].The derivative of V is computed as . Taking Selecting η = −K smc •sat(σ), and with K smc > 0, ( Consequently, the overall control input becomes Equation ( 15) further emphasizes that for stability, K smc > Γ to satisfy the Lyapunov condition.However, obtaining information about Γ can be a complex and tedious task.

Proposed Algorithm
There are two concerns: First, based on (9), as the perturbation affects the sliding dynamics, the correct dynamics are unknown.Therefore, a perturbation observer has been used to estimate and compensate the actual perturbation effects.For this purpose, an extended state observer (ESO) has been implemented, which offers the advantage of not requiring the system dynamics information.Secondly, we optimally tune the control gain for SMC and ESO to stabilize the system in finite time, ensuring that both tracking and estimation error converge to zero.Subsequently, deep deterministic policy gradient (DDPG) has been employed for control gain tuning.

Extended State Observer
ESO provides real-time estimations of unmeasured system states and perturbations, which is the combination of modelled and unmodelled dynamics and external disturbances, enhancing control system performance and robustness.This means ESO considers the system's linear and nonlinear dynamics as the perturbation and estimates them [23].Consequently, only the control input u in (1) is known.Furthermore, ESO does not require system dynamics information and uses only partial state feedback (position) for estimation.In addition to the system states (position x 1 and velocity x 2 ), an extended state x 3 is introduced, such as Subsequently, the system dynamics in (1) can be simplified as .
With the new system information, the mathematical model of nonlinear ESO [7] is then written as where the components with " ∧ ", and " ∼ " represent the estimated states and the error between the actual and the estimated value, e.g., ε o is the boundary layer of the ESO such that the estimation error should be As the estimated error should be bounded by a boundary later, therefore, ( 21) can be rewritten as follows.
. ∼ Subsequently, the state space of the error dynamics can be written as where The characteristic equation of A can be calculated as follows: The error dynamics are stable if the gains l 1 , l 2 , and l 3 are positive.Therefore, these gains are selected using the pole placement method as follows: Comparing the coefficients of ( 25) and ( 26) results in the following selection of gains:

Extended State Observer-Based Sliding Mode Control (SMCESO)
For enhanced system performance, the final control input u o for the system with estimated perturbation ψ(x, t) from ESO and switching control from SMC can be written as where u is from (16).Consequently, the system dynamics from (10) can be rewritten as follows. .. where is the perturbation estimation error.Now, it is evident that with ESO, the system is only affected by the perturbation estimation error as compared to the actual perturbation.This follows |ψ(x, t)|, ensuring that ESO-based SMC is more stable than the individual SMC.Subsequently, the Lyapunov function in (15) will become .
The stability of SMCESO with the Lyapunov function σ .σ ≤ 0 can be calculated as With the system dynamics and combined control input from ( 6) and (28), according to (7), this will result in the following condition: Simplifying (32) yields (33): Subsequently, to keep the system stable, the control gain should follow the following condition.
Now, the new control gain K smc is small in comparison to conventional gain K smc , with K smc < K smc .The reduced gain will result in smoother switching control, eliminating any chattering for improved performance.Furthermore, the control parameters K smc , c, ε c , and λ are then optimally tuned using DDPG to reduce manual tuning efforts.

DDPG-Based SMCESO
Deep deterministic policy gradient (DDPG) is a reinforcement learning algorithm designed for solving continuous action space problems.It combines elements of deep neural networks and the deterministic policy gradient theorem to achieve remarkable performance in control tasks.DDPG employs an actor-critic architecture, with the actor network modeling the policy and the critic network estimating the state-action value function.A key innovation in DDPG is the use of target networks to stabilize training, with periodic updates to slowly track the learned networks.This approach, coupled with experience replay, enables stable and efficient learning, making DDPG a prominent choice for complex, high-dimensional control problems.
Similar to other reinforcement learning algorithms, the DDPG algorithm operates within the framework of a Markov decision process (MDP) [24], denoted by (S, A, P, R), where S and A represent the environment's state space and the agent's action space, respectively.P signifies the probability of state transitions.During agent training, the reward function R serves as the training target.In core, while training the agent, the system's state s S is observed, and the associated reward r R is acquired.Subsequently, the optimal policy π a (a|s) is determined through maximizing the state-action value function.
where R c represents the cumulative reward, and R c = ∑ ∞ k=0 γ k r k+1 with 0 ≤ γ ≤ 1 is the discount factor that reflects the importance of the reward value at future moments.To enhance the controller performance, the DDPG has to study the regulation strategy µ (actor network) and calculate the probability of each action.Consequently, the controller parameters are updated in real time to maximize the total reward [25,26].
θ(k) is the set of action parameters, with minimum limit θ min and maximum limit θ max .The structure of DDPG is presented in Figure 1.The selection of a suitable state space is crucial for ensuring the convergence of reinforcement learning.In the context of the present challenges, the chosen state space should inherently pertain to the robot's position and its estimated dynamics.As a result, for the sake of computational efficiency and enhanced learning, the state space is straightforwardly defined as S = [x(k), x(k)], and the state vector is defined as s k = x 1 , x1 , e, The actor-critic value network for the robot system is established, which is a doublelayer structure including the target network and main network.The replay buffer stores data in the form of  ,  ,  ,  , which is used for network training.Both the main networks and target networks share the same structure but differ in their parameters.The actor network is denoted by  = ( | ), with  as the network parameter.The critic network is denoted as ( ,  | ), with the network parameter as  .When training, small batches of sample information  ,  ,  ,  are randomly selected from the replay buffer for learning.In brief, the training process involves the four networks to ensure that actions generated by the actor network can be used as input for the critic network to maximize the state-action value function in (35).The training process is provided in Algorithm 1.The actor-critic value network for the robot system is established, which is a doublelayer structure including the target network and main network.The replay buffer stores data in the form of [s k , a k , r k , s k+1 ], which is used for network training.Both the main networks and target networks share the same structure but differ in their parameters.The actor network is denoted by a k = µ(s k |θ µ ), with θ µ as the network parameter.The critic network is denoted as Q s k , a k θ Q , with the network parameter as θ Q .When training, small batches of sample information [s i , a i , r i , s i+1 ] are randomly selected from the replay buffer for learning.In brief, the training process involves the four networks to ensure that actions generated by the actor network can be used as input for the critic network to maximize the state-action value function in (35).The training process is provided in Algorithm 1.

Algorithm 1: Training DDPG Agent
Initialize the networks µ(s k |θ µ ), and Q s k , a k θ Q randomly.
Initialize the target network µ s k θ µ , and Q s k , a k θ Q with weights.Initialize the replay buffer.

While ep ≤ ep max
Randomly initialize the process N for action exploration.Receive the states s k while k < k max a k = µ(s k |θ µ ) + N .Execute the environment to update the reward r k , and s k+1 .Store [s k , a k , r k , s k+1 ] in replay buffer R. Sample a random minibatch of m transitions transitions [s i , a i , r i , s i+1 ] from R.
Set target Update the critic by minimizing the loss function Update the actor using the sampled policy gradient Update the target network with soft update Stop training.

End end
The DDPG-based SMCESO block diagram is presented in Figure 2.For robust performance, the tracking error should be eliminated.Subsequently, the estimation should be accurate, i.e., ∼ x → 0 .Consequently, the true perturbation will be estimated and well compensated.Therefore, the reward function for the current study is designed as follows.
where e tol is the error tolerance for accepting good performance of tracking control.Similarly, R 2 is for the good performance of ESO with estimation error tolerance as ∼ x 1,tol .R 3 is for the stopping condition (isdone in Algorithm 1), meaning the robot is not stable exceeding the movement limits x 1,stop .
The DDPG-based SMCESO block diagram is presented in Figure 2.For robust performance, the tracking error should be eliminated.Subsequently, the estimation should be accurate, i.e.,  → 0. Consequently, the true perturbation will be estimated and well compensated.Therefore, the reward function for the current study is designed as follows.
Where  is the error tolerance for accepting good performance of tracking control.Similarly,  is for the good performance of ESO with estimation error tolerance as  , . is for the stopping condition (isdone in Algorithm 1), meaning the robot is not stable exceeding the movement limits  , .

Simulations and Discussion
This section provides details about the simulation system and the environment.It also includes the presentation of results and their subsequent discussion.

System and Environment Descrption
For the DDPG implementation, a simulation environment is created in MATLAB/Simulink, featuring an object pick-and-place task using the Simscape Multibody model of the KUKA KR 16 S industrial robot arm, as presented in Figure 2. The KR 16 is a six-degrees-of-freedom (DOF) high-speed, heavy-duty industrial robot arm with a substantial payload capacity.Demonstrating robust performance with such robot will validate the efficiency of the proposed method.Consequently, the robot must exhibit robust performance and a minimal tracking error in the presence of nonlinear dynamics.The sampling time for the DDPG algorithm is set to 0.5 s, while the control algorithm operates

Simulations and Discussion
This section provides details about the simulation system and the environment.It also includes the presentation of results and their subsequent discussion.

System and Environment Descrption
For the DDPG implementation, a simulation environment is created in MATLAB/ Simulink, featuring an object pick-and-place task using the Simscape Multibody model of the KUKA KR 16 S industrial robot arm, as presented in Figure 2. The KR 16 is a six-degreesof-freedom (DOF) high-speed, heavy-duty industrial robot arm with a substantial payload capacity.Demonstrating robust performance with such robot will validate the efficiency of the proposed method.Consequently, the robot must exhibit robust performance and a minimal tracking error in the presence of nonlinear dynamics.The sampling time for the DDPG algorithm is set to 0.5 s, while the control algorithm operates with a sampling time of 5 ms.The computations are carried out on a computer equipped with an Intel i7 processor and an RTX 3090 ti GPU.

Simulations
Simulations are conducted in two phases.First is the implementation of the proposed algorithm on a simple linear system to explain the basics or the workings of the ESO.Second is the implementation on the multibody dynamics of the robot arm, with a sine wave as the desired position.For simulation, the DDPG hyperparameters are presented in Table 1 4.2.1.Simple System Implementation For a simple linear system, consider the following second-order dynamics. ..
where a is the magnitude of disturbance (d(t)), b = 10 is the damping coefficient, and k = 50 is the stiffness.The performance of DDPG-based SMCESO has been compared with SMC, proportional-integral-derivative (PID) control optimally tuned using the Control System Tuner toolbox in Simulink, and H ∞ control [27].The control gains are provided in Table 2, and the trajectory tracking error is shown in Figure 3.The error results of the step response in Figure 3a reveal that when a disturbance (a = 10) is present, all three controllers except for SMCESO demonstrate good performance with high control gains but fail to fully converge the error to zero.In contrast, SMCESO effectively estimates and compensates for the perturbation, as depicted in Figure 4 (on the next page), leading to error convergence toward zero.Moreover, as anticipated in Section 3.2, the new control gain  is notably smaller than the conventional gain  (in Table 1), which is tuned using the DDPG algorithm.Additionally, the algorithms underwent testing with parameter changes, where the stiffness was chosen as  = 50 8.These variations were introduced using the Simulink random number block with a variance of 20.The tracking errors for variable stiffness are presented in Figure 3b, illustrating that PID exhibits the maximum deviation, while  outperforms PID.However, SMC now surpasses H∞ due to a model mismatch between the actual system and the dynamics used for controller synthesis.Finally, SMCESO outperforms all three controllers through maintaining the error very close to zero.This validates that SMCESO effectively estimates sys- The error results of the step response in Figure 3a reveal that when a disturbance (a = 10) is present, all three controllers except for SMCESO demonstrate good performance with high control gains but fail to fully converge the error to zero.In contrast, SMCESO effectively estimates and compensates for the perturbation, as depicted in Figure 4 (on the next page), leading to error convergence toward zero.Moreover, as anticipated in Section 3.2, the new control gain K smc is notably smaller than the conventional gain K smc (in Table 1), which is tuned using the DDPG algorithm.Additionally, the algorithms underwent testing with parameter changes, where the stiffness was chosen as k = 50 ± 8.These variations were introduced using the Simulink random number block with a variance of 20.The tracking errors for variable stiffness are presented in Figure 3b, illustrating that PID exhibits the maximum deviation, while H ∞ outperforms PID.However, SMC now surpasses H ∞ due to a model mismatch between the actual system and the dynamics used for controller synthesis.Finally, SMCESO outperforms all three controllers through maintaining the error very close to zero.This validates that SMCESO effectively estimates system uncertainties and compensates for their effects on the system response, resulting in robust performance.

Control Algorithm Gains
with high control gains but fail to fully converge the error to zero.In contrast, SMCESO effectively estimates and compensates for the perturbation, as depicted in Figure 4 (on the next page), leading to error convergence toward zero.Moreover, as anticipated in Section 3.2, the new control gain  is notably smaller than the conventional gain  (in Table 1), which is tuned using the DDPG algorithm.Additionally, the algorithms underwent testing with parameter changes, where the stiffness was chosen as  = 50 8.These variations were introduced using the Simulink random number block with a variance of 20.The tracking errors for variable stiffness are presented in Figure 3b, illustrating that PID exhibits the maximum deviation, while  outperforms PID.However, SMC now surpasses H∞ due to a model mismatch between the actual system and the dynamics used for controller synthesis.Finally, SMCESO outperforms all three controllers through maintaining the error very close to zero.This validates that SMCESO effectively estimates system uncertainties and compensates for their effects on the system response, resulting in robust performance.

Adaptive SMCESO with Multibody Robot
With a multibody robot system, the DDPG agent has been trained to fine-tune the controller parameters.For controller evaluation, Joint 2 ( ) of the robot manipulator has been considered as it holds the maximum weight of the robot against gravity.Therefore, the robot arm is fully extended, and only  is moving.The desired trajectory is defined as  , = sin ( • ) , with initial frequency  = 1 , which resets after every episode as  = 1 +  −0.5,0.5 .Furthermore, the total simulation time is 10seconds, with an ideal reward  = 210 .The training stops when the average award reaches  199 ,

Adaptive SMCESO with Multibody Robot
With a multibody robot system, the DDPG agent has been trained to fine-tune the controller parameters.For controller evaluation, Joint 2 (q 2 ) of the robot manipulator has been considered as it holds the maximum weight of the robot against gravity.Therefore, the robot arm is fully extended, and only q 2 is moving.The desired trajectory is defined as q 2,d = sin(w•t), with initial frequency w o = 1, which resets after every episode as w = 1 + rand[−0.5, 0.5].Furthermore, the total simulation time is 10seconds, with an ideal reward r max = 210.The training stops when the average award reaches r c ≥ 199, considering the average reward window length.The DDPG agent took 343 episodes for the training.The episode reward and the cumulative reward are presented in the following Figure 5, and subsequently, the tuned parameters are shown in Figure 6 and the trajectory tracking error and joint torques are in Figure 7.The joints were equipped with electromechanical motor dynamics with the motor parameters given in Table 3.Consequently, both control algorithms (SMC and SMCESO) can achieve joint tracking errors with the range 1 .However, it is evident from the control input that SMC has sudden spikes throughout the simulation.Reducing gains can eliminate these spikes but will reduce the control performance, resulting in larger errors.Similarly, to reduce the error of SMC, higher gains (more than double those of SMCESO) are required.This, in turn, increases the spikes and occasionally introduces chattering in the response.In contrast, SMCESO shows very smooth performance and    The joints were equipped with electromechanical motor dynamics with the motor parameters given in Table 3.Consequently, both control algorithms (SMC and SMCESO) can achieve joint tracking errors with the range 1 .However, it is evident from the control input that SMC has sudden spikes throughout the simulation.Reducing gains can eliminate these spikes but will reduce the control performance, resulting in larger errors.Similarly, to reduce the error of SMC, higher gains (more than double those of SMCESO) are required.This, in turn, increases the spikes and occasionally introduces chattering in the response.In contrast, SMCESO shows very smooth performance and The joints were equipped with electromechanical motor dynamics with the motor parameters given in Table 3.Consequently, both control algorithms (SMC and SMCESO) can achieve joint tracking errors with the range ±1 degree.However, it is evident from the control input that SMC has sudden spikes throughout the simulation.Reducing gains can eliminate these spikes but will reduce the control performance, resulting in larger errors.Similarly, to reduce the error of SMC, higher gains (more than double those of SMCESO) are required.This, in turn, increases the spikes and occasionally introduces chattering in the response.In contrast, SMCESO shows very smooth performance and keeps the error within the range of ±0.1 degree.This validates the robustness of SMC integrated with ESO, which overcomes the perturbation effects of the system with a total mass m > 55 Kg on joint 2. Overall, the initial jump in the control input is primarily attributed to motor dynamics such as friction, which stabilizes once the robot starts moving.Moreover, for a deeper understanding of achieving robust performance, observing the estimated states in Figure 8.The position and velocity results show that the state observer is performing very well, with estimations showing nearly zero error.This suggests that the system may have highly effective perturbation estimation and compensation capabilities to enhance tracking performance.Moreover, the Simscape multibody toolbox allows obtaining the dynamics components of the robot system, including the mass matrix (), velocity product torque (,  ) •  with (,  ) Coriolis terms, and gravitational torque ().This can be achieved through first creating the rigid body tree and then utilizing the Manipulator Algorithm library from Robotics System Toolbox.Subsequently, similar to (10), the expected perturbation is presumed as (, ) = (,  ) •  + () The assumed and estimated perturbations are presented in Figure 9, below.The position and velocity results show that the state observer is performing very well, with estimations showing nearly zero error.This suggests that the system may have highly effective perturbation estimation and compensation capabilities to enhance tracking performance.Moreover, the Simscape multibody toolbox allows obtaining the dynamics components of the robot system, including the mass matrix M(q), velocity product torque C q, .q • .q with C q, .q Coriolis terms, and gravitational torque G(q).This can be achieved through first creating the rigid body tree and then utilizing the Manipulator Algorithm library from Robotics System Toolbox.Subsequently, similar to (10), the expected perturbation is presumed as ψ(x, t) C q, .q • .q + G(q) (40) The assumed and estimated perturbations are presented in Figure 9, below.The position and velocity results show that the state observer is performing very well, with estimations showing nearly zero error.This suggests that the system may have highly effective perturbation estimation and compensation capabilities to enhance tracking performance.Moreover, the Simscape multibody toolbox allows obtaining the dynamics components of the robot system, including the mass matrix (), velocity product torque (,  ) •  with (,  ) Coriolis terms, and gravitational torque ().This can be achieved through first creating the rigid body tree and then utilizing the Manipulator Algorithm library from Robotics System Toolbox.Subsequently, similar to (10), the expected perturbation is presumed as (, ) = (,  ) •  + ()

Parameter
The assumed and estimated perturbations are presented in Figure 9, below.The estimated perturbation closely aligns with the assumed perturbation.With the desired trajectory being a sine wave, the velocity is continuously changing, leading to some perturbation estimation error, as expected due to the motor dynamics, which are not factored into the perturbation calculation.However, this error can be compensated by the SMC in Equation ( 16), further validating the theory in Equation (29) that, with ESO, the system dynamics are primarily influenced by the perturbation estimation error.From a magnitude perspective, it is evident that the perturbation estimation error is considerably smaller than the actual perturbation, making the system achieve robust performance.Furthermore, when the robot comes to a stop, the estimated perturbation converges to match the assumed perturbation, confirming the accurate working of the ESO.

Conclusions
In this study, an approach to control and stabilize multibody robotic systems with inherent dynamics and uncertainties is presented.The approach leverages extended

Robotics 2023 ,
12,  x FOR PEER REVIEW 12 of 15 considering the average reward window length.The DDPG agent took 343 episodes for the training.The episode reward and the cumulative reward are presented in the following Figure5, and subsequently, the tuned parameters are shown in Figure6and the trajectory tracking error and joint torques are in Figure7.

Figure 8 .
Figure 8. Actual and estimated states of the system.

Figure 8 .
Figure 8. Actual and estimated states of the system.

Figure 8 .
Figure 8. Actual and estimated states of the system.