Deep Deterministic Policy Gradient-Based Parameter Adaptation for Synchronous Sliding-Mode Control with Time-Delay Estimation in Dual-Arm Robot Manipulators Under System Uncertainties

Tran, Duc Thien; Nguyen, Thanh Nha; Huynh, Thi Kim Tram; Ahn, Kyoung Kwan

doi:10.3390/app16042042

Open AccessArticle

Deep Deterministic Policy Gradient-Based Parameter Adaptation for Synchronous Sliding-Mode Control with Time-Delay Estimation in Dual-Arm Robot Manipulators Under System Uncertainties

¹

Faculty of Electrical and Electronics Engineering, Ho Chi Minh City University of Technology and Education, Ho Chi Minh City 71307, Vietnam

²

School of Mechanical Engineering, University of Ulsan, Ulsan 44610, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 2042; https://doi.org/10.3390/app16042042

Submission received: 21 December 2025 / Revised: 15 February 2026 / Accepted: 17 February 2026 / Published: 19 February 2026

(This article belongs to the Special Issue Advanced Robotics, Mechatronics, and Automation)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a synchronous sliding-mode control with time-delay estimation (SSMC-TDE)-based adaptive control framework for coordinated motion control of dual-arm robotic manipulators operating under system uncertainties. The baseline SSMC-TDE scheme is constructed using synchronization and cross-coupling errors to ensure precise coordinated motion among robot joints, while sliding-mode control effectively handles strong nonlinearities, and the time-delay estimation technique approximates lumped uncertainties arising from external disturbances, modeling errors, and payload variations. The stability of the closed-loop system is rigorously analyzed and guaranteed using the Lyapunov theory. To overcome performance degradation caused by manually tuned control gains, a deep reinforcement learning-assisted parameter adaptation mechanism is integrated into the SSMC-TDE structure. Specifically, a Deep Deterministic Policy Gradient (DDPG) algorithm is employed to adapt selected control gains online through a reward function designed to simultaneously enhance motion synchronization and reduce trajectory-tracking errors, while preserving the stability properties of the underlying controller. Simulation studies are conducted within a co-simulation framework integrating MATLAB/Simulink and ROS/Gazebo for a dual-arm robotic platform. Quantitative evaluations based on the root mean square error (RMSE) of trajectory-tracking and synchronization errors across all six joints demonstrate that, averaged over both scenarios, the proposed DDPG-assisted SSMC-TDE achieves an overall RMSE reduction of 35.52% and 99.3% compared with conventional SSMC and SSMC-TDE controllers, respectively, confirming its superior performance and robustness under system uncertainties.

Keywords:

synchronous sliding-mode control; sliding-mode control; time-delay estimation; deep reinforcement learning; Deep Deterministic Policy Gradient; dual-arm robot; lumped uncertainty

1. Introduction

Dual-arm robot manipulators have become an indispensable technology in advanced robotic systems, enabling collaborative operations such as cooperative transportation, precision assembly, and the dexterous manipulation of heavy, deformable, or fragile objects in industrial, service-oriented, and safety-critical environments [1,2]. Their ability to emulate human bimanual coordination allows two manipulators to jointly regulate motion, force, and interaction dynamics, thereby achieving superior dexterity and robustness compared with single-arm systems [3,4]. Despite these advantages, achieving high-performance coordinated control remains challenging due to the nonlinear and strongly coupled nature of dual-arm dynamics, where Coriolis and centrifugal effects, gravitational forces, and unmodeled interaction phenomena introduce significant uncertainties [1,5]. Moreover, practical disturbances such as payload variations, joint friction, sensor noise, and external perturbations further degrade synchronization accuracy and tracking precision [6].

To mitigate these difficulties, a broad spectrum of synchronous coordination strategies has been investigated. Impedance and admittance controllers offer compliant interaction but often suffer from reduced accuracy under fast or highly nonlinear motions [7,8,9]. Adaptive and passivity-based approaches enhance robustness to parametric uncertainties but typically depend on restrictive excitation conditions or conservative energy constraints that limit operational flexibility [9,10]. In contrast, sliding-mode control has proven to be an effective nonlinear control strategy and has been successfully implemented in various practical engineering systems, such as grid-connected power electronic converters and motor drive applications, demonstrating strong robustness against uncertainties and external disturbances [11]. Nevertheless, sliding-mode-based synchronous control provides strong disturbance rejection but remains sensitive to fixed design parameters, often resulting in pronounced chattering or degraded transient behavior when operating across varying environmental and interaction conditions [12]. Consequently, ensuring fast convergence, high synchronization fidelity, and strong robustness in dual-arm systems remains an open challenge.

Synchronous sliding-mode control (SSMC) has been widely adopted because it directly incorporates cross-coupling error structures that enforce coordinated behavior between the two manipulators, leading to asymptotic convergence of both trajectory-tracking and synchronization errors [13]. When combined with time-delay estimation (TDE), the SSMC-TDE framework reconstructs lumped uncertainties from delayed measurements, reducing reliance on accurate dynamic modeling and increasing resilience to unknown disturbances and unmodeled dynamics [14,15]. Lyapunov-based analysis has rigorously established that the incorporation of TDE preserves closed-loop stability under bounded approximation errors, ensuring robust synchronization even in the presence of substantial uncertainties [16,17]. Nevertheless, the overall performance of SSMC-TDE highly depends on the designer-selected sliding structure, which governs convergence characteristics, error coupling behavior, and chattering attenuation. Inadequate selection of this structure may impair tracking quality, weaken synchronization, or cause excessive actuator effort issues that are further exacerbated under time-varying disturbances, friction changes, and unpredictable interaction dynamics [18,19]. Manual parameterization, while often sufficient in simulation studies, lacks adaptability and cannot guarantee consistent performance across diverse real-world scenarios [20,21]. These limitations highlight the need for an autonomous online adaptation mechanism capable of continuously refining controller behavior in response to evolving system conditions.

In recent years, deep reinforcement learning (DRL) has demonstrated strong potential for enhancing adaptability in robotic control. Unlike manually tuned controllers, DRL algorithms such as DDPG can autonomously adjust control parameters through continuous interaction with the environment. DRL has been successfully applied to sliding-mode-based quadrotor control [22], robot manipulation and adaptive interaction learning [23], optimization of PID-type controllers under nonlinear disturbances [24], and adaptive sliding-mode gain tuning for multibody robotic systems [25]. These works consistently show that DRL improves robustness, tracking accuracy, and adaptability beyond what conventional manual parameterization can achieve.

Recently, fixed-time and predefined-time control strategies, including adaptive neural network-based fixed-time control [26,27] and prescribed performance-based reliable control [28,29], have also been proposed to guarantee convergence within a fixed or predefined time bound that is independent of initial conditions. These methods ensure that tracking errors evolve within prescribed performance limits, even in the presence of uncertainties, disturbances, and actuator constraints. However, such approaches typically rely on predefined convergence-time specifications, prescribed performance functions, and relatively complex controller architectures involving neural networks, terminal sliding modes, or barrier Lyapunov functions. As a result, their flexibility may be limited when system dynamics, interaction conditions, or task requirements vary significantly.

In parallel, advanced disturbance rejection and learning-based control strategies have also been investigated for uncertain nonlinear robotic systems [30,31,32,33]. State-filtered disturbance rejection control methods achieve effective disturbance attenuation by explicitly estimating and compensating disturbances through filter-based observers; however, their performance is closely tied to filter design and measurement quality, and the resulting controller complexity increases with system dimensionality. Multilayer neuroadaptive reinforcement learning approaches based on actor–critic architectures embed neural networks directly into the control loop to approximate unknown dynamics or control policies [34,35,36], which may complicate stability analysis and require additional excitation or boundedness assumptions.

In contrast to the above methods, this study adopts a stability-preserving, learning-assisted control framework in which SSMC-TDE serves as a robust baseline controller, while DRL is employed solely as an auxiliary mechanism for online parameter adaptation [37,38]. This separation allows Lyapunov-based stability guarantees to be retained while leveraging the learning capability of DRL to enhance synchronization performance, transient response, and control smoothness under complex uncertainties, without introducing additional disturbance observers or neural approximators into the control law.

According to the above discussion, this paper proposes a DRL-enhanced control framework that integrates the DDPG algorithm into a stability-guaranteed SSMC-TDE architecture. Leveraging an actor–critic structure with target networks, the DDPG agent autonomously adjusts critical controller parameters in real time, enabling the system to maintain high synchronization accuracy, improve tracking precision, and reduce chattering, all without requiring prior knowledge of the robot dynamics. A task-specific reward function is formulated to jointly promote robust coordination, transient performance, and smooth interaction. The main contributions of this work are threefold:

(1): A DRL-assisted parameter adaptation framework integrated with a baseline SSMC–TDE controller, enabling online performance-oriented gain adjustment for synchronous dual-arm robotic manipulation under significant uncertainties;
(2): Lyapunov-based stability analysis demonstrating that the integration of the DRL adaptation layer preserves the stability properties of the underlying SSMC-TDE control structure;
(3): Comprehensive co-simulation studies validating improved robustness, convergence performance, and synchronization accuracy compared with fixed-parameter baseline controllers.

The remainder of the paper is organized as follows: Section 2 formulates the problem and describes the dual-arm system dynamics; Section 3 details the SSMC-TDE controller and its stability; Section 4 presents the DRL-based adaptation using DDPG; Section 5 reports simulation results; and Section 6 concludes the paper.

2. Dynamic Modeling and Problem Formulation

2.1. Dual-Arm Robot Dynamics

The dynamics equations of the

i

-th arm

(i = 1, 2)

of the dual-arm (see Figure 1) in the joint space are expressed as follows [1,5,39]:

M_{i} (q_{i}) {\ddot{q}}_{i} + C_{i} (q_{i}, {\dot{q}}_{i}) {\dot{q}}_{i} + g_{i} (q_{i}) + d_{i} = τ_{i}

(1)

where

M_{i} (q_{i}) \in ℝ^{3 \times 3}

is the positive definite symmetric inertia matrix of the

i

-th arm.

C_{i} (q_{i}, {\dot{q}}_{i}) \in ℝ^{3 \times 3}

is the Coriolis and centrifugal force matrix.

g_{i} (q_{i}) \in ℝ^{3 \times 1}

denotes the gravitational vector acting on the

i

-th arm.

d_{i} \in ℝ^{3 \times 1}

is the vector representing uncertainties, including parameter variations and friction.

τ_{i} = {[τ_{1 i}, τ_{2 i}, τ_{3 i}]}^{T}

is the vector of joint torques generated at the joints of the

i

-th arm. The vectors

q_{i}

,

{\dot{q}}_{i}

, and

{\ddot{q}}_{i}

denote, respectively, the joint positions, velocities, and accelerations of the joints of the

i

-th arm, with

{(\cdot)}_{i} = {[{(\cdot)}_{1 i}, {(\cdot)}_{2 i}, {(\cdot)}_{3 i}]}^{T}

. From Equation (1), the dynamic model of the dual-arm system can be rewritten as:

M (q) \ddot{q} + C (q, \dot{q}) \dot{q} + g (q) + d = τ

(2)

where

M (q) = b l k d i a g (M_{1} (q_{1}), M_{2} (q_{2})) \in ℝ^{6 \times 6}

is the combined inertia matrix of the two arms; similarly

C (q, \dot{q}) = b l k d i a g (C_{1} (q_{1}, {\dot{q}}_{1}), C_{2} (q_{2}, {\dot{q}}_{2})) \in ℝ^{6 \times 6}

;

g (q) = {[g_{1} {(q_{1})}^{T}, g_{2} {(q_{2})}^{T}]}^{T} \in ℝ^{6 \times 1}

;

d = {[d_{1}^{T}, d_{2}^{T}]}^{T} \in ℝ^{6 \times 1}

;

τ = {[τ_{1}^{T}, τ_{2}^{T}]}^{T} \in ℝ^{6 \times 1}

;

q = {[q_{1}^{T}, q_{2}^{T}]}^{T} = {[q_{11}, q_{21}, q_{31}, q_{12}, q_{22}, q_{32}]}^{T}

;

\dot{q} = {[{\dot{q}}_{1}^{T}, {\dot{q}}_{2}^{T}]}^{T} = {[{\dot{q}}_{11}, {\dot{q}}_{21}, {\dot{q}}_{31}, {\dot{q}}_{12}, {\dot{q}}_{22}, {\dot{q}}_{32}]}^{T}

; and

\ddot{q} = {[{\ddot{q}}_{1}^{T}, {\ddot{q}}_{2}^{T}]}^{T} = {[{\ddot{q}}_{11}, {\ddot{q}}_{21}, {\ddot{q}}_{31}, {\ddot{q}}_{12}, {\ddot{q}}_{22}, {\ddot{q}}_{32}]}^{T}

.

Property 1:

M (q)

is a positive definite matrix and always invertible for

\forall q

.

Property 2:

(M (q) - 2 C (q, \dot{q}))

is a skew-symmetric matrix that satisfies

x^{T} ({\dot{M}}_{D} (q_{D}) - 2 C_{D} ({\dot{q}}_{D}, q_{D})) x = 0

.

The dynamic Equation (2) is derived based on the Euler–Lagrange method. However, owing to discrepancies between the mathematical model and the actual system, it can be expressed as follows:

M_{0} (q) \ddot{q} + C_{0} (q, \dot{q}) \dot{q} + g_{0} (q) + Δ M (q) \ddot{q} + Δ C (q, \dot{q}) \dot{q} + Δ g (g) + d = τ

(3)

where

M_{0} (q), C_{0} (q, \dot{q})

and

G_{0} (q)

represent the nominal dynamic components, which are derived from the Euler–Lagrange formulation based on known robot parameters; and unmodeled components

Δ M (q), Δ C (q, \dot{q})

and

Δ G (q)

represent the unmodeled dynamic components.

2.2. Synchronous Coordination Error

Consider a dual-arm robotic system performing a cooperative manipulation task involving a heavy object. Let the desired joint trajectories of the two arms be

x_{d} = [{q_{1 d}}^{T}, {q_{2 d}}^{T}] \in ℝ^{6 \times 1}

and

x = [{q_{1}}^{T}, {q_{2}}^{T}] \in ℝ^{6 \times 1}

be the vector of actual joint trajectories measured from the robot’s joints. The joint tracking error vector

e = {[e_{1}^{T}, e_{2}^{T}]}^{T} \in ℝ^{6 \times 1}

is defined as:

e = x - x_{d}

(4)

To determine the capability for synchronous motion between the two arms, the synchronization error vector

e_{s} = {[e_{s_{1}}^{T}, e_{s_{2}}^{T}]}^{T} \in ℝ^{6 \times 1}

is defined as:

e_{s} = T e

(5)

where

T = [I_{3 \times 3}, - I_{3 \times 3}; - I_{3 \times 3}, I_{3 \times 3}] \in ℝ^{6 \times 6}

is the synchronization matrix, and

I \in ℝ^{3 \times 3}

denotes the identity matrix. The cross-coupling error vector

e_{c} \in ℝ^{6 \times 1}

is defined from the tracking error

e

and the synchronization error

e_{s}

:

e_{c} = e + α e_{s} = (I + α T) e

(6)

where

α = d i a g (α_{1}, \dots, α_{6}) \in ℝ^{6 \times 6}

is a positive definite diagonal matrix, and when

α

is chosen appropriately

(0 < α_{i})

, the matrix

(I + α T)

becomes positive definite and full rank, i.e.,

|(I + α T)| \neq 0

. In this case, if the cross-coupling error

e_{c} \to 0

, then tracking error

e \to 0

, according to (6). This further supports the synchronization error

e_{s} \to 0

according to (5), meaning that

e_{1} = e_{2}

, thereby achieving synchronous motion between the two arms. In summary, the control objective of synchronous motion between the two arms is achieved when the cross-coupling error,

e_{c}

, asymptotically converges to 0.

2.3. Root Mean Square Error

To evaluate the quality of the tracking errors as well as the effectiveness of the trained agent, the RMSE [40] criterion is employed, which is defined as follows:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(S_{i} - O_{i})}^{2}}

(7)

where

n

denotes the total number of samples.

S_{i}

represents the

i

-th setpoint value, and

O_{i}

corresponds to the system output (feedback) at the

i

-th sample.

3. Synchronous Sliding-Mode Control with Time-Delay Estimation

3.1. Control Design

Figure 2 depicts the structure of the SSMC-TDE for coordinated motion control of a dual-arm robot. The controller integrates synchronous (cross-coupling) control, which enforces coordination between the two manipulators, sliding-mode control to address strong nonlinearities, and TDE to compensate for lumped uncertainties caused by modeling errors, external disturbances, and payload variations. This integrated control framework ensures robust and precise coordinated performance for dual-arm robotic systems. The detailed design procedure of the SSMC-TDE controller is presented below.

Define a diagonal positive definite matrix

\bar{M} \in R^{6 \times 6}

. The dynamic Equation (3) of the dual-arm robot can then be written as:

M_{0} (q) \ddot{q} + C_{0} (q, \dot{q}) \dot{q} + g_{0} (q) + \bar{M} \ddot{q} + [Δ M (q) - \bar{M}] \ddot{q} + Δ C (q, \dot{q}) \dot{q} + Δ g (q) + d = τ

(8)

Let

f (q, \dot{q}, \ddot{q})

be the nominal dynamic components of the robot and

Δ (q, \dot{q}, \ddot{q})

be the lumped uncertainty components of the model. Equation (8) is written concisely as follows:

\bar{M} \ddot{q} + f (q, \dot{q}, \ddot{q}) + Δ (q, \dot{q}, \ddot{q}) = τ

(9)

Assume that the signals

Δ (q, \dot{q}, \ddot{q})

in (9) are continuously measured over a sampling interval of sufficiently small duration

L

(typically the sensor sampling period). Thus, the actual value of

Δ (q, \dot{q}, \ddot{q})

at time

t

can be approximated

Δ (q, \dot{q}, \ddot{q})

at time

t - L

:

Δ {(q, \dot{q}, \ddot{q})}_{(t)} ≅ Δ {(q, \dot{q}, \ddot{q})}_{(t - L)}

(10)

Equations (9) and (10), with

\hat{Δ} (q, \dot{q}, \ddot{q})

denoting the time-delay estimation (TDE) value, yield:

\hat{Δ} {(q, \dot{q}, \ddot{q})}_{(t)} ≅ Δ {(q, \dot{q}, \ddot{q})}_{(t - L)} = τ_{(t - L)} - \bar{M} {\ddot{q}}_{(t - L)} - f {(q, \dot{q}, \ddot{q})}_{(t - L)}

(11)

where

{\ddot{q}}_{(t - L)}

is computed using the following finite-difference approximation:

{\ddot{q}}_{t - L} = \frac{q_{(t)} - 2 q_{(t - L)} + q_{(t - 2 L)}}{L^{2}}

(12)

Assumption 1.

When the value

L

is sufficient, the TDE approximation error is bounded by a constant

ε

:

{‖Δ (q, \dot{q}, \ddot{q}) - \hat{Δ} (q, \dot{q}, \ddot{q})‖}_{\infty} \leq ε

(13)

Based on (8) and (9), let

x_{1} = q, x_{2} = \dot{q},

and

x_{d} = q_{d}

, the dynamic model of the robot manipulator is expressed in state–space form as follows:

\{\begin{matrix} {\dot{x}}_{1} = x_{2} \\ {\dot{x}}_{2} = {\tilde{M}}^{- 1} [u - C_{0} (x_{1}, x_{2}) x_{2} - g_{0} (x_{1}) - Δ (x_{1}, x_{2}, {\dot{x}}_{2})] \\ y = x_{1} \end{matrix}

(14)

where

\tilde{M} = M_{0} (x_{1}) + \bar{M}

,

x_{1} = {[{q_{i 1}}^{T}, {q_{i 2}}^{T}]}^{T} \in R^{6 \times 1}, i = 1, 2, 3

, and

u = τ

is the control input.

The sliding variable

s \in ℝ^{6 \times 1}

is chosen based on the cross-coupling error

e_{c}

as:

\begin{matrix} s = {\dot{e}}_{c} + λ e_{c} \\ = (I + α T) (\dot{e} + λ e) \end{matrix}

(15)

where

λ = d i a g ([λ_{1}, \dots, λ_{6}]) \in ℝ^{6 \times 6}

is a positive definite diagonal matrix. The derivative of the sliding variable (15) is computed as:

\begin{matrix} \dot{s} = {\ddot{e}}_{c} + λ {\dot{e}}_{c} \\ = (I + α T) ({\dot{x}}_{2} - {\ddot{x}}_{d} + λ \dot{e}) \end{matrix}

(16)

Substituting

{\dot{x}}_{2}

from (14) into (16) results in:

\dot{s} = (I + α T) [{\tilde{M}}^{- 1} [u - C_{0} (x_{1}, x_{2}) x_{2} - g_{0} (x_{1}) - Δ (x_{1}, x_{2}, {\dot{x}}_{2})] - {\ddot{x}}_{d} + λ \dot{e}]

(17)

The control law of the synchronous sliding mode controller is designed as:

u_{S S M C} = u_{e q} + u_{r}

(18)

By choosing

\dot{s} = - (I + α T) {\tilde{M}}^{- 1} K s

when the system uncertainties are neglected (

Δ \in R^{6 \times 1}

is assumed to be zero), the equivalent control input

u_{e q}

is determined as:

u_{e q} = \tilde{M} ({\ddot{x}}_{d} - λ \dot{e}) + C_{0} (x_{1}, x_{2}) x_{2} + g_{0} (x_{1}) - K s

(19)

The robust control component

u_{r}

is defined as [31,41,42]:

u_{r} = - η sgn (s)

(20)

where

η \in ℝ^{6 \times 6}

is a positive diagonal matrix satisfying

η \geq {‖d‖}_{\infty}

, which ensures the robustness of the system in the presence of disturbances.

Remark 1.

Since the sign function is inherently discontinuous, chattering naturally occurs when the sliding surface

s

crosses zero, which can potentially damage the mechanical actuators in practice. To mitigate this issue in simulations, the sign function in Equation (20) has been replaced with a continuous saturation function [43,44], effectively smoothing the switching behavior and reducing high-frequency oscillations in the control input.

Combining (19) and (20), the control law in (18) becomes:

u_{S S M C} = \tilde{M} ({\ddot{x}}_{d} - λ \dot{e}) + C_{0} (x_{1}, x_{2}) x_{2} + g_{0} (x_{1}) - K s - η sgn (s)

(21)

To estimate the uncertain components

Δ (x_{1}, x_{2}, {\dot{x}}_{2})

acting on the system, the TDE method is employed as in (11). Therefore, the synchronous sliding-mode control law combined with TDE is expressed as:

u_{T} = u_{S S M C} + u_{T D E}

(22)

Based on (21) and (11), the SSMC-TDE control law in (18) is obtained as:

u_{T} = \tilde{M} ({\ddot{x}}_{d} - λ \dot{e}) + C_{0} (x_{1}, x_{2}) x_{2} + g_{0} (x_{1}) - K s - η sgn (s) + \hat{Δ} (x_{1}, x_{2}, {\dot{x}}_{2})

(23)

Substituting (23) into (17) results in:

\dot{s} = {\tilde{M}}^{- 1} (I + α T) (\hat{Δ} - Δ - K s - η sgn (s))

(24)

3.2. Stability Analysis

A Lyapunov function is selected as:

V = \frac{1}{2} s^{T} s

(25)

The time derivative of (25) is:

\dot{V} = s^{T} \dot{s}

(26)

Replacing (24) with (26) yields:

\dot{V} = s^{T} [{\tilde{M}}^{- 1} (I + α T) (\hat{Δ} - Δ - K s - η sgn (s))]

(27)

From Assumption 1, it follows that:

\dot{V} = - s^{T} [{\tilde{M}}^{- 1} (I + α T) (K s + η sgn (s) - ε)]

(28)

Remark 2.

Since the matrix

\tilde{M} = M_{0} (x_{1}) + \bar{M} > 0

, and the synchronization coefficient vector

α

is chosen to be strictly positive, the matrix is

(I + α T) > 0

. In addition, the gain matrix

K

is positive definite, implying that

- s^{T} [{\tilde{M}}^{- 1} (I + α T) K s] \leq 0, \forall s

. To ensure that (28) is strictly negative, the following condition must be satisfied:

η > |ε|

(29)

If the time-delay estimation error

ε

is bounded within a finite region, the stability condition (29) guarantees that the time derivative of the Lyapunov function remains negative semi-definite. Consequently, the closed-loop system preserves Lyapunov stability, and the tracking errors are ensured to remain uniformly bounded within a neighborhood of the origin.

4. DRL-Based Online Parameter Adaptation (DDPG)

4.1. Online Parameter Adaptive with SSMC-TDE

4.1.1. State, Action, and Reward Design

The DDPG algorithm [45] is employed to achieve online adaptation of the control gains in the SSMC–TDE controller for the dual-arm robot. As an actor–critic, off-policy algorithm, DDPG interacts with the system during training, using the robot states and tracking errors as observations, and control gains as continuous actions. The training process is guided by a reward function designed to penalize tracking errors, synchronization errors, and excessive control effort. The overall framework is illustrated in Figure 3. Due to its off-policy nature, DDPG enables policy updates using previously collected samples, allowing the training process to be conducted offline without requiring an explicit model of robot dynamics.

To formulate the reinforcement learning problem, the interaction between the DDPG agent and the SSMC–TDE-controlled dual-arm robot is defined in terms of the state, action, and reward. The state vector is constructed to capture both tracking and synchronization performance and is defined as:

s_{o} = {[e, e_{s}]}^{T} \in R^{18 \times 1}

(30)

The action generated by the DDPG agent corresponds to the continuous control gain parameters of the SSMC–TDE controller and is expressed as:

a = {[λ_{n}, η_{n}, K_{n}, {\bar{M}}_{n}, α_{n}]}^{T} \in R^{15 \times 1}

(31)

where

λ_{n}, η_{n}

and

K_{n}

are updated to adjust the corresponding sliding-mode control gains, while

α_{n}

is used to update the synchronization gain

α

and

{\bar{M}}_{n}

is employed to update the gain

\bar{M}

of the TDE module.

To achieve accurate trajectory tracking, improved synchronization, and controlled input magnitude in the dual-arm system, the reward function is designed to penalize tracking errors, synchronization errors, and control effort:

r = \sum_{i = 1}^{6} (- w_{e} |e (i)| - w_{s} |e_{s} (i)| - w_{u} |u (i)|)

(32)

where

e (i)

and

e_{s} (i)

denote the trajectory-tracking error and synchronization error, respectively.

u (i)

represents the control input magnitude. The weighting coefficients

w_{e}, w_{s}

and

w_{u} \in ℝ^{6 \times 6}

are selected to balance tracking accuracy, synchronization performance, control effort, and smoothness.

The continuous actions generated by the DDPG agent are normalized and constrained within a compact interval to ensure stability-preserving adaptation (see the update control gains block in Figure 3). Specifically, each action component is first saturated within

[- 1, 1]

, and the corresponding control gains

k_{i} (t)

are updated using a bounded multiplicative adaptation law based on predefined nominal values:

k_{i} (t) = k_{i 0} (1 + β a_{i} (t))

(33)

where

k_{i 0}

denotes the initial stabilizing gain obtained from the baseline SSMC–TDE controller,

β \in (0, 1)

is a scaling factor, and

a_{i} (t)

is the normalized output of the actor network. As a result, the adapted gains satisfy the following bounds:

(1 - β) k_{i 0} \leq k_{i} (t) \leq (1 + β) k_{i 0}

(34)

which guarantees that all control parameters remain strictly positive and bounded at all times.

After the training phase, only the trained actor network is retained and deployed online as an RL-based gain tuning module, while the critic network is discarded. As shown in Figure 4, the trained actor network adaptively adjusts the SSMC-TDE control gains in real time based on the current system states, thereby enhancing robustness and adaptability during execution.

4.1.2. DDPG Training Setup in MATLAB Simulink

The DDPG algorithm is an off-policy actor–critic method. Accordingly, the tuning of the SSMC–TDE controller parameters using DDPG is conducted offline in the MATLAB/Simulink environment, as illustrated in Figure 5. The RL Agent block is implemented using the Reinforcement Learning Toolbox, where the training environment is configured with the hyperparameters summarized in Table 1 and the actor and critic networks are initialized with the architectures shown in Figure 6 and Figure 7, respectively. Based on the observed system states, joint tracking errors of the dual-arm robot, and the accumulated reward following each action, the agent generates continuous actions to update the control gain parameters of the SSMC-TDE controller.

The training process is performed on a workstation equipped with an Intel Core i7-10750H CPU (2.6 GHz), 16 GB RAM, and an NVIDIA GeForce GTX 1650 Ti GPU. A circular reference trajectory is defined within the workspace of the dual-arm robot, as described by (35), to enable the agent to learn appropriate responses to external disturbances. Each training episode is simulated for 20 s with a sampling period of 0.01 s, using a fixed-step solver configuration and the ode4 (Runge–Kutta) integration method:

X_{m}^{d} (t) = [\begin{array}{l} x_{m}^{d} (t) \\ y_{m}^{d} (t) \\ γ_{m}^{d} (t) \end{array}] = [\begin{array}{l} 0.08 \cos (0.2 π t + \frac{π}{2}) \\ 0.08 \sin (0.2 π t + \frac{π}{2}) + 0.2 \\ \frac{π}{9} \sin (0.2 π t) \end{array}]

(35)

To enhance the robustness of the learned policy, different disturbance scenarios are randomly applied during the training process. Specifically, an external force of 20 N is applied along the y-direction to joint 2 of Arm 1 from 10 s to 10.5 s, while a force of −20 N is simultaneously applied to joint 2 of Arm 2. Alternatively, a varying payload disturbance is introduced, where a payload of 1 kg is applied to joint 2 of both robot arms during the first 10 s and increased to 3 kg from 10 s to 20 s. In addition, model uncertainties are emulated through friction effects, including viscous friction and Coulomb friction, which are modeled as follows:

τ_{fric} (t) = F_{v} \dot{q} + F_{c} sign (\dot{q})

(36)

Different friction parameters are employed depending on the disturbance type. When external forces are applied, the viscous and Coulomb friction vectors are set to

F_{v} = d i a g ([0.05,0.05,0.2,0.05,0.05,0.2])

and

F_{c} = d i a g ([0.3,0.2,0.1,0.3,0.2,0.1])

respectively. When payload variations are introduced, the corresponding friction coefficients are assigned as

F_{v} = d i a g ([0.05,0.05,0.0 . 5,0.05,0.05,0.05])

and

F_{c} = d i a g ([0.25,0.15,0.05,0.25,0.15,0.05])

.

4.2. DDPG Framework

In this subsection, the DDPG algorithm adopted for training the proposed RL-based gain tuning module is described in detail in Figure 8. The formulation of the actor–critic architecture, the learning process, and the overall training procedure are presented to clarify how the control gain parameters of the SSMC–TDE controller are optimized.

DDPG is an off-policy actor–critic reinforcement learning algorithm designed for continuous action spaces. It employs two neural networks: an actor network

μ (δ | ϕ^{μ})

, which represents a deterministic policy mapping the system states

δ

to a continuous action, and a critic network

Q (δ, a | ϕ^{Q})

, which approximates the action–value function. In this study, the system state is constructed to reflect the tracking and coordination performance of the dual-arm robot, while the action corresponds to the continuous control gain parameters of the SSMC–TDE controller. During training, the critic network is updated by minimizing the temporal-difference (TD) error between the predicted Q-value and the target value. For a sampled transition

(δ_{i}, a_{i}, r_{i}, δ_{i + 1})

from the replay buffer, the target value is computed as:

y_{i} = r_{i} + γ Q^{'} (δ_{i + 1}, μ^{'} (δ_{i + 1} | ϕ^{u'}) | ϕ^{Q'})

(37)

where

Q^{'}

and

μ'

denote the target critic and actor networks, respectively, and

γ \in (0, 1)

is a discount factor. The critic parameters

ϕ^{Q}

are updated by minimizing the loss function:

L_{f} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - Q (δ_{i}, a_{i} | ϕ^{Q}))}^{2}

(38)

The actor network is trained to maximize the expected return by following the deterministic policy gradient. The gradient of the performance objective with respect to the actor parameters is approximated as:

\nabla_{ϕ^{μ}} J \approx \frac{1}{N} \sum_{i = 1}^{N} \nabla_{a} Q (δ, a | ϕ^{Q}) |_{δ = δ_{i}, a = μ (δ_{i})} \nabla ϕ^{μ} μ (δ | ϕ^{μ}) |_{δ = δ_{i}}

(39)

Through this update mechanism, the actor learns to generate control gain parameters that improve the overall tracking and synchronization performance of the dual-arm robot. To enhance training stability and data efficiency, an experience replay buffer is employed to store previously observed transitions. Mini-batches are randomly sampled from the replay buffer during training, which helps break temporal correlations between consecutive samples. In addition, target networks are introduced for both the actor and the critic to stabilize learning. The target network parameters are updated using a soft update scheme given by:

\begin{array}{l} ϕ^{Q'} \leftarrow ν ϕ^{Q} + (1 - ν) ϕ^{Q'} \\ ϕ^{μ'} \leftarrow ν ϕ^{μ} + (1 - ν) ϕ^{μ'} \end{array}

(40)

with

ν \in (0, 1)

as a smoothing factor.

Exploration during training is achieved by adding a stochastic noise process

N

to the actor output, enabling sufficient exploration of the continuous action space. The complete training procedure of the DDPG algorithm is summarized in Algorithm 1, including network initialization, experience collection, parameter updates, and target network soft updates. After convergence, the trained actor network is retained and deployed online to adapt the SSMC-TDE control gains in real time.

Algorithm 1: Deep Deterministic Policy Gradient algorithm

Initialize the Actor

μ

and Critic

Q

networks along with their corresponding target networks

μ^{'}

and

Q^{'}

, and create an experience replay buffer.
for i = 1, episode_max do
Initialize a random noise process for action exploration
Receive initial state

δ_{1}

for j = 1, J do
Select action

a_{j} = μ (δ_{j} | ϕ^{μ}) + N

based on the current policy with added exploration noise.
Execute action

a_{j}

and observe reward

r_{j}

and observe new state

δ_{j + 1}

.
Store transition

(δ_{j}, a_{j}, r_{j}, δ_{j + 1})

in

R

.
Sample a random minibatch of

N_{b}

transitions from

R

.
Set

y_{j}

as Equation (37).
Update the critic by minimizing the loss function

L_{f}

given in (38).
Update the actor policy using the sampled policy gradient defined in (39).
Update the target networks expressed in (40).
end
end

5. Simulation and Co-Simulation Results

5.1. Simulation Environment

A co-simulation framework between MATLAB/Simulink and Gazebo is established, as illustrated in Figure 9, to support agent training and to validate the agent’s adaptability under various system uncertainties. The co-simulation environment consists of a Windows-based physical computer running MATLAB/Simulink with the Robotics System Toolbox and the Gazebo Co-Simulation package, together with a VMware Workstation virtual machine running Ubuntu with ROS and Gazebo installed. The overall procedure for constructing the co-simulation environment for the dual-arm robot is shown in Figure 10. The virtual machine performs the 3D simulation of the dual-arm robot in Gazebo, as depicted in Figure 11, where the physical environment such as external forces, payload variations, and friction is realistically emulated. Meanwhile, the physical machine is responsible for generating reference trajectories, executing the control algorithms, and collecting feedback data from Gazebo (see Figure 12).

Figure 13 illustrates the training performance of the agent using the DDPG algorithm within a co-simulation framework integrating MATLAB/Simulink and Gazebo. The results indicate that after approximately 300 training episodes, the average reward converges to a steady value of about −250. Based on this outcome, the trained actor network is extracted and employed as an RL-based gain tuner to adaptively adjust the parameters of the SSMC-TDE controller in response to environmental disturbances during coordinated motion tasks of the dual-arm robotic system. To validate the effectiveness of the proposed approach, two simulation scenarios are conducted using the trained agent to track predefined trajectories (see Table 2).

During the simulation phase, the friction effects are modeled using the same parameters as those used during training. In addition, the reference trajectory, simulation duration, and sampling time are kept identical to the training setup to ensure a fair and consistent evaluation environment. The control gain parameters of the SSMC and SSMC-TDE controllers employed in the simulations are summarized in Table 3. Finally, all simulation results, including comparative evaluations between the proposed controller and the conventional SSMC and SSMC-TDE controllers with fixed control gains, are recorded and presented in the subsequent Figures and Tables.

Remark 3.

In order to improve tracking accuracy, enhance synchronization among the joints, and counteract the effects of nonlinearities such as model uncertainties, external disturbances, and friction, the parameters of the SSMC controller are selected through a trial-and-error procedure subject to the Lyapunov stability conditions presented in Appendix A. The coefficient matrices

λ

and

K

are chosen in a manner analogous to the proportional and derivative gains of a conventional PD controller. The matrix

η

is gradually tuned from small to large values based on the resulting tracking performance and the oscillation amplitude of the control signal. Meanwhile, the synchronization gain matrix

α

is selected such that the synchronization error vector

e_{s}

converges to a neighborhood of zero. To ensure a fair comparison, the parameters of the SSMC-TDE controller are chosen to be identical to those of the SSMC controller. In addition, the diagonal elements of the matrix

\bar{M}

are selected to be sufficiently small compared to unity. This choice is motivated by the fact that

\bar{M}

directly affects the estimation capability of the TDE scheme for compensating nonlinear system dynamics in the control input. If

\bar{M}

is chosen too large, the system may exhibit oscillatory behavior and potentially become unstable. Finally, for the proposed RLSSMC-TDE approach, the controller parameters are adaptively tuned via reinforcement learning using the DDPG algorithm, with the initial coefficient matrices selected to be identical to those of the SSMC-TDE controller.

5.2. Simulation Results and Discussions

5.2.1. Scenario 1

Figure 14 illustrates the joint tracking errors of both arms of the dual-arm robot. The black dashed line, blue dashed line, and red dashed line represent the responses of the conventional SSMC, the SSMC-TDE controller, and the proposed RLSSMC-TDE controller, respectively. As observed in the Figure, compared with the conventional SSMC, both SSMC-TDE and the proposed RLSSMC-TDE achieve significantly smaller tracking errors, which can be attributed to the time-delay estimation mechanism that compensates for lumped uncertainties arising from unmodeled dynamics and external disturbances. This confirms the effectiveness of TDE in enhancing baseline tracking performance. More importantly, when external forces are applied from 8 s to 9 s to joint 2 of Arm 1 (see Figure 14b) and Arm 2 (see Figure 14e), the benefit of reinforcement learning-based gain adaptation becomes evident. Unlike the fixed-gain SSMC-TDE, the RLSSMC-TDE dynamically adjusts its control gains in response to sudden disturbance-induced error variations, resulting in faster error attenuation and reduced peak tracking deviations. This adaptive behavior explains why the proposed controller consistently achieves lower RMSE values, as reported in Table 4, despite the already strong baseline performance of SSMC-TDE. In addition, the synchronization performance is further illustrated in Figure 15, where opposite external forces are applied to the two arms. Although all controllers restore synchronization after the disturbance, the RLSSMC-TDE maintains smaller transient synchronization errors, indicating enhanced coordination robustness under asymmetric disturbances. The corresponding control inputs in Figure 16 reflect the inherent switching behavior of sliding-mode control, while the smoother error responses achieved by the proposed method result from adaptive gain regulation rather than aggressive suppression of the switching action. The compensation signals in Figure 17 further demonstrate that TDE effectively reconstructs the lumped uncertainties, allowing the learning-based adaptation to focus on performance refinement rather than disturbance rejection.

Table 4 presents the root mean square tracking errors of the end-effector trajectories of the dual-arm robot. The numerical results clearly indicate that the proposed controller achieves superior control performance, improved trajectory-tracking accuracy, and enhanced synchronization capability compared with the conventional SSMC and SSMC-TDE methods.

5.2.2. Scenario 2

The joint trajectory-tracking errors of the dual-arm robot in Scenario 2 are shown in Figure 18. Compared with the conventional SSMC and fixed-gain SSMC-TDE controllers, the proposed RLSSMC-TDE controller achieves improved tracking accuracy, particularly in instances when payload changes occur, as illustrated in Figure 18b,c. This behavior indicates that the reinforcement learning-based gain adaptation enables the controller to respond effectively to abrupt variations in payload-induced dynamics, which cannot be adequately compensated by manually tuned control gains. The influence of time-varying payloads on synchronization performance is depicted in Figure 19. Although all three controllers preserve synchronization in steady states, the proposed RLSSMC-TDE consistently exhibits smaller transient synchronization errors during payload transitions. This improvement demonstrates enhanced robustness of the proposed approach against asymmetric and time-varying uncertainties affecting the dual-arm system. Similar to Scenario 1, the corresponding control inputs and time-delay estimation compensation signals are presented in Figure 20 and Figure 21, respectively. The control inputs retain the characteristic switching behavior of sliding-mode control, while the compensation signals confirm the effectiveness of the TDE mechanism in reconstructing lumped uncertainties. The quantitative RMSE results summarized in Table 5 further support these observations, showing consistent performance improvements across all joints. Overall, these results validate the effectiveness and robustness of the proposed RLSSMC-TDE controller for dual-arm robotic systems operating under model uncertainties, friction effects, external disturbances, and time-varying payloads.

6. Conclusions

This paper presents a DRL-enhanced SSMC-TDE framework for dual-arm robot manipulators operating under significant uncertainties. By integrating the DDPG algorithm, the controller autonomously adapts the sliding surface and robust gains online, thereby improving synchronization accuracy, trajectory-tracking performance, and chattering suppression across a wide range of operating conditions. Quantitative simulation results demonstrate that, compared with conventional SSMC-TDE and SSMC schemes, the proposed method reduces the average RMSE of trajectory-tracking and synchronization errors by 35.52% and 99.3%, respectively, across all six joints in two representative scenarios. Lyapunov-based analysis guarantees the stability of the closed-loop system under nonlinear dynamics, payload variations, and external disturbances. Furthermore, MATLAB/Simulink-ROS/Gazebo co-simulations confirm smoother control actions and reduced control effort.

Future work will focus on hardware validation, multi-modal sensing integration, extensions to multi-arm systems, and exploration of alternative DRL algorithms and digital-twin-based training to enhance real-world performance. In addition, inspired by recent studies on imperfect dynamical systems [46,47], future research will investigate whether certain nonlinear imperfections can be systematically exploited to improve adaptability, robustness, or cooperative behavior in multi-robot manipulation tasks.

Author Contributions

Conceptualization, D.T.T. and T.N.N.; methodology, D.T.T., T.N.N., and T.K.T.H.; software, T.N.N. and T.K.T.H.; validation, D.T.T., T.N.N., and T.K.T.H.; formal analysis, T.N.N. and T.K.T.H.; investigation, T.N.N. and T.K.T.H.; resources, T.N.N. and T.K.T.H.; data curation, T.N.N. and T.K.T.H.; writing—original draft preparation, D.T.T. and T.N.N. writing—review and editing, D.T.T. and K.K.A.; visualization, T.N.N. and T.K.T.H., supervision, D.T.T. and K.K.A.; project administration, K.K.A.; funding acquisition, K.K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 2025 Research Fund of University of Ulsan.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of stability of the SSMC controller

The Lyapunov function is selected as follows:

V_{1} = \frac{1}{2} s^{T} s

(A1)

Taking the time derivative of (A1) yields:

{\dot{V}}_{1} = s^{T} \dot{s}

(A2)

Substituting (21) into (17), we obtain:

\dot{s} = (I + α T) {\tilde{M}}^{- 1} (- K s - η sgn (s) - d)

(A3)

From (A3), Equation (A2) can be rewritten as:

{\dot{V}}_{1} = - [s^{T} (I + α T) {\tilde{M}}^{- 1} K s] - [s^{T} (I + α T) {\tilde{M}}^{- 1}] (η sgn (s) + d)

(A4)

Based on Equation (A4), by appropriately choosing the design matrices

α

and

K \geq 0

, and in conjunction with Property 1 and the definition of

\bar{M}

in Equation (8), it follows that

s^{T} (I + α T) {\tilde{M}}^{- 1} K s \geq 0

. Furthermore, the matrix

η

is selected to be greater than or equal to the disturbance bound

d

as defined in Equation (20), which ensures that

(η sgn (s) + d) \geq 0

. Consequently,

{\dot{V}}_{1} \leq 0

for

s

, indicating that the close-loop system is Lyapunov stable.

References

Tran, D.T.; Dao, H.V.; Ahn, K.K. Adaptive Synchronization Sliding Mode Control for an Uncertain Dual-Arm Robot with Unknown Control Direction. Appl. Sci. 2023, 13, 7423. [Google Scholar] [CrossRef]
Smith, C.; Karayiannidis, Y.; Nalpantidis, L.; Gratal, X.; Qi, P.; Dimarogonas, D.V.; Kragic, D. Dual arm manipulation—A survey. Robot. Auton. Syst. 2012, 60, 1340–1353. [Google Scholar] [CrossRef]
Rigatos, G.; Abbaszadeh, M.; Busawon, K.; Pomares, J. A Nonlinear Optimal Control Approach for Dual-Arm Robotic Manipulators. Int. J. Humanoid Robot. 2025, 22, 2450009. [Google Scholar] [CrossRef]
Karim, M.F.; Bollimuntha, S.; Hashmi, M.S.; Das, A.; Singh, G.; Sridhar, S.; Singh, A.K.; Govindan, N.; Krishna, K.M. Da-Vil: Adaptive Dual-Arm Manipulation with Reinforcement Learning and Variable Impedance Control. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19–23 May 2025; pp. 11896–11903. [Google Scholar]
Yang, C.; Jiang, Y.; Na, J.; Li, Z.; Cheng, L.; Su, C.Y. Finite-Time Convergence Adaptive Fuzzy Control for Dual-Arm Robot with Unknown Kinematics and Dynamics. IEEE Trans. Fuzzy Syst. 2019, 27, 574–588. [Google Scholar] [CrossRef]
Hacioglu, Y.; Arslan, Y.Z.; Yagiz, N. MIMO fuzzy sliding mode controlled dual arm robot in load transportation. J. Frankl. Inst. 2011, 348, 1886–1902. [Google Scholar] [CrossRef]
Jinjun, D.; Yahui, G.; Ming, C.; Xianzhong, D. Symmetrical adaptive variable admittance control for position/force tracking of dual-arm cooperative manipulators with unknown trajectory deviations. Robot. Comput. -Integr. Manuf. 2019, 57, 357–369. [Google Scholar] [CrossRef]
Zhang, Y. Adaptive coordinated impedance control for dual-arm robot symmetric bimanual tasks. Robot. Auton. Syst. 2025, 193, 105110. [Google Scholar] [CrossRef]
Abbas, M.; Narayan, J.; Dwivedy, S.K. A systematic review on cooperative dual-arm manipulators: Modeling, planning, control, and vision strategies. Int. J. Intell. Robot. Appl. 2023, 7, 683–707. [Google Scholar] [CrossRef]
Al-Shuka, H.; Li, Y.; Song, R. Adaptive Approximation Control of Robotic Manipulators: Centralized and Decentralized Control Algorithms; School of Control Science and Engineering, Shandong University: Jinan, China, 2020; p. 10. [Google Scholar]
Zhang, W.; Sun, C.; Alharbi, M.; Hasanien, H.M.; Song, K. A voltage-power self-coordinated control system on the load-side of storage and distributed generation inverters in distribution grid. Ain Shams Eng. J. 2025, 16, 103480. [Google Scholar] [CrossRef]
Liu, X.; Xu, X.; Zhu, Z.; Jiang, Y. Dual-Arm Coordinated Control Strategy Based on Modified Sliding Mode Impedance Controller. Sensors 2021, 21, 4653. [Google Scholar] [CrossRef]
Tran, D.T.; Nguyen, T.N.; Nguyen, M.T.; Ngo, V.T.; Le, H.L. Synchronous Sliding Mode Control for a 4-DOF Parallel Manipulator in Practice. J. Tech. Educ. Sci. 2023, 18, 1–13. [Google Scholar] [CrossRef]
Tran, D.T.; Nguyen, X.T.; Nguyen, T.N.; Truong, Q.T. Practical Synchronous Sliding Mode Control With Time Delay Estimation for a 4-DOF Parallel Manipulator With Unknown Dynamics and Variable Payload. IEEE Access 2025, 13, 102758–102770. [Google Scholar] [CrossRef]
Cam, T.D.T.; Tran, D.T.; Tri, N.T.; Nghi, D.V. Synchronization Sliding Mode Control with Time-Delay Estimation for a 2-DOF Closed-Kinematic Chain Robot Manipulator. In Proceedings of the 2021 International Conference on System Science and Engineering (ICSSE); IEEE: New York, NY, USA, 2021; pp. 38–43. [Google Scholar]
Tran, D.T.; Nguyen, T.N.; Nguyen, X.T.; Nguyen, D.M. Synchronous PD Control Using a Time Delay Estimator for a Four-Degree-of-Freedom Parallel Robot in Practice. Machines 2023, 11, 831. [Google Scholar] [CrossRef]
Harandi, M. On the Controllers Based on Time Delay Estimation for Robotic Manipulators. arXiv 2021. [Google Scholar] [CrossRef]
Truong, T.N.; Vo, A.T.; Kang, H.-J. A Novel Time Delay Nonsingular Fast Terminal Sliding Mode Control for Robot Manipulators with Input Saturation. Mathematics 2025, 13, 119. [Google Scholar] [CrossRef]
Kali, Y.; Saad, M.; Benjelloun, K. Optimal super-twisting algorithm with time delay estimation for robot manipulators based on feedback linearization. Robot. Auton. Syst. 2018, 108, 87–99. [Google Scholar] [CrossRef]
Lee, J.W.; Rho, J.M.; Park, S.G.; An, H.M.; Kim, M.; Lee, S.Y. Improved Adaptive Sliding Mode Control Using Quasi-Convex Functions and Neural Network-Assisted Time-Delay Estimation for Robotic Manipulators. Sensors 2025, 25, 4252. [Google Scholar] [CrossRef]
Vo, A.T.; Truong, T.N.; Kang, H.-J.; Nguyen, N.H.A. Prescribed performance model-free sliding mode control using time-delay estimation and adaptive technique applied to industrial robot arms. Inf. Sci. 2025, 702, 121911. [Google Scholar] [CrossRef]
Hu, W.; Yang, Y.; Liu, Z. Deep Deterministic Policy Gradient (DDPG) Agent-Based Sliding Mode Control for Quadrotor Attitudes. Drones 2024, 8, 95. [Google Scholar] [CrossRef]
Simon, J.; Gogolák, L.; Sárosi, J. Deep Reinforcement Learning-Assisted Teaching Strategy for Industrial Robot Manipulator. Appl. Sci. 2024, 14, 10929. [Google Scholar] [CrossRef]
Hao, X.; Xin, Z.; Huang, W.; Wan, S.; Qiu, G.; Wang, T.; Wang, Z. Deep reinforcement learning enhanced PID control for hydraulic servo systems in injection molding machines. Sci. Rep. 2025, 15, 23005. [Google Scholar] [CrossRef] [PubMed]
Khan, H.; Khan, S.A.; Lee, M.C.; Ghafoor, U.; Gillani, F.; Shah, U.H. DDPG-Based Adaptive Sliding Mode Control with Extended State Observer for Multibody Robot Systems. Robotics 2023, 12, 161. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, O.; Zhao, Y.; Zhu, Q.; Liu, J. Adaptive neural network-based fixed-time control for robots with input saturation and prescribed performance. Nonlinear Dyn. 2025, 113, 18229–18241. [Google Scholar] [CrossRef]
Liu, Z.; Zhang, O.; Gao, Y.; Zhao, Y.; Sun, Y.; Liu, J. Adaptive neural network-based fixed-time control for trajectory tracking of robotic systems. IEEE Trans. Circuits Syst. II Express Briefs 2022, 70, 241–245. [Google Scholar] [CrossRef]
Liu, J.; Sun, Y.; Liu, Z.; Gao, Y.; Wu, L.; Leon, J.I.; Franquelo, L.G. Predefined-Time Reliable Control for Robotic Systems With Prescribed Performance. IEEE Trans. Ind. Electron. 2025, 72, 11695–11703. [Google Scholar] [CrossRef]
Kim, S.; Suh, J.-H.J.I.A. A Study on Robust Control Scheme Using Prescribed Performance-based Time Delay Control and RBF Neural Network. IEEE Access 2025, 13, 180513–180522. [Google Scholar] [CrossRef]
Yao, J.; Deng, W. Active disturbance rejection adaptive control of uncertain nonlinear systems: Theory and application. Nonlinear Dyn. 2017, 89, 1611–1624. [Google Scholar] [CrossRef]
Adane, A.G.; Abdissa, C.M. Adaptive Fuzzy Sliding Mode Controller of Three Link Robot Arm Manipulator. IEEE Access 2025, 13, 158222–158236. [Google Scholar] [CrossRef]
Han, M.; Wong, K.; Euler-Rolle, J.; Zhang, L.; Katzschmann, R.K. Robust learning-based control for uncertain nonlinear systems with validation on a soft robot. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 510–524. [Google Scholar] [CrossRef]
Zhang, X.; Liu, J.; Xu, X.; Yu, S.; Chen, H. Robust learning-based predictive control for discrete-time nonlinear systems with unknown dynamics and state constraints. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 7314–7327. [Google Scholar] [CrossRef]
Wang, Y.; Fang, S.; Hu, J. Active disturbance rejection control based on deep reinforcement learning of PMSM for more electric aircraft. IEEE Trans. Power Electron. 2022, 38, 406–416. [Google Scholar] [CrossRef]
Luo, G.; Zhang, D.; Feng, W.; Jiang, Z.; Liu, X. Deep Reinforcement Learning Based Active Disturbance Rejection Control for ROV Position and Attitude Control. Appl. Sci. 2025, 15, 4443. [Google Scholar] [CrossRef]
Ran, M.; Li, J.; Xie, L. Reinforcement-learning-based disturbance rejection control for uncertain nonlinear systems. IEEE Trans. Cybern. 2021, 52, 9621–9633. [Google Scholar] [CrossRef] [PubMed]
Maleki, M.; Razavi, F.S.; Taghavipour, A. Reinforcement Learning-Based Adaptive Gain Tuning of Terminal Super-Twisting SMC for Lane-Change Control in Autonomous Vehicles. IEEE Access 2025, 13, 197206–197218. [Google Scholar] [CrossRef]
Nguyen, T.N.; Nguyen, X.T.; Truong, Q.T.; Tu, D.C.T.; Ahn, J.H.; Ahn, K.K.; Tran, D.T. Reinforcement learning-based improvement of a PD-TDE controller for an upper limb rehabilitation robotic system. In Proceedings of the 2025 28th International Conference on Mechatronics Technology (ICMT), Ho Chi Minh City, Vietnam, 12–15 November 2025; pp. 96–101. [Google Scholar]
Lee, J.; Chang, P.H.; Jamisola, R.S. Relative Impedance Control for Dual-Arm Robots Performing Asymmetric Bimanual Tasks. IEEE Trans. Ind. Electron. 2014, 61, 3786–3796. [Google Scholar] [CrossRef]
Hodson, T.O. Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. 2022, 2022, 5481–5487. [Google Scholar] [CrossRef]
Pham, D.-A.; Han, S.-H. Enhancing Underwater Robot Manipulators with a Hybrid Sliding Mode Controller and Neural-Fuzzy Algorithm. J. Mar. Sci. Eng. 2023, 11, 2312. [Google Scholar] [CrossRef]
Tran, D.T.; Nha, N.T.; Van Thuyen, N.; Lam, L.H.; Ahn, K.K. A Fault-tolerant Synchronous Sliding Mode Control for a 4-DOF Parallel Manipulator With Uncertainties and Actuator Faults. Int. J. Control Autom. Syst. 2024, 22, 1313–1323. [Google Scholar] [CrossRef]
Kachroo, P.; Tomizuka, M. Chattering reduction and error convergence in the sliding-mode control of a class of nonlinear systems. IEEE Trans. Autom. Control 2002, 41, 1063–1068. [Google Scholar] [CrossRef]
Lee, H.; Utkin, V.I. Chattering suppression methods in sliding mode control systems. Annu. Rev. Control 2007, 31, 179–188. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D.J. Continuous control with deep reinforcement learning. arXiv 2015. [Google Scholar]
Bucolo, M.; Buscarino, A.; Famoso, C.; Fortuna, L.; Frasca, M. Control of imperfect dynamical systems. Nonlinear Dyn. 2019, 98, 2989–2999. [Google Scholar] [CrossRef]
Fortuna, L.; Buscarino, A.; Frasca, M. Imperfect dynamical systems. Chaos Solitons Fractals 2018, 117, 200. [Google Scholar] [CrossRef]

Figure 1. Dual-arm robot model.

Figure 2. Overview structure of the SSMC-TDE controller.

Figure 3. Agent training diagram using the DDPG algorithm.

Figure 4. Application diagram of the trained RL-based gain tuner for the SSMC-TDE controller.

Figure 5. Simulation diagram reinforcement learning agent block on MATLAB Simulink.

Figure 6. Actor network architecture.

Figure 7. Critic network architecture.

Figure 8. Detailed architecture of the DDPG reinforcement learning agent.

Figure 9. Overview of simulation environment MATLAB Simulink and Gazebo.

Figure 10. The process of building a co-simulation environment.

Figure 11. A model of a dual-arm robot on Gazebo.

Figure 12. The proposed controller simulation diagram in MATLAB Simulink.

Figure 13. Training learning curve of the reinforcement learning agent.

Figure 14. Tracking error signal at each joint of dual-arm robot: (a) Joint 11; (b) Joint 21; (c) Joint 31; (d) Joint 12; (e) Joint 22; and (f) Joint 32.

Figure 15. The synchronous error between the two joints of the 1st and 2nd robot arms: (a) Joint 11 and 12; (b) Joint 21 and 22; and (c) Joint 31 and 32.

Figure 16. Control input signal at each joint of the robot: (a) Joint 11; (b) Joint 21; (c) Joint 31; (d) Joint 12; (e) Joint 22; and (f) Joint 32.

Figure 17. The TDE signal is compensated for each joint of the robot: (a) Joint 11; (b) Joint 21; (c) Joint 31; (d) Joint 12; (e) Joint 22; and (f) Joint 32.

Figure 18. The tracking error at each joint of the robot: (a) Joint 11; (b) Joint 21; (c) Joint 31; (d) Joint 12; (e) Joint 22; and (f) Joint 32.

Figure 19. The synchronous error between the two joints of the 1st and 2nd robot arms: (a) Joint 11 and 12; (b) Joint 21 and 22; and (c) Joint 31 and 32.

Figure 20. Control input signal at each joint of the robot: (a) Joint 11; (b) Joint 21; (c) Joint 31; (d) Joint 12; (e) Joint 22; and (f) Joint 32.

Figure 21. The TDE signal is compensated at each joint of the robot: (a) Joint 11; (b) Joint 21; (c) Joint 31; (d) Joint 12; (e) Joint 22; and (f) Joint 32.

Table 1. Hyperparameters of the DDPG algorithm for agent training.

Define	Symbol	Tuning Parameter
Sample time	$T_{S}$	0.05
Smooth factor	$φ$	0.001
Reward discount factor	$γ$	0.99
Learning rate for actor	$α_{a}$	0.0001
Learning rate for critic	$α_{c}$	0.001
Minibatch size	$N_{b}$	128
Experience buffer length	R	$10^{6}$

Table 2. Summary of simulation scenarios and verification objectives.

Scenario	Uncertainty/Disturbance	Description	Verification Purpose
Scenario 1	External force disturbance	±10 N applied to joint 21 and 22 (8–9 s)	Robustness to external disturbances
Scenario 2	Payload variation	Payload increases from 0.5 kg (0–8 s) → 1 kg (8–15 s) → 1.5 kg (15–20 s)	Adaptability to time-varying payloads

Table 3. The control coefficients of the two controllers SSMC and SSMCTDE.

Controller	Control Coefficient
SSMC	$\begin{array}{l} λ = d i a g ([10, 18, 40, 10, 18, 40]); \\ K = d i a g ([0.3, 0.12, 0.12, 0.3, 0.12, 0.12]); \\ η = d i a g ([0.15, 0.15, 0.05, 0.15, 0.15, 0.05]); \\ α = d i a g ([0.4, 0.4, 0.4, 0.4, 0.4, 0.4]) . \end{array}$
SSMC-TDE	$\begin{array}{l} λ = d i a g ([10, 18, 40, 10, 18, 40]); \\ K = d i a g ([0.3, 0.12, 0.12, 0.3, 0.12, 0.12]); \\ η = d i a g ([0.15, 0.15, 0.05, 0.15, 0.15, 0.05]); \\ α = d i a g ([0.4, 0.4, 0.4, 0.4, 0.4, 0.4]); \\ \bar{M} = d i a g ([10^{- 4}, 10^{- 4}, 10^{- 4}, 10^{- 4}, 10^{- 4}, 10^{- 4}]) . \end{array}$

Table 4. Root mean square tracking error at each joint of the dual-arm robot.

Controllers	Joint 11	Joint 21	Joint 31	Joint 12	Joint 22	Joint 32
SSMC	0.0234	0.0586	0.1071	0.0518	0.098	0.2268
SSMC-TDE	0.001	0.0016	0.0028	0.0011	0.0018	0.0032
Proposed method	0.0006	0.001	0.0018	0.0007	0.0014	0.0018

Table 5. The root mean square tracking error at each joint of the robot.

Controllers	Joint 11	Joint 21	Joint 31	Joint 12	Joint 22	Joint 32
SSMC	0.0699	0.5788	0.1078	0.0864	0.5874	0.2277
SSMC-TDE	0.0007	0.0018	0.0027	0.0012	0.002	0.0029
Proposed method	0.0004	0.0012	0.0018	0.0007	0.0014	0.0019

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tran, D.T.; Nguyen, T.N.; Huynh, T.K.T.; Ahn, K.K. Deep Deterministic Policy Gradient-Based Parameter Adaptation for Synchronous Sliding-Mode Control with Time-Delay Estimation in Dual-Arm Robot Manipulators Under System Uncertainties. Appl. Sci. 2026, 16, 2042. https://doi.org/10.3390/app16042042

AMA Style

Tran DT, Nguyen TN, Huynh TKT, Ahn KK. Deep Deterministic Policy Gradient-Based Parameter Adaptation for Synchronous Sliding-Mode Control with Time-Delay Estimation in Dual-Arm Robot Manipulators Under System Uncertainties. Applied Sciences. 2026; 16(4):2042. https://doi.org/10.3390/app16042042

Chicago/Turabian Style

Tran, Duc Thien, Thanh Nha Nguyen, Thi Kim Tram Huynh, and Kyoung Kwan Ahn. 2026. "Deep Deterministic Policy Gradient-Based Parameter Adaptation for Synchronous Sliding-Mode Control with Time-Delay Estimation in Dual-Arm Robot Manipulators Under System Uncertainties" Applied Sciences 16, no. 4: 2042. https://doi.org/10.3390/app16042042

APA Style

Tran, D. T., Nguyen, T. N., Huynh, T. K. T., & Ahn, K. K. (2026). Deep Deterministic Policy Gradient-Based Parameter Adaptation for Synchronous Sliding-Mode Control with Time-Delay Estimation in Dual-Arm Robot Manipulators Under System Uncertainties. Applied Sciences, 16(4), 2042. https://doi.org/10.3390/app16042042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Deterministic Policy Gradient-Based Parameter Adaptation for Synchronous Sliding-Mode Control with Time-Delay Estimation in Dual-Arm Robot Manipulators Under System Uncertainties

Abstract

1. Introduction

2. Dynamic Modeling and Problem Formulation

2.1. Dual-Arm Robot Dynamics

2.2. Synchronous Coordination Error

2.3. Root Mean Square Error

3. Synchronous Sliding-Mode Control with Time-Delay Estimation

3.1. Control Design

3.2. Stability Analysis

4. DRL-Based Online Parameter Adaptation (DDPG)

4.1. Online Parameter Adaptive with SSMC-TDE

4.1.1. State, Action, and Reward Design

4.1.2. DDPG Training Setup in MATLAB Simulink

4.2. DDPG Framework

5. Simulation and Co-Simulation Results

5.1. Simulation Environment

5.2. Simulation Results and Discussions

5.2.1. Scenario 1

5.2.2. Scenario 2

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI