Physical Reinforcement Learning with Integral Temporal Difference Error for Constrained Robots

Pantoja-Garcia, Luis; Parra-Vega, Vicente; Garcia-Rodriguez, Rodolfo

doi:10.3390/robotics14080111

Open AccessArticle

Physical Reinforcement Learning with Integral Temporal Difference Error for Constrained Robots

by

Luis Pantoja-Garcia

¹

,

Vicente Parra-Vega

^2,*

and

Rodolfo Garcia-Rodriguez

³

¹

Tecnologico de Monterrey, Institute of Advanced Materials for Sustainable Manufacturing, Av. General Ramon Corona 2514, Zapopan 45210, Jalisco, Mexico

²

Robotics and Advanced Manufacturing Department, Research Center for Advanced Studies (Cinvestav-Ipn), Mexico City 07360, Mexico

³

Facultad de Ciencias de la Administración, Universidad Autónoma de Coahuila, Saltillo 25225, Mexico

^*

Author to whom correspondence should be addressed.

Robotics 2025, 14(8), 111; https://doi.org/10.3390/robotics14080111

Submission received: 5 June 2025 / Revised: 29 July 2025 / Accepted: 12 August 2025 / Published: 14 August 2025

(This article belongs to the Section AI in Robotics)

Download

Browse Figures

Versions Notes

Abstract

The paradigm of reinforcement learning (RL) refers to agents that learn iteratively through continuous interactions with their environment. However, when the value function is unknown, a neural network is used, which is typically encoded into an unknown temporal difference equation. When RL is implemented in physical systems, explicit convergence and stability analyses are required to guarantee the worst-case operations for any trial, even when the initial conditions are set to zero. In this paper, physical RL (p-RL) refers to the application of RL in dynamical systems that interact with their environments, such as robot manipulators in contact tasks and humanoid robots in cooperation or interaction tasks. Unfortunately, most p-RL schemes lack stability properties, which can even be dangerous for specific robot applications, such as those involving contact (constrained) tasks or interaction tasks. Considering an unknown and disturbed DAE2 robot, in this paper a p-RL approach is developed to guaranteeing robust stability throughout a continuous-time-adaptive actor–critic, with local exponential convergence of force–position tracking error. The novel adaptive mechanisms lead to robustness, while an integral sliding mode enforces tracking. Simulations are presented and discussed to show our proposal’s effectiveness, and some final remarks are addressed concerning the structural aspects.

Keywords:

continuous actor–critic; reinforcement learning; force–position control; robot manipulators; stability

1. Introduction

Reinforcement learning (RL) has emerged as an alternative decision-making learning technique that issues a reinforcement to compute the action, or control, based on recollection of the discount reward encoded throughout the performance evaluation; a distinct feature of RL versus control is precisely that it evaluates the performance each time and iteratively corrects the action to improve it, leading to a task being learned. Unfortunately, for us roboticists, it was developed and substantiated for the tuple of state–action, with the state s being the output and the action

π

the input of a system

Σ

—however, without an explicit structure of

Σ

. Thus, to deploy RL in the physical layer, e.g., a robot, a lot of effort is required to guarantee that the physical structural properties of all of the components are met, with concerns arising about the safety of its deployment in a real-world application [1]. In addition, due to the iterative realm of these schemes, the safety requirements are exacerbated, leading to sim-to-real schemes to transfer RL schemes to the physical realm. In addition, when a rigid-body robot engages in contact with a rigid environment, high-frequency contact is of concern due to the high energy transfer index. Fortunately, the robotics community has proposed RL schemes based on a Lagrangian structure for

Σ

and derived RL motor learning, ready to meet such structures based on passivity-based and model-free adaptive schemes [2]. In the latter type of scheme, ordinary differential equations (ODEs) model the Lagrangian robot for free-motion tasks, and safety issues are handled since the passivity is preserved; uncertainties and unmodeled dynamics are considered; smooth action is synthesized; and no trial-and-error is involved, but explicit energy-based stability conditions are established. In this context, the energy-based stability and adaptive mechanism that substitutes pre-training sets apart this approach of RL for robots as a machine learning scheme since the learning paradigm of a black-box interactive process of exploration and exploitation is circumvented using a structure energy-based adaptation that guides reward recollection without pre-training, quite convenient for physical world applications. In this paper, we extended such an approach to constrained motion modeled using a differential algebraic system of equations (DAE2) to study position–force learning tasks.

Most robotic contact tasks require precise control of the normal force exerted on a surface while following a desired end-effector trajectory on the tangent plane. Active force–position control considers the DAE2’s constrained dynamics, wherein the derivative of the momentum represents the Lagrangian multiplier (contact force) to be controlled. Most schemes circumvent the DAE2 formulation, using ODE (unconstrained motion) dynamical models, where no state is related to the contact force [3]. Popular schemes that use the latter have proliferated, among them impedance control [4,5] and stiffness control, which cannot guarantee position–force convergence. On the other hand, the DAE2 formulation leverages orthogonalized subspaces to model orthogonal operators from the holonomic constraint, thereby projecting force and velocity errors into orthogonal complements. Then, control is designed to enforce force and position convergence simultaneously and independently [6], followed by model-based schemes [7]. Additionally, to compensate for unknown robot dynamics, even Lipschitz disturbances, a method called neurocontrol is proposed where neural networks, NNs, are used as approximator functions or to compensate for unknown dynamics. Some approaches to contact tasks based on NNs are presented in [8,9,10]. In recent years, the application of reinforcement learning (RL) to dynamical systems has attracted attention due to its ability to learn through interaction by encoding different behaviors [11]. Crucial elements of RL are the instrumentation of a “reward” and its “effort” (the value function), which qualify a robot’s task performance accordingly. Depending on such an evaluation, it issues a reinforcement to the “action”, which leads to an improved performance while minimizing the value function.

Actor–critic is a technique used to instrument RL based on a critic NN used to approximate the value function of the temporal difference error equation. In contrast, the actor NN approximates inverse dynamics to yield the action [12]. In control schemes, the actor NN is coupled with a damping term to stabilize the robot. This scheme has gained popularity due to its ability to handle uncertainties while gradually improving the performance to facilitate the successful completion of robotic tasks for free motion, such as in [13,14], to name a few. For contact/interaction tasks, see [15,16,17,18]; notably, in [19], an admittance controller computes the desired force based on RL, leading to a stable force error. Similarly, in [15], adaptive dynamic programming based on an LQR is explored, with bounded errors. In [16], an impedance-based advantage actor–critic (A2C) deals with constraint uncertainties, while [17] uses an impedance-based natural gradient actor–critic (NAC), with experiments. Additionally, [18] presents a model-based AC for optimal impedance tuning, and it is used for cooperation with multi-robots in [20,21], even for humanoid robots [22] without a stability analysis. Although variants of the AC approach have shown effectiveness for contact/interaction tasks, the literature has overlooked the force–position convergence, which is critical due to interactions that occur at a high frequency (due to rigid-body contact), in particular when the robot is constrained by a rigid environment, i.e., a holonomic constraint that yields a DAE2 model.

An early review [23] and recent work [24] on deep RL for manipulation have pointed out the need to explicitly address certain challenges in constrained robot dynamics resulting from the physical interaction, which we refer to as physical RL (p-RL). This acronym aims to distinguish it from RL schemes for unconstrained robots; the underlying foundation of p-RL is based on sound physical interaction principles or contact tasks rather than simply software apps (s-RL). That is, in p-RL, the physical layer of a robotic system cannot withstand hundreds, or thousands, or millions, of training or pre-training tests, as are typically required nowadays in RL, due to the risk of compromising the robot’s structural integrity. Thus, p-RL schemes need to be substantiated through a model-based stability analysis to produce an actionable (physically speaking) action command. Actionable p-RL implies control that is smooth, causal, model-free, and bounded, without pre-training requirements.

In this manuscript, we propose actionable p-RL for a DAE2 robot, showing position–force convergence with a stability analysis. A continuous-time AC scheme deploys RL, extending the methodology [25] from unconstrained to constrained robot dynamics. Hereby, fast and robust adaptation for the critic is obtained by handling robust convergence of the so-called integral temporal difference (ITD) error. At the same time, a passivity-based actor is proposed over orthogonalized error manifolds, where robust convergence is guaranteed based on chatterless integral sliding modes.

2. Preliminaries

2.1. Robot Dynamics and the Problem Statement

Let us take a robot manipulator with n degrees of freedom whose end-effector is constrained to evolve onto a rigid surface, modeled by the following differential algebraic equation (DAE index 2, or DAE-2)

H (q) \ddot{q} + C (q, \dot{q}) \dot{q} + τ_{f r} + g (q) = τ + J_{φ}^{T} (q) λ + τ_{d},

(1)

s . t . φ (q) = 0,

(2)

where

q \in R^{n}

stands for the generalized position and

\dot{q} \in R^{n}

the generalized velocity,

H (q) \in R^{n \times n}

is the inertia matrix,

C (q, \dot{q}) \in R^{n \times n}

is the Coriolis matrix,

g (q) \in R^{n}

is the gravity vector,

J_{φ} \in R^{l \times n}

is the constrained Jacobian,

λ \in R^{1}

stands for the contact force in operational space, and

τ_{f r}

represents affine friction and

τ_{d}

a Lipschitz disturbance. Notice that the rigid surface is modeled by the implicit equation

φ (q) = 0 \in R^{l}

, for

l = 1

independent contact point. Additionally, the DAE system is of index 2 since the input appears in the second derivative of the constraint, and its state is

ξ = (q, \dot{q}, p = \int λ)

, where

q \in R^{n - r}, \dot{q} \in R^{n - r}

, and the momentum

p \in R^{1}

.

2.2. The Actor–Critic Learning Control Problem

Leveraging the functional

Y_{r} = H (q) {\ddot{q}}_{r} + C (q, \dot{q}) {\dot{q}}_{r} + g (q) + τ_{f r}

, where

{\dot{q}}_{r}

is a nominal velocity reference to the shape orthogonalized error coordinate system

S_{r} = \dot{q} - {\dot{q}}_{r}

, [26], then the open-loop error equation is defined as

H (q) {\dot{S}}_{r} + C (q, \dot{q}) S_{r} + B S_{r} = τ + τ_{d} + J_{φ}^{T} λ - Y_{r} .

(3)

Thus, the problem statement is to design a model-free p-RL control mechanism for a DAE2 robot (1) and (2) in order to generate a smooth action policy

τ

such that the state

ξ = (q, \dot{q}, p) \to ξ_{d} = (q_{d}, {\dot{q}}_{d}, p_{d} = \int λ_{d})

without any previous training or learning and any knowledge of the robot dynamics,

Y_{r}

. The use of the term “model-free” in this work refers to a lack of knowledge of the robot’s dynamic model, but the kinematics are assumed to be known.

3. Learning Actor–Critic Design

In the AC algorithm proposed here, the critic learning process is based on a stability-guaranteed integral temporal difference error. On the other hand, the actor synthesizes a partial action, playing the role as an approximator of the inverse dynamics. In this way, the proposed controller is composed of the action plus a stabilizer term, typically a PID-like term.

In the following sections, it is shown how stability is ensured in the NN-critic using Lyapunov stability theory. After that, the adaptation laws for the NN-actor are designed to guarantee closed-loop stability.

3.1. The Value Function and the Temporal Difference Error

Let the value function in continuous time, representing the cumulative rewards, given as [27],

R = \int_{t_{0}}^{\infty} e^{- \frac{m - t}{ψ}} r (m) d m,

(4)

where r represents the instant reward, and

ψ

corresponds to a discount factor with

t_{0} \leq m \leq \infty

. In continuous time, the value function is driven by the temporal difference (TD) error [11], such that differentiating (4), we have

\dot{R} = \frac{1}{ψ} R - r (t) .

(5)

In cases where the value function R is not exactly known, the learning goal is to guarantee that

δ = \dot{R} - \frac{1}{ψ} R + r = 0 .

(6)

Expressing (6) as a lower-order equation

γ = \int δ = \int_{t_{0}}^{t_{f}} δ (τ) d τ

, the integral temporal difference (ITD) error arises as

γ = R - \frac{1}{ψ} \int R + \int r .

(7)

Notice that if

γ \to 0

, so does (6), and henceforth (5) applies. The foundations of this chain of implications lead to the learning rationale encoded by the unknown value function (4).

Remark 1.

The motivation behind the introduction of the ITD error is to drive the learning and stability properties of the critic by γ, specifically the quasi-sliding mode condition. Additionally, the inclusion of the ITD eliminates the need for

\ddot{R}

in the stability analysis. In the next section, the use of the TD error and the ITD as tools to design critic adaptive weights is explained.

3.2. The Critic Neural Network

Assuming that the smooth value function R of (4) is unknown, yet it is Lipschitz, then it can be approximated by a neural network. To this end, a large vector of the ideal weights

W_{c}^{*}

and a basis

Z_{c}^{*} (X_{c}^{*})

exist for a set of ideal inputs

X_{c}^{*}

such that

R = W_{c}^{* T} Z_{c}^{*}

. Now, considering a smaller set of weights

W_{c} \subset W_{c}^{*}

and basis

Z_{c} \subset Z_{c}^{*} (*)

, a small reconstruction error

ϵ_{R}

appears that approximates the value function, i.e.,

R = W_{c}^{T} Z_{c} + ϵ_{R} .

(8)

The linear parametrization (8) suggests that a subset of adaptive weights

{\hat{W}}_{c} \in W_{c}

and a basis exist such that the value function R can be approximated as

\hat{R} = {\hat{W}}_{c}^{T} Z_{c},

(9)

where

{\hat{W}}_{c} \in R^{p \times 1}

are adaptive weights;

Z_{c} = σ (χ_{c})

is the bipolar sigmoid function, with the scaled input

χ_{c} = V_{c}^{T} X_{c}

, with

V_{c}

a constant matrix; and

X_{c}

is the input vector to the neural network. Using (9) and its derivative, the neural approximation of

δ

becomes

\begin{matrix} \hat{δ} & = \dot{\hat{R}} - \frac{1}{ψ} \hat{R} + r \\ = {\dot{\hat{W}}}_{c}^{T} Z_{c} + {\hat{W}}_{c}^{T} {\dot{Z}}_{c} - \frac{1}{ψ} {\hat{W}}_{c}^{T} Z_{c} + r, \end{matrix}

(10)

where

\hat{δ} (t_{0}) \neq 0

in general. Note that using the NN parametrization (9) in Equation (7), we obtain

\dot{\hat{γ}} = \hat{δ}

. The challenge now becomes proposing an adaptive law

{\dot{\hat{W}}}_{c}

such that

\hat{δ}

converges, meaning that the learning process is achieved within a small approximation error.

Proposition 1

(Adaptive Critic Weight Design). Let the adaptation law for the critic’s weights be defined as follows:

{\dot{\hat{W}}}_{c} = - K_{ω} sgn ({\hat{W}}_{c}) - K sgn (\hat{γ}) \frac{Z_{c}}{Z_{c}^{T} Z_{c}},

(11)

where

K_{w}

, K are positive gains. Then,

\hat{γ} \to ϵ_{γ}

for

ϵ_{γ}

a constant, which leads to

\dot{\hat{γ}} = 0 \to

\hat{δ} = 0

. Finally, due to the reconstruction error, we obtain

δ \to ϵ_{δ}

.

Proof.

See Appendix A.1. □

3.3. The Actor Neural Network

The actor network has the role of approximating the inverse dynamics parametrized by the functional

Y_{r}

. Then, analogously constant weights

W_{a}

and the basis

Z_{a}

exist such that

Y_{r} = W_{a}^{T} Z_{a} + ϵ_{a},

(12)

where

ϵ_{a}

is the neural reconstruction error. Henceforth, an actor neural network exists that approximates (12) as follows

{\hat{Y}}_{r} = {\hat{W}}_{a}^{T} Z_{a},

(13)

where

{\hat{W}}_{a} \in R^{h \times n}

are adaptive weights, and

Z_{a} = σ (χ_{a})

is the bipolar sigmoid function, with the input

χ_{a} = V_{a}^{T} X_{a}

, where

V_{a}

are constant weights, and

X_{a}

its input vector. The adaptation law

{\dot{\hat{W}}}_{a}

still needs to be designed.

Proposition 2

(Adaptive Actor Weight Design). Let the following adaptive law update the actor weights:

{\dot{\hat{W}}}_{a} = - Γ_{a} Z_{a} S_{r}^{T} - Γ_{a} {\hat{W}}_{a} {(\hat{γ} r)}^{2},

(14)

where

Γ_{a} \in R_{+}^{h \times h}

is the feedback gain, and

{(\hat{γ} r)}^{2}

constitutes the reinforcement. Then, weights

{\hat{W}}_{a}

are bounded.

Proof.

See Appendix A.2. □

3.4. Model-Free Actor–Critic Control Design

Assuming that

Y_{r}

is unknown and approximated by the NN-actor, let the controller be given as

τ = - K_{d} S_{r} + {\hat{W}}_{a}^{T} Z_{a} + J_{φ}^{T} (- λ_{d} - {\dot{S}}_{d F} + K_{F} \tanh (μ S_{q F}) + η S_{v F}),

(15)

where

S_{r} = \dot{q} - {\dot{q}}_{r}

is an orthogonalized extended velocity error arising when the nominal reference

{\dot{q}}_{r}

is

{\dot{q}}_{r} = Q (q) (\dot{q} - α Δ q + S_{d q} - K_{q} \int sgn (S_{q q})) + J_{φ}^{+} (q) β S_{v F},

(16)

where

Q (q) = I - J_{φ}^{+} (q) J_{φ} (q)

is the orthogonal projection of

J_{φ}

,

J_{φ}^{+}

is the right pseudoinverse,

Δ q = q - q_{d}

is the position error,

q_{d} \in C^{2}

is the desired position, and

β

is a positive definite gain. Thus, the extended velocity error

S_{r}

can be rewritten as follows:

S_{r} = Q (q) S_{v q} - J_{φ}^{+} (q) β S_{v F},

(17)

where

S_{v q} = S_{q q} + K_{q} \int sgn (S_{q q}) position error manifold,

(18)

S_{v F} = S_{q F} + K_{F} \int sgn (S_{q F}) force error manifold,

(19)

for

S_{q q} = S_{q} - S_{d q}

,

S_{q} = Δ \dot{q} + α Δ q

,

S_{d q} = S_{q} (t_{0}) e^{- k_{1} t}

,

S_{q F} = S_{F} - S_{d F}

,

S_{F} = \int Δ λ

,

S_{d F} = S_{F} (t_{0}) e^{- k_{2} t}

, and

Δ λ = λ - λ_{d}

, where

λ_{d}

is the desired contact force, and

α

,

k_{1}

,

k_{2}

,

K_{q}

,

K_{F}

,

β

are positive definite feedback gains. Using (3) and (15), we obtain the following closed-loop error equation:

\begin{matrix} H (q) {\dot{S}}_{r} = & - C (q, \dot{q}) S_{r} - B S_{r} + τ_{d} - K_{d} S_{r} + {\hat{W}}_{a}^{T} Z_{a} - Y_{r} \\ + J_{φ}^{T} ({\dot{S}}_{v F} + η S_{v F}) + K_{F} J_{φ}^{T} ϵ_{z}, \end{matrix}

(20)

where

ϵ_{z} = (\tanh (μ S_{q F}) - sgn (S_{q F}))

. Notice that the state of the closed-loop error equation is complemented with (10), (11) and (14). We are now ready to announce the main result.

Theorem 1.

Consider the constrained robot dynamics (1) in the closed loop with the proposed actor–critic learning controller (15), which leads to the system of Equations (10), (11), (14) and (20). Then, from Propositions 1 and 2 and selecting

K_{d}

,

K_{q}

,

K_{F}

large enough, the integral sliding mode at

S_{q q} = 0

and

S_{q F} = 0

is enforced. Thus, the local exponential convergence of the tracking errors

Δ q

,

Δ \dot{q}

, and

Δ λ

is guaranteed, and

\hat{δ} \to 0

without any knowledge of the DAE2 dynamics.

Proof.

See Appendix A.2. □

4. Simulations

Simulations are carried out in Matlab Simulink using the integrator ode4, with a fixed step of

1 \times 10^{- 4}

. The simulation consists of a 3-dof robot in contact with a rigid plane parallel to the

x - y

plane, see Figure 1, at a height of

z_{d} = 0.15

m. The desired trajectory onto the plane

x - y

is a circle with the radius

a = 0.15

centered at

(x_{c}, y_{c}) = (0.5, 0)

; then, the desired Cartesian path is

x_{d} = x_{c} + a \cos (w t)

and

y_{d} = y_{c} + a \sin (w t)

, with the frequency

w = 1

; the desired force is

λ_{d} = 25 + 2 \sin (t)

N.

The robot parameters are the following: link lengths

l_{1} = 0.5

,

l_{2} = 0.5

, and

l_{3} = 0.35

m; link masses

m_{1} = 7.2

,

m_{2} = 5

, and

m_{3} = 1.9

kg; distances to the center of mass

l_{c 1} = 0.25

,

l_{c 2} = 0.19

, and

l_{c 3} = 0.12

; and inertias

I_{1} = 0.15

,

I_{2} = 0.02

, and

I_{3} = 0.016

. The initial conditions are

q (t_{0}) = {(0, 0, - π / 2)}^{T} r a d

,

\dot{q} (t_{0}) = {(0, 0, 0)}^{T} r a d / s

, corresponding to being in contact with the plane.

The reward is designed to continuously supervise the tracking performance. Thus, a weighted summation of the position and velocity errors is considered, without weighting the integral of the tracking errors, regardless of its orthogonal projection. Then,

r = Δ q^{T} P Δ q + Δ {\dot{q}}^{T} A Δ \dot{q} + {(\int Δ λ)}^{2}

, with

P = d i a g (0.9) \in R^{n \times n}

,

A = d i a g (0.1) \in R^{n \times n}

so that the position is greatly weighted given that the robot is constrained in its position. The adaptation gains are selected as

K_{w} = 7

,

K = 50

,

Γ_{a} = I \times 1500

and the feedback gains are selected as

K_{d} = d i a g (50 100 50)

,

β = 1

,

α = d i a g (3 3 3)

,

K_{q} = d i a g (1 1 1)

,

K_{F} = 1.15

,

k_{1} = 50

,

k_{2} = 10

,

η = 2000

.

Results

In Figure 2, we can see that the integral temporal difference error converges close to zero and remains constant; consequently, the temporal difference error, a derivative of the ITD error, converges to zero. This means that the adaptive laws for the NN-critic guarantee the convergence of the TD error. The integral temporal difference error dynamics facilitate motor learning through reward recollection; the larger the ITD error or the reward, the greater the reinforcement to improve the task performance. This is independent of how the reward policy is designed, including Pavlov’s negative rewards [28].

The joint position and velocity errors are shown in Figure 3, which converge approximately after 1 s, while Figure 4a shows the convergence of the force error in approximately 0.02 s, considerably faster than the position and velocity errors, shown in Figure 4b. Extended velocity errors are shown in Figure 5a, which remain bounded. Additionally, Figure 5b and Figure 6a show how the invariant manifolds for position-velocity and force, respectively, remain near zero. In Figure 6b, the performance of the smooth control signals is shown. Finally, Figure 7 shows how the actor–critic weights remain bounded, as is expected according to the theory. The convergence of the tracking errors stems from the underlying integral sliding mode, notably without any chattering in the controller, which contrasts with the existing literature, which lacks a similar result. Chattering refers to a very high-frequency control component that puts the structural integrity of the robot, let alone its constrained motion, in danger. It is unclear how to achieve smooth and robust control and convergence using another technique.

To compare the performance of the proposed approach, a comparative study is presented under the same assumptions as those for the model-free DAE robot, including the parameters and simulation conditions. Specifically, it is used as a force–position neurocontroller for a DAE-2 robot. For the comparative study, the following performance metrics are used:

I T A E_{p o s} = \int_{t_{0}}^{t} t {∥Δ q∥}^{2}

represents the accumulation of the position error over time;

I T A E_{v e l} = \int_{t_{0}}^{t} t {∥Δ \dot{q}∥}^{2}

represents the accumulation of the velocity error over time;

I A C = \int_{t_{0}}^{t} ∥τ∥

and

I A C V = \int_{t_{0}}^{t} ∥\frac{d}{d t} τ∥

represent the control effort index and the smoothness of the control signal index, respectively.

Table 1 presents the numerical values obtained for each case, indicating that our proposal yields smaller values, representing a better performance than that of the neurocontroller.

5. Conclusions

A novel model-free AC controller for contact tasks, which yields motor learning in DAE2 robots, is presented. This scheme is based on a method called p-RL, which refers to the application of RL to dynamical systems that interact with their environments. The proposed p-RL is instrumented through an AC architecture for robotic contact tasks, instrumenting novel adaptive laws. It is worth mentioning that the stability involves the nominal DAE2 dynamics; however, the AC-based controller scheme is free of model dynamics.

Update laws are designed to satisfy Lyapunov stability rather than being concerned with parametric variations, as in gradient descent. In this way, local minima are avoided, resulting in stronger stability properties in the time domain of the complete state of the closed-loop error equation.

A simulation for a 3D robot constrained by a rigid plane is presented, demonstrating tracking with smooth control actions, without requiring knowledge of the disturbed dynamics. An experimental evaluation is underway, including a comparison with various reward policies, as well as asymptotic model-free controllers that guarantee tracking, such as neurocontrol and integral sliding modes. Finally, we have employed a general formulation of DAE2 robots, which enables a straightforward extension to multirobot tasks, such as cooperative manipulators and robotic hand manipulation, particularly in critical contact tasks where continuous oversight of the performance is required.

Safety concerns arise when attempting to deploy an RL scheme (obtained under the premise of an abstract system

Σ

) in a physical system. Lyapunov-based schemes are a straightforward methodology to deal with, as devised more than two decades ago in the sound paper [29], or, for instance, in synthesizing control Lyapunov functions from barrier Lyapunov functions or Lyapunov stability, as in our case.

Unlike the traditional RL schemes reported in the literature, the proposed p-RL does not require any training or learning in the actor or the critic. Although sim-to-real transfer may be recommended to reduce high real-world trial costs, it is not a fundamental step for the proposed approach. That is, the process of transferring knowledge or policies learned in a simulated environment to the real world is not necessary. Only the process of tuning the gains is required to satisfy the stability conditions, where physics-based simulators can be used to improve the transferability.

Last but not least, model-free force–position control for DAE2 robots was solved over three decades ago, without chattering [7]. Therefore, the question arises as to what contribution is obtained by introducing motor learning. When motor learning is deployed, tracking is commensurate with the performance evaluation, not only asymptotic tracking. This latter notion enforces tracking as

t \to \infty

, whereas motor learning supervises the performance at each instant, not in the limit, at infinity.

Author Contributions

Conceptualization and investigation, L.P.-G., V.P.-V. and R.G.-R.; methodology and formal analysis, L.P.-G., V.P.-V. and R.G.-R.; software and resources, L.P.-G.; writing—original draft preparation, L.P.-G., V.P.-V. and R.G.-R.; writing—review, editing and supervision, V.P.-V. and R.G.-R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created, or used from any publicly archived datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs of the Propositions and Theorem

Appendix A.1. Proof of Proposition 1

Consider the following Lyapunov candidate function (LCF):

V_{γ} = \frac{1}{2} {\hat{γ}}^{2} + \frac{1}{2} {\hat{W}}_{c}^{T} {\hat{W}}_{c}

(A1)

which qualifies as the LCF for system (10) and (11) for the origin

(\hat{γ}, {\hat{W}}_{c}) = (0, 0)

. Taking the time derivative to (A1) and substituting Equation (10) leads to

\begin{matrix} {\dot{V}}_{γ} & = \hat{γ} \dot{\hat{γ}} + {\hat{W}}_{c}^{T} {\dot{\hat{W}}}_{c} \\ = \hat{γ} (\dot{\hat{R}} - \frac{1}{ψ} \hat{R} + r) + {\hat{W}}_{c}^{T} {\dot{\hat{W}}}_{c} \\ = \hat{γ} ({\dot{\hat{W}}}_{c}^{T} Z_{c} + {\hat{W}}_{c}^{T} {\dot{Z}}_{c} - \frac{1}{ψ} {\hat{W}}_{c}^{T} Z_{c} + r) + {\hat{W}}_{c}^{T} {\dot{\hat{W}}}_{c} \end{matrix}

(A2)

Using the weight adaptation law defined in (11), (A2) can be written as follows:

\begin{matrix} {\dot{V}}_{γ} = & \hat{γ} ({\{- K sgn (\hat{γ}) \frac{Z_{c}}{Z_{c}^{T} Z_{c}} - K_{ω} sgn ({\hat{W}}_{c})\}}^{T} Z_{c} + {\hat{W}}_{c}^{T} {\dot{Z}}_{c} - \frac{1}{ψ} {\hat{W}}_{c}^{T} Z_{c} + r) \\ + {\hat{W}}_{c}^{T} \{- K sgn (\hat{γ}) \frac{Z_{c}}{Z_{c}^{T} Z_{c}} - K_{ω} sgn ({\hat{W}}_{c})\} \end{matrix}

(A3)

Considering

ψ ≫ 1

and using

∥ \frac{Z_{c}}{Z_{c}^{T} Z_{c}} ∥ \leq 1

, Equation (A3) becomes

\begin{matrix} {\dot{V}}_{γ} = & - K \hat{γ} sgn (\hat{γ}) - K_{ω} \hat{γ} \underset{ϵ_{1}}{\underset{︸}{sgn {({\hat{W}}_{c})}^{T} Z_{c}}} + \hat{γ} {\hat{W}}_{c}^{T} {\dot{Z}}_{c} - {\hat{W}}_{c}^{T} \underset{ϵ_{2}}{\underset{︸}{Z_{c} \frac{\hat{γ}}{ψ}}} + \hat{γ} r \\ - K {\hat{W}}_{c}^{T} \underset{ϵ_{3}}{\underset{︸}{sgn (\hat{γ}) \frac{Z_{c}}{Z_{c}^{T} Z_{c}}}} - K_{ω} {\hat{W}}_{c}^{T} sgn ({\hat{W}}_{c}) \end{matrix}

(A4)

where

ϵ_{1}, ϵ_{2}, ϵ_{3}

are bounded terms. Then, (A4) becomes

\begin{matrix} {\dot{V}}_{γ} & \leq - K |\hat{γ}| - K_{ω} |{\hat{W}}_{c}| + |\hat{γ}| |{\dot{Z}}_{c}| |{\hat{W}}_{c}| + K_{ω} |\hat{γ}| |ϵ_{1}| - ϵ_{2} |{\hat{W}}_{c}| + \hat{γ} r - K ϵ_{3} |{\hat{W}}_{c}| \\ \leq - (K - K_{ω} ϵ_{1}) |\hat{γ}| - (K_{ω} - |\hat{γ}| |{\dot{Z}}_{c}| - ϵ_{2} - K ϵ_{3}) |{\hat{W}}_{c}| + \hat{γ} r \\ \leq - η_{1} |\hat{γ}| - η_{2} |{\hat{W}}_{c}| + ϵ_{γ} \end{matrix}

(A5)

If

K > K_{ω} ϵ_{1}

and

K_{ω} > (| {\dot{Z}}_{c} | |\hat{γ}| + ϵ_{2}) / (1 + ϵ_{3})

, then

η_{1} > 0, η_{2} > 0

. In this condition, (A5) guarantees that

(\hat{γ}, {\hat{W}}_{c}) \to ϵ = ϵ_{γ} / \inf (η_{1}, η_{2})

, a very small vicinity. This implies that the

L_{1}

norm induces a quasi-sliding mode at

(\hat{γ}, {\hat{W}}_{c}) = (ϵ_{5}, ϵ_{6})

in finite time but not at

(\hat{γ}, {\hat{W}}_{c}) = (0, 0)

(QED).

Appendix A.2. Proof of the Theorem

Consider the derivative of the Lyapunov candidate function

V_{T} = \frac{1}{2} S_{r}^{T} H (q) S_{r} + \frac{1}{2} S_{v F} β S_{v F} + \frac{1}{2} t r ({\tilde{W}}_{a}^{T} Γ_{a}^{- 1} {\tilde{W}}_{a}) + V_{γ}

, along with its solutions (14) and (20); for

\tilde{γ} = Δ γ = γ - \hat{γ}

, we obtain

\begin{matrix} {\dot{V}}_{T} = & S_{r}^{T} H (q) {\dot{S}}_{r} + S_{r}^{T} \frac{\dot{H}}{2} S_{r} + S_{v F}^{T} β {\dot{S}}_{v F} + t r (- {\tilde{W}}_{a}^{T} Γ_{a}^{- 1} {\dot{\hat{W}}}_{a}) + {\dot{V}}_{γ} \\ = & S_{r}^{T} (- (K_{d} + B) S_{r} + {\hat{W}}_{a}^{T} Z_{a} - Y_{r} + τ_{d} + K_{F} J_{φ}^{T} ϵ_{z}) - S_{v F}^{T} β η S_{v F} \\ - t r ({\tilde{W}}_{a}^{T} (- Z_{a} S_{r}^{T} - {\hat{W}}_{a} {(\hat{γ} r)}^{2})) + {\dot{V}}_{γ} \\ = & S_{r}^{T} (- (K_{d} + B) S_{r} - {\tilde{W}}_{a}^{T} Z_{a} - ϵ_{a} + τ_{d} + K_{F} J_{φ}^{T} ϵ_{z}) - S_{v F}^{T} β η S_{v F} \\ + S_{r}^{T} {\tilde{W}}_{a}^{T} Z_{a} - t r (- {\tilde{W}}_{a} {\hat{W}}_{a} {(\hat{γ} r)}^{2}) + {\dot{V}}_{γ} \\ \leq & - S_{r}^{T} (K_{d} + B) S_{r} + S_{r}^{T} (τ_{d} - ϵ_{a}) - S_{v F}^{T} β η S_{v F} + S_{r}^{T} K_{F} J_{φ}^{T} ϵ_{z} \\ - {(\hat{γ} r)}^{2} t r ({\tilde{W}}_{a}^{T} ({\tilde{W}}_{a} - W_{a})) - η_{1} |\hat{γ}| - η_{2} |{\hat{W}}_{c}| + \hat{γ} r \end{matrix}

(A6)

where we have used the trace property, and

{\dot{V}}_{γ} \leq - η_{1} |\hat{γ}| - η_{2} |{\hat{W}}_{c}| + \hat{γ} r

. Thus, (A6) becomes

\begin{matrix} {\dot{V}}_{T} \leq & - S_{r}^{T} (K_{d} + B) S_{r} - S_{v F}^{T} β η S_{v F} + ∥S_{r}∥ ∥ϵ_{a} + ϵ_{τ_{d}} + E_{z}∥ - η_{1} ∥ \hat{γ} ∥ - η_{2} ∥ {\hat{W}}_{c} ∥ \\ - {(\hat{γ} r)}^{2} {(∥{\tilde{W}}_{a}∥ - \frac{W_{a m a x}}{2})}^{2} + {(\hat{γ} r)}^{2} \frac{W_{a m a x}}{4} + \hat{γ} r \end{matrix}

(A7)

where

ϵ_{τ_{d}} \geq | | τ_{d} | |

and

E_{z} \leq K_{F} J_{φ}^{T} ϵ_{z}

. Then, for large enough gains

K_{d}

,

β

,

η

,

η_{1}

and

η_{2}

, the closed-loop signals

S_{r}

,

{\tilde{W}}_{c}

,

{\tilde{W}}_{a}

, and

\hat{γ}

remain bounded. Now, we continue to prove the existence of the sliding mode at

S_{q q}

and

S_{q F}

. From (18) and (19), we obtain

\begin{matrix} S_{q q}^{T} {\dot{S}}_{q q} & = - K_{q} ∥S_{q q}∥ + {\dot{S}}_{v q} S_{q q} \\ \leq - μ_{q} ∥S_{q q}∥ \end{matrix}

(A8)

\begin{matrix} S_{q F}^{T} {\dot{S}}_{q F} & = - K_{F} ∥S_{q F}∥ + {\dot{S}}_{v F} S_{q F} \\ \leq - μ_{F} ∥S_{q F}∥ \end{matrix}

(A9)

where

μ_{q} = K_{q} - ϵ_{q}

and

μ_{F} = K_{F} - ϵ_{F}

, with

ϵ_{q} > {\dot{S}}_{v q}

and

ϵ_{q} > {\dot{S}}_{v F}

; then, tuning

K_{q} > ϵ_{q}

and

K_{F} > ϵ_{F}

yields

μ_{q}, μ_{F} > 0

, guaranteeing the sliding mode condition at

S_{q q} = 0

and

S_{q F} = 0

, with

(Δ q, Δ \dot{q}), Δ λ

as the unique solution to these manifolds, respectively; thus, local exponential convergence for the position, velocity, and force tracking errors is proven (QED).

References

Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A Review of Safe Reinforcement Learning: Methods, Theories, and Applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef]
Pantoja-Garcia, L.; Parra-Vega, V.; Garcia-Rodriguez, R. Automatic reinforcement for robust model-free neurocontrol of robots without persistent excitation. Int. J. Adapt. Control Signal Process. 2024, 38, 221–236. [Google Scholar] [CrossRef]
Chiaverini, S.; Siciliano, B.; Villani, L. A survey of robot interaction control schemes with experimental comparison. IEEE/ASME Trans. Mechatronics 1999, 4, 273–285. [Google Scholar] [CrossRef]
Hogan, N. Impedance Control: An Approach to Manipulation. In Proceedings of the 1984 American Control Conference, San Diego, CA, USA, 6–8 June 1984; pp. 304–313. [Google Scholar] [CrossRef]
Chien, M.C.; Huang, A.C. Adaptive Impedance Control of Robot Manipulators based on Function Approximation Technique. Robotica 2004, 22, 395–403. [Google Scholar] [CrossRef]
McClamroch, N.; Wang, D. Feedback stabilization and tracking of constrained robots. IEEE Trans. Autom. Control 1988, 33, 419–426. [Google Scholar] [CrossRef]
Parra-Vega, V.; Arimoto, S.; Liu, Y.; Naniwa, T. Model-based adaptive hybrid control for robot manipulators under holonomic constraints. IFAC Proc. Vol. 1994, 27, 475–480. [Google Scholar] [CrossRef]
Jung, S.; Hsia, T. Neural network impedance force control of robot manipulator. IEEE Trans. Ind. Electron. 1998, 45, 451–461. [Google Scholar] [CrossRef]
Bechlioulis, C.P.; Doulgeri, Z.; Rovithakis, G.A. Neuro-Adaptive Force/Position Control With Prescribed Performance and Guaranteed Contact Maintenance. IEEE Trans. Neural Netw. 2010, 21, 1857–1868. [Google Scholar] [CrossRef] [PubMed]
Peng, G.; Yang, C.; He, W.; Chen, C.L.P. Force Sensorless Admittance Control With Neural Learning for Robots With Actuator Saturation. IEEE Trans. Ind. Electron. 2020, 67, 3138–3148. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Barto, A.G.; Sutton, R.S.; Anderson, C.W. Looking Back on the Actor–Critic Architecture. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 40–50. [Google Scholar] [CrossRef]
He, W.; Gao, H.; Zhou, C.; Yang, C.; Li, Z. Reinforcement Learning Control of a Flexible Two-Link Manipulator: An Experimental Investigation. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 7326–7336. [Google Scholar] [CrossRef]
Vu, V.T.; Dao, P.N.; Loc, P.T.; Huy, T.Q. Sliding Variable-based Online Adaptive Reinforcement Learning of Uncertain/Disturbed Nonlinear Mechanical Systems. J. Control Autom. Electr. Syst. 2021, 32, 281–290. [Google Scholar] [CrossRef]
Zhan, H.; Huang, D.; Chen, Z.; Wang, M.; Yang, C. Adaptive dynamic programming-based controller with admittance adaptation for robot–environment interaction. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420924610. [Google Scholar] [CrossRef]
Zhang, T.; Xiao, M.; Zou, Y.b.; Xiao, J.d.; Chen, S.y. Robotic Curved Surface Tracking with a Neural Network for Angle Identification and Constant Force Control based on Reinforcement Learning. Int. J. Precis. Eng. Manuf. 2020, 21, 869–882. [Google Scholar] [CrossRef]
Liang, L.; Chen, Y.; Liao, L.; Sun, H.; Liu, Y. A novel impedance control method of rubber unstacking robot dealing with unpredictable and time-variable adhesion force. Robot. Comput.-Integr. Manuf. 2021, 67, 102038. [Google Scholar] [CrossRef]
Zhao, X.; Han, S.; Tao, B.; Yin, Z.; Ding, H. Model-Based Actor−Critic Learning of Robotic Impedance Control in Complex Interactive Environment. IEEE Trans. Ind. Electron. 2022, 69, 13225–13235. [Google Scholar] [CrossRef]
Perrusquía, A.; Yu, W.; Soria, A. Position/force control of robot manipulators using reinforcement learning. Ind. Robot. Int. J. Robot. Res. Appl. 2019, 46, 267–280. [Google Scholar] [CrossRef]
Dao, P.N.; Do, D.K.; Nguyen, D.K. Adaptive Reinforcement Learning-Enhanced Motion/Force Control Strategy for Multirobot Systems. Math. Probl. Eng. 2021, 2021, 5560277. [Google Scholar] [CrossRef]
Dao, P.N.; Liu, Y.C. Adaptive reinforcement learning in control design for cooperating manipulator systems. Asian J. Control 2022, 24, 1088–1103. [Google Scholar] [CrossRef]
Katić, D.M.; Rodić, A.D.; Vukobratović, M.K. Hybrid dynamic control algorithm for humanoid robots based on reinforcement learning. J. Intell. Robot. Syst. Theory Appl. 2008, 51, 3–30. [Google Scholar] [CrossRef]
Kormushev, P.; Calinon, S.; Caldwell, D.G. Reinforcement Learning in Robotics: Applications and Real-World Challenges. Robotics 2013, 2, 122–148. [Google Scholar] [CrossRef]
Liu, R.; Nageotte, F.; Zanne, P.; de Mathelin, M.; Dresp-Langley, B. Deep Reinforcement Learning for the Control of Robotic Manipulation: A Focussed Mini-Review. Robotics 2021, 10, 22. [Google Scholar] [CrossRef]
Pantoja-Garcia, L.; Parra-Vega, V.; Garcia-Rodriguez, R.; Vázquez-García, C.E. A Novel Actor—Critic Motor Reinforcement Learning for Continuum Soft Robots. Robotics 2023, 12, 141. [Google Scholar] [CrossRef]
Parra-Vega, V.; Arimoto, S. A passivity-based adaptive sliding mode position-force control for robot manipulators. Int. J. Adapt. Control Signal Process. 1996, 10, 365–377. [Google Scholar] [CrossRef]
Doya, K. Temporal Difference Learning in Continuous Time and Space. In Advances in Neural Information Processing Systems, Proceedings of the 1995 Conference, Denver, CO, USA, 27–30 November 1995; Touretzky, D., Mozer, M., Hasselmo, M., Eds.; MIT Press: Cambridge, MA, USA, 1995; Volume 8, pp. 1073–1079. [Google Scholar]
Madden, G.J.; Mahmoudi, S.; Brown, K. Pavlovian learning and conditioned reinforcement. J. Appl. Behav. Anal. 2023, 56, 498–519. [Google Scholar] [CrossRef]
Perkins, T.; Barto, A. Lyapunov design for safe reinforcement learning control. In Safe Learning Agents: Papers from the 2002 AAAI Symposium, Palo Alto, CA, USA, 25–27 March 2002; AAAI Press: Menlo Park, CA, USA, 2002; pp. 23–30. [Google Scholar]

Figure 1. A robot manipulator in contact with a rigid plane follows the desired trajectory while exerting a desired force on it.

Figure 2. TD error performance: (a) the ITD error converges faster to the origin, (b) the TD error converges in approximately 0.02 s.

Figure 3. Tracking errors: (a) position tracking error convergence for each joint,

Δ q_{i}

; (b) velocity tracking error convergence for each joint,

Δ {\dot{q}}_{i}

, for

i = 1, 2, 3

.

Figure 3. Tracking errors: (a) position tracking error convergence for each joint,

Δ q_{i}

; (b) velocity tracking error convergence for each joint,

Δ {\dot{q}}_{i}

, for

i = 1, 2, 3

.

Figure 4. Tracking errors: (a) force tracking error convergence; (b) real vs. desired Cartesian trajectory.

Figure 5. Error manifolds: (a) orthogonalized extended velocity errors,

S_{r i}

, (b) sliding position manifold,

S_{q q i}

for

i = 1, 2, 3

.

Figure 5. Error manifolds: (a) orthogonalized extended velocity errors,

S_{r i}

, (b) sliding position manifold,

S_{q q i}

for

i = 1, 2, 3

.

Figure 6. (a) Sliding force manifold,

S_{q F}

, (b) control signal performance,

τ_{i}

, for

i = 1, 2, 3

.

Figure 6. (a) Sliding force manifold,

S_{q F}

, (b) control signal performance,

τ_{i}

, for

i = 1, 2, 3

.

Figure 7. Adaptation of neural weights for the Critic NN (a), and the Actor NN (b).

Table 1. Metrics obtained from our proposal and the neurocontroller for contact tasks.

Control Law	Performance Metrics
Control Law	${ITAE}_{pos}$	${ITAE}_{vel}$	$IAC$	$IACV$
Our Proposal	4.796 × 10⁻⁴	9.194 × 10⁻²	258.8	142.9
Neurocontroller	7.142 × 10⁻⁴	9.246 × 10⁻²	259	151.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pantoja-Garcia, L.; Parra-Vega, V.; Garcia-Rodriguez, R. Physical Reinforcement Learning with Integral Temporal Difference Error for Constrained Robots. Robotics 2025, 14, 111. https://doi.org/10.3390/robotics14080111

AMA Style

Pantoja-Garcia L, Parra-Vega V, Garcia-Rodriguez R. Physical Reinforcement Learning with Integral Temporal Difference Error for Constrained Robots. Robotics. 2025; 14(8):111. https://doi.org/10.3390/robotics14080111

Chicago/Turabian Style

Pantoja-Garcia, Luis, Vicente Parra-Vega, and Rodolfo Garcia-Rodriguez. 2025. "Physical Reinforcement Learning with Integral Temporal Difference Error for Constrained Robots" Robotics 14, no. 8: 111. https://doi.org/10.3390/robotics14080111

APA Style

Pantoja-Garcia, L., Parra-Vega, V., & Garcia-Rodriguez, R. (2025). Physical Reinforcement Learning with Integral Temporal Difference Error for Constrained Robots. Robotics, 14(8), 111. https://doi.org/10.3390/robotics14080111

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Physical Reinforcement Learning with Integral Temporal Difference Error for Constrained Robots

Abstract

1. Introduction

2. Preliminaries

2.1. Robot Dynamics and the Problem Statement

2.2. The Actor–Critic Learning Control Problem

3. Learning Actor–Critic Design

3.1. The Value Function and the Temporal Difference Error

3.2. The Critic Neural Network

3.3. The Actor Neural Network

3.4. Model-Free Actor–Critic Control Design

4. Simulations

Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs of the Propositions and Theorem

Appendix A.1. Proof of Proposition 1

Appendix A.2. Proof of the Theorem

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI