Event-Trigger Reinforcement Learning-Based Coordinate Control of Modular Unmanned System via Nonzero-Sum Game

Liu, Yebao; An, Tianjiao; Chen, Jianguo; Zhong, Luyang; Qian, Yuhan

doi:10.3390/s25020314

Open AccessArticle

Event-Trigger Reinforcement Learning-Based Coordinate Control of Modular Unmanned System via Nonzero-Sum Game

by

Yebao Liu

¹,

Tianjiao An

^2,*

,

Jianguo Chen

¹,

Luyang Zhong

¹ and

Yuhan Qian

¹

Aerospace Times Feihong Technology Company Limited, Beijing 130012, China

²

Department of Control Science and Engineering, Changchun University of Technology, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(2), 314; https://doi.org/10.3390/s25020314

Submission received: 6 November 2024 / Revised: 24 December 2024 / Accepted: 28 December 2024 / Published: 7 January 2025

(This article belongs to the Special Issue Smart Sensing and Control for Autonomous Intelligent Unmanned Systems)

Download

Browse Figures

Versions Notes

Abstract

Decreasing the position error and control torque is important for the coordinate control of a modular unmanned system with less communication burden between the sensor and the actuator. Therefore, this paper proposes event-trigger reinforcement learning (ETRL)-based coordinate control of a modular unmanned system (MUS) via the nonzero-sum game (NZSG) strategy. The dynamic model of the MUS is established via joint torque feedback (JTF) technology. Based on the NZSG strategy, the existing coordinate control problem is transformed into an RL issue. With the help of the ET mechanism, the periodic communication mechanism of the system is avoided. The ET-critic neural network (NN) is used to approximate the performance index function, thus obtaining the ETRL coordinate control policy. The stability of the closed-loop system is verified via Lyapunov’s theorem. Experiment results demonstrate the validity of the proposed method. The experimental results show that the proposed method reduces the position error by 30% and control torque by 10% compared with the existing control methods.

Keywords:

reinforcement learning; nonzero-sum game; optimal control; event-trigger

1. Introduction

With the rapid development of the space industry and the continuous increase in the demand for space exploration, the complexity of the environment and the precision of the control requirements faced by space operations are also constantly improving [1,2,3]. The problems of high risk and low efficiency caused by traditional astronaut operations relying on them leaving the capsule are becoming increasingly prominent. In recent years, thanks to the rapid improvement and development of unmanned systems research and development technology, the use of high-precision and -performance unmanned systems to solve the assembly and maintenance of space operations in orbit is gradually becoming a scientific value of the goal-oriented basic research topic in the field of space exploration. So far, unmanned systems operating in conventional ground environments have achieved good reliability and accuracy. However, under the high standard requirements of coordinate missions in complex space environments, traditional unmanned systems are difficult to meet the transportation requirements of launch vehicles and spacecraft due to their large volume, heavy mass, and difficulty in disassembly and assembly. However, for the space unmanned systems that are in service to overcome the above difficulties, there are still some limitations in the configuration of the mechanism, and it is difficult to change its assembly configuration and working mode according to different task requirements. The modular unmanned system (MUS) [4,5] is a kind of autonomous unmanned system with standard modules and interfaces that can reassemble and configure itself according to different task requirements. Through the reconfiguration of modules, the unmanned system can show a variety of assembly configurations to complete different tasks, thus showing advantages that traditional unmanned systems do not possess.

As an important branch of game theory, differential game [6,7] focuses on the dynamic decision-making process of continuous time systems described by differential equations. It is an ideal tool to deal with multi-participant decision making and control problems and to solve optimal strategies, and it is widely used in economics [8], management [9], computer science [10], and other fields [11,12]. Reinforcement learning [13,14] originated as an imitation of the human brain learning mechanism, reflecting the mapping of learning environment state to action, so that the system can obtain the maximum cumulative reward from the environment and then optimize the system performance through the optimal strategy selection. In recent years, it has been widely used in complex nonlinear differential games because it can effectively solve the problem of “dimensionality disaster” in traditional dynamic programming [15,16]. As a kind of game, the nonzero-sum game (NZSG) [17,18] needs to solve the corresponding coupled Hamilton–Jacobi (HJ) equation for each player in order to obtain its Nash equilibrium solution.

As an important part of modern control theory, the core problem of optimal control is to select control strategies to make some performance indexes of a given controlled system optimal. For a large number of nonlinear systems in practical engineering, to obtain the optimal control strategy, it is necessary to solve the HJ(-Bellman) (HJB) equation, which is a class of nonlinear partial differential equation, and it is difficult to obtain the optimal solution by analytical methods. The reinforcement learning method is a powerful tool to solve the optimal control problem of nonlinear systems. In reinforcement learning systems, a neural network (NN) [19,20] is designed to approximate the performance index function and estimate the solution of the HJ(B) equation. Due to its strong advantages in solving nonlinear optimal control, reinforcement learning has attracted extensive attention from scholars both domestic and abroad in recent years, and has made rich achievements in solving problems such as discrete time optimal control [21,22,23], continuous time optimal control [24,25,26], and data-driven optimal control [27,28,29] of complex nonlinear systems. However, these results are based on periodic sampling or event triggering, resulting in a waste of resources and high computational costs.

Motivated by the above, this paper develops event-trigger (ET) reinforcement learning (ETRL)-based coordinate control of MUS via the NZSG strategy. The main contributions of this paper are mainly the following two aspects:

1. To the best of the authors’ knowledge, it is the first time to introduce the NZSG via reinforcement learning applied to an MUS. By considering the control torque of n modules in the MUS as decision-makers, the optimal control problem for the MUS system is morphed into an NZSG issue with n players.

2. The stability of the developed method is guaranteed and the experiment on MUS is conducted. Through the experimental results, we can conclude that the proposed method produces less tracking errors and power consumption.

2. Background and Related Work

2.1. Reinforcement Learning

Optimal control is widely utilized and holds great importance in many areas. However, with the increase in the system’s dimension, the issue of the dimensionality curse has appeared. Reinforcement learning, as an effective solution to the dimensionality curse in optimal control, has emerged as a crucial approach for addressing approximate optimal control issues. Vamvoudakis et al. [30] published a book about RL-based control for cognitive autonomy. Wang et al. [12] developed a review in the field of RL for advanced control applications. Liu and Xue et al. [31] concluded the RL with applications in control. The above three papers are all surveys about the RL or adaptive dynamic programming. The proposed method in this paper deals with the coordinate control of an modular unmanned system that is a specific application environment using RL. Dong et al. [32] proposed safe RL for the sake of trajectory tracking for a modular robot system. Event-trigger is not mentioned, as it causes computation and communication burdens. An et al. [17] designed a cooperative game-based RL method for human–robot interaction. Liu et al. [33] used the RL control method to deal with the vehicle path tracking issue. The particle swarm optimization method is utilized in the vehicle path tracking issue [34]. However, the above methods only consider a single controller to guarantee optimality. The modern industry needs more than one player/controller to finish the task using other player’s information. Hence, the developed nonzero-sum game has great importance.

2.2. Nonzero-Sum Game

Differential game theory focuses on the dynamic decision-making process in multi-player interactive systems and with advantages in dealing with uncertain interaction and disturbance. Differential games include the zero-sum game, nonzero-sum game, cooperative game, etc. Each module in the MUS system functions as a participant in NZSG, each with its own policy, collectively operating within the group using a general quadratic performance index function as the basis for the game. Wu et al. [35] proposed NZSG for an unmanned aerial vehicle with uncertain as well as asymmetric information. Coordinate control is not considered. Zheng et al. [36] developed a Q-learning-based NZSG for spacecraft system under a pursuit–evasion condition. The above method is based on a time-triggered mechanism. Besides optimum, the communication burden between the sensor and actuator needs to be considered. Therefore, an event-trigger has been developed to decrease the quantity of sampling. An et al. [37] used a dynamic event-trigger to complete a robot’s tracking task via NZSG. Dong et al. [38] proposed event-trigger value iteration RL for a coordinated task under the framework of NZSG. The above methods are only applied on robots; thus, they are unsuitable for modular unmanned systems.

3. Dynamic Model

For an MUS employing the JTF technique, the dynamic model of the ith subsystem is presented below:

I_{i m} γ_{i} {\ddot{q}}_{i} + \frac{τ_{i s}}{γ_{i}} + f_{i r} (q_{i}, {\dot{q}}_{i}) + I_{i} (q, \dot{q}, \ddot{q}) = τ_{i} + J_{i}^{T} f,

(1)

where f is the contact force between the MUS and object;

f_{i r} (q_{i}, {\dot{q}}_{i})

means lumped joint friction;

γ_{i}

indicates the gear ratio;

q_{i}

reflects joint position;

τ_{i s}

represents coupled joint torque;

I_{i} (q, \dot{q}, \ddot{q})

is the IDC effect among MUS subsystems;

τ_{i}

indicates control torque; and subscript i is ith joint module subsystem. The property analyses are described below:

(1): The lumped joint friction

The joint friction term

f_{i r} (q_{i}, {\dot{q}}_{i})

is formulated as

\begin{matrix} f_{i r} (q_{i}, {\dot{q}}_{i}) & = {\hat{f}}_{i b} {\dot{q}}_{i} + ({\hat{f}}_{i s} e^{(- {\hat{f}}_{i r} {\dot{q}}_{i}^{2})} + {\hat{f}}_{i c}) s g n ({\dot{q}}_{i}) + f_{i p} (q_{i}, {\dot{q}}_{i}) + Y_{i} ({\dot{q}}_{i}) {\tilde{F}}_{i r}, \end{matrix}

(2)

in which

Y_{i} ({\dot{q}}_{i}) = {[f_{i b} - {\hat{f}}_{i b}, f_{i c} - {\hat{f}}_{i c}, f_{i s} - {\hat{f}}_{i s}, f_{i τ} - {\hat{f}}_{i τ}]}^{T},

(3)

where

f_{i p} (q_{i}, {\dot{q}}_{i})

is the position dependency friction term;

f_{i b}, f_{i τ}

are viscous and Stribect friction effects; and

f_{i s}, f_{i c}

are static and Coulomb friction parameters. Furthermore,

{\hat{f}}_{i b}, {\hat{f}}_{i c},

{\hat{f}}_{i s},

and

{\hat{f}}_{i τ}

are the estimated values.

Remark 1.

The variables

f_{i b}, f_{i c}, f_{i s}, f_{i τ}

are bounded, and their corresponding estimates also possess boundedness. Consequently, this ensures that the variable

{\tilde{F}}_{i r}

is bounded, as indicated by

|{\tilde{F}}_{i r}| \leq b_{i F r m}

, where

b_{i F r m}

represents a known positive constant for each m in (1,2,3,4). Consequently,

Y_{i} ({\dot{q}}_{i}) {\tilde{F}}_{i r}

can be derived, which is designated as

|Y_{i} ({\dot{q}}_{i}) {\tilde{F}}_{i r}| \leq Y_{i} ({\dot{q}}_{i}) b_{i F r m}

. Additionally,

|f_{i p} (q_{i}, {\dot{q}}_{i})| \leq b_{i F p}

, in which

b_{i F p}

is a known positive constant.

(2): The interconnected dynamic coupling

The IDC is expressible as a nonlinear function of the coupled vectors of the entire modular subsystem in this way:

\begin{matrix} I_{i} & = I_{i m} \sum_{j = 1}^{i - 1} v_{m i}^{T} v_{l j} {\ddot{q}}_{j} + I_{i m} \sum_{j = 2}^{i - 1} \sum_{k = 1}^{j - 1} v_{m i}^{T} (v_{l k} \times v_{l j}) {\dot{q}}_{k} {\dot{q}}_{j} \\ = I_{i m} \sum_{j = 1}^{i - 1} D_{j}^{i} {\ddot{q}}_{j} + I_{i m} \sum_{j = 2}^{i - 1} \sum_{k = 1}^{j - 1} Θ_{k j}^{i} {\dot{q}}_{k} {\dot{q}}_{j} \\ = \sum_{j = 1}^{i - 1} [I_{i m} {\hat{D}}_{j}^{i}, I_{i m}] {[{\ddot{q}}_{j}, {\tilde{D}}_{j}^{i} {\ddot{q}}_{j}]}^{T} + \sum_{j = 2}^{i - 1} \sum_{k = 1}^{j - 1} [I_{i m} {\hat{Θ}}_{k j}^{i}, I_{i m}] {[{\ddot{q}}_{j}, {\tilde{Θ}}_{k j}^{i} {\dot{q}}_{k} {\dot{q}}_{j}]}^{T}, \end{matrix}

(4)

in which

v_{m i}, v_{l j}, v_{l k}

denote the unit vectors along with the ith, jth, and kth joint rotation axes, respectively. Consequently, define

D_{j}^{i} = v_{m i}^{T} v_{l j}

and

Θ_{k j}^{i} = v_{m i}^{T} (v_{l k} \times v_{l j})

. We also have the relation that

{\hat{D}}_{j}^{i} = D_{j}^{i} - {\tilde{D}}_{j}^{i}

and

{\hat{Θ}}_{k j}^{i} = Θ_{k j}^{i} - {\tilde{Θ}}_{k j}^{i}

, in which

{\hat{D}}_{j}^{i}, {\hat{Θ}}_{k j}^{i}

represent the estimated values of

D_{j}^{i}, Θ_{k j}^{i}

, and

{\tilde{D}}_{j}^{i}, {\tilde{Θ}}_{k j}^{i}

are alignment errors.

Remark 2.

Based on (4), which characterizes

v_{m i}, v_{l k}, v_{l j}

, it is inferred that the magnitudes of the associated vector products are bounded, where

|D_{j}^{i}| = |v_{m i}^{T} v_{l j}| < 1

and

|Θ_{k j}^{i}| = |v_{m i}^{T} (v_{l k} \times v_{l j})| < 1

. Additionally, our findings indicate that

I_{i}

is bounded and the up-bound is given as

|I_{i}| \leq b_{i I}

with a positive constant.

Define state vector

x_{i} = {[x_{i 1}, x_{i 2}]}^{T} = {[q_{i}, {\dot{q}}_{i}]}^{T}

and the control input

u_{i} = τ_{i}

. The state space of the ith subsystem is

\{\begin{matrix} {\dot{x}}_{i 1} = x_{i 2} \\ {\dot{x}}_{i 2} = f_{i} (x) + g_{i} u_{i} \end{matrix},

(5)

where

\begin{matrix} g_{i} = {(I_{i m} γ_{i})}^{- 1} \\ f_{i} = g_{i} (\begin{matrix} - ({\hat{f}}_{i s} e^{(- {\hat{f}}_{i r} {\dot{x}}_{i 1}^{2})} + {\hat{f}}_{i c}) s g n (x_{i 2}) - f_{i p} (x_{i 1}, x_{i 2}) - \\ {\hat{f}}_{i b} x_{i 2} - Y_{i} (x_{i 2}) {\tilde{F}}_{i r} - \frac{τ_{i s}}{γ_{i}} - I_{i} (x, \dot{x}, \ddot{x}) - J_{i}^{T} f \end{matrix}) . \end{matrix}

(6)

Control objectives aim to ensure optimal tracking error performance for the MUS in coordinate control. Within the subsequent section, we introduce an event-trigger reinforcement learning-based coordinate control via nonzero-sum game framework.

4. Event-Trigger Reinforcement Learning-Based Coordinate Control via Nonzero-Sum Game

4.1. Problem Transformation

Based on the dynamic model (1) and state space (5), the control object of this paper is completing optimal trajectory tracking. Therefore, to facilitate designing the controller, the augmenting subsystem is deduced:

\{\begin{matrix} {\dot{x}}_{1} = x_{2} \\ {\dot{x}}_{2} = f (x) + \sum_{m = 1}^{n} G_{m} u_{m} \end{matrix},

(7)

where

x = {[x_{1}^{T}, x_{2}^{T}]}^{T} \in R^{2 n}

is global state of the MUS, in which the vectors

x_{1}, x_{2}

are given by

x_{1} = [x_{11}, \dots, x_{i 1}, \dots, x_{n 1}]^{T} \in R^{n}

and

x_{2} = {[x_{12}, \dots, x_{i 2}, \dots, x_{n 2}]}^{T} \in R^{n}

. Moreover,

f (x) = [f_{1} (x), \dots, f_{i} (x), \dots, f_{n} (x)]^{T}

,

G_{m} = {[0, \dots, 0, g_{m}, 0, \dots, 0]}^{T}

, where

g_{m} = {(I_{m m} γ_{m})}^{- 1}, m = 1, \dots, n

.

Define the cost function:

\begin{matrix} J_{i} ({\dot{e}}_{s}, u_{1}, \dots, u_{n}) & = \int_{t}^{\infty} ({\dot{e}}^{T}_{s} Q_{i} {\dot{e}}_{s} + \sum_{m = 1}^{n} u_{m}^{T} R_{i m} u_{m}) d τ \\ = \int_{t}^{\infty} U_{i} ({\dot{e}}_{s}, u_{1}, \dots, u_{n}) d τ, \end{matrix}

(8)

where position error is

e = {[e_{1}, e_{2}, \dots, e_{n}]}^{T} = x_{1} - x_{d}

and velocity error vector

\dot{e} = {[{\dot{e}}_{1}, {\dot{e}}_{2}, \dots, {\dot{e}}_{n}]}^{T} = x_{2} - {\dot{x}}_{d}

;

e_{s} = \dot{e} + β e

means fusion error;

x_{d}, {\dot{x}}_{d}, {\ddot{x}}_{d}

represent the determined reference vectors;

Q_{i}, R_{i m}

denote determined positive definite matrices; and

U_{i} ({\dot{e}}_{s}, u_{1}, \dots, u_{n})

indicates the utility function. Employing the infinitesimal version of (8), the Hamiltonian function can be derived:

H_{i} ({\dot{e}}_{s}, u_{1}, \dots, u_{n}, \nabla J_{i}) = U_{i} ({\dot{e}}_{s}, u_{1}, \dots, u_{n}) + {(\nabla J_{i})}^{T} (f (x) + \sum_{m = 1}^{n} G_{m} u_{m} - b_{d}),

(9)

where

\nabla J_{i} ({\dot{e}}_{s}) = \frac{\partial J_{i} ({\dot{e}}_{s})}{\partial {\dot{e}}_{s}}

is the partial derivative of

J_{i} ({\dot{e}}_{s})

,

b_{d} = {\ddot{x}}_{d} + β e

. Additionally, the optimal value function can be described as

J_{i}^{*} ({\dot{e}}_{s}, u_{1}, \dots, u_{n}) = min_{u_{i}} \int_{t}^{\infty} U_{i} ({\dot{e}}_{s}, u_{1}, \dots, u_{n}) d τ .

(10)

Based on the stationary condition

\frac{\partial H_{i}}{\partial u_{i}} = 0

, the local optimal control policy

u_{i}^{*}

is defined as

u_{i}^{*} = - \frac{1}{2} R_{i i}^{- 1} G_{i}^{T} \nabla J_{i}^{*} .

(11)

By substituting (8) and (11) into the Hamiltonian function (9), the coupled Hamilton–Jacobi (HJ) equation can be derived:

\begin{matrix} 0 & = {(\nabla J_{i}^{*})}^{T} (f (x) - \frac{1}{2} \sum_{m = 1}^{n} G_{m} R_{m m}^{- 1} G_{m}^{T} \nabla J_{m}^{*} + ϖ (x) - b_{d}) \\ + \frac{1}{4} \sum_{m = 1}^{n} {(\nabla J_{i}^{*})}^{T} G_{m} R_{m m}^{- 1} R_{i m} R_{m m}^{- 1} G_{m}^{T} (\nabla J_{m}^{*}) + {\dot{e}}^{T}_{s} Q_{i} {\dot{e}}_{s} \end{matrix}

(12)

It is hard to obtain an analytical solution because of the nonlinearity system. Therefore, an event-trigger reinforcement learning-based coordinate control is introduced.

4.2. Event-Trigger Reinforcement Learning-Based Coordinate Control

The optimal control policy (11) is addressed from periodic sampling as well as the coupled HJ equation (12). Fixed sampling control not only escalates computational demands but also excessively taps into communication resources, jeopardizing the timeliness of control in environments with constrained bandwidth. Therefore, an event-triggered strategy is introduced to optimize efficiency.

Set a series of monotonously increasing

{\{t_{j}\}}_{j = 0}^{+ \infty}

, which contains trigger instants

t_{j}

. Then, define the sampling state

{\dot{e}}_{s j i} (x_{j i}) = {\dot{e}}_{s j i} (x_{i} (t_{j})),

(13)

where

{\dot{e}}_{s j i} (x_{j i})

denotes triggering instant state for

t \in [t_{j}, t_{j + 1})

. To obtain the trigger condition, the subsequent gap function is introduced:

g_{e j i} (t) = {\dot{e}}_{s i} (x_{i}) - {\dot{e}}_{s j i} (x_{i}) .

(14)

Upon event triggering, based on (13), the actual state undergoes sampling to become the sampled state, after which

g_{e j i} (t)

is reset to zero. The optimal control law is updated to

{u_{i}}^{*} ({\dot{e}}_{s i} (t_{j}))

= {u_{i}}^{*} ({\dot{e}}_{s j i})

during

[t_{j}, t_{j + 1})

,

j \in N

. It should be noted that

{u_{i}}^{*} ({\dot{e}}_{s j i})

are discrete values updated irregularly, necessitating conversion to continuous values. Therefore, a zero-order holder is derived to cope with this issue.

According to the dynamic model of MUS (7), one gives the event-trigger value function as follows:

J_{i} ({\dot{e}}_{s j i}, u_{1}, \dots, u_{n}) = \int_{t}^{t + T} ({\dot{e}}_{s j i}^{T} Q_{i} {\dot{e}}_{s j i} + \sum_{m = 1}^{n} u_{m}^{T} ({\dot{e}}_{s j i}) R_{i m} u_{m} ({\dot{e}}_{s j i})) d τ .

(15)

One has the event-triggered HJ equation:

\begin{matrix} H_{i} ({\dot{e}}_{s j i}, u_{1}, \dots, u_{n}, \nabla J_{i} ({\dot{e}}_{s j i})) = U_{i} ({\dot{e}}_{s j i}, u_{1}, \dots, u_{n}) \\ + {(\nabla J_{i} ({\dot{e}}_{s j i}))}^{T} (f (x) + \sum_{m = 1}^{n} G_{m} u_{m} ({\dot{e}}_{s j i}) - b_{d}), \end{matrix}

(16)

where

\nabla J_{i} ({\dot{e}}_{s j i}) = \partial J_{i} ({\dot{e}}_{s j i}) / \partial {\dot{e}}_{s j i}

is the partial derivative of

J_{i} ({\dot{e}}_{s j i})

with regard to

{\dot{e}}_{s j i}

. To eliminate the assumption of norm-boundness regarding interconnections, the desired states of coupled subsystems are used as a replacement for their actual states. Consequently, the interconnection term is depicted:

\begin{matrix} f (x) = f_{i} (x_{i}, x_{m d}) + Δ f_{i} (x, x_{m d}), \\ u_{m} = G_{m}^{- 1} ({\dot{x}}_{m 2 d} - f_{m} (x_{d})), m \neq i . \end{matrix}

(17)

where

x_{m d}

is the desired state of coupled subsystems for

m = 1, \dots, i - 1, i + 1, \dots, n,

and

Δ f_{i} (x, x_{m d})

represents the substitution error. Given the interconnection’s compliance with the global Lipschitz condition, this indicates

∥Δ f_{i} (x, x_{m d})∥ \leq \sum_{m = 1, m \neq i}^{n} d_{i m} E_{m},

(18)

where

E_{m} = ∥x_{m} - x_{m d}∥,

and

d_{i m} \geq 0

denotes an unknown global Lipschitz constant.

The improved event-triggered optimal value function is

J_{i}^{*} ({\dot{e}}_{s j i}, u_{1}, \dots, u_{n}) = min_{μ_{i}} (\int_{t}^{t + T} ({\dot{e}}_{s j i}^{T} Q_{i} {\dot{e}}_{s j i} + \sum_{m = 1}^{n} u_{m}^{T} ({\dot{e}}_{s j i}) R_{i m} u_{m} ({\dot{e}}_{s j i})) d τ) .

(19)

Through the substitution of (19) into (16), it can be inferred that

0 = min_{u_{i} ({\dot{e}}_{s j i}) \in Ψ_{i} (Ω)} H_{i} ({\dot{e}}_{s j i}, u_{i} ({\dot{e}}_{s j i}), \nabla {J_{i}}^{*} ({\dot{e}}_{s j i})) .

(20)

Based on (21), one has the event-triggered optimal control law

{u_{i}}^{*} ({\dot{e}}_{s j i}) = - \frac{1}{2} R_{i i}^{- 1} G_{i}^{T} \nabla {J_{i}}^{*} ({\dot{e}}_{s j i}) .

(21)

For any

{\dot{e}}_{s i}, {\dot{e}}_{s j i} \in Ω

, the control law is Lipschitz continuous. Then, one has a constant

χ_{i}

satisfying

\begin{matrix} ∥{u_{i}}^{*} ({\dot{e}}_{s i}) - {u_{i}}^{*} ({\dot{e}}_{s j i})∥ & = ∥{u_{i}}^{*} ({\dot{e}}_{s i} + g_{e j i} (t)) - {u_{i}}^{*} ({\dot{e}}_{s j i})∥ \leq χ_{i} ∥g_{e j i} (t)∥ . \end{matrix}

(22)

Given the challenging nature of solving the coupled HJ equation and the curse of dimensionality that arises with increasing dimensions, we employ the reinforcement learning algorithm for deriving an approximate solution for the event-triggered HJ equation in real-time.

The improved value function

{J_{i}}^{*} ({\dot{e}}_{s j i})

can be obtained by the radial basis function neural network (RBFNN) as follows:

{J_{i}}^{*} ({\dot{e}}_{s j i}) = {W_{c i}}^{T} δ_{c i} ({\dot{e}}_{s j i}) + ε_{c i} ({\dot{e}}_{s j i}),

(23)

where

W_{c i} \in R^{K_{i}}

is the desired critic NN weight vector;

K_{i}

is the number of neurons in the hidden-layer;

δ_{c i} ({\dot{e}}_{s j i}) = exp (- ∥{\dot{e}}_{s j i} - c_{i j}∥ / 2 b_{i j}^{2})

denotes the activation function; and

ε_{c i} ({\dot{e}}_{s j i})

is critic NN approximation error, which is bounded as

∥δ_{c i} ({\dot{e}}_{s j i})∥

\leq δ_{c i max}

,

∥ε_{c i} ({\dot{e}}_{s j i})∥ \leq ε_{c i max}

with the positive constants

δ_{c i max}

and

ε_{c i max}

.

Therefore, the partial derivative of

{J_{i}}^{*} ({\dot{e}}_{s j i})

can be obtained as follows:

\nabla {J_{i}}^{*} ({\dot{e}}_{s j i}) = \nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) W_{c i} + \nabla ε_{c i} ({\dot{e}}_{s j i}),

(24)

where

\nabla δ_{c i} ({\dot{e}}_{s j i})

is Lipschitz continuous.

The relationship

∥\nabla δ_{c i} ({\dot{e}}_{s i}) - \nabla δ_{c i} ({\dot{e}}_{s j i})∥ \leq p_{i} ∥g_{e j i} (t)∥

can be derived, and

p_{i}

is a positive constant.

Substituting (24) into (21) can yield the following:

\begin{matrix} {u_{i}}^{*} ({\dot{e}}_{s j i}) & = - \frac{1}{2} R_{i i}^{- 1} G_{i}^{T} \nabla {J_{i}}^{*} ({\dot{e}}_{s j i}) \\ = - \frac{1}{2} R_{i i}^{- 1} G_{i}^{T} (\nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) W_{c i} + \nabla ε_{c i} ({\dot{e}}_{s j i})) . \end{matrix}

(25)

Therefore, one obtains the event-triggered HJ equation as follows:

\begin{matrix} H_{i} ({\dot{e}}_{s j i}, u_{i}^{*} ({\dot{e}}_{s j i}), \nabla J_{i}^{*} ({\dot{e}}_{s j i})) = U_{i} ({\dot{e}}_{s j i}, u_{1}, \dots, u_{i}^{*}, \dots, u_{n}) \\ + {(\nabla J_{i} ({\dot{e}}_{s j i}))}^{T} (\begin{matrix} f_{i} (x_{i}, x_{m d}) - b_{d} + \sum_{m = 1}^{n} G_{m} u_{m} ({\dot{e}}_{s j i}) \end{matrix}) ≜ e_{c H i}, \end{matrix}

(26)

where

e_{c H i} = - \nabla ε_{c i}^{T} ({\dot{e}}_{s j i}) {\ddot{e}}_{s j i}

means residual error, and the positive constant

e_{c H i max}

is the upper bound of

e_{c H i}

.

Since we cannot obtain the desired critic NN weight vector, we approximate the improved value function

{\hat{J}}_{i} ({\dot{e}}_{s j i}) = {\hat{W}}_{c i}^{T} δ_{c i} ({\dot{e}}_{s j i}) .

(27)

Furthermore, the partial derivative

{\hat{J}}_{i} ({\dot{e}}_{s j i})

is formulated as follows:

\nabla {\hat{J}}_{i} ({\dot{e}}_{s j i}) = \nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) {\hat{W}}_{c i} .

(28)

Therefore, merging (28) with (21), the event-trigger-based approximate optimal control law

{\hat{u}}_{i} ({\dot{e}}_{s j i})

is obtained as

{\hat{u}}_{i} ({\dot{e}}_{s j i}) = - \frac{1}{2} R_{i i}^{- 1} G_{i}^{T} (\nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) {\hat{W}}_{c i}) .

(29)

According to (25), (28) and (29), we can obtain the event-triggered approximate HJ equation as

\begin{matrix} {\hat{H}}_{i} ({\dot{e}}_{s j i}, {\hat{u}}_{i}^{*} ({\dot{e}}_{s j i}), \nabla {\hat{J}}_{i}^{*} ({\dot{e}}_{s j i})) = U_{i} ({\dot{e}}_{s j i}, {\hat{u}}_{1}, \dots, {\hat{u}}_{i}^{*}, \dots, {\hat{u}}_{n}) \\ + {(\nabla J_{i} ({\dot{e}}_{s j i}))}^{T} (\begin{matrix} f_{i} (x_{i}, x_{m d}) - b_{d} + \sum_{m = 1}^{n} G_{m} {\hat{u}}_{m} ({\dot{e}}_{s j i}) \end{matrix}) ≜ e_{c i} . \end{matrix}

(30)

Define the critic approximation error vector as

{\tilde{W}}_{c i} = W_{c i} - {\hat{W}}_{c i}

; from (30), we can define

\partial e_{c i} / \partial {\hat{W}}_{c i} = \nabla δ_{c i} ({\dot{e}}_{s j i})

{\ddot{e}}_{s j i} = θ_{i}

. To refine the estimation of the desired vector

{\hat{W}}_{c i}

, we employ the gradient descent algorithm to minimize the objective function

E_{c i} = \frac{1}{2} e_{c i}^{2}

, with the update rate given by

\begin{matrix} {\dot{\hat{W}}}_{c i} & = - α_{c i} \frac{\partial E_{c i}}{\partial {\hat{W}}_{c i}} = - α_{c i} (U_{i} ({\dot{e}}_{s j i}, {\hat{u}}_{1}, \dots, {\hat{u}}_{i}^{*}, \dots, {\hat{u}}_{n}) + θ_{i} {\hat{W}}_{c i}^{T}) θ_{i} . \end{matrix}

(31)

Then, the approximation error vector is

{\dot{\tilde{W}}}_{c i} = - α_{c i} (θ_{i} {\tilde{W}}_{c i}^{T} - e_{c H i}) θ_{i} .

(32)

Theorem 1.

Taking the value function (23) into account, it is estimated by the critic NN with weights

W_{c i}

. The cost function, as given by Equation (27), is approximated using the weights

{\hat{W}}_{c i}

. Assuming the update law for the critic NN is defined by (31), the weight approximation error is proven to be UUB.

Proof.

The candidate for the Lyapunov function is selected as

V_{1 i} (t) = \frac{1}{2 α_{c i}} {\tilde{W}}_{c i}^{T} {\tilde{W}}_{c i} .

(33)

The derivative of

V_{1 i} (t)

can be obtained as

\begin{matrix} {\dot{V}}_{1 i} (t) = \frac{1}{α_{c i}} {\tilde{W}}_{c i}^{T} {\dot{\tilde{W}}}_{c i} = {\tilde{W}}_{c i}^{T} (e_{c H i} - θ_{i} {\tilde{W}}_{c i}^{T}) θ_{i} \\ = {\tilde{W}}_{c i}^{T} e_{c H i} θ_{i} - {∥{\tilde{W}}_{c i}^{T} θ_{i}∥}^{2} \leq \frac{1}{2} e_{c H i}^{2} - \frac{1}{2} {∥{\tilde{W}}_{c i}^{T} θ_{i}∥}^{2} . \end{matrix}

(34)

Upon analyzing (34), we observe that if

∥{\tilde{W}}_{c i}∥ \geq ∥\frac{e_{c H i}}{θ_{i}}∥

, this leads to

{\dot{V}}_{1 i} (t) < 0

, which in turn confirms the UUB of the critic approximation error vector. □

Theorem 2.

Given an MUS with joint subsystem dynamic model (1) and state space (7), the closed-loop MUS with coordinate control is UUB under the presented event-triggered reinforcement learning-based coordinate control law (35) if

\begin{matrix} {∥g_{e j i}∥}^{2} & \leq \frac{(1 - α_{i}^{2}) σ_{min} (Q_{i}) {∥{\dot{e}}_{s i}∥}^{2} + \sum_{m = 1}^{n} {∥r_{m}∥}^{2} {∥{\hat{u}}_{m} ({\dot{e}}_{s j m})∥}^{2}}{2 R_{i} χ_{i}^{2}} \\ - \frac{G_{i}^{2} {(\nabla δ_{c i max} W_{c i max} + \nabla ε_{c i max})}^{2}}{4 R_{i}^{2} χ_{i}^{2}}, \end{matrix}

(35)

holds, where

α_{i} \in (0, 1)

is the designed sampling frequency parameter,

σ_{min} (\cdot)

means the minimum eigenvalue of the matrix, and

r_{i m}

is a positive constant that satisfies

R_{i m} = r_{i m}^{T} r_{i m}

, assuming

∥{\tilde{W}}_{c i}∥ \leq ∥W_{c i max}∥

.

Proof.

We select the Lyapunov candidate function

V_{i} (t) = V_{s i} + V_{s j i},

(36)

where

V_{s i} = {J_{i}}^{*} ({\dot{e}}_{s i})

and

V_{s j i} = {J_{i}}^{*} ({\dot{e}}_{s j i})

. □

The following proof is divided into two cases.

Case 1:: The events are not triggered, i.e., $t \in [t_{j}, t_{j + 1})$ .

Computing the derivative with respect to time of (36), the result is obtained as follows:

{\dot{V}}_{s i} (t) = {(\nabla J_{i}^{*} ({\dot{e}}_{s i}))}^{T} (\begin{matrix} f_{i} (x_{i}, x_{m d}) - b_{d} + \sum_{m = 1}^{n} G_{m} u_{m} \end{matrix}),

(37)

{\dot{V}}_{s j i} = 0 .

(38)

Based on the optimal control law (11) and time-triggered HJ equation (12), one has

\begin{matrix} {(\nabla J_{i}^{*})}^{T} (f_{i} (x_{i}, x_{m d}) - b_{d}) = & - {\dot{e}}_{s i}^{T} Q_{i} {\dot{e}}_{s i} + \frac{1}{2} \sum_{m = 1}^{n} G_{m} R_{m m}^{- 1} G_{m}^{T} \nabla J_{m}^{*} \\ - \frac{1}{4} \sum_{m = 1}^{n} {(\nabla J_{m}^{*})}^{T} G_{m} R_{m m}^{- 1} R_{i m} R_{m m}^{- 1} G_{m}^{T} (\nabla J_{m}^{*}), \end{matrix}

(39)

\nabla {J_{i}}^{*} ({\dot{e}}_{s i}) G_{i} = - 2 \sum_{m = 1}^{n} G_{m}^{T} u_{m}^{*} .

(40)

Substituting (39) and (40) into (37), we can obtain the following equation:

\begin{matrix} {\dot{V}}_{s i} (t) & = - {\dot{e}}_{s i}^{T} Q_{i} {\dot{e}}_{s i} - {(\nabla J_{i}^{*})}^{T} (\sum_{m = 1}^{n} G_{m} (u_{m}^{*} - {\hat{u}}_{m})) - \frac{1}{4} \sum_{m = 1}^{n} ({(\nabla J_{m}^{*})}^{T} G_{m} R_{m m}^{- 1} R_{i m} R_{m m}^{- 1} G_{m}^{T} (\nabla J_{m}^{*})) \\ = - {\dot{e}}_{s i}^{T} Q_{i} {\dot{e}}_{s i} - \frac{1}{4} \sum_{m = 1}^{n} (\begin{matrix} {(\nabla J_{m}^{*} ({\dot{e}}_{s j i}))}^{T} G_{m} R_{m m}^{- 1} R_{i m} R_{m m}^{- 1} G_{m}^{T} (\nabla J_{m}^{*} ({\dot{e}}_{s j i})) \end{matrix}) \\ + \frac{1}{2} {(\begin{matrix} \nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) W_{c i} + \nabla ε_{c i} \end{matrix})}^{T} (\sum_{m = 1}^{n} G_{m} R_{m m}^{- 1} (\begin{matrix} G_{m}^{T} \nabla δ_{c m}^{T} ({\dot{e}}_{s j i}) {\tilde{W}}_{c m} + G_{m}^{T} \nabla ε_{c m} \end{matrix})) \\ = - {\dot{e}}_{s i}^{T} Q_{i} {\dot{e}}_{s i} - \frac{1}{4} \sum_{m = 1}^{n} (\begin{matrix} {(\nabla J_{m}^{*} ({\dot{e}}_{s j i}))}^{T} G_{m} R_{m m}^{- 1} R_{i m} R_{m m}^{- 1} G_{m}^{T} (\nabla J_{m}^{*} ({\dot{e}}_{s j i})) \end{matrix}) + Π_{i J}, \end{matrix}

(41)

in which the function term

Π_{i J}

has the following up-bound:

\begin{matrix} Π_{i J} & \leq ∥\begin{matrix} \frac{1}{2} {(\nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) W_{c i} + \nabla ε_{c i})}^{T} (\sum_{m = 1}^{n} G_{m} R_{m m}^{- 1} G_{m}^{T} (\nabla ϕ_{c m}^{T} ({\dot{e}}_{s j i}) {\tilde{W}}_{c m} + \nabla ε_{c m})) \end{matrix}∥ \leq π_{i J}, \end{matrix}

(42)

where

π_{i J}

is a computable positive constant.

According to (24) and (28), (41) can be transformed into

{\dot{V}}_{s i} \leq - {\dot{e}}_{s i}^{T} Q_{i} {\dot{e}}_{s i} - \sum_{m = 1}^{n} {\hat{u}}_{m}^{T} ({\dot{e}}_{s j m}) R_{i m} {\hat{u}}_{m} ({\dot{e}}_{s j m}) + 2 χ^{2} R_{i i} {∥g_{e j i} (t)∥}^{2} + \frac{1}{2} R_{i i}^{- 1} G_{i}^{2} {(\nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) {\tilde{W}}_{c i} + \nabla ε_{c i} ({\dot{e}}_{s j i}))}^{2} .

(43)

Thus, we have

\dot{V} (t)

as

\dot{V} (t) = \sum_{i = 1}^{n} {\dot{V}}_{i} (t) = \sum_{i = 1}^{n} ({\dot{V}}_{s i} + {\dot{V}}_{s j i}) .

(44)

According to (38) and (43) as well as (44), one has

\begin{matrix} \dot{V} (t) & \leq \sum_{i = 1}^{n} (\begin{matrix} - {\dot{e}}_{s i}^{T} Q_{i} {\dot{e}}_{s i} + 2 R_{i m} χ^{2} {∥g_{e j i}∥}^{2} - \sum_{m = 1}^{n} {\hat{u}}_{m}^{T} ({\dot{e}}_{s j m}) R_{i m} {\hat{u}}_{m} ({\dot{e}}_{s j m}) \\ + \frac{1}{2} R_{i m}^{- 1} G_{i}^{2} {(\nabla δ_{c i}^{T} ({\dot{e}}_{s j i}) {\tilde{W}}_{c i} + \nabla ε_{c i} ({\dot{e}}_{s j i}))}^{2} \end{matrix}) \\ \leq \sum_{i = 1}^{n} (\begin{matrix} - α_{i}^{2} σ_{min} (Q_{i}) {∥{\dot{e}}_{s i}∥}^{2} + (α_{i}^{2} - 1) σ_{min} (Q_{i}) {∥{\dot{e}}_{s i}∥}^{2} - {∥r_{i m}∥}^{2} {∥{\hat{u}}_{i} ({\dot{e}}_{s j m})∥}^{2} \\ + 2 R_{i m} χ^{2} {∥g_{e j i}∥}^{2} + \frac{1}{2} R_{i m}^{- 1} G_{i}^{2} {(\nabla δ_{c i max} W_{c i max} + \nabla ε_{c i max})}^{2} \end{matrix}) . \end{matrix}

(45)

Given that (35) is valid, (45) is compliant with the requirement

\dot{V} (t) \leq \sum_{i = 1}^{n} (π_{i J} - α_{i}^{2} σ_{min} (Q_{i}) {∥{\dot{e}}_{s i}∥}^{2})

. The condition that ensures the negativity of

{\dot{V}}_{i} (t)

is that

{\dot{e}}_{s i}

does not fall within the confines of

Ω = \{{\dot{e}}_{s i} : ∥{\dot{e}}_{s i}∥ \leq \sqrt{\frac{π_{i J}}{α_{i}^{2} σ_{min} (Q_{i})}}\}

, a requirement critical for affirming the negativity of the proposed Lyapunov function.

Case 2:: When events are triggered, $\forall t = t_{j + 1}$ , the difference of (36) is rewritten as

$\begin{matrix} Δ V_{i} (t) & = V_{i} ({\dot{e}}_{s j i + 1}) - V_{i} ({\dot{e}}_{s i} (x_{j i + 1}^{-})) \\ = J_{i}^{*} ({\dot{e}}_{s j i + 1}) - J_{i}^{*} ({\dot{e}}_{s i} (x_{j i + 1}^{-})) + J_{i}^{*} ({\dot{e}}_{s j i + 1}) - J_{i}^{*} ({\dot{e}}_{s j i}) . \end{matrix}$

(46)

Based on (35), one has

\dot{V} (t) \leq 0

. Therefore, we have

{J_{i}}^{*} ({\dot{e}}_{s j i + 1})

\leq {J_{i}}^{*} ({\dot{e}}_{s i} (x_{j i + 1}^{-}))

.

Then, one has (46) as the following form:

Δ V (t) \leq {J_{i}}^{*} ({\dot{e}}_{s j i + 1}) - {J_{i}}^{*} ({\dot{e}}_{s j i}) \leq - ν ∥g_{e j i} (t_{j})∥,

(47)

where

ν

denotes a class-k function,

g_{e j i} (t_{j}) = {\dot{e}}_{s j i + 1} - {\dot{e}}_{s j i}

.

Taking into account Cases 1 and 2 collectively, it follows that under the condition specified by (35), the closed-loop MUS’s tracking error is UUB. Thus, the conclusion of the proof is established.

4.3. Exclusion of Zeno Behaviors

The minimum trigger interval

t_{min} = min \{t_{j + 1} - t_{j}\}

is likely to be 0—that is, Zeno behavior. Therefore, we give the following theorem to avoid the phenomenon:

Theorem 3.

Considering MUS (1), the triggering condition (35) and the event-triggered approximate optimal control law (29), the minimum trigger interval

t_{min}

is with a positive lower bound by

t_{min} \geq \frac{1}{S_{i Z}} ln (1 + π_{j, min}) > 0,

(48)

where

π_{j, min} = min (\frac{∥g_{e j i} ({\dot{e}}_{s i} (x_{j i + 1}^{-}))∥}{({\dot{e}}_{s j i}) + Θ_{i}}) > 0

,

S_{i Z}, Θ_{i}

are positive constants.

Proof.

The time derivative of the event-triggered gap function (14) can be derived as follows:

\frac{d (g_{e j i})}{d t} = {\dot{g}}_{e j i} = {\ddot{e}}_{s i} (x_{i}) - {\ddot{e}}_{s j i} (x_{j i}) = {\ddot{e}}_{s i} (x_{i}) .

(49)

The upper bound of

{\ddot{e}}_{s i} (x_{i})

is derived as

∥{\ddot{e}}_{s i} (x_{i})∥ \leq S_{i Z} ∥x_{i}∥ + S_{i Z} Θ_{i} .

(50)

Combining (14) and (49) with (50), it can be obtained that

\begin{matrix} ∥{\dot{g}}_{e j i}∥ & \leq \int_{t_{j}}^{t} (e^{(S_{i Z} (t - w))} S_{i Z} (∥{\dot{e}}_{s j i}∥ + Θ_{i})) d w \leq (∥{\dot{e}}_{s j i}∥ + Θ_{i}) (e^{(S_{i Z} (t - t_{j}))} - 1) . \end{matrix}

(51)

When

t = t_{j + 1}

, the event-triggered condition satisfies

∥g_{e j i} ({\dot{e}}_{s j i} (x_{j i + 1}^{-}))∥ = ∥g_{e j i} (x_{j i + 1}^{-})∥ .

(52)

Based on (51) and (52), the jth triggering interval

Δ t_{j}

has the lower bound by

Δ t_{j} = t_{j + 1} - t_{j} \geq \frac{1}{S_{i Z}} ln (1 + \frac{∥g_{e j i} ({\dot{e}}_{s j i} (x_{j i + 1}^{-}))∥}{({\dot{e}}_{s j i}) + Θ_{i}}) .

(53)

This concludes the proof. □

5. Experiment

5.1. Experimental Setup

The validation of the proposed control method’s effectiveness is demonstrated through experiments on a 2 degrees of freedom (DOF) MUS platform. Detailed information about the experimental setup can be found in Figure 1. Joint control torque is measured using a joint torque sensor, while joint position information is acquired from both absolute and incremental encoders. The data acquisition board acts as the intermediary allowing interaction between the software environment (Simulink of Matlab 2016a) and hardware components. It is noted that the proposed ETRL via NZSG, which is in the form of continuous time, needs to be realized discretely when it is implemented in experiments. Fortunately, the control system, which is constructed under the Simulink environment, may complete the discrete realization automatically and adjust the sampling period adaptively. The model parameters are as follows:

I_{m i}

= 120 g·cm²,

γ_{i}

= 100,

{\hat{f}}_{i b}

= 12 m·Nm/rad,

{\hat{f}}_{i c}

= 30 m·Nm,

{\hat{f}}_{i s}

= 40 m·Nm,

{\hat{f}}_{i τ}

= 20 s²/rad². We consider the coordinate control, which is illustrated in Figure 1. The purpose of the experiment is satisfying the requirements of position tracking performance and control torque optimization under a coordinate operation with MUS. The critic NN is selected as RBFNN, and the activation function of (23) is

δ_{c i} = e^{\frac{- {({\dot{e}}_{s j i} - ℓ_{i})}^{T} ({\dot{e}}_{s j i} - ℓ_{i})}{ζ_{i}}}

with initial value

W_{c i} = {[0.3, 0.3, 0.3, 0.3, 0.3]}^{T} .

ℓ_{i}, ζ_{i}

denote the center and width of the activation function. The purpose of the control method is to decrease the position error and control torque as much as possible. The experimental results show that the proposed method reduces the position error by 30% and control torque by 10% compared with the existing control methods.

5.2. Experimental Results

The experimental outcomes are utilized to evaluate the system’s position tracking accuracy, tracking error magnitude, applied control torque, contact forces, event-triggering mechanism’s efficiency, and neural network (NN) weights’ performance individually. Two distinct control methodologies are implemented: the established learning-based tracking approach, as seen in references [37,39], and the novel proposed control strategy. The coordinate control was subjected to the implementation of two distinct control approaches. The upper figure corresponds to joint one, while the lower one illustrates joint two.

(1) Position tracking

Figure 2, Figure 3 and Figure 4 depict the position tracking and tracking error curves in joint space during coordinate control using both the existing learning-based tracking control method and the proposed approximate optimal control method. The graphical analysis indicates that the position tracking error is notably lower and smoother with the newly proposed control method as opposed to the previously established method. The proposed method reduces the position error by 30%. This is attributed to the accurate solution of the coordinate control problem achieved by the proposed method. At the corners of the trajectory, the tracking error tends to increase but is effectively mitigated back to an acceptable range along the smooth path by the proposed approximate optimal control method. Figure 5 shows the 3D tracking curves.

(2) Control torque

Figure 6 displays the control torque curves during coordinate control using both the existing learning-based tracking control method and the proposed approximate optimal control method. The illustrations indicate that the control torque experiences a sharp increase during sudden trajectory changes, potentially impacting the lifespan of the DC motors. The proposed method reduces the control torque by 10% compared with the existing control methods. Furthermore, the control torque curves under the current control method display pronounced chattering, potentially degrading the accuracy of trajectory tracking. However, by employing the developed approximate optimal control method, the output torques are optimized to minimize motor power consumption and instantaneous increases in control torques are maintained within safe boundaries.

(3) Contact force

Figure 7 depicts the contact force curves during coordinate control using the proposed approximate optimal control method. Since the MUS has 2-DOF and the joint axes are assembled in parallel, the contact force curves appear in a two-dimensional space. From the figures, it can be observed that the proposed approximate optimal control method ensures that the contact force remains below 2N, with minimal chattering phenomenon.

(4) Event-triggered mechanism

Figure 8 and Figure 9 depict the trigger threshold and trigger condition curves. Owing to the incorporation of the NN within the reinforcement learning process, both the trigger condition and the trigger threshold exhibit large values. The proposed method’s trigger time is nearly half that of the existing method. However, the trigger condition stays within the threshold limits, confirming the reliability of the newly introduced strategy. Figure 9 demonstrates that the developed controller substantially reduces the communication burden of the MUS.

(5) NN weight

Figure 10 illustrates the behavior of the critic NN via RBFNN under coordinate control facilitated by the proposed approximate optimal control method. The converged weights obtained from the proposed approximate optimal control policies allow the NN to accurately reflect the ongoing coordinate operations in real-time.

Based on the experimental results, the closed-loop MUS systems have better performance than the existing methods in terms of position tracking and control torque under the proposed ETRL via the NZSG approach (cf. Table 1). Drawing from the experimental figure findings, when compared to existing methods, the closed-loop MUS demonstrates enhanced performance in position tracking, control torque, contact force, and event-triggered conditions under the proposed approximate optimal control method.

6. Conclusions

An ETRL-based coordinate control of MUS via NZSG is proposed in the paper. JTF is utilized to form the MUS’s dynamic. The coordinate control problem is transformed into an RL issue via the NZSG strategy. Conventional periodic communication is avoided by the ET mechanism. The performance index function is approximated by the critic NN to obtain the optimal control strategy. According to the Lyapunov theorem, the closed-loop system is guaranteed to be stable. The experimental results show that the proposed method reduces the position error by 30% and control torque by 10% compared with the existing control methods. The mentioned control algorithm only concerns the static event-trigger. However, the computation burden and power consumption can be optimized by the dynamic event-trigger or self-event-trigger. This is the future research direction that we will work on.

Author Contributions

Conceptualization, Y.L.; methodology, T.A.; software, J.C.; validation, L.Z. and Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported by the National Natural Science Foundation of China (62473063), the Scientific Technological Development Plan Project in Jilin Province of China (20220201038GX), Key Laboratory of Advanced Structural Materials (Changchun University of Technology), Ministry of Education, China (ASM-202202).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

Authors Y.L., J.C., L.Z. and Y.Q. were employed by the company Aerospace Times Feihong Technology Company Limited. The remaining author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hirano, D.; Inazawa, M.; Sutoh, M.; Sawada, H.; Kawai, Y.; Nagata, M.; Sakoda, G.; Yoneda, Y.; Watanabe, K. Transformable Nano Rover for Space Exploration. IEEE Robot. Autom. Lett. 2024, 9, 3139–3146. [Google Scholar] [CrossRef]
Kedia, R.; Goel, S.; Balakrishnan, M.; Paul, K.; Sen, R. Design Space Exploration of FPGA-Based System With Multiple DNN Accelerators. IEEE Embed. Syst. Lett. 2021, 13, 114–117. [Google Scholar] [CrossRef]
Goyal, M.; Dewaskar, M.; Duggirala, P.S. NExG: Provable and Guided State-Space Exploration of Neural Network Control Systems Using Sensitivity Approximation. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 4265–4276. [Google Scholar] [CrossRef]
Nguyen, T.M.; Ajib, W.; Assi, C. A Novel Cooperative NOMA for Designing UAV-Assisted Wireless Backhaul Networks. IEEE J. Sel. Areas Commun. 2018, 36, 2497–2507. [Google Scholar] [CrossRef]
Cheng, X.; Jiang, R.; Sang, H.; Li, G.; He, B. Joint Optimization of Multi-UAV Deployment and User Association Via Deep Reinforcement Learning for Long-Term Communication Coverage. IEEE Trans. Instrum. Meas. 2024, 73, 5503613. [Google Scholar] [CrossRef]
Xue, S.; Luo, B.; Liu, D.; Yang, Y. Constrained Event-Triggered H∞ Control Based on Adaptive Dynamic Programming With Concurrent Learning. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 357–369. [Google Scholar] [CrossRef]
Yang, X.; Xu, M.; Wei, Q. Adaptive Dynamic Programming for Nonlinear-Constrained H∞ Control. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 4393–4403. [Google Scholar] [CrossRef]
Renga, D.; Spoturno, F.; Meo, M. Reinforcement Learning for charging scheduling in a renewable powered Battery Swapping Station. IEEE Trans. Veh. Technol. 2024, 73, 14382–14398. [Google Scholar] [CrossRef]
Lv, Y.; Wu, Z.; Zhao, X. Data-Based Optimal Microgrid Management for Energy Trading With Integral Q-Learning Scheme. IEEE Internet Things J. 2023, 10, 16183–16193. [Google Scholar] [CrossRef]
Sun, J.; Zhang, H.; Yan, Y.; Xu, S.; Fan, X. Optimal Regulation Strategy for Nonzero-Sum Games of the Immune System Using Adaptive Dynamic Programming. IEEE Trans. Cybern. 2023, 53, 1475–1484. [Google Scholar] [CrossRef]
Sun, J.; Dai, J.; Zhang, H.; Yu, S.; Xu, S.; Wang, J. Neural-Network-Based Immune Optimization Regulation Using Adaptive Dynamic Programming. IEEE Trans. Cybern. 2023, 53, 1944–1953. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Gao, N.; Liu, D.; Li, J.; Lewis, F.L. Recent Progress in Reinforcement Learning and Adaptive Dynamic Programming for Advanced Control Applications. IEEE/CAA J. Autom. Sin. 2024, 11, 18–36. [Google Scholar] [CrossRef]
Lv, Y.; Chang, H.; Zhao, J. Online Adaptive Integral Reinforcement Learning for Nonlinear Multi-Input System. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 4176–4180. [Google Scholar] [CrossRef]
Na, J.; Lv, Y.; Zhang, K.; Zhao, J. Adaptive Identifier-Critic-Based Optimal Tracking Control for Nonlinear Systems With Experimental Validation. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 459–472. [Google Scholar] [CrossRef]
Jin, P.; Ma, Q.; Lewis, F.L.; Xu, S. Robust Optimal Output Regulation for Nonlinear Systems With Unknown Parameters. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 4908–4917. [Google Scholar] [CrossRef]
Jin, P.; Ma, Q.; Gu, J. Fixed-Time Practical Anti-Saturation Attitude Tracking Control of QUAV with Prescribed Performance: Theory and Experiments. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 6050–6060. [Google Scholar] [CrossRef]
An, T.; Wang, Y.; Liu, G.; Li, Y.; Dong, B. Cooperative Game-Based Approximate Optimal Control of Modular Robot Manipulators for Human–Robot Collaboration. IEEE Trans. Cybern. 2023, 53, 4691–4703. [Google Scholar] [CrossRef]
Sahabandu, D.; Moothedath, S.; Allen, J.; Bushnell, L.; Lee, W.; Poovendran, R. RL-ARNE: A Reinforcement Learning Algorithm for Computing Average Reward Nash Equilibrium of Nonzero-Sum Stochastic Games. IEEE Trans. Autom. Control 2024, 69, 7824–7831. [Google Scholar] [CrossRef]
Zhao, B.; Shi, G.; Liu, D. Event-Triggered Local Control for Nonlinear Interconnected Systems Through Particle Swarm Optimization-Based Adaptive Dynamic Programming. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 7342–7353. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, B.; Liu, D.; Zhang, S. Distributed Fault Tolerant Consensus Control of Nonlinear Multiagent Systems via Adaptive Dynamic Programming. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 9041–9053. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, B.; Liu, D.; Zhang, S. Event-Triggered Control of Discrete-Time Zero-Sum Games via Deterministic Policy Gradient Adaptive Dynamic Programming. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 4823–4835. [Google Scholar] [CrossRef]
Ye, J.; Dong, H.; Bian, Y.; Qin, H.; Zhao, X. ADP-Based Optimal Control for Discrete-Time Systems With Safe Constraints and Disturbances. IEEE Trans. Autom. Sci. Eng. 2024; early access. [Google Scholar] [CrossRef]
Song, S.; Gong, D.; Zhu, M.; Zhao, Y.; Huang, C. Data-Driven Optimal Tracking Control for Discrete-Time Nonlinear Systems With Unknown Dynamics Using Deterministic ADP. IEEE Trans. Neural Netw. Learn. Syst. 2023; early access. [Google Scholar] [CrossRef] [PubMed]
Mu, C.; Wang, K.; Xu, X.; Sun, C. Safe Adaptive Dynamic Programming for Multiplayer Systems With Static and Moving No-Entry Regions. IEEE Trans. Artif. Intell. 2024, 5, 2079–2092. [Google Scholar] [CrossRef]
Xiao, G.; Zhang, H. Convergence Analysis of Value Iteration Adaptive Dynamic Programming for Continuous-Time Nonlinear Systems. IEEE Trans. Cybern. 2024, 54, 1639–1649. [Google Scholar] [CrossRef]
Davari, M.; Gao, W.; Aghazadeh, A.; Blaabjerg, F.; Lewis, F.L. An Optimal Synchronization Control Method of PLL Utilizing Adaptive Dynamic Programming to Synchronize Inverter-Based Resources With Unbalanced, Low-Inertia, and Very Weak Grids. IEEE Trans. Autom. Sci. Eng. 2024; early access. [Google Scholar] [CrossRef]
Wei, Q.; Li, T. Constrained-Cost Adaptive Dynamic Programming for Optimal Control of Discrete-Time Nonlinear Systems. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3251–3264. [Google Scholar] [CrossRef]
Lin, M.; Zhao, B. Policy Optimization Adaptive Dynamic Programming for Optimal Control of Input-Affine Discrete-Time Nonlinear Systems. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 4339–4350. [Google Scholar] [CrossRef]
Mu, C.; Wang, K.; Ni, Z. Adaptive Learning and Sampled-Control for Nonlinear Game Systems Using Dynamic Event-Triggering Strategy. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4437–4450. [Google Scholar] [CrossRef]
Vamvoudakis, K.; Kokolakis, N. Synchronous Reinforcement Learning-Based Control for Cognitive Autonomy. Found. Trends Syst. Control 2020, 8, 1–175. [Google Scholar] [CrossRef]
Liu, D.; Xue, S.; Zhao, B.; Luo, B.; Wei, Q. Adaptive Dynamic Programming for Control: A Survey and Recent Advances. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 142–160. [Google Scholar] [CrossRef]
Dong, B.; Zhu, X.; An, T.; Jiang, H.; Ma, B. Barrier-critic-disturbance Approximate Optimal Control of Nonzero-sum Differential Games for Modular Robot Manipulators. Neural Netw. 2025, 181, 106880. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Cui, D.; Peng, W. Optimum Control for Path Tracking Problem of Vehicle Handling Inverse Dynamics. Sensors 2023, 23, 6673. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Cui, D. Optimal Control of Vehicle Path Tracking Problem. World Electr. Veh. J. 2024, 15, 429. [Google Scholar] [CrossRef]
Wu, P.; Wang, H.; Liang, G.; Zhang, P. Research on Unmanned Aerial Vehicle Cluster Collaborative Countermeasures Based on Dynamic Non-Zero-Sum Game under Asymmetric and Uncertain Information. Aerospace 2023, 10, 711. [Google Scholar] [CrossRef]
Zheng, Z.; Zhang, P.; Yuan, J. Nonzero-Sum Pursuit-Evasion Game Control for Spacecraft Systems: A Q-Learning Method. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 3971–3981. [Google Scholar] [CrossRef]
An, T.; Dong, B.; Yan, H.; Liu, L.; Ma, B. Dynamic Event-triggered Strategy-based Optimal Control of Modular Robot Manipulator: A Multiplayer Nonzero-Sum Game Perspective. IEEE Trans. Cybern. 2024, 54, 7514–7526. [Google Scholar] [CrossRef]
Dong, B.; Gao, Y.; An, T.; Jiang, H.; Ma, B. Nonzero-sum Game-based Decentralized Approximate Optimal Control of Modular Robot Manipulators with Coordinate Operation Tasks using Value Iteration. Meas. Sci. Technol. 2024. [Google Scholar] [CrossRef]
Liu, F.; Xiao, W.; Chen, S.; Jiang, C. Adaptive Dynamic Programming-based Multi-sensor Scheduling for Collaborative Target Tracking in Energy Harvesting Wireless Sensor Networks. Sensors 2018, 18, 4090. [Google Scholar] [CrossRef]

Figure 1. Experimental platform.

Figure 2. Position tracking curves in joint space via the existing learning-based tracking control method, where the upper (a) and lower (b) subgraphs correspond to Joint 1 and Joint 2 respectively.

Figure 3. Position tracking curves in joint space via the proposed approximate optimal control method, where the upper (a) and lower (b) subgraphs correspond to Joint 1 and Joint 2 respectively.

Figure 4. Position tracking error curves in joint space, where the upper (a) and lower (b) subgraphs correspond to Joint 1 and Joint 2 respectively.

Figure 5. Position tracking curves in 3D space.

Figure 6. Control torque curves, where the upper (a) and lower (b) subgraphs correspond to Joint 1 and Joint 2 respectively.

Figure 7. Contact force curves via the proposed approximate optimal control method, where the upper (a) and lower (b) subgraphs correspond to Joint 1 and Joint 2 respectively.

Figure 8. Trigger threshold and trigger condition curves via the proposed approximate optimal control method, where the upper (a) and lower (b) subgraphs correspond to Joint 1 and Joint 2 respectively.

Figure 9. Time-triggered and event-triggered time curves via the proposed approximate optimal control method.

Figure 10. NN curve via the proposed approximate optimal control method, where the upper (a) and lower (b) subgraphs correspond to Joint 1 and Joint 2 respectively.

Table 1. Performance comparisons.

	Mean Absolute Value of Position Error	Mean Absolute Value of Control Torque
The existing method (Joint 1)	$1.73 \times 10^{- 3}$ rad	0.32 Nm
The proposed method (Joint 1)	$1.03 \times 10^{- 3}$ rad	0.29 Nm
The existing method (Joint 2)	$1.62 \times 10^{- 3}$ rad	0.30 Nm
The existing method (Joint 2)	$0.98 \times 10^{- 3}$ rad	0.26 Nm

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; An, T.; Chen, J.; Zhong, L.; Qian, Y. Event-Trigger Reinforcement Learning-Based Coordinate Control of Modular Unmanned System via Nonzero-Sum Game. Sensors 2025, 25, 314. https://doi.org/10.3390/s25020314

AMA Style

Liu Y, An T, Chen J, Zhong L, Qian Y. Event-Trigger Reinforcement Learning-Based Coordinate Control of Modular Unmanned System via Nonzero-Sum Game. Sensors. 2025; 25(2):314. https://doi.org/10.3390/s25020314

Chicago/Turabian Style

Liu, Yebao, Tianjiao An, Jianguo Chen, Luyang Zhong, and Yuhan Qian. 2025. "Event-Trigger Reinforcement Learning-Based Coordinate Control of Modular Unmanned System via Nonzero-Sum Game" Sensors 25, no. 2: 314. https://doi.org/10.3390/s25020314

APA Style

Liu, Y., An, T., Chen, J., Zhong, L., & Qian, Y. (2025). Event-Trigger Reinforcement Learning-Based Coordinate Control of Modular Unmanned System via Nonzero-Sum Game. Sensors, 25(2), 314. https://doi.org/10.3390/s25020314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Event-Trigger Reinforcement Learning-Based Coordinate Control of Modular Unmanned System via Nonzero-Sum Game

Abstract

1. Introduction

2. Background and Related Work

2.1. Reinforcement Learning

2.2. Nonzero-Sum Game

3. Dynamic Model

4. Event-Trigger Reinforcement Learning-Based Coordinate Control via Nonzero-Sum Game

4.1. Problem Transformation

4.2. Event-Trigger Reinforcement Learning-Based Coordinate Control

4.3. Exclusion of Zeno Behaviors

5. Experiment

5.1. Experimental Setup

5.2. Experimental Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI