Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning

Hu, Jintao; Wang, Fujie; Li, Xing; Qin, Yi; Guo, Fang; Jiang, Ming

doi:10.3390/biomimetics9120779

Open AccessArticle

Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning

by

Jintao Hu

,

Fujie Wang

^*,

Xing Li

,

Yi Qin

,

Fang Guo

and

Ming Jiang

School of Computer Science and Technology, Dongguan University of Technology, Dongguan 523808, China

^*

Author to whom correspondence should be addressed.

Biomimetics 2024, 9(12), 779; https://doi.org/10.3390/biomimetics9120779

Submission received: 10 November 2024 / Revised: 15 December 2024 / Accepted: 18 December 2024 / Published: 21 December 2024

(This article belongs to the Special Issue Bio-Inspired Robotics and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, a deep reinforcement learning (DRL) approach based on generative adversarial imitation learning (GAIL) and long short-term memory (LSTM) is proposed to resolve tracking control problems for robotic manipulators with saturation constraints and random disturbances, without learning the dynamic and kinematic model of the manipulator. Specifically, it limits the torque and joint angle to a certain range. Firstly, in order to cope with the instability problem during training and obtain a stability policy, soft actor–critic (SAC) and LSTM are combined. The changing trends of joint position over time are more comprehensively captured and understood by employing an LSTM architecture designed for robotic manipulator systems, thereby reducing instability during the training of robotic manipulators for tracking control tasks. Secondly, the obtained policy by SAC-LSTM is used as expert data for GAIL to learn a better control policy. This SAC-LSTM-GAIL (SL-GAIL) algorithm does not need to spend time exploring unknown environments and directly learns the control strategy from stable expert data. Finally, it is demonstrated by the simulation results that the end effector of the robot tracking task is effectively accomplished by the proposed SL-GAIL algorithm, and more superior stability is exhibited in a test environment with interference compared with other algorithms.

Keywords:

generative adversarial imitation learning; end effector; robotics tracking control; long short-term memory; deep reinforcement learning

1. Introduction

With artificial intelligence technology advancing and the concept of Industry 4.0 emerging, the development of robot manipulators has gained significant attention due to their numerous applications in various sectors [1]. The operational scenarios are no longer fixed, predefined, or well known [2]. Regarding trajectory tracking control for a end effector of a manipulator, it is essential for each position to track a desired trajectory closely [3]. Various methods are proposed to achieve satisfactory tracking control effects, including PID control, sliding-mode control, adaptive tracking control, and model predictive control [4,5,6,7,8]. However, many of the mentioned methods necessitate a precise model of the robot, and complex controllers often need more control parameters. Fortunately, neural networks demonstrate exceptional approximation capabilities for uncertain mathematical models and are among the effective methods for addressing nonlinear system control problems.

In [9], an adaptive asymptotic prescribed performance approach using a radial basis function neural network (RBFNN) is proposed for a hydraulic manipulator. An end-to-end (E2E) deep learning method is proposed for robot classification and real-time motion control in [10]. Through forward synchronous learning and adjustable Rectified Linear Unit (ReLU), the stability and robustness of the deep learning algorithm are ensured. However, neural network training for tracking control tasks in robotic manipulators is susceptible to local optimal problems. It is worth mentioning that the appearance of DRL expands the options for designing control algorithms [11]. The DRL collects experience data through interactive learning of the agent with the environment, utilizing these data to optimize the loss function and train the model. Consequently, local optimal solution problems can be flexibly handled.

In [12], the fuzzy Q-learning network, a typical DRL scheme, is used to resolve the problem of trajectory tracking of an Uncertain Quadrotor System (UQS). The state–action–reward–state–action (SARSA) algorithm is used to realize the positioning of random target and fixed target points of the 3-DoF manipulator end effector in [13]. While Q-learning and SARSA are effective for addressing typical reinforcement learning (RL) tasks, they are not suitable for continuous spaces. It should be noted that trajectory tracking control problems typically involve continuous state spaces. Thus, in order to complete the tracking control task of pneumatic musculoskeletal robots, a model-based RL (MBRL) method is proposed in [14]. In [15], a model-based offline RL method is introduced to enhance the control performance, which is combined with a torque controller to implement tracking control for a manipulator. However, learning an effective control policy is more challenging with model-based RL (MBRL) compared with model-free RL (MFRL). The MFRL algorithms are not dependent on specific environment models, which enables them to perform control tasks effectively in dynamic and uncertain environments. In [16], a distributed Proximal Policy Optimization (PPO) algorithm [17] based on the LSTM network is proposed to train robotic arms and mobile robots to track a given trajectory. In order to control a 2-DoF manipulator with an unknown deadzone to track a desired trajectory, an actor–critic RL method is adopted in [18]. An end-to-end target tracking method utilizing a DRL approach is proposed to address the complexity of control in free-floating space manipulators (FFSMs) in [19]. The challenge of locating and tracking the eddy center in an uncharted environment using an Underactuated Autonomous Surface and Underwater Vehicle (UASUV) is explored in [20], in which the SAC algorithm and LSTM are combined. While MFRL is extensively studied in tracking control tasks, most existing algorithms necessitate numerous training epochs to derive reasonable control policies.

Fortunately, the proposal of GAIL solves the problem of time-consuming training in many tasks [21]. GAIL learns the optimal policy directly from expert data by combining Inverse RL (IRL) [22] with Generative Adversarial Networks (GANs) [23]. In [24], a GAIL-based DRL method is introduced to train an Autonomous Underwater Vehicle (AUV) to emulate expert paths. Similarly, in [25], a GAIL-driven navigation system for Unmanned Surface Vehicles (USVs) is proposed, which learns a policy to mimic expert trajectories. The above-mentioned references [24,25] use the on-policy algorithms PPO and Trust Region policy optimization (TRPO) [26] as generators, respectively [27]. Although the on-policy algorithm is stable, it is challenging to learn a superior control policy due to the expense of sample efficiency GAIL policy training. When the robot environment model is unknown, the exploration ability of the algorithm is crucial. Meanwhile, off-policy methods exhibit superior sample efficiency, but their training process is unstable [28]. Fortunately, by utilizing the LSTM network, the evolving patterns of joint positions over time are competently captured and comprehended, thereby mitigating instability during the training process of robotic manipulators for tracking control tasks. Therefore, the main focus of this paper is to embed LSTM into the SAC algorithm and utilize the resulting SAC-LSTM algorithm as a generator for GAIL training. This approach aims to address the trajectory tracking control problem for the end effector of manipulator systems under input saturation and random interference. The principal contributions are summarized as follows:

(1) Inspired by [20], the LSTM is introduced to enhance the stability of the end effector of the manipulator when tracking the target trajectory. During trajectory tracking control, the sequential states of the manipulator are inputted into the LSTM layer to produce hidden states. These hidden states are processed to generate Gaussian action parameters. This method effectively captures sequential dependencies, thereby improving the robustness and adaptability of trajectory tracking control.

(2) In the above-mentioned references [24,25], the on-policy algorithm is selected as the generator of GAIL. In this paper, the off-policy algorithm SAC combining LSTM is chosen as a generator, and the policies trained by SAC-LSTM are selected as the expert data. Through GAIL training, the agent learns control strategies directly from expert data and is able to quickly learn strategies that are superior to expert data.

(3) A SL-GAIL algorithm is proposed by combining the SAC-LSTM and GAIL method. This algorithm is used to train the robot to track the target trajectory. This approach reduces unnecessary exploration, accelerates the acquisition of superior control strategies, and helps improve the efficiency and robustness of trajectory tracking for the end effector of the robotic manipulator.

The controlled objects and control methods used in the above literature are shown in Table 1.

2. System Description

2.1. Dynamics Model

The dynamical equations of the system can be obtained by deriving the equations from the Lagrangian functions. The dynamic equation for a n-DOF manipulator is described as follows [29]:

\begin{matrix} \begin{matrix} M (θ) \ddot{θ} + C (θ, \dot{θ}) \dot{θ} + G (θ) = τ + τ_{d} \end{matrix} \end{matrix}

(1)

Among them,

θ

,

\dot{θ}

, and

\ddot{θ}

∈

R^{n}

represent the joint position, joint velocity, and joint acceleration, respectively.

M (θ)

∈

R^{n \times n}

expresses the inertia matrix,

C (θ, \dot{θ})

∈

R^{n \times n}

represents the centripetal and Coriolis torque,

G (θ)

∈

R^{n}

is the gravitational force,

τ

∈

R^{n}

is the to the joint torques, and

τ_{d} \in R^{n}

is a random interference. In this paper, the Phantom Omni robot is studied. The robot is shown in Figure 1a. The schematic diagram of the Phantom Omni is shown in Figure 1b, in which the reference frames used in dynamics are outlined [30]. The robot is a 3-DoF manipulator and is modeled by Equation (1), and the M, C, and G matrices are as follow [31]:

\begin{matrix} M (θ) & = [\begin{matrix} m_{11} & 0 & 0 \\ 0 & m_{22} & m_{23} \\ 0 & m_{32} & m_{33} \end{matrix}], G (θ) = [\begin{matrix} 0 \\ g k_{5} c_{2} + g k_{6} c_{23} \\ g k_{6} c_{23} \end{matrix}], \\ C (θ, \dot{θ}) & = [\begin{matrix} - a_{1} {\dot{θ}}_{2} & - a_{1} {\dot{θ}}_{1} & - a_{2} {\dot{θ}}_{1} \\ a_{1} {\dot{θ}}_{1} & - a_{3} {\dot{θ}}_{3} & - a_{3} ({\dot{θ}}_{2} + {\dot{θ}}_{3}) \\ a_{2} {\dot{θ}}_{1} & a_{3} {\dot{θ}}_{2} & 0 \end{matrix}] \end{matrix}

where

m_{11} = k_{1} + k_{2} c_{2}^{2} + k_{3} c_{23}^{2} + 2 k_{4} c_{2} c_{23}, m_{22} = k_{2} + k_{3} + 2 k_{4} c_{3}, m_{23} = k_{3} + k_{4} c_{3},

m_{32} = m_{23}, m_{33} = k_{3}, a_{1} = k_{2} c_{2} s_{2} + k_{3} c_{23} s_{23} + k_{4} c_{2 \times 23}, a_{2} = k_{3} c_{23} s_{23} + k_{4} c_{2} s_{23},

a_{3} = k_{4} s_{3}, s_{i} = s i n (θ_{i}), s_{23} = sin (θ_{2} + θ_{3}), c_{i} = cos (θ_{i}), c_{23} = cos (θ_{2} + θ_{3}), c_{2 \times 23} = cos (2 θ_{2} + θ_{3})

.

2.2. Kinematics Model

The real-time joint angle and joint velocity of the manipulator are obtained by solving Equation (1). Thus, the relationships between the end effector position vector and joint space vector are expressed as [32]

\begin{matrix} x & = f (θ) \end{matrix}

(2)

\dot{x} = J (θ) \dot{θ}

(3)

where

f (\cdot) : R^{n} \to R^{n}

is the mapping between the joint space and task space.

J (θ) \in R^{n \times n}

is the Jacobian matrix. According to forward kinematics, the Cartesian coordinates of the end effector are expressed as follows [32]:

x = [\begin{matrix} x \\ y \\ z \end{matrix}] = [\begin{matrix} (l_{2} cos (θ_{2}) + l_{3} cos (θ_{2} + θ_{3})) cos (θ_{1}) \\ (l_{2} cos (θ_{2}) + l_{3} cos (θ_{2} + θ_{3})) sin (θ_{1}) \\ l_{2} sin (θ_{2}) + l_{3} sin (θ_{2} + θ_{3}) \end{matrix}]

(4)

the Jacobian matrix, J, is expressed [33]

J = [\begin{matrix} - r s_{1} & - z c_{1} & - l_{3} c_{1} s_{23} \\ r c_{1} & - z s_{1} & - l_{3} s_{1} s_{23} \\ 0 & r & l_{3} c_{23} \end{matrix}]

(5)

where

r = \sqrt{x^{2} + y^{2}}

. According to Equations (3) and (5), the

\dot{x}

is as follows:

\dot{x} = [\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{z} \end{matrix}] = [\begin{matrix} - r s_{1} & - z c_{1} & - l_{3} c_{1} s_{23} \\ r c_{1} & - z s_{1} & - l_{3} s_{1} s_{23} \\ 0 & r & l_{3} c_{23} \end{matrix}] [\begin{matrix} \dot{θ_{1}} \\ \dot{θ_{2}} \\ \dot{θ_{3}} \end{matrix}]

(6)

2.3. Control Objective

The kinematics of a robotic manipulator is generally nonlinear, and the parameters of the manipulator system are not completely known, which results in modeling uncertainties. In addition, this paper addresses the task of controlling the joint angles by applying torques to the robot arm’s joints, subsequently enabling the end effector to track the time-varying curve in the task space, which adds to the complexity and difficulty of the control process. In this paper, the control goal is designing a RL trajectory tracking controller that aims to solve the time-varying curve tracking control problem in the task space of a manipulator with saturation constraints on joint angle and torque inputs.

3. Preliminaries

3.1. Reinforcement Learning

An RL problem is typically represented as a Markov Decision Process (MDP). An MDP is defined by a tuple D = (S, A, P, R), where S denotes a set of states that the agent might encounter, A signifies a set of actions that the agent chooses from at any given state, P defines the transition dynamics, specifying the probability of moving from one state, s, to another state,

s^{'}

, given a specific action, and R assigns rewards for actions taken in a state and transitioning to another state.

An RL agent aims to study a policy that maps states to actions based on interactions with its environment. This learning process is illustrated in Figure 2. The agent perceives its current state, and chooses and executes an action based on its policy at each time step. Then the environment transitions to the next state according to the action, and provides a reward. The policy is updated based on the received rewards to maximize the long-term cumulative reward. The main goal is for the agent to train an optimal control policy that maximizes the cumulative discounted reward [11].

When an RL task conforms to the characteristics of an MDP, the optimization of the task can be approached by resolving the Bellman optimality equations. The optimal state value function,

V^{*} (s)

, which maximizes the expected cumulative rewards, satisfies the following Bellman optimality equation [34]:

V^{*} (s) = \sum_{a} π (s, a) \sum_{s^{'}} P_{s s^{'}}^{a} [R_{s s^{'}}^{a} + γ V^{*} (s^{'})]

(7)

The Bellman optimality equation for the state-action value function,

Q^{*} (s, a)

, is given by:

Q^{*} (s, a) = \sum_{s^{'}} P_{s s^{'}}^{a} [R_{s s^{'}}^{a} + γ m a x_{a} Q^{*} (s^{'}, a^{'})] .

(8)

By learning the optimal value function, an agent is able to determine the optimal policy by choosing actions that maximize this value function.

3.2. Soft Actor–Critic Algorithm

SAC is the most representative algorithm for actor–critic architecture and has good sampling efficiency. The goal of an ordinary RL algorithm is to maximize the expected return

\sum_{t}^{T} E_{(s_{t}, a_{t}) \sim π} [r (s_{t}, a_{t})]

, while SAC is based on the maximum entropy RL theory, which introduces expectations of the policy as the overall optimization goal based on the original RL return [35]:

π^{*} = \underset{π}{argmax} \sum t^{T} E_{(s_{t}, a_{t}) \sim π} [r (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))]

(9)

where

H (π (\cdot | s_{t})) = - \sum_{a_{t}} π (a_{t} | s_{t}) log π (a_{t} | s_{t})

represents the policy entropy in state

s_{t}

, and

α \geq 0

is a temperature parameter that balances the entropy term against the reward.

SAC utilizes five neural networks: two soft Q-value networks, two target Q-value networks, and a policy network. The soft Q-value network updates follow Equation (8). For a given Q-value network parameterized by

θ_{i}

, the loss function is:

L_{Q} (θ_{i}) = E_{(s, a, r, s^{'}, d) \sim D} [\frac{1}{2} {(Q_{θ_{i}} (s, a) - y (r, s^{'}, d))}^{2}]

(10)

where

y (r, s^{'}, d)

is described as:

y (r, s^{'}, d) = r + γ (min_{i = 1, 2} Q_{θ_{i}^{'}} (s^{'}, {\tilde{a}}^{'}) - α log π_{ϕ} ({\tilde{a}}^{'} | s^{'}))

(11)

SAC employs the smaller value from the two target Q-value networks to ensure stable training. The Q-value networks are updated using stochastic gradient descent:

\nabla_{θ_{i}} L (θ_{i}) = \nabla_{θ_{i}} Q_{θ_{i}} (s, a) [{(Q_{θ_{i}} (s, a) - y (r, s^{'}, d))}^{2}]

(12)

For policy network

π_{ϕ} (a | s)

, which can be updated directly by maximizing (minimizing negative values), the Q value and policy corresponding to the current policy network:

J_{π} (ϕ) = E_{(s) \sim D} [E_{a \sim π_{ϕ}} [α log π_{ϕ} (a | s) - Q_{θ_{i}} (s, a)]]

(13)

SAC uses the reparameterization trick to sample actions

a = f_{ϕ} (ξ_{a}; s)

, where

ξ_{a} \in R^{\dim (A)}

is drawn from a fixed Gaussian distribution, and

f_{ϕ}

represents the reparameterized policy network. The policy network is then updated using stochastic gradient descent:

\begin{matrix} \nabla_{ϕ} J_{π} (ϕ) = \nabla_{ϕ} α log π_{ϕ} (a | s) + (\nabla_{ϕ} α log π_{ϕ} (a | s) \\ - \nabla Q_{θ_{i}} (s, a)) \nabla_{ϕ} f_{ϕ} (ξ_{a}; s) \end{matrix}

(14)

3.3. Long Short-Term Memory

Each cell in an LSTM network is composed of three integral components: the forget gate, the input gate, and the output gate. The forget gate controls which information from the previous cell state should be discarded. It is defined as follows: [36]:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(15)

The input gate decides which values are updated and stored in the cell state. This process is broken down into two parts: a candidate value,

\hat{C_{t}}

, is created using the tanh function; and the input gate,

ι_{t}

, controls the extent to which this candidate value is incorporated into the cell state:

\begin{matrix} ι_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i}) \\ {\tilde{C}}_{t} = tanh (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C}) \end{matrix}

(16)

where the final component is the output gate, which determines the output of the LSTM cell. This gate combines the current cell state and the output gate activation to produce the final hidden state:

\begin{matrix} o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o}) \\ h_{t} = o_{t} \cdot tanh (C_{t}) \end{matrix}

(17)

The mention above

σ

is sigmoid function, W and b represent the weight matrix and bias term,

h_{t - 1}

is the previous hidden state,

x_{t}

is the current input,

[\cdot]

denotes the concatenation operation,

\hat{C_{t}}

is the candidate cell state, and

C_{t}

is the cell state. The interaction of these gates enables the LSTM unit to capture long-term dependencies, making it effective for sequential data processing.

3.4. Generative Adversarial Imitation Learning

GAIL enhances traditional imitation learning by incorporating principles from maximum entropy inverse reinforcement learning. It combines imitation learning with Generative Adversarial Networks (GANs) to directly learn the optimal policy that makes the distribution of the state–action pairs of the agent closely match those of expert trajectories. In GAIL, the role of the discriminator is to differentiate between the state–action pairs generated by the agent and those from the expert. Meanwhile, the generator, which represents the policy of the agent, is trained to produce state–action pairs that the discriminator classifies as expert-like. This adversarial process aims to find a saddle point (

π

, D) in the following objective function [37]:

E_{π} [log D ((s, a))] + E_{π_{E}} [log (1 - D (s, a))] - λ H (π)

(18)

where

D (s, a)

represents the discriminator in GAIL.

E_{π} [log D ((s, a))]

is the expectation over the policy of the agent, encouraging the generator to produce state–action pairs that the discriminator classifies as expert-like.

E_{π_{E}} [log (1 - D (s, a))]

is the expectation over the policy of the expert, ensuring that the discriminator correctly identifies expert state–action pairs.

H (π) ≜ E_{π} [- log π (a ∣ s)]

denotes the entropy of the policy,

π

, which encourages exploration and prevents the policy from collapsing to deterministic behaviors.

λ

is the weighting parameter that determines the significance of the entropy term in the overall objective.

4. Control Design

4.1. Design of System Input/Output

In deep RL, the actor network determines an output action based on the current input state. For the trajectory tracking in the task space of the manipulator, the joint angles of the manipulator and the position of the end effector are crucial. The traditional state space includes joint angles and joint angular velocities, which can describe the internal state of the robot, but cannot directly reflect the relationship between the performance of the end effector of the robot and the target. Therefore, adding the mapped end effector position and velocity, as well as the target position and velocity, can provide more relevant information about the end effector and target trajectory, thereby helping the reinforcement learning algorithm to control more accurately. Furthermore, to guarantee that the control policy is able to obtain effective tracking control, there must be error information for trajectory tracking in the state space [38]. In this paper, the primary objective of the controller is to determine the appropriate actions according to the real-time status information of the robotic manipulator to achieve effective tracking control. The state space is described as follows:

s = {[\begin{matrix} θ & \dot{θ} & x & \dot{x} & x_{d} & \dot{x_{d}} & e & \dot{e} \end{matrix}]}^{T}

(19)

where x and

\dot{x}

are are the end position and the velocity at the end position, respectively.

x_{d}

and

\dot{x_{d}}

are the target end position and the velocity at the target end position.

e = x - x_{d}

represents the tracking error, and

\dot{e}

is the the speed error. Additionally, the saturation constraints are imposed on the manipulator joints to constrain the swing amplitude when the manipulators track a target curve. The limitations are as follows:

q_{i} = \{\begin{matrix} q_{min} & , if q_{i} \leq q_{min} \\ q_{i} & , if q_{min} < q_{i} < q_{max}, i = 1, 2, \dots, n \\ q_{max} & , if q_{i} \geq q_{max} \end{matrix}

(20)

where

q_{max} = π

and

q_{min} = - π

.

In trajectory tracking control of the robot manipulator end effector, the controller’s output signals are the torque acting on the joint angle of the manipulator. The action space is as follows:

a = {[\begin{matrix} τ \end{matrix}]}^{T}

(21)

where

τ_{i} = \{\begin{matrix} τ_{min} & , if τ_{i} \leq τ_{min} \\ τ_{i} & , if τ_{min} < τ_{i} < τ_{max}, i = 1, 2, \dots, n \\ τ_{max} & , if τ_{i} \geq τ_{max} \end{matrix}

(22)

where

τ_{max}

is the maximum of the input torque,

τ_{min}

is the minimum of the input torque, and

τ_{max} = 5

,

τ_{min} = - 5

.

In RL tasks, the reward is a measure to access the effectiveness of the current policy. In the trajectory tracking task of the manipulator, the tracking error,

e (t)

, is the variable of most concern. Therefore, the reward function is set to a nonlinear form. The reward function is formulated as:

r = (1 - \frac{1}{exp (- ς \cdot (\bar{e} - ϵ))})

(23)

where

ς

denotes a sensitive scale and is set to adjust the rate of increase, enabling the agent to quickly obtain rewards once a certain performance level is reached.

ϵ

represents a benefit threshold and is used to indicate the bound of rewards [39]. In this paper,

ς = 2

and

ϵ = 0.02

.

4.2. Control Algorithm Based on SAC-LSTM

The actor–critic architecture is employed in most RL algorithms to effectively address policy gradient-based problems. The actor network will eventually train a policy,

π

, which is the controller required for the tracking task. This policy is refined employing the policy gradient approach under the guidance of the critic network. Concurrently, the critic network needs to be trained to precisely assess the outputs generated by the actor network.

In this paper, the SAC-LSTM employs the actor–critic architecture, including the policy network and critic network, as detailed in Algorithm 1. The network structure is depicted in Figure 3. The actor network includes an LSTM network layer, a fully connected network layer, and an output layer. The state sequence of each time step of the robot manipulator in the trajectory tracking task is used as the input of LSTM, and the hidden state, h, of each time step is used as the output. Then, h at each time step in the entire trajectory is used as the input of the fully connected layer. After passing through the fully connected layer network, the Gaussian distribution parameters of the action (torque) are output. During training, the action values are sampled from this distribution, ensuring that the actions of the agent are chosen in a random and exploratory manner. As training progresses, the network converges and the variance decreases gradually. The value network includes a fully connected layer and an output layer, taking the splicing of state sequences and actions as input. The network output layer outputs a single variable, the value of the state sequence,

V_{ϕ} (S)

.

Algorithm 1 SAC-LSTM

Input: Policy network parameter

ϕ

, Q Network parameters

θ_{1}

,

θ_{2}

Input: Empty replay buffer

D

Input: target network weights

{\hat{θ}}_{1} \leftarrow θ_{1}, {\bar{θ}}_{2} \leftarrow θ_{2}

Output: Optimized parameters

1:: for each iteration do
2:: for each interaction with environment do
3:: Get $s$ and choose $a \sim π_{ϕ} (. ∣ s_{t})$
4:: Perform action $a$
5:: Record $s^{'}$ , reward $r$ , and
6:: termination signal $d$
7:: Save $(s, a, r, s^{'}, d)$ in replay buffer $D$
8:: if termination signal $d$ is terminal then
9:: Reset state of environment
10:: end if
11:: end for
12:: for each gradient do
13:: From $D$ sample a batch of data B=( $s$ , $a$ , $r$ , $s^{'}$ , $d$ );
14:: Calculate targets for the Q networks following (15)
15:: Update parameters of Q network according to
16:: gradient descent following (16)
17:: Update parameters of policy according to
18:: gradient descent following (18)
19:: Update target network parameters with
20:: $ϕ_{targ, i} \leftarrow ρ ϕ_{targ, i} + (1 - ρ) ϕ_{i}$ for $i = 1, 2$
21:: end for
22:: end for

Remark 1.

Inspired by the above reference [20], the LSTM network is able to efficiently capture sequence information. It generates current outputs by integrating evolving trends from past data, which is more effective for tasks with periodic changes such as robotic manipulator trajectory tracking. In this paper, retaining the outputs of all time steps from the LSTM and passing them to the fully connected layer offers significant advantages for tracking the trajectory of a robotic manipulator. Firstly, this approach fully utilizes the information from the entire sequence, ensuring that the state at each time step is considered, thereby enhancing the precision of trajectory tracking. Secondly, it helps capture long-term dependencies, enabling the model to better handle complex temporal dependencies. Thirdly, using information from the entire sequence can make the control output smoother, reducing jitter and abrupt changes in the robotic arm’s movement. Additionally, this method improves the model’s robustness when dealing with dynamic environments or uncertain external disturbances, enhancing its ability to adapt and respond to changes.

4.3. GAIL for Robot Manipulator Trajectory Tracking

GAIL is able to learn control policies directly from expert demonstration data, which effectively solve the time-consuming problem of reinforcement learning training. The SAC-LSTM algorithm is used as a generator to train GAIL, for implementing SL-GAIL on robotic manipulators’ trajectory tracking control, as shown in Figure 4. The discriminator of GAIL is composed of a fully connected layer followed by an output layer. Among them, the state–action pair spliced into the state and action is used as the input of the discriminator, and the output is a value between 0 to 1. The robotic manipulator agent interacts with the environment employing the current policy

π (a | s; θ_{n} o w)

, which serves as the generator. This interaction generates a trajectory

Ω

= [

s_{1}, a_{1}, s_{2}, a_{2}, \dots, s_{n}, a_{n}

] consisting of state–action pairs. For each state–action pair

(s, a)

, the discriminator outputs a value,

D (s, a)

, to determine whether it originates from the expert demonstration or from the policy of the agent. Ideally,

D (s, a)

should be close to 1 for expert trajectories and close to 0 for agent-generated trajectories. The generator (policy of the agent) is trained employing the SAC-LSTM algorithm to produce state–action pairs that closely match the expert data, thereby fooling the discriminator. The discriminator, conversely, is designed to differentiate between the expert and agent-generated trajectories. Eventually, the generator is trained to maximize

E_{π} [log (D (s, a))]

and the discriminator is trained to minimize

E_{π} [log (D (s, a))] + E_{π_{E}} [log (1 - D (s, a))]

. GAIL uses a surrogate reward:

r (s, a) = - log (D (s, a))

(24)

This surrogate reward is used to update the policy of the agent. The ultimate objective is learning an optimal control policy that maximizes the expected cumulative discounted return:

π^{*} = \underset{π \in Π}{arg min} E_{π} [r (s, a)],

(25)

where

E_{π} [r (s, a)] ≜ E [\sum_{t}^{T} γ^{t} r (s, a)]

represents an expectation of discounted return according to the trajectory generated by policy

π

.

This approach, detailed in Algorithm 2, ensures that the policy of the agent evolves to closely mimic expert behavior while efficiently handling the challenges of trajectory tracking control in robotic manipulators.

Algorithm 2 SL-GAIL

Input: Expert demonstration data

D_{E}

, Policy network parameter

ϕ

, Discriminator network parameters

ω

Output: Optimal policy parameters

ϕ

1:: Initialize policy network parameters $ϕ$
2:: Initialize discriminator network parameters $ω$
3:: for each training iteration do
4:: Sample expert batch $B_{E}$ from $D_{E}$
5:: Generate trajectory $Ω$ using current policy $π (a | s; ϕ)$
6:: for each state–action pair $(s, a)$ in $Ω$ do
7:: Compute discriminator output $D (s, a)$
8:: Compute surrogate reward by equation (24)
9:: end for
10:: Update $ϕ$ using surrogate reward $r (s, a)$
11:: Sample state–action pairs $(s, a)$ from $Ω$ and $B_{E}$
12:: Update discriminator parameters $ω$ using equation (18)
13:: end for

Remark 2.

Different from the above-mentioned references [24,25], this paper employs the off-policy algorithm SAC as a generator for GAIL. The policy entropy regularization of the SAC algorithm enhances the sampling capability and generates smoother and continuous action outputs, which is more suitable for processing systems such as robotic manipulators. In order to solve the instability of the off-policy algorithm when training the GAIL strategy, the LSTM network is added to the SAC algorithm. LSTM effectively captures the long-term dependencies and dynamic patterns in the sequence data, thereby helping the generator to better predict the continuous action sequence, maintain the temporal coherence of the generated samples, and enhance the model’s adaptability to dynamic changes in the environment, thereby improving the stability of the generator in the GAIL framework.

5. Simulation

In this section, simulation experiments are performed based on the simulation environment in Section 2 to check the capability of the proposed method in this paper. The SAC, SAC-LSTM, and SAC-GAIL algorithms are chosen as the benchmarks to evaluate our SL-GAIL algorithm. Meanwhile, the anti-interference capabilities of the four controllers are evaluated. In this paper, all experiments are examined using the Phantom Omni robotic manipulator. The initial state values are defined as

θ_{1} (0) = - 0.3

,

θ_{2} (0) = 0.3

, and

θ_{3} (0) = - 0.8

. The total steps are 3000, and the step size of each step is 0.01 s.

In the all simulation experiments,

x_{d}

is selected as

x_{d} = [\begin{matrix} x \\ y \\ z \end{matrix}] = [\begin{matrix} 0.1 sin (t) + 0.12 \\ 0.1 cos (t) + 0.12 \\ 0.1 sin (t) \end{matrix}]

where

t \in (0, t_{l i m i t})

and

t_{l i m i t}

= 30 s.

5.1. Training Performance

In this experiment, the performance of the SAC-LSTM, SAC-GAIL and SL-GAIL algorithms are tested during the training process. The detailed parameter settings of all algorithms are shown in the Appendix A.

The trends in cumulative reward values obtained by the three algorithms, and the standard deviation and variance of the cumulative reward are shown in Figure 5. The results show that the cumulative rewards value of the SL-GAIL agent tend to converge around the 20th episode and the learning curve is smooth later. However, the SAC-LSTM agent and the SAC-GAIL agent do not converge until after the 100th. At the same time, the variance and standard deviation of SL-GAIL are in a stable state after the 100th episode, but the variance and standard deviation of SAC-LSTM and SAC-GAIL always fluctuate greatly.

Moreover, compared with SAC-GAIL, the SAC-LSTM algorithm is more stable during training. Therefore, when using RL to perform robotic manipulator tracking control tasks, LSTM can enhance the stability of robotic manipulator agent learning and GAIL to improve the efficiency of robotic manipulator agent learning.

5.2. Control Performance

The tracking control performance of SL-GAIL is compared with SAC, SAC-LSTM, and SAC-GAIL. Figure 6 shows the tracking trajectories, tracking error, and a change in the torque of the end effector using SAC, SAC-GAIL, SAC-LSTM, and SL-GAIL during the training process in the control task. As demonstrated in Figure 6a–c, the trajectories of SAC, SAC-GAIL, SAC-LSTM, and SL-GAIL agents in the X-, Y-, and Z-axes are shown. The tracking errors of the end effector in the three axes are shown in Figure 6d–f. It can be seen that the SL-GAIL algorithm achieves more superior tracking control performance than the baseline methods. Figure 6g–i show the change trend of the torque for the joint; it can be seen that the torque of the robotic manipulator when controlled by SL-GAIL is smaller than that of SAC, SAC-GAIL, and SAC-LSTM.

As can be seen from this experiment, SL-GAIL combines the strengths of GAIL and LSTM and quickly learns policies that surpass those derived from stable expert data. As a result, it achieves excellent control performance with lower torque requirements. The average value of the absolute value of the tracking error (AAE), the average value of the absolute value of torque (AAT) and root-mean-squared error (RMSE) are shown in Table 2. The following formulas calculate AAE, AAT, and RMSE:

RMSE = \sqrt{\frac{1}{T} \sum_{i = 1}^{T} e_{i}^{2}} i = 1, 2, \dots, n

(26)

AAE = \frac{1}{T} \sum_{i = 1}^{T} | e_{i} | i = 1, 2, \dots, n

(27)

AAT = \frac{1}{T} \sum_{i = 1}^{T} | τ_{i} | i = 1, 2, \dots, n

(28)

where T is the total time.

5.3. Anti-Interference Performance

In this simulation experiment, the proposed method is tested for anti-interfere,

y_{d}

is the same as in Experiment 1, and the external disturbance,

τ_{d}

, is selected as:

τ_{d} = \{\begin{matrix} {(5 sin (t) 5 cos (t) 5 sin (t))}^{T} & a > 0.8 \\ 0 & else \end{matrix}

where a represents a random variable uniformly distributed in the interval [0.0, 1.0). If this value exceeds 0.80, a nonlinear disturbance term is added to the system input signal, which means that the perturbation has a twenty percent probability of occurring at each time step. The nonlinear disturbance term is a 3-dimensional column vector composed of the sine and cosine values of the current time step, multiplied by a scaling factor (5). By introducing random disturbances, the robustness of the controller in uncertain environments can be tested. In addition, introducing perturbations in the form of sine and cosine curves can simulate many nonlinear effects in mechanical systems and improve the accuracy of the model.

In order to evaluate the anti-interfere of SL-GAIL, we saved the optimal control policies trained with SAC-LSTM, SAC-GAIL, and SL-GAIL in the curve-following task. The control policies of the three agents are compared with the SL-GAIL algorithm. Figure 7 shows the position tracking of the four algorithms in a disturbance-free environment (DFENV) and a disturbed environment (DENV). The first, second, and third rows are the X, Y, and Z positions of the four algorithms, respectively. It is noted that, compared with the other three algorithms, the position tracking of the SL-GAIL algorithm changes very little in the two environments. The tracking error and torque changes of the four algorithms in the DENV are shown in Figure 8. It can be seen that the torques using SL-GAIL are still smaller than using SAC, SAC-LSTM, and SAC-GAIL. Figure 9 present the AAE and RMSE for all algorithms in the DFENV and DENV. The AAE and RMSE of SL-GAIL tested in the two environments have almost no change. However, the AEE and RMSE of the SAC, SAC-LSTM, and SAC-GAIL change significantly in the two environments. The changes in AAE under the two environments are shown in Table 3. The changes in RMSE under the two environments are shown in Table 4.

From this experiment, it can be seen that, in the robotic arm environment with the same joint angle constraints and maximum torque limits, the anti-interference ability of the SL-GAIL controller proposed in this paper is stronger than that of the three other controllers. It is indicated again that GAIL and LSTM play an important role in this paper in enhancing the robustness of the algorithm.

6. Conclusions

In this paper, we addressed the challenge of trajectory tracking in task space for the end effector of the Phantom Omni manipulator. We proposed the SAC-LSTM algorithm to enhance the performance of the control system, particularly in adapting to time-varying trajectories. The integration of long short-term memory (LSTM) allowed the system to effectively capture temporal dependencies in the trajectory, thereby improving the robustness and adaptability of the controller in dynamic environments. By combining SAC-LSTM with generative adversarial imitation learning (GAIL), the SL-GAIL method is able to directly learn from expert demonstrations, significantly improving the efficiency of policy learning and reducing training time. The simulation results demonstrate that the SL-GAIL method not only achieves faster learning but also enhances the robustness of the control system compared with the baseline algorithm. The experimental outcomes highlight the effectiveness of combining reinforcement learning with imitation learning to solve real-world robotic control problems, suggesting that the proposed approach is a promising solution for time-sensitive and high-precision tasks in robotic manipulation.

Author Contributions

Conceptualization, J.H. and F.W.; Methodology, F.W.; Software, J.H.; Validation, J.H.; Formal analysis, Y.Q.; Resources, X.L. and Y.Q.; Writing original draft, J.H.; Writing review and editing, J.H. and F.W.; Project administration, F.G. and M.J.; Funding acquisition, F.W. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by National Natural Science Foundation of China under grant no. 62203116, 62103106, 62273095, in part by GuangDong Basic and Applied Basic Research Foundation 2024A1515010222, in part by Characteristic Innovation Foundation of Guangdong Education Department under grant 2022ktscx138, 2022ZDZX1031, in part by Liaoning Natural Science Foundation (2022-KF-21-06), in part by Dongguan Science and Technology of Social Development Program under grant no. 20231800935882, SSL Sci-tech Commissioner Program (20234430-01KCJ-G, 20234371-01KCJ-G).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

This research was supported in part by National Natural Science Foundation of China under grant no. 62203116, 62103106, 62273095, in part by GuangDong Basic and Applied Basic Research Foundation 2024A1515010222, in part by Characteristic Innovation Foundation of Guangdong Education Department under grant 2022ktscx138, 2022ZDZX1031, in part by Liaoning Natural Science Foundation (2022-KF-21-06), in part by Dongguan Science and Technology of Social Development Program under grant no. 20231800935882, SSL Sci-tech Commissioner Program (20234430-01KCJ-G, 20234371-01KCJ-G).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Parameter values of manipulator.

Parameter	Value	Parameter	Value
g	$9.806 m / s^{2}$	$l_{1}$	$0 m$
$l_{2}$	$0.135 m$	$l_{3}$	$0.13 m$
$k_{1}$	$3.7 \times 10^{- 3} kg \cdot m^{2}$	$k_{2}$	$7.0 \times 10^{- 3} kg \cdot m^{2}$
$k_{3}$	$8.0 \times 10^{- 3} kg \cdot m^{2}$	$k_{4}$	$0.4 \times 10^{- 3} kg \cdot m^{2}$
$k_{5}$	$9.1 \times 10^{- 3} kg \cdot m^{2}$	$k_{6}$	$5.2 \times 10^{- 3} kg \cdot m^{2}$

Table A2. Parameter values of algorithms.

Parameter	Value	Parameter	Value
learning rate of actor	0.0003	learning rate of critic	0.003
learning rate of discriminator	0.001	learning rate of $α$	0.0003
soft update parameter	0.005	buffer size	1,000,000
minimal size	5000	batch size	256
lstm hidden size	128	lstm num layers	1

Table A3. Symbol descriptions for reinforcement learning and control algorithms.

Symbol	Description
$V^{*} (s)$	The optimal state-value function, representing the maximum expected return at state s.
$π (s, a)$	The policy probability of selecting action a at state s.
$P_{s s^{'}}^{a}$	The transition probability from state s to state $s^{'}$ by action a.
$R_{s s^{'}}^{a}$	The reward for transitioning from state s to state $s^{'}$ under action a.
$Q^{*} (s, a)$	The optimal action-value function, representing the maximum expected return for taking action a at state s.
$H (π (\cdot \| s_{t}))$	The policy entropy, representing the randomness of the policy at state $s_{t}$ .

References

Abdelmaksoud, S.I.; Al-Mola, M.H.; Abro, G.E.M.; Asirvadam, V.S. In-Depth Review of Advanced Control Strategies and Cutting-Edge Trends in Robot Manipulators: Analyzing the Latest Developments and Techniques. IEEE Access 2024, 12, 47672–47701. [Google Scholar] [CrossRef]
Poór, P.; Broum, T.; Basl, J. Role of collaborative robots in industry 4.0 with target on education in industrial engineering. In Proceedings of the 2019 4th International Conference on Control, Robotics and Cybernetics (CRC), Tokyo, Japan, 27–30 September 2019; pp. 42–46. [Google Scholar]
Hu, Y.; Wang, W.; Liu, H.; Liu, L. Reinforcement learning tracking control for robotic manipulator with kernel-based dynamic model. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 3570–3578. [Google Scholar] [CrossRef] [PubMed]
Chotikunnan, P.; Chotikunnan, R. Dual design pid controller for robotic manipulator application. J. Robot. Control (JRC) 2023, 4, 23–34. [Google Scholar] [CrossRef]
Dou, W.; Ding, S.; Yu, X. Event-triggered second-order sliding-mode control of uncertain nonlinear systems. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 7269–7279. [Google Scholar] [CrossRef]
Pan, J. Fractional-order sliding mode control of manipulator combined with disturbance and state observer. Robot. Auton. Syst. 2025, 183, 104840. [Google Scholar] [CrossRef]
Li, T.; Li, S.; Sun, H.; Lv, D. The fixed-time observer-based adaptive tracking control for aerial flexible-joint robot with input saturation and output constraint. Drones 2023, 7, 348. [Google Scholar] [CrossRef]
Cho, M.; Lee, Y.; Kim, K.S. Model predictive control of autonomous vehicles with integrated barriers using occupancy grid maps. IEEE Robot. Autom. Lett. 2023, 8, 2006–2013. [Google Scholar] [CrossRef]
Deng, W.; Zhou, H.; Zhou, J.; Yao, J. Neural network-based adaptive asymptotic prescribed performance tracking control of hydraulic manipulators. IEEE Trans. Syst. Man Cybern. Syst. 2022, 53, 285–295. [Google Scholar] [CrossRef]
Li, S.; Nguyen, H.T.; Cheah, C.C. A theoretical framework for end-to-end learning of deep neural networks with applications to robotics. IEEE Access 2023, 11, 21992–22006. [Google Scholar] [CrossRef]
Zhu, Z.; Lin, K.; Jain, A.K.; Zhou, J. Transfer learning in deep reinforcement learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
Tran, V.P.; Mabrok, M.A.; Anavatti, S.G.; Garratt, M.A.; Petersen, I.R. Robust fuzzy q-learning-based strictly negative imaginary tracking controllers for the uncertain quadrotor systems. IEEE Trans. Cybern. 2023, 53, 5108–5120. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Zhou, Y.; Gao, J.; Yan, W. Visual servoing gain tuning by sarsa: An application with a manipulator. In Proceedings of the 2023 3rd International Conference on Robotics and Control Engineering, Nanjing, China, 12–14 May 2023; pp. 103–107. [Google Scholar]
Xu, H.; Fan, J.; Wang, Q. Model-based reinforcement learning for trajectory tracking of musculoskeletal robots. In Proceedings of the 2023 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Kuala Lumpur, Malaysia, 22–25 May 2023; pp. 1–6. [Google Scholar]
Li, X.; Shang, W.; Cong, S. Offline reinforcement learning of robotic control using deep kinematics and dynamics. IEEE/ASME Trans. Mechatron. 2023, 29, 2428–2439. [Google Scholar] [CrossRef]
Zhang, S.; Pang, Y.; Hu, G. Trajectory-tracking control of robotic system via proximal policy optimization. In Proceedings of the 2019 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE Conference on Robotics, Automation and Mechatronics (RAM), Bangkok, Thailand, 18–20 November 2019; pp. 380–385. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. CoRR arXiv 2017, arXiv:abs/1707.06347. [Google Scholar]
Hu, Y.; Si, B. A reinforcement learning neural network for robotic manipulator control. Neural Comput. 2018, 30, 1983–2004. [Google Scholar] [CrossRef]
Lei, W.; Sun, G. End-to-end active non-cooperative target tracking of free-floating space manipulators. Trans. Inst. Meas. Control 2023, 416, 379–394. [Google Scholar] [CrossRef]
Song, D.; Gan, W.; Yao, P. Search and tracking strategy of autonomous surface underwater vehicle in oceanic eddies based on deep reinforcement learning. Appl. Soft Comput. 2023, 132, 109902. [Google Scholar] [CrossRef]
Ho, J.; Ermon, S. Generative adversarial imitation learning. Adv. Neural Inf. Process. Syst. 2016, 29, 2016. [Google Scholar]
Ning, G.; Liang, H.; Zhang, X.; Liao, H. Inverse-reinforcement-learning-based robotic ultrasound active compliance control in uncertain environments. IEEE Trans. Ind. Electron. 2024, 71, 1686–1696. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. Acm 2020, 63, 139–144. [Google Scholar] [CrossRef]
Jiang, D.; Huang, J.; Fang, Z.; Cheng, C.; Sha, Q.; He, B.; Li, G. Generative adversarial interactive imitation learning for path following of autonomous underwater vehicle. Ocean. Eng. 2022, 260, 111971. [Google Scholar] [CrossRef]
Chaysri, P.; Spatharis, C.; Blekas, K.; Vlachos, K. Unmanned surface vehicle navigation through generative adversarial imitation learning. Ocean. Eng. 2023, 282, 114989. [Google Scholar] [CrossRef]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 1889–1897. [Google Scholar]
Pecioski, D.; Gavriloski, V.; Domazetovska, S.; Ignjatovska, A. An overview of reinforcement learning techniques. In Proceedings of the 2023 12th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 6–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–4. [Google Scholar]
Zhou, Y.; Lu, M.; Liu, X.; Che, Z.; Xu, Z.; Tang, J.; Zhang, Y.; Peng, Y.; Peng, Y. Distributional generative adversarial imitation learning with reproducing kernel generalization. Neural Netw. 2023, 165, 43–59. [Google Scholar] [CrossRef] [PubMed]
Spong, M.W.; Hutchinson, S.; Vidyasagar, M. Robot Modeling and Control; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
Wan, L.; Pan, Y.J.; Shen, H. Improving synchronization performance of multiple euler–lagrange systems using nonsingular terminal sliding mode control with fuzzy logic. IEEE/ASME Trans. Mechatron. 2021, 27, 2312–2321. [Google Scholar] [CrossRef]
Ma, Z.; Liu, Z.; Huang, P. Fractional-order control for uncertain teleoperated cyber-physical system with actuator fault. IEEE/ASME Trans. Mechatron. 2020, 26, 2472–2482. [Google Scholar] [CrossRef]
Forbrigger, S. Prediction-Based Haptic Interfaces to Improve Transparency for Complex Virtual Environments. 2017. Available online: https://dalspace.library.dal.ca/items/d436a139-31ec-4571-8247-4b5d70530513 (accessed on 17 December 2024).
Liu, Y.C.; Khong, M.H. Adaptive control for nonlinear teleoperators with uncertain kinematics and dynamics. IEEE/ASME Trans. Mechatron. 2015, 20, 2550–2562. [Google Scholar] [CrossRef]
Maheshwari, A.; Rautela, A.; Rayguru, M.M.; Valluru, S.K. Adaptive-optimal control for reconfigurable robots. In Proceedings of the 2023 International Conference on Device Intelligence, Computing and Communication Technologies, (DICCT), Dehradun, India, 17–18 March 2023; pp. 511–514. [Google Scholar]
Li, F.; Fu, M.; Chen, W.; Zhang, F.; Zhang, H.; Qu, H.; Yi, Z. Improving exploration in actor–critic with weakly pessimistic value estimation and optimistic policy optimization. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 8783–8796. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Q.; Huang, Z.; Wu, L. Learning unbiased rewards with mutual information in adversarial imitation learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 4–10 June 2023; pp. 1–5. [Google Scholar]
Huang, F.; Xu, J.; Wu, D.; Cui, Y.; Yan, Z.; Xing, W.; Zhang, X. A general motion controller based on deep reinforcement learning for an autonomous underwater vehicle with unknown disturbances. Eng. Appl. Artif. Intell. 2023, 117, 105589. [Google Scholar] [CrossRef]
Wang, T.; Wang, F.; Xie, Z.; Qin, F. Curiosity model policy optimization for robotic manipulator tracking control with input saturation in uncertain environment. Front. Neurorobot. 2024, 18, 1376215. [Google Scholar] [CrossRef]

Figure 1. (a) The 3-DOF Phantom Omni robot. (b) Schematic of the 3-DOF device.

Figure 2. Graph of agent learning with reinforcement learning.

Figure 3. (a) Actor network architecture; (b) critic network architecture.

Figure 4. Illustration of the SL-GAIL method.

Figure 5. The variance and standard deviation of the cumulative reward value for each episode. (a) Cumulative rewards for each episode. (b) Standard deviation of cumulative rewards. (c) Variance deviation of cumulative rewards.

Figure 6. The tracking position, tracking error, and torque in Experiment 1. (a–c) Tracking position of x-, y-, and z-axes. (d–f) Tracking error of of x-, y- and z-axes. (g–i) Torque of the three joints.

Figure 7. The position tracking in an environment with interference in Experiment 2. (a–d) Position tracking of x-axes. (e–h) Position tracking of y-axes. (i–l) Position tracking of z-axes.

Figure 8. The tracking error and torque in an environment with interference in Experiment 2. (a–c) tracking error of x-, y- and z-axes. (d–f) tracking error of the three joints.

Figure 9. The AAE and RMSE in two environments. (a–c) The AAE in two environments. (d–f) The RMSE in two environments.

Table 1. The controlled objects and their control methods.

Category	Method	Main Constraints	References
Hydraulic Manipulators	RBFNN-based adaptive asymptotic method	Complex mathematical models, uncertain dynamics	[9]
Serial Manipulators	PID, sliding-mode control, adaptive control	High reliance on precise models, many control parameters	[4,5,7]
Free-floating Manipulators	DRL-based SAC-RNN	Unknown dynamics, high control complexity	[19]
3-DoF Manipulators	SARSA algorithm	Weak capability for handling continuous state spaces	[13]
2-DoF Manipulators	Actor–critic RL	Deadzone problem, nonlinear dynamic characteristics	[18]
Pneumatic Musculoskeletal Robots	Model-based RL (MBRL) methods	Difficulty in learning effective control policies	[14]
Autonomous Underwater Vehicles (AUVs)	GAIL-based policy learning	Low data efficiency, challenging generalization to unknown environments	[24]
Unmanned Surface Vehicles (USVs)	GAIL combined with PPO or TRPO	Low sample efficiency, insufficient exploration capability	[25]
Robotic Arms and Mobile Robots	Distributed PPO combined with LSTM	Insufficient handling of temporal state dependencies	[16]
Underactuated Autonomous Surface/Underwater Vehicles (UASUVs)	SAC combined with LSTM	Training instability, input disturbances	[20]

Table 2. AAE, AAT, and RMSE of SL-GAIL, SAC-LSTM, and SAC-GAIL algorithms in training.

Axe Algorithms		SAC	SAC-GAIL	SAC-LSTM	SL-GAIL
X	AAE	0.0123	0.0044	0.0062	0.0058
	AAT	3.5349	1.9642	2.2820	1.1057
	RMSE	0.0158	0.0112	0.0110	0.0101
Y	AAE	0.0088	0.0123	0.0072	0.0066
	AAT	3.4609	2.6186	1.8884	0.7469
	RMSE	0.0206	0.0240	0.0213	0.01975
Z	AAE	0.0081	0.0081	0.0063	0.0038
	AAT	2.9630	2.3313	0.9714	0.4388
	RMSE	0.0099	0.0088	0.0116	0.0052

Table 3. Comparison of AAE in two environments.

Axes Algorithms		SAC	SAC-GAIL	SAC-LSTM	SL-GAIL
X	DFENV	0.0123	0.0044	0.0062	0.0058
X	DENV	0.0176	0.0061	0.0098	0.0059
Y	DFENV	0.0088	0.0123	0.0072	0.0066
Y	DENV	0.0137	0.0177	0.0118	0.0066
Z	DFENV	0.0081	0.0081	0.0063	0.0038
Z	DENV	0.0099	0.0102	0.0098	0.0042

Table 4. Comparison of RMSE in two environments.

Axes Algorithms		SAC	SAC-GAIL	SAC-LSTM	SL-GAIL
X	DFENV	0.0158	0.0112	0.0110	0.0101
X	DENV	0.0221	0.0123	0.0124	0.0102
Y	DFENV	0.0206	0.0240	0.0213	0.0197
Y	DENV	0.0274	0.0279	0.0234	0.0196
Z	DFENV	0.0099	0.0088	0.0116	0.0049
Z	DENV	0.0128	0.0124	0.0127	0.0052

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, J.; Wang, F.; Li, X.; Qin, Y.; Guo, F.; Jiang, M. Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning. Biomimetics 2024, 9, 779. https://doi.org/10.3390/biomimetics9120779

AMA Style

Hu J, Wang F, Li X, Qin Y, Guo F, Jiang M. Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning. Biomimetics. 2024; 9(12):779. https://doi.org/10.3390/biomimetics9120779

Chicago/Turabian Style

Hu, Jintao, Fujie Wang, Xing Li, Yi Qin, Fang Guo, and Ming Jiang. 2024. "Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning" Biomimetics 9, no. 12: 779. https://doi.org/10.3390/biomimetics9120779

APA Style

Hu, J., Wang, F., Li, X., Qin, Y., Guo, F., & Jiang, M. (2024). Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning. Biomimetics, 9(12), 779. https://doi.org/10.3390/biomimetics9120779

Article Menu

Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning

Abstract

1. Introduction

2. System Description

2.1. Dynamics Model

2.2. Kinematics Model

2.3. Control Objective

3. Preliminaries

3.1. Reinforcement Learning

3.2. Soft Actor–Critic Algorithm

3.3. Long Short-Term Memory

3.4. Generative Adversarial Imitation Learning

4. Control Design

4.1. Design of System Input/Output

4.2. Control Algorithm Based on SAC-LSTM

4.3. GAIL for Robot Manipulator Trajectory Tracking

5. Simulation

5.1. Training Performance

5.2. Control Performance

5.3. Anti-Interference Performance

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI