LSTM-Enhanced TD3 and Behavior Cloning for UAV Trajectory Tracking Control

Qi, Yuanhang; Hu, Jintao; Wang, Fujie; Huang, Gewen

doi:10.3390/biomimetics10090591

Open AccessArticle

LSTM-Enhanced TD3 and Behavior Cloning for UAV Trajectory Tracking Control

¹

School of Computer Science, University of Electronic Science and Technology of China, Zhongshan Institute, Zhongshan 528402, China

²

College of Excellent Engineers, Dongguan University of Technology, Dongguan 523820, China

³

Modern Educational Technology Center, Jiaying University, Meizhou 514015, China

^*

Author to whom correspondence should be addressed.

Biomimetics 2025, 10(9), 591; https://doi.org/10.3390/biomimetics10090591

Submission received: 24 July 2025 / Revised: 19 August 2025 / Accepted: 28 August 2025 / Published: 4 September 2025

(This article belongs to the Special Issue Bio-Inspired Robotics and Applications 2025)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicles (UAVs) often face significant challenges in trajectory tracking within complex dynamic environments, where uncertainties, external disturbances, and nonlinear dynamics hinder accurate and stable control. To address this issue, a bio-inspired deep reinforcement learning (DRL) algorithm is proposed, integrating behavior cloning (BC) and long short-term memory (LSTM) networks. This method can achieve autonomous learning of high-precision control policy without establishing an accurate system dynamics model. Motivated by the memory and prediction functions of biological neural systems, an LSTM module is embedded into the policy network of the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. This structure captures temporal state patterns more effectively, enhancing adaptability to trajectory variations and resilience to delays or disturbances. Compared to memoryless networks, the LSTM-based design better replicates biological time-series processing, improving tracking stability and accuracy. In addition, behavior cloning is employed to pre-train the DRL policy using expert demonstrations, mimicking the way animals learn from observation. This biomimetic plausible initialization accelerates convergence by reducing inefficient early-stage exploration. By combining offline imitation with online learning, the TD3-LSTM-BC framework balances expert guidance and adaptive optimization, analogous to innate and experience-based learning in nature. Simulation experimental results confirm the superior robustness and tracking accuracy of the proposed method, demonstrating its potential as a control solution for autonomous UAVs.

Keywords:

deep reinforcement learning; UAV control; TD3 algorithm; LSTM; behavior cloning; target tracking

1. Introduction

With the development of technology, UAVs have been increasingly used in disaster relief, agricultural plant protection, logistics and distribution, wind farm operation and maintenance, and other fields [1]. Inspired by biomimetic principles, recent studies have sought to enhance UAV adaptability and robustness by mimicking biological learning and control mechanisms. Classic control algorithms such as PID control [2], model predictive control (MPC) [3], and sliding mode control [4] rely on accurate system dynamics modeling and perform well in structured environments. In order to enhance the adaptability under dynamic conditions such as load changes and wind disturbances, an adaptive PID strategy is proposed in [5] to improve flight stability by adjusting control parameters in real time. In [6], a PID-based pitch control system is developed for UAV trajectory tracking near or above the speed of sound. In [7], a stable adaptive PID control scheme for UAV systems is introduced. It aims to accurately estimate the ideal controller to satisfy a particular control objective and dynamically adjust its gains through a stable adaptive process. Aiming at the problem that the deception performance is greatly reduced due to the sudden change of UAV trajectory, a UAV dynamic trajectory deception method based on MPC is proposed in [8], which gradually guides the UAV towards the predetermined trajectory of the deceiver. Aiming at the target trajectory planning problem of UAV swarm in uncertain environments, an optimized MPC method based on a deep neural network (DNN) is proposed in [9]. A UAV trajectory planning algorithm based on a nonlinear MPC scheme is studied in [10]. By controlling the parameters, the UAV trajectory planning problem is transformed into a nonlinear planning problem, thereby alleviating the heavy computational burden of solving the NMPC optimization problem online. In [11], a time-delay sliding mode is proposed to resolve the control problem of the manipulator. The sliding surface is composed of the sliding variable value at the previous sampling moment, which enhances the tracking performance with fast convergence speed and minimum steady-state error. In [12], it proposes an adaptive fuzzy fixed-time sliding mode formation control method for quadrotor UAVs, which can achieve the predetermined performance under uncertainty and external interference. Although traditional control methods are widely used in UAV control, these methods often require complex parameter tuning and online optimization, and their adaptability is limited. In [13], a safety control method based on event triggering and adaptive dynamic programming was proposed to solve the control problem of nonlinear systems with asymmetric input constraints and state constraints. For unknown nonlinear systems with actuator failures and asymmetric input constraints, a safe and optimal fault-tolerant control method based on a control barrier function and neural network was proposed in [14], and its effectiveness was verified through theoretical analysis and simulation.

In order to break through the bottleneck of traditional methods, reinforcement learning (RL) has shown significant advantages by learning the optimal strategy through the trial and error mechanism and interaction with the environment. In [15], it proposes a deep reinforcement learning UAV navigation method, which uses adaptive control and an attention mechanism to dynamically balance navigation and obstacle avoidance and optimizes control performance through speed constraint loss. In [16], a fixed-wing UAV trajectory tracking control method based on deep deterministic policy gradient (DDPG) is proposed. In [17], a Z-function decomposition-based RL approach has been developed to jointly optimize transmission power and UAV trajectories, aiming to improve target positioning accuracy. To improve the robustness of the nonlinear dynamic inversion (NDI) method under model uncertainty, a control scheme combining a TD3 agent with the NDI method is proposed in [18]. In [19], a hierarchical PPO reinforcement learning method is proposed to improve the maneuverability of UAVs in complex environments by decomposing high-level and low-level tasks. In [20], a reinforcement learning (RL)-based quadrotor control architecture is proposed, using the Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, to train the RL agent. However, standard DRL algorithms still face challenges in handling time-dependent tasks such as long-term trajectory prediction such as drone trajectory tracking.

Therefore, the researchers introduced an LSTM network to enhance the ability of the policy network to remember historical states [21]. In [22], the application of DRL in the trajectory tracking of a mobile robot with a sliding steering under terrain constraints is discussed, and it is proposed to integrate LSTM into the DRL controller to address partial observability issues in navigation. To address the problem of the poor obstacle avoidance effect of the particle model in an ideal environment, a non-particle model USV obstacle avoidance algorithm based on LSTM-PPO is proposed in [23], and a training environment that adapts to non-ideal environments is constructed. In order to solve the tracking control problem of the manipulator, a method based on LSTM and generative adversarial imitation learning (GAIL) is proposed in [24]. In [25], an improved deep reinforcement learning algorithm based on LSTM and MATD3 is proposed for training multi-UAV adaptive collaborative formation trajectory planning. In addition, imitation learning methods such as BC are also widely used to assist reinforcement learning training, especially when data samples are insufficient or the performance of the strategy is poor in the early stage of training. In [26], a BC and PPO based approach is proposed to solve the within visual range (WVR) air-to-air combat problem of aircraft and missiles under complex nonlinear six-degree-of-freedom (6-DOF) dynamics. In [27], a collaborative method combining reinforcement learning and imitation learning is proposed to solve the problem of ordinary reinforcement learning having a poor learning effect on navigation policy in partially observable non-Markov environments.

In summary, combining deep reinforcement learning with the temporal modeling structure and imitation learning mechanism has become an effective way to improve the intelligent control performance of UAVs [28]. This paper aims to build a deep reinforcement learning control framework that integrates LSTM structure and acbehavior cloning mechanism and applies it to the three-dimensional trajectory tracking task of UAVs with six-degree-of-freedom dynamic modeling. The main contributions are as follows:

(1): Inspired by references [22,23,24], an LSTM network is introduced to enhance the trajectory tracking capability of drones in dynamic environments. By mimicking the memory and temporal processing capabilities of biological neural systems, the LSTM layer extracts time-dependent features from the sequential observations of the UAV. These features are processed by the policy network to generate continuous control actions, which significantly improves the adaptability and control stability of the algorithm in partially observable environments.
(2): Based on the advantages of the BC method, this paper innovatively combines expert demonstration data with reinforcement learning by using the principle of biomimetics. By pre-training the policy network to imitate the expert behavior, it not only greatly shortens the exploration time in the early stage of training but also avoids the generation of dangerous control actions.
(3): A TD3-LSTM-BC algorithm that integrates TD3, LSTM, and BC is proposed for drone trajectory tracking control. The algorithm captures the time-series dependency through LSTM and uses BC to provide expert prior knowledge, achieving the coordinated optimization of control accuracy, learning efficiency, and robustness under the TD3 framework.

2. Unmanned Aerial Vehicle System Modeling

The UAV is modeled as a 6-degree-of-freedom (6-DoF) rigid body with states including position, velocity, Euler angles, and angular velocities. The dynamics are derived using Newton–Euler equations, considering forces and moments acting on the UAV [29]. The position

P = {[p_{x}, p_{y}, p_{z}]}^{T}

and velocity

V = {[v_{x}, v_{y}, v_{z}]}^{T}

evolve according to:

\begin{matrix} \dot{p} = v \\ \dot{v} = \frac{F}{m} + g \end{matrix}

(1)

where m is the UAV mass

F = {[F_{x}, F_{y}, F_{z}]}^{T}

is the applied force (control input), and

g = {[0, 0, - 9]}^{T}

is the gravity. In this implementation, the control input is applied directly as linear acceleration:

\begin{matrix} v_{t + 1} = v_{t} + a_{l i n e a r} \cdot Δ t \\ p_{t + 1} = p_{t} + v_{t} \cdot Δ t \end{matrix}

(2)

where

a_{l i n e a r} = F / m

is the commanded linear acceleration.

The attitude is represented in Euler angles

Θ = {[ϕ, θ, ψ]}^{T}

, and the angular velocity is

ω = {[ω_{x}, ω_{y}, ω_{z}]}^{T}

. The dynamics are given by:

Θ = R (Θ) . ω

(3)

where

R (Θ)

is the transformation matrix from body rates to Euler angle derivatives. The angular acceleration is controlled directly:

\begin{matrix} ω_{t + 1} = ω_{t} + a_{l i n e a r} \cdot Δ t \\ Θ_{t + 1} = Θ_{t} + ω_{t} \cdot Δ t \end{matrix}

(4)

where

a_{l i n e a r}

is the commanded angular acceleration. To prevent unrealistic attitudes, Euler angles are constrained:

ϕ, θ \in [- \frac{π}{2}, \frac{π}{2}], ψ \in [- π, π]

(5)

3. Preliminaries

3.1. Reinforcement Learning

RL is a learning paradigm where an agent interacts with an environment modeled as a Markov Decision Process (MDP) defined by the tuple

(S, A, P, r, γ)

[30], where:

–: $S$ is the state space,
–: $A$ is the action space,
–: $P (s_{t + 1} ∣ s_{t}, a_{t})$ is the transition probability,
–: $r (s_{t}, a_{t})$ is the reward function,
–: $γ \in [0, 1]$ is the discount factor.

The agent aims to learn a policy

π (a_{t} | s_{t})

that maximizes the expected discounted return:

J (π) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})]

(6)

In continuous control tasks like UAV trajectory tracking, actor-critic methods are widely used, where:

–: the actor $π_{θ} (a ∣ s)$ outputs continuous actions,
–: the critic $Q_{ϕ} (s, a)$ estimates the action-value function:

Q^{π} (s_{t}, a_{t}) = E_{π} [\sum_{t^{'} = t}^{\infty} γ^{t^{'} - t} r (s_{t^{'}}, a_{t^{'}})]

(7)

3.2. TD3 Algorithm

TD3 is an advanced off-policy algorithm that is mainly used in continuous action spaces. It addresses the overestimation bias of Q-values found in DDPG by introducing three key techniques:

Clipped Double Q-learning: maintain two critics $Q_{ϕ_{1}}$ and $Q_{ϕ_{2}}$ , and use the smaller Q-value in the target:

$y_{t} = r_{t} + γ \cdot min_{i = 1, 2} Q_{ϕ_{i}^{'}} (s_{t + 1}, π_{θ^{'}} (s_{t + 1}) + ϵ)$

(8)

where $ϵ \sim clip (N (0, σ), - c, c)$ adds noise for target smoothing.
Delayed Policy Update: to reduce the discrepancy, the actor network is updated less frequently than the critic network.
Target Policy Smoothing: adds noise to the target action for smoother Q-function estimation.

The critic loss is defined as:

L_{Q} = E_{(s_{t}, a_{t}, r_{t}, s_{t + 1}) \sim D} [{(Q_{ϕ_{i}} (s_{t}, a_{t}) - y_{t})}^{2}]

(9)

The actor is updated to maximize the critic’s estimate:

L_{π} = - E_{s_{t} \sim D} [Q_{ϕ_{1}} (s_{t}, π_{θ} (s_{t}))]

(10)

3.3. Long Short-Term Memory

To capture temporal dependencies and history in UAV control tasks, this paper embeds an LSTM network in the policy and/or critic. LSTM is a type of recurrent neural network (RNN) designed to alleviate the vanishing gradient problem through a gating mechanism [31]. At each time step t, given input

x_{t} \in R^{d}

, previous hidden state

h_{t - 1}

, and cell state

C_{t - 1}

, the LSTM performs the following updates:

\begin{matrix} f_{t} & = σ (W_{f} [h_{t - 1}, x_{t}] + b_{f}) & (forget gate) \\ i_{t} & = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i}) & (input gate) \\ {\tilde{C}}_{t} & = tanh (W_{C} [h_{t - 1}, x_{t}] + b_{C}) & (candidate cell state) \\ C_{t} & = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t} & (cell state update) \\ o_{t} & = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o}) & (output gate) \\ h_{t} & = o_{t} ⊙ tanh (C_{t}) & (new hidden state) \end{matrix}

where:

–: $σ (\cdot)$ denotes the sigmoid activation,
–: ⊙ denotes element-wise multiplication,
–: $W_{f}, W_{i}, W_{C}, W_{o}$ and $b_{f}, b_{i}, b_{C}, b_{o}$ are weight matrices and bias vectors.

The LSTM is integrated into the policy and/or value networks to process sequences of states and actions, which is beneficial in partially observable or history-dependent UAV environments [32].

3.4. Behavior Cloning

Behavior cloning is a form of supervised imitation learning that learns a policy from expert demonstrations. Given an expert dataset:

D_{E} = {\{(s_{i}, a_{i}^{*})\}}_{i = 1}^{N}

where:

–: $s_{i} \in S$ is the observed state,
–: $a_{i}^{*} \in A$ is the expert’s action.

The policy

π_{θ} (a ∣ s)

is trained to minimize the mean squared error (or other divergence metrics) between its output and the expert action:

L_{B C} (θ) = \frac{1}{N} \sum_{i = 1}^{N} {∥π_{θ} (s_{i}) - a_{i}^{*}∥}^{2}

This provides a warm start for reinforcement learning, accelerating convergence and improving safety during early exploration. BC can also be integrated into TD 3, resulting in TD3-BC, by regularizing the policy loss:

L_{π} = λ \cdot L_{B C} - (1 - λ) \cdot E_{s \sim D} [Q_{ϕ} (s, π_{θ} (s))]

where:

–: $λ \in [0, 1]$ is a weighting coefficient balancing between imitation and value maximization,
–: $Q_{ϕ}$ is the critic network.

This hybrid objective ensures that the policy remains close to expert behavior while still exploring higher-reward actions via RL optimization [33].

4. Control Design

4.1. Input and Output

The state space is designed to comprehensively capture the UAV’s full motion state and tracking objectives, including position, velocity, orientation (Euler angles), angular velocity, and the target position. This allows the learning algorithm to observe both the current dynamic status and the goal, enabling more accurate and responsive control decisions. The full state vector is:

s_{t} = [p, v, Θ, ω, p_{t a r g e t}] \in R^{15}

(11)

where

p_{t a r g e t}

is the current target position.

The action space, defined as linear and angular accelerations, provides direct and interpretable control over both translational and rotational dynamics. This setup strikes a balance between model complexity and learning efficiency, avoids unnecessary coupling from low-level motor dynamics, and facilitates smooth policy learning in continuous control tasks like trajectory tracking. The agent outputs a six-dimensional continuous action:

a_{t} = [a_{l i n e a r}, a_{a n g u l a r}] \in R^{6}

(12)

To guide the UAV toward accurate trajectory tracking, the reward function is designed based on the position error

e_{t}

, defined as:

e_{t} = p - p_{t a r g e t}

(13)

A smoothed negative L1-norm reward is used to ensure continuous gradients and robustness:

r_{t} = - λ | | e_{t} {| |}_{1} = - \frac{1}{3} λ \sum_{i = 1}^{3} |p_{i} - p_{t a r g e t, i}|

(14)

where

λ

is a weight coefficient between (0, 1). This formulation penalizes large deviations from the target and encourages the agent to minimize trajectory tracking error. The main goal of this task is to minimize the position error, and the reward function is essentially a linear penalty on the error, so the role of

λ

is only to normalize and scale the reward value without changing the training objective or the optimal policy.

Remark 1.

In real UAV control tasks, the environmental state has time dependence. Traditional MLP strategies have difficulty capturing this temporal feature, resulting in poor policy generalization. The LSTM module can explicitly model the temporal relationship between state sequences, making the strategy more adaptable in partially observable and dynamic environments, thereby improving control accuracy and stability [24].

4.2. TD3-LSTM Design

Most reinforcement learning algorithms use the actor-critic architecture to efficiently deal with policy gradient problems. In this architecture, the policy network (Actor) is continuously optimized under the evaluation feedback of the value network (Critic) and finally learns a policy

π

for performing trajectory tracking control tasks. The core task of the Critic network is to evaluate the performance of the current policy and provide learning signals for policy updates.

The TD3-LSTM algorithm designed in this paper is based on the actor-critic framework and is mainly composed of a policy network and a value network. For the trajectory tracking task, the state of the UAV agent at each time step constitutes a state sequence and is input to the LSTM network. The LSTM generates the corresponding hidden state h at each time step and uses it as the input of the subsequent fully connected layer. After being processed by the fully connected network, the policy network outputs the Gaussian distribution parameters of the action at that time step. During the training process, actions are sampled from this distribution to enhance the randomness and exploration ability of the policy.

The input of the value network is the concatenation of the state sequence and the corresponding action. After passing through a fully connected network and an output layer, it outputs a scalar representing the state value

V_{ϕ} (S)

under the current state sequence, which is used to guide the update of the policy network.

4.3. TD3-LSTM-BC Design

On the basis of TD3-LSTM, a BC mechanism is further introduced to pre-train the policy network to improve the sample efficiency and policy stability in the early stage of learning, as shown in Algorithm 1. In this paper, the expert data primarily comes from trajectory data trained using the TD3 and TD3-LSTM algorithms. These strategies have demonstrated good stability and control performance in UAV trajectory tracking tasks and can therefore serve as reliable expert demonstrations. In the BC stage, the expert demonstration data

D^{*} = (s_{t - k : t}^{(j)}, a t^{* (j)}) {j = 1}^{N}

is used to train the LSTM-Actor approximate expert policy with the state sequence as input.

Specifically, the state sequence with a time window length of k is first input into the LSTM network to extract the hidden state

h_{t}

, and then the policy network outputs the action

π θ (h_{t})

. The training goal is to minimize the mean square error loss between the policy output and the expert action:

L_{B C} (θ) = \frac{1}{N} \sum_{j = 1}^{N} {∥π_{θ} (h_{t}^{(j)}) - a_{t}^{* (j)}∥}^{2}

(15)

Update the policy network parameters

θ

by gradient descent:

θ \leftarrow θ - α_{B C} \nabla_{θ} L_{B C}

(16)

After completing several rounds of BC pre-training, the algorithm enters the online TD3 reinforcement learning phase and continues to optimize the Actor and Critic networks based on the environment interaction data. BC training only acts on the policy network, and the Critic network is still updated from the environment sampling to keep the modules decoupled.

Remark 2.

In deep reinforcement learning, the initial strategy is often random, which can easily lead to inefficient exploration or state collapse in the early stages of training, especially in complex control tasks such as UAVs. By introducing expert data for behavior cloning, the strategy can be quickly guided to a reasonable strategy space, improving the initial performance of the strategy and reducing training interference caused by invalid actions. Combined with the time-dependent features extracted by LSTM, BC can significantly accelerate the convergence of the strategy while improving training stability and final performance.

Algorithm 1 TD3 with LSTM-Actor and BC for UAV Trajectory Tracking

1:: Initialize:
2:: LSTM-based policy network $π_{θ}$ , MLP critics $Q_{ψ_{1}}, Q_{ψ_{2}}$
3:: Target networks $π_{θ^{-}}, Q_{ψ_{1}^{-}}, Q_{ψ_{2}^{-}} \leftarrow$ copies of $π_{θ}, Q_{ψ_{1}}, Q_{ψ_{2}}$
4:: Replay buffer $D \leftarrow \emptyset$ , Expert demonstration buffer $D^{*}$
5:: Time window size k, batch size B, soft update rate $τ$
6:: Phase 1: Behavior Cloning Pre-Training
7:: for $e p o c h = 1$ to $N_{B C}$ do
8:: Sample batch ${(s_{t - k : t}^{(j)}, a_{t}^{* (j)})}_{j = 1}^{B} \sim D^{*}$
9:: for $j = 1$ to B do
10:: Compute LSTM hidden state: $h_{t}^{(j)} \leftarrow {LSTM}_{θ} (s_{t - k : t}^{(j)})$
11:: end for
12:: Compute BC loss: $L_{B C} = \frac{1}{B} \sum_{j = 1}^{B} {∥ π_{θ} (h_{t}^{(j)}) - a_{t}^{* (j)} ∥}^{2}$
13:: Update $θ$ using $\nabla_{θ} L_{B C}$
14:: end for
15:: Phase 2: Online TD3 Training
16:: for $e p i s o d e = 1$ to M do
17:: Initialize time series buffer with k frames: $s_{1 : k} \leftarrow env . reset ()$
18:: while not done do
19:: $h_{t} \leftarrow {LSTM}_{θ} (s_{t - k : t})$
20:: Select action with exploration noise: $a_{t} \leftarrow π_{θ} (h_{t}) + N (0, σ)$
21:: Execute $a_{t}$ , observe $r_{t}, s_{t + 1}, d_{t}$
22:: Store transition $(s_{t - k : t}, a_{t}, r_{t}, s_{t - k + 1 : t + 1}, d_{t})$ in buffer $D$
23:: end while
24:: if $| D | \geq B$ then
25:: for each gradient update step do
26:: Sample mini-batch ${(s_{t - k : t}, a_{t}, r_{t}, s_{t - k + 1 : t + 1}, d_{t})}_{j = 1}^{B} \sim D$
27:: Compute $h_{t}^{(j)} \leftarrow {LSTM}_{θ} (s_{t - k : t}^{(j)})$
28:: Compute $h_{t + 1}^{(j)} \leftarrow {LSTM}_{θ^{-}} (s_{t - k + 1 : t + 1}^{(j)})$
29:: Compute target action: $a_{t + 1}^{'} \leftarrow π_{θ^{-}} (h_{t + 1}^{(j)}) + clip (N (0, \tilde{σ}), - c, c)$
30:: Compute target Q-value:

$y_{j} = r_{t} + γ (1 - d_{t}) \cdot min_{i = 1, 2} Q_{ψ_{i}^{-}} (h_{t + 1}^{(j)}, a_{t + 1}^{'})$
31:: Update critics by minimizing:

$L_{ψ_{i}} = \frac{1}{B} \sum_{j = 1}^{B} {(Q_{ψ_{i}} (h_{t}^{(j)}, a_{t}^{(j)}) - y_{j})}^{2}$
32:: end for
33:: if episode % 2 == 0 then
34:: Update actor by policy gradient:

$\nabla_{θ} J \approx \frac{1}{B} \sum_{j = 1}^{B} \nabla_{θ} Q_{ψ_{1}} (h_{t}^{(j)}, π_{θ} (h_{t}^{(j)}))$
35:: Soft update targets:

$ψ_{i}^{-} \leftarrow τ ψ_{i} + (1 - τ) ψ_{i}^{-}, θ^{-} \leftarrow τ θ + (1 - τ) θ^{-}$
36:: end if
37:: end if
38:: end for

5. Simulation

In order to verify the effectiveness of the proposed TD3-LSTM-BC algorithm in the UAV trajectory tracking task, this chapter conducts comparative experiments based on the UAV simulation environment built in Section 2. All simulations of the UAV trajectory tracking task were implemented in Python 3.9 using the PyTorch 2.1.0 framework. The training process was conducted on a workstation equipped with an Intel i5-12500K CPU, 64 GB RAM, and a single NVIDIA RTX 2060S GPU. The total number of steps in the experiment is set to

N = 200

, the step length is

Δ t = 0.1

s, and the total simulation time of the system is

T = N \cdot Δ t = 20

s. Four benchmark algorithms, DDPG, TD3, TD3-LSTM, and TD3-BC, are selected for comparison. In this paper, a spiral trajectory is selected as the target trajectory, as shown below:

{T a r g t}_{T r a j} = [\begin{matrix} x \\ y \\ z \end{matrix}] = [\begin{matrix} 3 cos (0.3 * t) \\ 3 sin (0.3 * t) \\ 0.3 * t \end{matrix}]

The trajectory defines a spiral motion with a constant radius, where the x- and y-axes form a circular motion with an angular frequency of

ω = 0.3

rad/s, and the z-axis climbs at a constant rate of

0.3

m/s. Compared with conventional spiral trajectories, this design enhances the challenge of trajectory tracking through time-varying parameters and can fully test the algorithm’s adaptability to nonlinear dynamic systems [34].

The experiment will conduct a quantitative analysis from three dimensions: the algorithm’s training effect, the controller’s trajectory tracking accuracy, and the speed accuracy and robustness. The parameters of the UAV are shown in Table 1.

5.1. Training Performance

In order to evaluate the training performance of each algorithm, all methods were trained for 1000 rounds in the same environment, and the experiment is repeated four times to eliminate the influence of randomness [35]. As shown in Figure 1, the reward curves of the five algorithms show obvious differences: The DDPG algorithm performs the worst among all methods, with the slowest convergence speed and the highest reward fluctuations throughout training. It frequently falls into local optima, failing to adapt effectively to the time-varying trajectory tasks due to its inherent limitations in policy updates and exploration–exploitation balance.

The traditional TD3 algorithm, while outperforming DDPG, still exhibits slower convergence compared to other variants, and its rewards show significant fluctuations even after stabilization, indicating insufficient policy stability in dynamic tasks. In contrast, the TD3-LSTM algorithm, with its LSTM structure, significantly improves training stability. Its reward curve variance is markedly lower than both TD3 and DDPG, demonstrating the effectiveness of LSTM in capturing temporal dependencies and enhancing policy robustness. The TD3-BC algorithm, combined with BC, achieves faster initial convergence by leveraging expert data to initialize the policy, thereby reducing inefficient exploration. However, its long-term stability remains inferior to TD3-LSTM-BC.

The TD3-LSTM-BC algorithm proposed in this paper synthesizes the strengths of both approaches. It converges faster than all baseline algorithms while maintaining the lowest reward fluctuations. This confirms that the prior knowledge of BC and temporal modeling capability of LSTM forms a synergistic effect, jointly optimizing policy convergence and stability in complex time-varying environments.

5.2. Tracking Control Performance

In order to evaluate the control performance of each algorithm, this section conducts a comparative analysis from two dimensions: trajectory tracking accuracy and speed error. Figure 2 shows the position tracking curves of five algorithms in three-dimensional space. The DDPG algorithm exhibits the poorest tracking performance, with severe deviations in all three axes, especially during turns and altitude changes. Its trajectory lags significantly behind the target and fails to recover, accumulating the largest final position error due to unstable policy updates and inadequate dynamic adaptation.

Among the remaining algorithms, TD3-LSTM-BC achieves the highest consistency with the target trajectory, demonstrating superior tracking precision. In contrast, TD3 shows obvious tracking lag during turns and altitude transitions, where its cumulative error is maximal. The tracking effects of TD3-LSTM and TD3-BC are comparable, both outperforming TD3 but slightly inferior to TD3-LSTM-BC. This confirms that while the LSTM structure enhances long-term dependency modeling and BC accelerates initial convergence, their isolated use still leaves room for improvement. Figure 3 shows the position tracking in the x, y, and z directions.

Further observation of the tracking error curve in Figure 4 reveals that TD3-LSTM-BC achieves significantly lower root mean square error (RMSE) and mean absolute error (MAE) compared to both the baseline TD3 and DDPG algorithms while maintaining the most stable error fluctuations across all three axes. DDPG exhibits the most unstable error profile, with its y-axis and z-axis error showing progressive divergence due to the compounding effects of spiral motion. While TD3 performs better than DDPG, it still demonstrates substantial oscillations, particularly in the y-axis direction. Both TD3-LSTM and TD3-BC show noticeable improvements over TD3, with TD3-LSTM displaying superior transient response in the x-axis and TD3-BC achieving better initial convergence—findings that align with the training performance analysis. The MAE and RMSE of the tracking position errors of the five algorithms are shown in Table 2.

For speed tracking performance, TD3-LSTM-BC maintains its advantage, achieving consistently lower speed errors than all algorithms across dynamic maneuvers, as shown in Figure 5. Although DDPG shows a smaller steady-state speed error, this may be because its strategy is more likely to fall into the local optimum and learn a conservative “slow approach” control method. Although this strategy maintains a small speed difference, it leads to a large trajectory position error accumulation due to slow response and limited action range. TD3 improves upon the instability of DDPG but still suffers from speed overshoots in rapidly changing phases, as its deterministic policy lacks explicit memory of past states. In contrast, TD3-LSTM-BC effectively mitigates these issues by combining LSTM-based temporal modeling with BC-guided policy initialization, enabling smoother velocity tracking during complex, time-varying maneuvers. These results highlight that while the local optimum of DDPG is able to yield deceptively favorable speed errors in restricted scenarios, the integrated architecture of TD3-LSTM-BC delivers superior robustness and accuracy across the full trajectory spectrum. The MAE and RMSE of tracking velocity errors of five algorithms are shown in Table 3.

5.3. Anti-Disturbance Performance

In the simulation experiment, in order to more realistically simulate the actuator uncertainty and random disturbances that the UAV control system may encounter in the real environment, Gaussian action noise is introduced into the environment.

{\tilde{a}}_{t} = clip (a_{t} + ε_{t}, - a_{max}, a_{max}), ε_{t} \sim N (2, σ^{2} I_{d})

(17)

Among them,

{\tilde{a}}_{t}

denotes the final action executed in the environment after applying noise and clipping;

a_{t}

is the original action output by the policy network;

ε_{t}

is the additive Gaussian noise sampled from a two-mean multivariate normal distribution with covariance matrix

σ^{2} I_{d}

, where

σ

is the standard deviation of the noise and

I_{d}

is the

d \times d

identity matrix, which indicates that the noise between each action dimension is independent of each other, ensuring that each dimension of the action is perturbed independently. The function

clip (\cdot, - a_{max}, a_{max})

restricts each element of the noisy action to remain within the valid action range

[- a_{max}, a_{max}]

, where

a_{max}

is the maximum action magnitude allowed by the environment. In order to evaluate the control performance of each algorithm in an anti-interference environment, this section conducts a comparative analysis from two dimensions: trajectory tracking accuracy and speed error.

Figure 6 shows the error curves of the five algorithms after adding Gaussian perturbations. TD3-LSTM-BC achieves significantly lower RMSE and MAE under noise perturbations, and the three-axis error fluctuations are the most stable. DDPG has the most unstable error, especially the y-axis and z-axis errors gradually diverge with the disturbance, reflecting its weak anti-interference ability. Although TD3 has improved compared to DDPG, it still has large oscillations in the y-axis direction. Both TD3-LSTM and TD3-BC have shown significant improvements on the basis of TD3. The former has better transient response on the x-axis, and the latter has faster initial convergence, which is consistent with the training performance results. The MAE comparison of the five algorithms in the disturbance and non-disturbance environments is shown in Table 4, and the RMSE comparison is shown in Table 5. Among them, w/o Dist. means that the index is measured without disturbance, and w/ Dist. means that the index is measured after disturbance.

In terms of speed tracking, as shown in Figure 7, TD3-LSTM-BC still maintains its leading advantage, and the speed error in the dynamic maneuvering stage is significantly lower than that of other algorithms. Although DDPG has a small steady-state speed error, this is because its strategy is prone to fall into local optimality and adopts a “conservative slow-forward” control method, which leads to large accumulation of trajectory position errors due to slow response and limited action amplitude [36]. TD3 is more stable than DDPG, but there is still speed overshoot in the rapid change stage. Its deterministic strategy lacks memory of historical states, which limits its anti-interference ability. In contrast, TD3-LSTM-BC, which combines LSTM time series modeling and behavioral cloning strategy initialization, effectively alleviates the above problems and achieves smoother speed tracking under complex time-varying disturbances. The results show that although the local optimality of DDPG can present good speed error under constrained conditions, the integrated architecture of TD3-LSTM-BC shows stronger robustness and control accuracy over the entire trajectory range. The MAE comparison of the five algorithms in the disturbance and non-disturbance environments is shown in Table 6, and the RMSE comparison is shown in Table 7.

5.4. Generalization Performance

The experiments in this section tested the generalization ability of the reinforcement learning agent. In the three experiments described above, the interval between each step in the drone environment was 0.1 s, equivalent to a 10Hz update frequency. In this section, the interval was 0.05 s, equivalent to a 20Hz update frequency.

The experimental results are shown in Figure 8, Figure 9, Figure 10 and Figure 11. Figure 8 illustrates the trajectory tracking performance of different algorithms in three-dimensional space. It can be seen that TD3-LSTM-BC tracks the reference trajectory most stably and closely, with almost no noticeable deviation in turns and complex trajectory sections. TD3-LSTM and TD3-BC perform second best, remaining roughly close to the target trajectory but exhibiting some lag in areas of rapid change. DDPG and TD3, on the other hand, tend to deviate significantly during dynamic responses, resulting in relatively poor tracking stability.

Figure 9 shows a comparison of tracking trajectories in the x, y, and z directions. The TD3-LSTM-BC curve almost completely overlaps with the target curve, demonstrating strong tracking performance. TD3-LSTM and TD3-BC also track well but exhibit slight phase differences in some sections. In contrast, the DDPG and TD3 curves exhibit significant deviation and significant oscillation.

Figure 10 shows the position tracking errors of each algorithm along the three axes. TD3-LSTM-BC achieves the smallest error amplitude and the narrowest fluctuation range, demonstrating its ability to maintain good accuracy and stability even under high-frequency control. The error curves of TD3-LSTM and TD3-BC are slightly higher but remain within a relatively small range overall. However, the errors of DDPG and TD3 are larger, particularly with significant error peaks during rapid trajectory changes.

Figure 11 illustrates the evolution of velocity error. The results show that TD3-LSTM-BC maintains the smallest fluctuation and the smoothest response curve in velocity tracking. The velocity errors of TD3-LSTM and TD3-BC are relatively small but still exhibit some jitter, while the velocity error curves of DDPG and TD3 exhibit more pronounced oscillations. Overall, TD3-LSTM-BC performed best in this experiment because it combines the temporal modeling capabilities of LSTM with the prior knowledge guidance of behavior cloning. This allows it to capture the temporal dependencies in trajectory control while leveraging expert experience to stabilize the training process. As a result, it significantly outperforms other algorithms in both position accuracy and velocity stability, demonstrating the strongest generalization ability.

6. Conclusions

Target tracking control for UAVs in complex dynamic environments remains a critical challenge due to the limitations of conventional RL algorithms in partially observable and time-varying scenarios. To address this issue, this paper proposes a biomimetically inspired TD3-LSTM-BC framework, which integrates an LSTM network for temporal state modeling and BC for policy initialization. The LSTM module mimics the memory and temporal reasoning capabilities of biological neural systems, enabling the agent to infer system dynamics from sequential observations. Meanwhile, the BC module draws on expert demonstrations to guide early-stage learning, resembling imitation learning in natural organisms and significantly enhancing training efficiency and stability. Simulation results demonstrate that the proposed TD3-LSTM-BC method outperforms baseline algorithms in terms of learning performance and robustness, highlighting its potential as a bio-inspired control solution for autonomous UAVs operating in dynamic and uncertain environments.

Nevertheless, this study has several limitations. First, the proposed method has only been validated in simulation, and deployment verification on actual UAV hardware remains to be explored. Second, the current work focuses on single-UAV trajectory tracking, while performance in multi-UAV collaborative scenarios is yet to be investigated. Third, the computational complexity of the algorithm may affect real-time applicability, which calls for further optimization. These aspects will be the focus of future research to advance the practical deployment of the proposed approach.

Author Contributions

Conceptualization, Y.Q. and J.H.; Methodology, J.H.; Software, J.H.; Validation, Y.Q.; Formal analysis, F.W.; Investigation, G.H.; Resources, F.W.; Data curation, J.H.; Writing—original draft, J.H.; Writing—review and editing, Y.Q. and F.W.; Visualization, G.H.; Supervision, G.H.; Project administration, F.W.; Funding acquisition, Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by Guangdong Basic and Applied Basic Research Foundation under grant No. 2024A1515012283. and No. 2022A1515240058, the Natural Science Foundation of Guangdong Province under and No. 2022A1515010178, the Key Project in Higher Education of Guangdong Province, China, under No. 2022DZX1045, No. 2022ZDZX4049, No. 2023ZDZX1040, and No. 2024ZDZX1046, the Social Public Welfare and Basic Research Project of Zhongshan City undergrant No. 2021B2063, and the research project of Jiaying University under grant No. 2022RC127.

Data Availability Statement

The original contributions presented in this study are included in the article material. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Yang, X.; He, W.; Ren, J.; Zhang, Q.; Zhao, Y.; Bai, R.; He, X.; Liu, J. Scale optimization using evolutionary reinforcement learning for object detection on drone imagery. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 410–418. [Google Scholar]
Lopez-Sanchez, I.; Moreno-Valenzuela, J. PID control of quadrotor UAVs: A survey. Annu. Rev. Control 2023, 56, 100900. [Google Scholar] [CrossRef]
Kouvaritakis, B.; Cannon, M. Model Predictive Control; Springer International Publishing: Cham, Switzerland, 2016; Volume 38, p. 7. [Google Scholar]
Yan, X.; Wang, S.; He, Y.; Ma, A.; Zhao, S. Autonomous Tracked Vehicle Trajectory Tracking Control Based on Disturbance Observation and Sliding Mode Control. Actuators 2025, 14, 51. [Google Scholar] [CrossRef]
Shi, M. Application of PID Control Technology in Unmanned Aerial Vehicles. Appl. Comput. Eng. 2024, 96, 24–30. [Google Scholar] [CrossRef]
Siwek, M.; Baranowski, L.; Ładyżyńska-Kozdraś, E. The Application and Optimisation of a Neural Network PID Controller for Trajectory Tracking Using UAVs. Sensors 2024, 24, 8072. [Google Scholar] [CrossRef] [PubMed]
Boubakir, A.; Souanef, T.; Labiod, S.; Whidborne, J.F. A robust adaptive PID-like controller for quadrotor unmanned aerial vehicle systems. Aerospace 2024, 11, 980. [Google Scholar] [CrossRef]
Hou, B.; Yin, Z.; Jin, X.; Fan, Z.; Wang, H. MPC-Based Dynamic Trajectory Spoofing for UAVs. Drones 2024, 8, 602. [Google Scholar] [CrossRef]
Song, C.; Zhang, X.; She, Y.; Li, B.; Zhang, Q. Trajectory Planning for UAV Swarm Tracking Moving Target Based on an Improved Model Predictive Control Fusion Algorithm. IEEE Internet Things J. 2025, 12, 19354–19369. [Google Scholar] [CrossRef]
Cui, Y.; Li, B.; Shi, M. Nonlinear Model Predictive Control for UAV Trajectory Optimization. In Guidance, Navigation and Control; Springer Nature: Singapore, 2024; pp. 405–412. [Google Scholar]
Yang, J.; Wang, Y.; Wang, T.; Hu, Z.; Yang, X.; Rodriguez-Andina, J.J. Time-delay sliding mode control for trajectory tracking of robot manipulators. IEEE Trans. Ind. Electron. 2024, 71, 13083–13091. [Google Scholar] [CrossRef]
Hu, F.; Ma, T.; Su, X. Adaptive fuzzy sliding-mode fixed-time control for quadrotor unmanned aerial vehicles with prescribed performance. IEEE Trans. Fuzzy Syst. 2024, 32, 4109–4120. [Google Scholar] [CrossRef]
Qin, C.; Jiang, K.; Wang, Y.; Zhu, T.; Wu, Y.; Zhang, D. Event-triggered H ∞ control for unknown constrained nonlinear systems with application to robot arm. Appl. Math. Model. 2025, 144, 116089. [Google Scholar] [CrossRef]
Zhang, D.; Wang, Y.; Meng, L.; Yan, J.; Qin, C. Adaptive critic design for safety-optimal FTC of unknown nonlinear systems with asymmetric constrained-input. ISA Trans. 2024, 155, 309–318. [Google Scholar] [CrossRef] [PubMed]
Yin, Y.; Wang, Z.; Zheng, L.; Su, Q.; Guo, Y. Autonomous UAV navigation with adaptive control based on deep reinforcement learning. Electronics 2024, 13, 2432. [Google Scholar] [CrossRef]
Tang, J.; Xie, N.; Li, K.; Liang, Y.; Shen, X. Trajectory tracking control for fixed-wing UAV based on DDPG. J. Aerosp. Eng. 2024, 37, 04024012. [Google Scholar] [CrossRef]
Zhu, Y.; Chen, M.; Wang, S.; Hu, Y.; Liu, Y.; Yin, C. Collaborative reinforcement learning based unmanned aerial vehicle (UAV) trajectory design for 3D UAV tracking. IEEE Trans. Mob. Comput. 2024, 23, 10787–10802. [Google Scholar] [CrossRef]
Hu, W.; Wang, Y.; Chen, Q.; Wang, P.; Wu, E.; Guo, Z.; Hou, Z. TD3 Agent-Based Nonlinear Dynamic Inverse Control for Fixed-Wing UAV Attitudes. IEEE Trans. Intell. Transp. Syst. 2025, 1–12. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, Y.; Xu, H.; Xiao, C.; Zhao, K. Research on Unmanned Aerial Vehicle Intelligent Maneuvering Method Based on Hierarchical Proximal Policy Optimization. Processes 2025, 13, 357. [Google Scholar] [CrossRef]
Mahran, Y.; Gamal, Z.; El-Badawy, A. Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC). In Proceedings of the 2024 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 19–21 October 2024; pp. 72–75. [Google Scholar]
Graves, A.; Graves, A. Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Alcayaga, J.M.; Menéndez, O.A.; Torres-Torriti, M.A.; Vásconez, J.P.; Arévalo-Ramirez, T.; Romo, A.J.P. LSTM-Enhanced Deep Reinforcement Learning for Robust Trajectory Tracking Control of Skid-Steer Mobile Robots Under Terra-Mechanical Constraints. Robotics 2025, 14, 74. [Google Scholar] [CrossRef]
Luo, W.; Wang, X.; Han, F.; Zhou, Z.; Cai, J.; Zeng, L.; Chen, H.; Chen, J.; Zhou, X. Research on LSTM-PPO Obstacle Avoidance Algorithm and Training Environment for Unmanned Surface Vehicles. J. Mar. Sci. Eng. 2025, 13, 479. [Google Scholar] [CrossRef]
Hu, J.; Wang, F.; Li, X.; Qin, Y.; Guo, F.; Jiang, M. Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor–Critic and Generative Adversarial Imitation Learning. Biomimetics 2024, 9, 779. [Google Scholar] [CrossRef]
Xing, X.; Zhou, Z.; Li, Y.; Xiao, B.; Xun, Y. Multi-UAV adaptive cooperative formation trajectory planning based on an improved MATD3 algorithm of deep reinforcement learning. IEEE Trans. Veh. Technol. 2024, 73, 12484–12499. [Google Scholar] [CrossRef]
Li, L.; Zhang, X.; Qian, C.; Zhao, M.; Wang, R. Cross coordination of behavior clone and reinforcement learning for autonomous within-visual-range air combat. Neurocomputing 2024, 584, 127591. [Google Scholar] [CrossRef]
Wang, Z.; Li, J.; Mahmoudian, N. Synergistic Reinforcement and Imitation Learning for Vision-driven Autonomous Flight of UAV Along River. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 9976–9982. [Google Scholar]
Lockwood, O.; Si, M. A review of uncertainty for deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Washington, DC, USA, 7–14 February 2022; Volume 18, pp. 155–162. [Google Scholar]
Ahangar, A.R.; Ohadi, A.; Khosravi, M.A. A novel firefighter quadrotor UAV with tilting rotors: Modeling and control. Aerosp. Sci. Technol. 2024, 151, 109248. [Google Scholar] [CrossRef]
Rezaeifar, S.; Dadashi, R.; Vieillard, N.; Hussenot, L.; Bachem, O.; Pietquin, O.; Geist, M. Offline reinforcement learning as anti-exploration. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2022; Volume 36, pp. 8106–8114. [Google Scholar]
Wang, C.; Cao, R.; Wang, R. Learning discriminative topological structure information representation for 2D shape and social network classification via persistent homology. Knowl.-Based Syst. 2025, 311, 113125. [Google Scholar] [CrossRef]
Wang, C.; He, S.; Wu, M.; Lam, S.K.; Tiwari, P.; Gao, X. Looking Clearer with Text: A Hierarchical Context Blending Network for Occluded Person Re-Identification. IEEE Trans. Inf. Forensics Secur. 2025, 20, 4296–4307. [Google Scholar] [CrossRef]
Qin, C.; Ran, X.; Zhang, D. Unsupervised image stitching based on Generative Adversarial Networks and feature frequency awareness algorithm. Appl. Soft Comput. 2025, 183, 113466. [Google Scholar] [CrossRef]
Wang, R.; Lam, S.K.; Wu, M.; Hu, Z.; Wang, C.; Wang, J. Destination intention estimation-based convolutional encoder-decoder for pedestrian trajectory multimodality forecast. Measurement 2025, 239, 115470. [Google Scholar] [CrossRef]
Wang, F.; Hu, J.; Qin, Y.; Guo, F.; Jiang, M. Trajectory Tracking Control Based on Deep Reinforcement Learning for a Robotic Manipulator with an Input Deadzone. Symmetry 2025, 17, 149. [Google Scholar] [CrossRef]
Sumiea, E.H.; Abdulkadir, S.J.; Alhussian, H.S.; Al-Selwi, S.M.; Alqushaibi, A.; Ragab, M.G.; Fati, S.M. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697. [Google Scholar] [CrossRef]

Figure 1. (a) Cumulative reward. (b) Standard deviation of cumulative reward. (c) Variance of cumulative reward.

Figure 2. 3D display of UAV tracking trajectory.

Figure 3. Tracking trajectories in the x, y, and z directions in Experiment 2. (a) Tracking trajectories in the x. (b) Tracking trajectories in the y. (c) Tracking trajectories in the z.

Figure 4. Tracking errors in the x, y, and z directions in Experiment 2. (a) Tracking errors in the x. (b) Tracking errors in the y. (c) Tracking errors in the z.

Figure 5. Tracking error of velocity in x, y, and z directions in Experiment 2. (a) Tracking error of velocity in x. (b) Tracking error of velocity in y. (c) Tracking error of velocity in z.

Figure 6. Tracking errors in the x, y, and z directions in Experiment 3. (a) Tracking errors in the x. (b) Tracking errors in the y. (c) Tracking errors in the z.

Figure 7. Tracking errors of velocity in x, y, and z directions in Experiment 3. (a) Tracking errors of velocity in x. (b) Tracking errors of velocity in y. (c) Tracking errors of velocity in z.

Figure 8. 3D display of UAV tracking trajectory in Experiment 4.

Figure 9. Tracking trajectories in the x, y, and z directions in Experiment 4. (a) Tracking trajectories in the x. (b) Tracking trajectories in the y. (c) Tracking trajectories in the z.

Figure 10. Tracking errors in the x, y, and z directions in Experiment 2. (a) Tracking errors in the x. (b) Tracking errors in the y. (c) Tracking errors in the z.

Figure 11. Tracking error of velocity in x, y, and z directions in Experiment 2. (a) Tracking error of velocity in x. (b) Tracking error of velocity in y. (c) Tracking error of velocity in z.

Table 1. UAV simulation parameters (Newton–Euler model).

Category	Parameter	Value
Physical Properties	Mass (m)	1.0 kg
	Linear Acceleration Bound	± 5.0 m/ $s^{2}$
	Angular Acceleration Bound	± 5.0 rad/ $s^{2}$
	Euler Angle Constraints	Roll/Pitch: $[- π / 2, π / 2]$
		Yaw: $[- π, π]$
State Space	Position (x,y,z)	$R^{3}$ (unbounded)
	Velocity (vx,vy,vz)	$R^{3}$
	Attitude (roll,pitch,yaw)	$[- π / 2, π / 2] \times [- π / 2, π / 2] \times [- π, π]$
	Angular Velocity	$R^{3}$
Action Space	Linear + Angular Acc.	6D: ${[- 5.0, 5.0]}^{6}$
Simulation	Time Step ( $Δ t$ )	0.1 s
Simulation	Max Episode Steps	200

Table 2. MAE and RMSE of tracking position errors of five algorithms.

	Algorithms	TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
Axis		TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
X	MAE	0.1249	0.1415	0.1394	0.1877	0.5088
X	RMSE	0.5033	0.5057	0.5039	0.5115	0.7492
Y	MAE	0.0113	0.0526	0.0477	0.0771	0.6744
Y	RMSE	0.0157	0.0675	0.0600	0.0894	1.0208
Z	MAE	0.0148	0.0327	0.0461	0.0630	0.7880
Z	RMSE	0.0183	0.0471	0.0690	0.0800	1.1669
Overall	MAE	0.0503	0.0756	0.0777	0.1093	0.6571
Overall	RMSE	0.1791	0.2068	0.2110	0.2270	0.9789

Table 3. MAE and RMSE of tracking velocity errors of five algorithms.

	Algorithms	TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
Axis		TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
X	MAE	0.7169	0.7235	0.7158	0.7416	0.7370
X	RMSE	0.8493	0.8570	0.8534	0.8902	0.8492
Y	MAE	0.5519	0.5578	0.5595	0.5477	0.5239
Y	RMSE	0.6261	0.6237	0.6275	0.6308	0.5885
Z	MAE	0.2962	0.2929	0.2936	0.3049	0.2031
Z	RMSE	0.3124	0.3166	0.3130	0.3418	0.2897
Overall	MAE	0.5216	0.5247	0.5230	0.5314	0.4880
Overall	RMSE	0.5959	0.5991	0.5980	0.6210	0.5758

Table 4. Comparison of the MAE of position error with and without interference.

	Algorithms	TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
Axis		TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
X	w/o Dist.	0.1249	0.1415	0.1394	0.1877	0.5088
X	w/ Dist.	0.1291	0.1512	0.1701	0.2240	0.5009
Y	w/o Dist.	0.0113	0.0526	0.0477	0.0771	0.6744
Y	w/ Dist.	0.0125	0.0587	0.0486	0.0861	0.6519
Z	w/o Dist.	0.0148	0.0327	0.0461	0.0630	0.7880
Z	w/ Dist.	0.0162	0.0468	0.0494	0.0769	0.7821
Overall	w/o Dist.	0.0503	0.0756	0.0777	0.1093	0.6571
Overall	w/ Dist.	0.0526	0.0856	0.0894	0.1290	0.6450

Table 5. Comparison of the RMSE of position error with and without interference.

	Algorithms	TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
Axis		TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
X	w/o Dist.	0.5033	0.5057	0.5039	0.5115	0.7492
X	w/ Dist.	0.5051	0.5208	0.5321	0.5559	0.7591
Y	w/o Dist.	0.0157	0.0675	0.0600	0.0894	1.0208
Y	w/ Dist.	0.0172	0.0750	0.0586	0.0963	0.9653
Z	w/o Dist.	0.0183	0.0471	0.0690	0.0800	1.1669
Z	w/ Dist.	0.0198	0.0578	0.0666	0.0931	1.1552
Overall	w/o Dist.	0.1791	0.2068	0.2110	0.2270	0.9789
Overall	w/ Dist.	0.1807	0.2179	0.2191	0.2484	0.9599

Table 6. Comparison of the MAE of velocity error with and without interference.

	Algorithms	TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
Axis		TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
X	w/o Dist.	0.7169	0.7235	0.7158	0.7416	0.7370
X	w/ Dist.	0.7205	0.7340	0.7668	0.7856	0.7795
Y	w/o Dist.	0.5519	0.5578	0.5595	0.5477	0.5239
Y	w/ Dist.	0.5532	0.5653	0.5671	0.5680	0.5186
Z	w/o Dist.	0.2962	0.2929	0.2936	0.3049	0.2031
Z	w/ Dist.	0.2981	0.3094	0.3225	0.3217	0.2593
Overall	w/o Dist.	0.5216	0.5247	0.5230	0.5314	0.4880
Overall	w/ Dist.	0.5239	0.5362	0.5521	0.5584	0.5192

Table 7. Comparison of the RMSE of velocity error with and without interference.

	Algorithms	TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
Axis		TD3-LSTM-BC	TD3-BC	TD3-LSTM	TD3	DDPG
X	w/o Dist.	0.8493	0.8570	0.8534	0.8902	0.8492
X	w/ Dist.	0.8510	0.8764	0.9438	0.9376	0.8951
Y	w/o Dist.	0.6261	0.6237	0.6275	0.6308	0.5885
Y	w/ Dist.	0.6275	0.6536	0.6540	0.6619	0.6032
Z	w/o Dist.	0.3124	0.3166	0.3130	0.3418	0.2897
Z	w/ Dist.	0.3142	0.3647	0.3922	0.3814	0.3568
Overall	w/o Dist.	0.5959	0.5991	0.5980	0.6210	0.5758
Overall	w/ Dist.	0.5976	0.6316	0.6633	0.6603	0.6184

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qi, Y.; Hu, J.; Wang, F.; Huang, G. LSTM-Enhanced TD3 and Behavior Cloning for UAV Trajectory Tracking Control. Biomimetics 2025, 10, 591. https://doi.org/10.3390/biomimetics10090591

AMA Style

Qi Y, Hu J, Wang F, Huang G. LSTM-Enhanced TD3 and Behavior Cloning for UAV Trajectory Tracking Control. Biomimetics. 2025; 10(9):591. https://doi.org/10.3390/biomimetics10090591

Chicago/Turabian Style

Qi, Yuanhang, Jintao Hu, Fujie Wang, and Gewen Huang. 2025. "LSTM-Enhanced TD3 and Behavior Cloning for UAV Trajectory Tracking Control" Biomimetics 10, no. 9: 591. https://doi.org/10.3390/biomimetics10090591

APA Style

Qi, Y., Hu, J., Wang, F., & Huang, G. (2025). LSTM-Enhanced TD3 and Behavior Cloning for UAV Trajectory Tracking Control. Biomimetics, 10(9), 591. https://doi.org/10.3390/biomimetics10090591

Article Menu

LSTM-Enhanced TD3 and Behavior Cloning for UAV Trajectory Tracking Control

Abstract

1. Introduction

2. Unmanned Aerial Vehicle System Modeling

3. Preliminaries

3.1. Reinforcement Learning

3.2. TD3 Algorithm

3.3. Long Short-Term Memory

3.4. Behavior Cloning

4. Control Design

4.1. Input and Output

4.2. TD3-LSTM Design

4.3. TD3-LSTM-BC Design

5. Simulation

5.1. Training Performance

5.2. Tracking Control Performance

5.3. Anti-Disturbance Performance

5.4. Generalization Performance

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI