SCPL-TD3: An Intelligent Evasion Strategy for High-Speed UAVs in Coordinated Pursuit-Evasion

Zhang, Xiaoyan; Yan, Tian; Li, Tong; Liu, Can; Jiang, Zijian; Yan, Jie

doi:10.3390/drones9100685

Open AccessArticle

SCPL-TD3: An Intelligent Evasion Strategy for High-Speed UAVs in Coordinated Pursuit-Evasion

by

Xiaoyan Zhang

¹

,

Tian Yan

^1,*

,

Tong Li

¹,

Can Liu

¹

,

Zijian Jiang

² and

Jie Yan

¹

Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an 710072, China

²

Institute of Acoustics, Chinese Academy of Sciences, Beijing 100080, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(10), 685; https://doi.org/10.3390/drones9100685

Submission received: 28 August 2025 / Revised: 24 September 2025 / Accepted: 28 September 2025 / Published: 2 October 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

Proposes the SCPL-TD3 strategy to enable effective evasion for High-Speed UAVs against coordinated pursuers.
Analyzes and classifies the impact of pursuer spacing on evasion difficulty levels.

What is the implication of the main finding?

Achieves superior evasion rates while minimizing resource costs, thereby preserving operational capability for subsequent missions.
Provides a foundational framework and a critical decision-making metric for assessing evasion difficulty and optimizing vehicle trajectory in complex pursuit-evasion scenarios.

Abstract

The rapid advancement of kinetic pursuit technologies has significantly increased the difficulty of evasion for high-speed UAVs (HSUAVs), particularly in scenarios where two collaboratively operating pursuers approach from the same direction with optimized initial space intervals. This paper begins by deriving an optimal initial space interval to enhance cooperative pursuit effectiveness and introduces an evasion difficulty classification framework, thereby providing a structured approach for evaluating and optimizing evasion strategies. Based on this, an intelligent maneuver evasion strategy using semantic classification progressive learning with twin delayed deep deterministic policy gradient (SCPL-TD3) is proposed to address the challenging scenarios identified through the analysis. Training efficiency is enhanced by the proposed SCPL-TD3 algorithm through the employment of progressive learning to dynamically adjust training complexity and the integration of semantic classification to guide the learning process via meaningful state-action pattern recognition. Built upon the twin delayed deep deterministic policy gradient framework, the algorithm further enhances both stability and efficiency in complex environments. A specially designed reward function is incorporated to balance evasion performance with mission constraints, ensuring the fulfillment of HSUAV’s operational objectives. Simulation results demonstrate that the proposed approach significantly improves training stability and evasion effectiveness, achieving a 97.04% success rate and a 7.10–14.85% improvement in decision-making speed.

Keywords:

reinforcement learning; intelligent evasion strategy; pursuit-evasion game; high-speed UAVs

1. Introduction

High-speed unmanned aerial vehicles (HSUAVs) possess distinct characteristics, including high flight speeds and exceptional evasion capabilities [1,2,3]. Therefore, HSUAVs are increasingly utilized for critical missions, including reconnaissance, and rapid mobility [4]. However, advancements in pursuit technologies—particularly the development of sophisticated pursuers with optimized guidance systems—have significantly increased the complexity of evasive maneuvers for HSUAVs [5,6]. This challenge becomes particularly pronounced when two pursuers approach from the same direction with precisely coordinated spatial and temporal intervals [7,8]. The evolving threat landscape underscores the need for the development of intelligent and adaptive evasion strategies for HSUAVs.

Conventional evasive maneuvers for HSUAVs primarily rely on predefined maneuver patterns [9,10]. However, these approaches are highly predictable, making them susceptible to tracking by pursuers while lacking the flexibility required for dynamic threat environments. In response, recent research has explored active evasion strategies. These methods can be broadly divided into two categories: threat-avoidance trajectory optimization—exemplified by the Gaussian pseudospectral method [11,12], convex optimization [13,14,15], and predictor–corrector methods [16,17,18]—and guidance laws based on modern control theory [19,20]. However, threat-avoidance trajectory optimization is highly conservative due to its reliance on complete spatial evasion. This approach often necessitates extensive trajectory deviations, leading to increased travel distance and excessive energy consumption, imposing significant operational constraints on HSUAVs.

Modern control-based evasion guidance laws address spatiotemporal engagement dynamics between HSUAVs and pursuers, primarily using optimal control [21] or differential game theory [22]. Yu et al. [23] developed optimal control-based evasion laws maintaining minimum interceptor distance, and Liu et al. [24] validated optimal evasion via 6-DOF models. However, these methods are only effective when target maneuver patterns are known, yielding unilateral optimal solutions that significantly limit practical applicability. In reality, engagements are inherently dynamic, characterized by continuous interactions between opposing forces in a two-sided game [25]. Differential game theory provides a framework for bilateral optimization [26], as demonstrated in Wang et al.’s [27] midcourse evasion strategy. However, such approaches often require significant simplifications due to computational constraints. Moreover, both methods rely heavily on precise environmental and dynamic modeling, which limits their effectiveness in highly uncertain or unpredictable real-world scenarios.

Deep reinforcement learning (DRL) combines deep learning and reinforcement learning to enable model-free decision-making in complex environments [28,29]. However, its application to HSUAV evasion, especially under coordinated pursuit, remains underexplored. Existing research on flight vehicle evasion has yielded a series of significant findings [30,31]. Li et al. [32] proposed an evasion strategy based on virtual targets and the contextual Markov decision process. To enhance adaptability and learning efficiency, Yan et al. [33] developed an unmanned aerial vehicle evasion guidance law using deep reinforcement learning in an approximate head-on pursuit–evasion (PE) scenario. These studies primarily focus on one-on-one PE scenarios, limiting their applicability to coordinated pursuit engagements. Furthermore, since reinforcement learning is an unguided, experience-free learning paradigm, deep reinforcement learning-based evasion strategies often suffer from prolonged training times and slow convergence in complex adversarial environments.

The literature on HSUAV evasion strategies presents several critical limitations:

Most existing studies focus primarily on one-on-one PE scenarios [30,31,32,33,34], leaving a critical void in strategies against coordinated pursuit, which involves higher-dimensional state spaces and complex interaction dynamics;
Existing methodologies primarily emphasize the successful execution of evasion maneuvers while neglecting the extent of deviation from the intended trajectory [23,24,25,26,27]. This oversight reduces the probability of successfully reaching the target zone post-evasion, thereby limiting mission effectiveness;
Classical methods suffer from computational intractability in high-dimensional state spaces and reliance on idealized assumptions, while existing DRL approaches lack sample efficiency and exhibit slow convergence as well as training instability.

To address these challenges, this study focuses on developing an intelligent evasion strategy for HSUAVS in the coordinated PE scenario where two pursuers approach from the same direction with optimized initial space intervals. The primary contributions of this study are as follows:

A semantic classification progressive learning framework, combined with the Twin Delayed Deep Deterministic Policy Gradient (SCPL-TD3) algorithm, is introduced to improve agent training in complex PE environments. Unlike conventional fixed training paradigms, SCPL-TD3 dynamically adjusts training complexity and leverages semantic classification to prioritize critical state-action patterns, thereby providing targeted guidance during policy learning. Furthermore, by building upon the twin delayed deep deterministic policy gradient framework, the proposed method not only accelerates convergence but also significantly improves training stability and sample efficiency in dynamic and high-dimensional environments;
This work presents an analysis of a coordinated PE scenario involving two pursuers, from which an optimal initial spatial interval is derived to enhance the effectiveness of coordinated pursuit maneuvers. Moreover, an evasion difficulty classification framework is introduced, which systematically categorizes two-pursuer PE scenarios into distinct levels according to their spatial and dynamic constraints. This framework provides a structured method for evaluating and optimizing evasion strategies under cooperative pursuer conditions;
A reward function is proposed that integrates energy consumption constraints and critical miss distance into a unified optimization objective. This multi-objective design not only minimizes evasion cost but also maintains sufficient operational margins, thereby ensuring the reliable accomplishment of subsequent mission objectives. By effectively balancing immediate evasion demands and overall mission requirements, the proposed reward mechanism significantly enhances the practicality and robustness of the learned evasion strategies.

A comparison of the aforementioned methods is presented in Table 1 in order to more clearly illustrate their respective advantages and limitations.

2. Problem Formulation and Vehicle Modeling

This paper investigates the scenario in which a HSUAV is subject to pursuit by two cooperative pursuers. A central focus is the derivation of the optimal initial space interval to enhance cooperative pursuit effectiveness, alongside the establishment of an evasion difficulty classification framework that provides a formal basis for quantifying and evaluating evasion algorithm performance. Building upon this formulation, a reinforcement learning algorithm—SCPL-TD3—is designed to address critical challenges including real-time decision-making, sample efficiency, and training stability in adversarial settings. The proposed algorithm is subsequently evaluated under the scenarios defined by the classification framework.

This section formulates multiple cooperative PE engagements, provides rigorous definitions of key concepts, and introduces the vehicle and engagement dynamics models, thereby laying the theoretical foundation for subsequent algorithm development and performance analysis.

2.1. Coordinated PE Scenario Construction

In a one-on-two confrontation scenario, the initial spatial separation between the two pursuers plays a critical role in determining their ability to sustain a continuous threat. This section classifies these scenarios into two difficulty levels—easy and hard—based on the initial interval and presents the corresponding evasion strategies. To systematically characterize the complexity of one-on-two confrontation scenarios, the following definitions are introduced:

Definition 1.

The Single Pursuer Equivalence Condition (SPEC) is defined as the state in which the space interval

Δ X (t)

between the two pursuers satisfies

Δ X (t) < A_{1}

, where

A_{1}

is a predefined threshold. Under this condition, the pursuers are considered sufficiently close to be treated as a single entity for evasion analysis and strategy design. This concept is illustrated in Figure 1.

Definition 2.

The Non-Continuous Threat Condition (NCTC) is defined as the state in which the space interval between the two pursuers

Δ X (t) > A_{2}

, where

A_{2}

is a predefined threshold such that

A_{2} > A_{1}

. When this condition is met, the distance between the pursuers is sufficiently large to prevent them from forming a continuous threat. Under these circumstances, the HSUAV can successfully evade using a one-on-one evasion strategy, as illustrated in Figure 2.

Definition 3.

The Coordinated Threat Condition (CTC) is defined as the state in which the space interval

Δ X (t)

between the two pursuers satisfies

Δ X (t) \in [A_{1}, A_{2}]

where

A_{1}

and

A_{2}

are predefined thresholds with

A_{2} > A_{1}

. Under this condition, the two pursuers maintain a persistent threat, preventing the HSUAV from escaping either simultaneously or sequentially (as illustrated in Figure 3). In such cases, a specialized evasion strategy must be designed to counteract the coordinated posture.

In this study, the one-on-two confrontation scenario is classified into two levels of difficulty based on the space interval

Δ X

. The SPEC and NCTC are categorized as simple scenarios, whereas the CTC represents a difficult scenario requiring advanced evasion techniques. The method for determining the space interval range in difficult scenarios is detailed in Section 3.1. This paper primarily focuses on the challenges posed by the difficult scenario and proposes targeted strategies to enhance evasion performance under such conditions.

For clarity, the confrontation scenarios are designated as follows: the confrontation scenario under the SPEC is referred to as Scenario 1; the confrontation scenario under the NCTC is referred to as Scenario 2; and the confrontation scenario under the CTC is referred to as Scenario 3.

2.2. Vehicle Modeling

To accurately capture the dynamic characteristics of the vehicle during flight, the kinematic and dynamic models of the HSUAV and interceptors are defined as follows:

\{\begin{cases} \frac{d V_{i}}{d t} = g (n_{i x} - \sin θ_{i}) \\ \frac{d θ_{i}}{d t} = \frac{g}{V_{i}} (n_{i y} - \sin θ_{i}) \\ \frac{d ψ_{V i}}{d t} = - \frac{g}{V_{i} \cos θ_{i}} n_{i z} \end{cases}

(1)

\{\begin{cases} \frac{d x_{i}}{d t} = V_{i} \cos θ_{i} \cos ψ_{V i} \\ \frac{d y_{i}}{d t} = V_{i} \sin θ_{i} \\ \frac{d z_{i}}{d t} = - V_{i} \cos θ_{i} \sin ψ_{V i} \end{cases}

(2)

where

i = H, I_{ς}

and

H

represent HSUAV, and

I_{ς}

represents the

ς th

pursuer (

ς = 1, 2

);

V_{i}

represents the speed of the vehicle;

g

represents the acceleration of gravity;

θ_{i}

denotes the ballistic inclination angle;

ψ_{V i}

denotes the ballistic deflection angle;

n_{i x}

,

n_{i y}

,

n_{i z}

denote the overload components along the three axes in the ballistic coordinate system, whereas

x

,

y

,

z

represent the displacement components in the geographical coordinate system.

To enhance the realism of the confrontation scenario, the autopilot characteristics are incorporated into the modeling process. In accordance with Assumption 1 and for computational efficiency, the autopilot model is approximated as a first-order dynamical system. The relationship between the actual overload and the commanded overload is expressed as follows:

\frac{n_{i} (s)}{n_{i c} (s)} = \frac{1}{1 + τ_{i} s}

(3)

where

n_{i}

represents the actual aircraft overload;

n_{i c}

represents the aircraft overload command;

τ_{i}

is the time constant characterizing the first-order system response.

The pursuers utilize an Augmented Proportional Navigation (APN) guidance law, incorporating both longitudinal and lateral overload commands, given by:

\{\begin{cases} n_{I_{ς} y} = \frac{N V_{H I_{ς}} {\dot{λ}}_{H I_{ς} P}}{g} + \cos θ_{I_{ς}} + \frac{1}{2} n_{H y} \\ n_{I_{ς} z} = - \frac{N V_{H I_{ς}} {\dot{λ}}_{H I_{ς} Y} \cos θ_{I_{ς}}}{g} + \frac{1}{2} n_{H z} \end{cases}

(4)

where

n_{I_{ς} y}

and

n_{I_{ς} z}

denote the longitudinal and lateral overload commands of the

ς th (ς = 1, 2)

pursuer. The parameter

V_{H I_{ς}}

represents the relative velocity between the HSUAV and the pursuer, while

N \in [3, 5]

is the navigation factor.

During the cruise phase, the HSUAV’s maneuvering flight consists of longitudinal and lateral maneuvers, which may occur independently or simultaneously. In this study, lateral maneuvers are prioritized, because lateral maneuvers help mitigate the influence of speed and altitude variations on attitude angle and control system performance. Therefore, the confrontation model is formulated in the lateral plane for simplification.

When two pursuers are present, the schematic representation of the confrontation in the lateral plane, along with relevant angle definitions, is depicted in Figure 4. Without loss of generality, the initial line-of-sight direction of

H - I_{ς}

is aligned with the X-axis, while the Z-axis remains perpendicular to the X-axis within the fixed-altitude plane.

Based on the geometric relationships in Figure 4, and considering the first-order autopilot model, the relative kinematic equation is expressed as follows:

\{\begin{cases} {\dot{r}}_{H I_{ς}} = - V_{H} \cos (ψ_{V H} - λ_{H I_{ς}}) + V_{I_{ς}} \cos (ψ_{V I_{ς}} + λ_{H I_{ς}}) \\ {\dot{λ}}_{H I_{ς}} = (V_{H} \sin (ψ_{V H} - λ_{H I_{ς}}) - V_{I_{ς}} \sin (ψ_{V I_{ς}} + λ_{H I_{ς}})) / r_{H I_{ς}} \\ {\ddot{r}}_{H I_{ς}} = n_{H} g \sin (ψ_{V H} - λ_{H I_{ς}}) + n_{I_{ς}} g \sin (ψ_{V I_{ς}} + λ_{H I_{ς}}) + r_{H I_{ς}} {\dot{λ}}_{H I_{ς}}^{2} \\ {\ddot{λ}}_{H I_{ς}} = (n_{H} g \cos (ψ_{V H} - λ_{H I_{ς}}) + n_{I_{ς}} g \cos (ψ_{V I_{ς}} + λ_{H I_{ς}}) - 2 {\dot{r}}_{H I_{ς}} {\dot{λ}}_{H I_{ς}}) / r_{H I_{ς}} \\ {\dot{ψ}}_{V H} = n_{H} g / V_{H} \\ {\dot{ψ}}_{V I_{ς}} = - n_{I_{ς}} g / V_{I_{ς}} \end{cases}

(5)

For the terminal guidance problem, it is assumed that the confrontation occurs near the initial triangular collision region, where the line-of-sight angle

λ_{H I_{ς}} (ς = 1, 2)

is small with minimal variation. Here, it is assumed that the kinematic confrontation model can be linearized along the X-axis (Assumption 2). Defining the state variable as

X_{H I_{ς}} = {[z_{H I_{ς}}, {\dot{z}}_{H I_{ς}}, n_{H}, n_{I_{ς}}]}^{T}

, where

z_{H I_{ς}} = z_{H} - z_{I_{ς}}

,

{\dot{z}}_{H I_{ς}}

is the is the derivative of

z_{H I_{ς}}

.

n_{H}

and

n_{I_{ς}}

are the normal overloads of HSUAV and pursuers, respectively. The linearized confrontation model is given by:

{\dot{X}}_{H I_{ς}} = A X_{H I_{ς}} + B_{H} n_{H c} + B_{I_{ς}} n_{I_{ς} c}

(6)

where

A = [\begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & g \cos ψ_{V H 0} & g \cos ψ_{V I_{ς} 0} \\ 0 & 0 & - \frac{1}{τ_{H}} & 0 \\ 0 & 0 & 0 & - \frac{1}{τ_{I_{ς}}} \end{matrix}], B_{H} = [\begin{matrix} 0 \\ 0 \\ \frac{1}{τ_{H}} \\ 0 \end{matrix}], B_{I_{ς}} = [\begin{matrix} 0 \\ 0 \\ 0 \\ \frac{1}{τ_{I_{ς}}} \end{matrix}]

(7)

Assumption 1

[23]. This study assumes that both the HSUAV and the pursuers have access to real-time state information of the other, enabled by their respective sensor systems. Specifically, the HSUAV is equipped with a radar warning receiver, while the pursuers employ active radar or infrared seekers.

Assumption 2

[35]. The autopilot dynamics of both the HSUAV and the pursuers are approximated by a first-order system. This is a standard and widely adopted simplification in guidance and control literature for high-performance vehicles. While real-world autopilots are higher-order systems, their dominant closed-loop behavior in the pursuit-evasion context is often effectively captured by a first-order lag, prioritizing computational tractability for large-scale Monte Carlo simulations without significantly sacrificing the fidelity of the engagement realism.

Assumption 3

[36]. The kinematic confrontation model is linearized along the X-axis. This simplification is justified by the endgame scenario characteristics, wherein the extremely high relative velocity (exceeding Mach 10) results in a very short engagement time, and the limited maneuverability (with accelerations commonly bounded within 80 m/s²) ensures that the ballistic declination angles of both adversaries remain nearly constant. This approach, prevalent in terminal-phase guidance studies, is essential for deriving analytical solutions and facilitates the tractable design and analysis of the proposed intelligent evasion strategy.

3. Methods

3.1. Initial Common Space Interval Generation

Due to speed disadvantages relative to HSUAV, pursuers must adopt an approximately head-on posture to maximize effectiveness. This posture is adopted in the present study. This section employs Monte Carlo simulations to determine the common space interval that enables pursuers to achieve a coordinated pursuit posture. Three typical maneuvering patterns of HSUAV are considered, providing a basis for designing an optimal evasion strategy.

A hybrid framework combining offline training and online application is developed to determine the space interval between interceptors.

The offline training phase consists of three key steps. First, three representative maneuver commands for HSUAV are selected: (1) constant-value maneuver (HSUAV maintains a constant maximum available overload). (2) Bang-Bang maneuver (HSUAV initially applies full positive overload for the first interceptor, followed by full negative overload for the second interceptor). (3) HSUAV follows maneuvering strategies documented in the Ref. [9]. Second, Monte Carlo simulations are conducted to explore space intervals, identifying those that facilitate coordinated posture. Finally, neural networks are trained to approximate the common space intervals, leading to the development of a space interval estimation network.

The online application phase utilizes the trained estimation network to determine the appropriate initial interval between the two pursuers in real time. Given a new confrontation scenario, the initial interval between the first pursuer and HSUAV is input into the network, which then estimates the interval required to establish a coordinated pursuit posture. The algorithmic process is illustrated in Figure 5.

3.2. SCPL-TD3 Algorithm

To address the complexities of coordinated PE game and reduce training difficulty while improving algorithm convergence under adverse conditions, this section introduces targeted enhancements to the TD3 algorithm to develop an effective evasion strategy for HSUAV.

Building on this objective, the semantic classification progressive learning with twin delayed deep deterministic policy gradient (SCPL-TD3) algorithm is proposed. This algorithm is designed to enhance learning efficiency, stability, and adaptability, thereby ensuring robust performance in high-dimensional and adversarial environments. To establish a foundation for the proposed improvements, an overview of the traditional TD3 algorithm is first provided, outlining its key mechanisms and inherent limitations.

The TD3 algorithm is an optimization-based extension of DDPG. During training, the agent generates information

(s_{t}, a_{t}, s_{t + 1}, r_{t})

at each step and stores it as a tuple in the experience replay buffer (ERB)

D

. The actor and critic networks update their parameters by randomly selecting small batches of data from this buffer. The critic network is trained by minimizing the following loss function

L (θ^{Q})

:

\{\begin{cases} L (θ^{Q_{i}}) = E_{s_{t}, a_{t}, r_{t}, s_{t + 1}} [{(y_{i} - Q_{i} (s_{i}, a_{i} | θ^{Q_{i}}))}^{2}] \\ y_{i} = r_{i} + γ \min_{i = 1, 2} Q_{i}^{'} (s_{t + 1}, \tilde{a}) \\ \tilde{a} \leftarrow π_{ϕ}^{T} (s_{t + 1}) + ϵ, ϵ \sim c l i p (N (0, \tilde{σ}), - c, c) \end{cases}

(8)

where

\tilde{α}

represents the action space augmented with Gaussian white noise to encourage exploration and prevent the policy from converging to a local optimum,

θ^{Q_{i}}

is the parameter of the critic network,

θ^{Q_{i}^{'}}

is the parameter of the critic target network,

i = 1, 2

,

y_{i}

is the target value, and

c

defines the noise truncation range.

The gradient update for the actor network parameters is expressed as follows:

\nabla_{θ^{μ}} J (θ^{μ}) \approx \frac{1}{N} \sum_{i} \nabla_{a} Q_{i} (s, a | θ^{Q_{i}}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s_{i}}

(9)

where

θ^{μ}

denotes the parameters of the actor network;

\nabla_{a} Q_{i} (s, a | θ^{Q_{i}})

represents the gradient of the critic network

Q_{i}

with respect to action

a

;

\nabla_{θ^{μ}} μ (s | θ^{μ})

corresponds to the gradient of the actor network with respect to its parameter

θ^{μ}

.

In the TD3 framework, deep neural networks parameterized by

θ^{μ}

,

θ^{Q}

, and

θ^{Q^{'}}

are employed to represent the actor network, critic network, and critic target network, respectively. The optimization process in TD3 is centered on three key modifications: (1) Mitigating overestimation bias in the critic network within the DDPG framework through a strategy similar to the dual Q-network approach. (2) Delaying the actor network update, a technique that improves algorithmic stability. (3) Introducing noise into the actor target network to enhance resilience against perturbations. These modifications collectively define the TD3 algorithm, as illustrated in Figure 6.

While these enhancements improve performance in many continuous control tasks, they are not inherently designed to address the challenges of cooperative interception scenarios, such as rapid adaptation to dynamic threats and efficient exploration of high-dimensional state spaces. To overcome these limitations, further improvements to the TD3 algorithm are proposed in this study.

In cooperative PE scenarios, HSUAVs encounter significant challenges, including high-dimensional state spaces, dynamic adversarial interactions, and the need for rapid adaptation. To address these challenges, progressive learning is integrated into the training framework. Initially, the HSUAV undergoes training in a basic evasion scenario to acquire fundamental strategies such as attitude adjustment and overload control. Once proficiency is achieved, the training advances to a more complex cooperative PE scenario, where the HSUAV must navigate intricate interactions with multiple pursuers. This progressive learning approach not only accelerates the acquisition of evasion tactics but also significantly enhances adaptability to diverse and dynamic confrontation scenarios, providing a robust and scalable solution for real-world deployment.

To further improve algorithmic efficiency and mitigate the limitations associated with random experience selection, the SCPL-TD3 algorithm incorporates an importance- based classification mechanism to optimize experience replay. Specifically, SCPL-TD3 employs three ERBs to categorize and store experience samples based on their relative importance, prioritizing high-importance samples to accelerate learning. The classification methodology is structured as follows:

At the initialization stage, the importance of all samples in the three ERBs is set to zero. When a new experience sample is generated, the first step is to determine whether the agent successfully evades one or two interceptors. If the evasion attempt fails, the sample is assigned an importance value of −1 and stored in the corresponding experience buffer

Ξ_{l o w}

. Conversely, the average immediate reward value of all successful experience samples is updated. The next step involves comparing the sample’s immediate reward value to this updated average. If the sample’s immediate reward exceeds the average of successful experiences, it is assigned an importance value of 1 and placed in the corresponding experience buffer

Ξ_{h i g h}

. Otherwise, the sample is assigned an importance value of 0 and stored in the designated buffer

Ξ_{m e d i u m}

.

During training, a greater proportion of high-importance experiences should be utilized, particularly those stored in the high and medium importance ERBs

Ξ_{h i g h}

and

Ξ_{m e d i u m}

. Accordingly, experience replay is conducted proportionally, as described by the following sampling method:

\{\begin{cases} N_{1} = ℘ (e^{- ϖ}) M \\ N_{2} = l M \\ N_{3} = M - N_{1} - N_{2} \end{cases}

(10)

where

N_{1}

,

N_{2}

, and

N_{3}

represent the quantities sampled from the high, medium, and low importance ERBs, respectively;

M

denotes the total number of samples;

l \in (0, 1)

represents the medium- importance sample sampling rate;

℘ (\cdot)

is the precision function, retaining three decimal places; and

ϖ

denotes the total number of samples. The first term indicates that as the total sample size increases, the proportion of samples drawn from

Ξ_{l o w}

decreases.

Based on these considerations, the framework for constructing the HSUAV evasion model using the SCPL-TD3 algorithm is presented in Figure 7.

In the figure, Subtask 1 corresponds to Scenario 2, while Subtask 2 corresponds to Scenario 3. The term

a_{t} = ℵ_{G a u s s i a n} (μ_{t}, σ^{2})

represents the action space augmented with Gaussian white noise, and

B_{s a m p l e}

denotes the sampled information.

The selection of Scenario 2 as Subtask 1 is based on the following rationale: In Scenario 1, the spatial separation between the two pursuers is sufficiently small, allowing the HSUAV to evade them simultaneously. This characteristic makes Scenario 1 less suitable for designing evasion strategies applicable to coordinated PE scenarios. Scenario 2, which presents a more challenging and realistic context, is therefore selected as Subtask 1 to support the development of effective evasion strategies.

Remark 1.

The training process employs a progressive learning strategy, structured into two distinct phases. Initially, the model is trained on simple scenarios from sub-task 1 until full convergence is achieved. The parameters obtained from this phase are then saved and used to initialize the model for the subsequent phase (sub-task 2), with this process advancing sequentially. The escalation in complexity is primarily manifested through gradual increases in data dimensionality and task difficulty. Transition between phases is determined based on the model’s convergence on the current sub-task, specifically when performance on the validation set meets a predefined threshold. This ensures that knowledge from the preceding phase is thoroughly assimilated before proceeding to more complex scenarios. This approach not only enhances the stability of the training process but also facilitates efficient knowledge transfer in challenging settings.

3.3. Design of State Space and Motion Space

The HSUAV perceives its current state and makes decisions based on available state information. However, the high-dimensional nature of the state space complicates the extraction of relevant information and reduces computational efficiency. Additionally, insufficient state information decreases the distinguishability between states, impairing decision-making accuracy and increasing the likelihood of errors. To address these challenges, a normalized state space is formulated based on the relative motion equation, integrating relative motion parameters, HSUAV state variables, and pursuer position data, as expressed in the following equation:

s_{s t a t e} = {[r_{H I_{1}} / r_{H I_{1} 0}, r_{H I_{2}} / r_{H I_{2} 0}, λ_{H I_{1}}, λ_{H I_{2}}, σ_{1} {\dot{λ}}_{H I_{1}}, σ_{2} {\dot{λ}}_{H I_{2}}, V_{H} / V_{I_{1}}, V_{H} / V_{I_{2}}]}^{T}

(11)

where:

r_{H I_{1}}

and

r_{H I_{2}}

denote the relative distances between the HSUAV and the two pursuers, while

r_{H I_{1} 0}

and

r_{H I_{2} 0}

represent their initial values. The parameters

λ_{H I_{1}}

and

λ_{H I_{2}}

correspond to the line-of-sight angles between the HSUAV and the pursuers. The HSUAV’s flight speed is given by

V_{H}

, while

V_{I_{1}}

and

V_{I_{2}}

denote the flight speeds of the two pursuers. The normalization coefficient

σ_{i} (i = 1, 2)

ensures that the magnitudes of selected state variables remain comparable, thereby facilitating algorithm convergence.

The objective of deep reinforcement learning is to determine the optimal action for each state. Based on guidance requirements, the selected action is defined as the HSUAV’s normal overload

n_{H}

. Accordingly, the following condition must be satisfied:

n_{H} \in [- n_{H m a x}, n_{H m a x}]

(12)

where

n_{H m a x}

represents the overload limit for HSUAV.

3.4. Reward Function Design Considering the Evasion Cost

In HSUAV evasion scenarios, evasion cost is a critical factor in strategy design. Two primary aspects are considered: (1) the HSUAV’s lateral maneuver distance and (2) its energy consumption. The lateral maneuver distance directly influences the feasibility of subsequent missions; excessive deviation may hinder mission completion. Meanwhile, energy consumption affects the ability to execute future evasion.

The design of the reward function plays a crucial role in determining the efficiency and convergence of reinforcement learning algorithms, as it directly shapes the agent’s learning behavior and performance. In this study, a comprehensive two-part reward function is proposed, consisting of process rewards (

R_{1}

,

R_{2}

, and

R_{3}

) and terminal rewards (

R_{4}

and

R_{5}

). The process reward, which quantifies the effectiveness of evasion, is defined as follows:

R_{1} = - β_{1} (\frac{1}{r_{H I_{1}}} - \frac{1}{r_{H I_{1} 0}}) + β_{2} |{\dot{λ}}_{H I_{1}}| - β_{3} (\frac{1}{r_{H I_{2}}} - \frac{1}{r_{H I_{2} 0}}) + β_{4} |{\dot{λ}}_{H I_{2}}|, v_{H I_{1}} < 0 o r v_{H I_{2}} < 0

(13)

where

β_{1}

,

β_{2}

,

β_{3}

, and

β_{4}

are weighting factors. As shown in Equation (13) R₁ consists of four components. The first component

- β_{1} (\frac{1}{r_{H I_{1}}} - \frac{1}{r_{H I_{1} 0}})

and third component

- β_{3} (\frac{1}{r_{H I_{2}}} - \frac{1}{r_{H I_{2} 0}})

components are derived from potential energy functions, where the penalty (i.e., repulsive force) intensifies as the HSUAV approaches an pursuer. The second component

β_{2} |{\dot{λ}}_{H I_{1}}|

and fourth component

β_{4} |{\dot{λ}}_{H I_{2}}|

incentivize maneuvering by promoting an increased rate of change in the line-of-sight angle.

The survival reward

R_{2}

is designed to encourage prolonged survival, assigning a higher reward for extended evasion time. It is expressed as:

R_{2} = β_{5} t

(14)

where

β_{5}

is a weighting factor.

The reward function

R_{3}

is formulated in Equation (15) to discourage the HSUAV from sustaining high overloads over extended periods, thereby minimizing energy consumption and enhancing operational efficiency:

R_{3} = - β_{6} n_{H \max} t, n_{H \max} - |n_{H c}| \leq δ

(15)

where

β_{6}

is the weighting factor.

δ

is a constant set to 0.1 in this study.

In addition to process rewards, a terminal reward function is introduced to evaluate the final outcome of the HSUAV’s evasion strategy. This function consists of two components,

R_{4}

and

R_{5}

, whose design rationale and formulations are detailed below.

The first component,

R_{4}

, represents the final reward for successfully evading the first pursuer. It incentivizes the HSUAV based on the favorability of the terminal state following successful evasion while imposing a penalty of −100 in the event of failure. This component ensures that successful evasion is prioritized and that the HSUAV optimizes its terminal state. The formulation of

R_{4}

is expressed as follows:

R_{4} = \{\begin{cases} e^{\frac{1}{{|r_{H I_{1}} (t_{f_{1}}) - z_{c}|}^{2}}}, v_{H I_{1}} \geq 0 & r_{H I_{1}} (t_{f_{1}}) \geq 1 \\ - 100, v_{H I_{1}} \geq 0 & r_{H I_{1}} (t_{f_{1}}) < 1 \end{cases}

(16)

where

t_{f_{1}}

denotes the moment of intersection between the HSUAV and the first pursuer,

z_{c}

is defined as the maximum distance that satisfies the direct collision condition, serving as the lower limit of the miss distance.

The second component,

R_{5}

, represents the final reward for evading the second pursuer. Similar to

R_{4}

, it rewards the HSUAV based on the favorability of the terminal state upon successful evasion and imposes a penalty of −100 in case of failure.

R_{5}

is formulated as follows:

R_{5} = \{\begin{cases} e^{\frac{1}{{|r_{H I_{2}} (t_{f_{2}}) - z_{c}|}^{2}}}, v_{H I_{2}} \geq 0 & r_{H I_{2}} (t_{f_{2}}) \geq 1 \\ - 100, v_{H I_{2}} \geq 0 & r_{H I_{2}} (t_{f_{2}}) < 1 \end{cases}

(17)

where

t_{f_{2}}

denotes the moment of intersection between the HSUAV and the second pursuer.

The total reward is computed as follows:

R = R_{1} + R_{2} + R_{3} + R_{4} + R_{5}

(18)

For the reward function weight design, a two-stage approach was adopted. The initial weights were determined by the relative importance of each objective, prioritizing survival, then energy consumption, and finally the rewards related to evasion maneuvers, and were subsequently refined through empirical tuning in preliminary simulations to ensure a balanced contribution from each term.

3.5. Network Structure Design

Both the policy network and the value function network are implemented using a fully connected neural network comprising three hidden layers. The ReLU function is adopted as the activation function for the hidden layers.

The proposed network structure is summarized in Table 2. The parameters of the policy and value function networks are optimized using the Adam optimizer.

Remark 2.

This study introduces a space interval calculation approach based on a combined “offline training + online application” framework, allowing real-time generation of initial space intervals for the two pursuers. This approach enhances the robustness and generalizability of the strategy in highly adversarial scenarios.

Remark 3.

Given the complexity of coordinated PE scenarios, this study divides the problem into two subtasks: simple PE scenarios and difficult PE scenarios. The proposed SCPL-TD3 algorithm adopts a progressive learning approach, where the HSUAV first masters evasion strategies for simple scenarios before advancing to more complex ones. This enables the HSUAV to gradually acquire new strategies from relatively simple to increasingly challenging conditions, achieving adaptive evasion across diverse confrontation scenarios. Additionally, to address the inefficiency caused by random experience sampling, SCPL-TD3 designs three prioritized experience replay buffers to classify and store experience samples based on their importance, significantly improving training efficiency.

Remark 4.

The introduction of the critical miss distance concept in the design of

R_{4}

and

R_{5}

strengthens the HSUAV’s evasion strategy. By incentivizing the HSUAV to evade both pursuers with the smallest possible miss distance, the proposed reward structure minimizes lateral maneuvering distance while enhancing the HSUAV’s capability to execute subsequent missions. This distance, the proposed reward structure minimizes lateral maneuvering distance while enhancing the HSUAV’s capability to execute subsequent missions. This optimization ensures efficient energy utilization and improves overall mission effectiveness in complex engagement scenarios. Additionally, the combined effects of

R_{3}

,

R_{4}

, and

R_{5}

effectively reduce evasion costs, supporting both energy efficiency and operational performance.

Remark 5.

Compatibility with Ground Intervention. Although the proposed SCPL-TD3 algorithm is designed for autonomous on-board decision-making, it remains compatible with ground-based guidance updates. The policy can incorporate high-level commands—such as revised waypoints or new mission objectives—received via communication links. This feature enhances operational flexibility, though real-world implementation would need to account for communication delays and reliability, especially in contested environments.

4. Simulation and Analysis

This section presents a comprehensive simulation analysis of the HSUAV’s one-versus-two engagement scenario, focusing on three critical aspects: algorithm effectiveness, robustness, and real-time performance.

The simulations were conducted using MATLAB 2022b on a hardware platform equipped with an Intel Core i5-12500H CPU (3 GHz), an RTX 3060 GPU with 16 GB of memory, and a 1 TB SSD. The detailed simulation parameters are provided in Table 3. Both pursuers were guided by the APN law. The maximum available overload ratio between the HSUAV and the pursuer was set to 1:3, reflecting the pursuer’s superior maneuverability. This substantial disparity in overload capacity resulted in a lower instantaneous maneuvering capability for the HSUAV, thereby increasing the evasion difficulty.

4.1. Effectiveness Validation

4.1.1. Training Efficiency and Performance

A comprehensive simulation framework was established to evaluate the effectiveness and sample efficiency of the proposed SCPL-TD3 algorithm, with a comparative analysis against baseline methods TD3 and PPO. The training for all algorithms was conducted over a maximum of 6000 episodes on the complex Subtask 2.

As illustrated in Figure 8, the proposed SCPL-TD3 algorithm demonstrates clear superiority over the baseline methods, a result attributable to its progressive learning strategy. By leveraging knowledge transferred from Subtask 1, SCPL-TD3 achieved convergence approximately 1400 iterations faster than TD3 (around 2800 versus 4200 iterations) and exhibited significantly smaller post-convergence reward fluctuations, highlighting its improved sample efficiency and training stability. Furthermore, the PPO algorithm suffered from premature convergence to a local optimum. In contrast, SCPL-TD3 continued effective exploration, achieving a final reward value substantially higher than PPO, which underscores its enhanced capability in finding superior policies.

The trained agent was subsequently evaluated through simulation tests in both relatively simple tasks (Scenarios 1 and 2) and a more challenging task (Scenario 3).

4.1.2. Simple Evasion Scenario

A comparative analysis in both Scenario 1 and Scenario 2 is conducted against the baseline method M1 from Ref. [9], which is based on classical game-theoretic approaches, to further validate the practical effectiveness of the proposed intelligent evasion strategy.

The simulation results for Scenario 1 are presented in Figure 9 and Figure 10, corresponding to the proposed SCPL-TD3 strategy and the baseline method M1, respectively. As shown in Figure 9a,b and Figure 10a, both methods achieve successful penetration with miss distances greater than 1 m. However, a comparative analysis of the two-dimensional evasion trajectories in Figure 9a and Figure 10a reveals that the maximum lateral maneuvering distance of the proposed strategy is 0.0072 km, significantly smaller than that of M1 (0.0139 km), indicating a substantial improvement in trajectory optimality. Furthermore, by comparing the overload command curves in Figure 9c with the trajectory result in Figure 10b, it is observed that the SCPL-TD3 strategy avoids the prolonged use of maximum overload maneuvers employed by M1, thereby reducing energy consumption. This is attributed to the intelligent maneuver sequence adopted by the HSUAV based on the proposed strategy: a short-duration positive overload maneuver followed swiftly by a negative overload maneuver. This tactic compresses the pursuers’ reaction time, facilitating a successful escape, as evidenced by the synchronized variation between the overload command in Figure 9c and the ballistic deflection angle in Figure 9d.

The results for Scenario 2, illustrated in Figure 11 and Figure 12, further validate the advantages of the SCPL-TD3 strategy. Both methods again achieve successful evasion, as confirmed by the miss distances exceeding 1 m in Figure 11a,b and Figure 12a. The proposed strategy again demonstrates a significantly smaller maximum lateral maneuvering distance (0.0186 km) compared to M1 (0.0451 km), as evident from the trajectories in Figure 11a and Figure 12a. Moreover, analysis of the overload commands in Figure 11c and Figure 12b indicates that SCPL-TD3 avoids the energy-inefficient, sustained maximum overload maneuvers characteristic of M1. The overload and ballistic deflection angle curves in Figure 11c,d reveal well-timed directional switches in maneuvers, thereby effectively disrupting the pursuers’ tracking and facilitating a successful evasion.

In summary, while both strategies achieve successful evasion against two pursuers in the simple scenarios, the proposed SCPL-TD3 strategy exhibits superior performance by optimizing the flight trajectory through minimized lateral maneuvering distance and reduced energy consumption. This enhanced efficiency significantly lowers the overall evasion cost, underscoring the practical advantage of the SCPL-TD3 algorithm.

4.1.3. Difficult Evasion Scenario

A more comprehensive comparative analysis was conducted in the challenging Scenario 3 to rigorously evaluate the generalization capability and strategic superiority of the proposed SCPL-TD3 strategy. The baseline methods for comparison included the constant-value maneuver, the Bang-Bang maneuver, and the method M1 from Ref. [9]. The simulation results are presented in Figure 13, Figure 14, Figure 15 and Figure 16.

As evidenced by the miss distances greater than 1 m for both pursuers in Figure 13a,b, the HSUAV employing the proposed SCPL-TD3 strategy successfully evaded capture in this demanding scenario. In stark contrast, the other three strategies all failed, as indicated by their miss distances falling below the 1 m threshold in Figure 14a, Figure 15a, and Figure 16a, respectively.

A detailed analysis of the overload commands reveals the underlying reasons for these outcomes. The constant-value maneuver, maintaining a constant positive maximum overload as shown in Figure 14b, successfully evaded Pursuer 1, while also resulting in the largest lateral displacement among all strategies and ultimately failing to evade Pursuer 2. The Bang-Bang maneuver, depicted in Figure 15b, switched the sign of its overload after evading Pursuer 1, which reduced the lateral maneuvering distance compared to the constant-value strategy. However, this switch occurred too late, missing the critical window to effectively respond to Pursuer 2 and ultimately resulting in capture. As shown in Figure 16b, the M1 method evaded Pursuer 1 but then failed against Pursuer 2 despite adjustments to the overload’s magnitude and direction, surviving only 0.2 s longer than the Bang-Bang maneuver.

In contrast, the proposed method demonstrated intelligent and anticipatory decision-making. The overload command and ballistic deflection angle curves in Figure 13c,d show that the HSUAV initially executed a negative overload maneuver for approximately 0.3 s, followed by a zero-overload phase until 0.4 s, before initiating a gradual positive maneuver. This sequence was designed to compress the reaction time of the pursuers strategically. By considering the threat from both pursuers in an integrated manner from the outset, the strategy successfully created a sufficient evasion window against Pursuer 2 while handling Pursuer 1, thereby achieving successful evasion against the coordinated pair.

This comparative analysis under challenging conditions conclusively demonstrates that the SCPL-TD3 strategy is uniquely capable of handling complex, multi-threat scenarios effectively, attributable to its robustness, energy efficiency, and trajectory optimization that collectively enhance mission survivability and success.

4.2. Robustness Verification

To rigorously evaluate the robustness and generalizability of the proposed intelligent evasion strategy, extensive Monte Carlo simulations were conducted under the three engagement scenarios defined previously in Section 2.1. The initial position of pursuer 1 was set to

(x_{I_{1} 0}, y_{I_{1} 0}, z_{I_{1} 0}) = (4, 25, 0) km

.

A total of 5000 Monte Carlo simulations were conducted to validate the robustness of the proposed strategy. The simulations encompassed three distinct scenarios, categorized by increasing difficulty: simple cases (Scenarios 1 and 2) and a difficult case (Scenario 3). Specifically, Scenario 1 (1000 runs) was defined with

Δ X \in [500, 750)

m, while Scenario 2 (1000 runs) with

Δ X \in (2000, 3000]

m. Scenario 3 (3000 runs) was specifically designed to evaluate the policy’s generalization capability by introducing an untrained initial LOS angle perturbation, uniformly distributed between −10° and 10°, in addition to the

Δ X \in [750, 2000]

m.

The corresponding simulation results are presented in Figure 17. A joint analysis of Figure 17a,c confirms the exceptional reliability of the strategy under nominal conditions (Scenarios 1 and 2). The scatter plot in Figure 17a demonstrates that for all 2000 runs in these scenarios, the resulting miss distances are consistently above the 1-m. This visual evidence of perfect success is further corroborated by the corresponding boxplots in Figure 17c. The boxes are compact and located high on the axis, indicating that the strategy produces near-optimal and highly predictable evasion outcomes in both Scenarios 1 and 2 head-on engagements, achieving a 100% success rate.

To evaluate the generalization capability, we further introduced untrained LOS angle perturbations into the setup of Scenario 3. The results are shown in Figure 17b–d. As shown in Figure 17b, the strategy maintains a high evasion success rate of 95.07% even in the presence of untrained LOS angle perturbations. Figure 17c further quantifies the performance distribution: the median miss distance is 1.62 m, with an interquartile range (IQR) of 1.37–1.90 m. Although variability increases compared to nominal scenarios, the results remain within an effective range. Moreover, Figure 17d shows a high median cumulative reward of 1466.99 with a concentrated distribution, indicating that the policy consistently makes high-quality decisions even when confronted with unseen state configurations, highlighting its strong generalization capability and intelligent adaptability.

In summary, the proposed method achieves an aggregate success rate of 97.04% across the entire set of tests, including a deliberately challenging scenario designed to test generalization. This result demonstrates the method’s strong robustness and generalization capability.

4.3. Real-Time Performance Validation

Ensuring real-time performance is a critical requirement for the practical implementation of algorithms in engineering applications. To assess this aspect, the parameter settings in this section were kept identical to those in Section 4.2. A total of 5000 Monte Carlo simulations were conducted to compare the real-time performance of the SCPL-TD3 algorithm with that of the algorithm presented in the Ref. [9]. The simulation results are provided in Figure 18.

The results indicate that the SCPL-TD3 algorithm exhibits significantly improved real-time performance compared to the algorithm from the Ref. [9]. Specifically, the average command generation time is reduced from 0.0169 ms to 0.0157 ms, representing a 7.10% improvement. Furthermore, the optimal command generation time is reduced from 0.0101 ms to 0.0086 ms, corresponding to a 14.85% improvement. This acceleration in command generation delivers critical tactical advantages for HSUAVs operating in real-time confrontation scenarios. This enhancement translates directly into two key system-level benefits:

Firstly, it secures a decisive decision-making advantage. Given the extreme closing velocities (exceeding Mach 10) that define the engagement envelope, the achieved reduction in computational latency allows the evasion strategy to be executed earlier. This head-start is paramount, enabling the HSUAV to initiate and commit to optimal maneuvers that accumulate vital lateral displacement, thereby dramatically increasing the probability of successful penetration against advanced threats.

Secondly, the improved computational efficiency alleviates the burden on the onboard processing unit. The freed computational resources can be reallocated to other critical functions—such as high-fidelity sensor processing, resilient communication links, or more complex predictive models—enhancing overall system capability and robustness without the need for hardware upgrades.

Consequently, within the operational context of high-speed autonomous systems where milliseconds determine mission outcomes, this advancement in processing speed is not just an incremental improvement but a necessary enabler for practical and effective intelligent decision-making.

The proposed SCPL-TD3 algorithm has been rigorously validated across multiple critical dimensions, demonstrating significant advantages:

(1): The proposed SCPL-TD3 algorithm demonstrates superior training efficiency, converging significantly faster than TD3 and achieving higher performance than PPO by effectively avoiding premature convergence.
(2): In evasion effectiveness tests, SCPL-TD3 achieves successful evasion in complex scenarios where all baseline methods fail, while its energy-saving trajectory optimization directly enhances mission success probability by preserving critical kinetic energy for follow-on tasks.
(3): Comprehensive Monte Carlo simulations validate the algorithm’s robustness, showing a 97.04% success rate under significant initial condition perturbations.
(4): The real-time performance validation confirms the method’s practical deployability, generating intelligent evasion commands within stringent computational constraints.

5. Conclusions and Future Work

This study addresses the limitations of existing evasion strategies by proposing a comprehensive framework for HSUAV maneuvering in coordinated PE scenarios. An optimal interval range for initial spatial separation is derived, and the evasion difficulty classification scheme is introduced to categorize evasion scenarios, facilitating effective strategy optimization. A reward function that integrates energy constraints and critical miss distance is designed to minimize evasion cost while maintaining an operational margin. Additionally, the proposed SCPL-TD3 algorithm dynamically adjusts training complexity and employs semantic patterns to prioritize critical state-action relationships, significantly enhancing learning efficiency and stability. Simulation results demonstrate that the SCPL-TD3 algorithm achieves efficient convergence and robust performance, with a 97.04% success rate and a 7.10–14.85% improvement in decision-making speed, confirming its effectiveness in complex pursuit-evasion scenarios.

Despite these achievements, this study has several limitations. The evaluation is confined to a two-pursuer configuration, leaving the performance in scenarios with more pursuers or heterogeneous adversary teams unexamined. Furthermore, practical factors such as environmental disturbances and communication delays were not incorporated into the current model. Future research will focus on the following aspects: (1) extending the approach to support cooperative evasion among multiple HSUAVs under communication constraints; (2) incorporating robust perception and control strategies to accommodate sensor noise and state uncertainty; (3) validating the algorithm through hardware-in-the-loop simulation to verify performance under realistic conditions.

Author Contributions

Conceptualization, all authors; methodology, X.Z. and T.Y.; software, X.Z. and T.L.; validation, X.Z., T.Y. and T.L.; formal analysis, X.Z., T.Y. and T.L.; resources, Z.J. and J.Y.; data curation, Z.J.; writing—original draft preparation, X.Z., T.Y. and T.L.; writing—review and editing, T.L. and C.L.; supervision, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 62303380 and the Aeronautical Science Foundation of China, grant number 201907053001.

Data Availability Statement

All the data used to support the findings of this study are contained within the article.

Acknowledgments

The authors greatly appreciate the financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, Z.Y.; Ma, Y.; Tian, Y.; Ding, Z.J.; Zhang, H.; Tong, S.H. Research on integrated design method of wide-range hypersonic vehicle/engine based on dynamic multi-objective optimization. Aerosp. Sci. Technol. 2025, 159, 110031. [Google Scholar] [CrossRef]
Bu, X.W.; Qi, Q. Fuzzy optimal tracking control of hypersonic flight vehicles via single-network adaptive critic design. IEEE Trans. Fuzzy Syst. 2022, 30, 270–278. [Google Scholar] [CrossRef]
Wang, J.; Zhang, C.; Zheng, C.M.; Kong, X.W.; Bao, J.Y. Adaptive neural network fault-tolerant control of hypersonic vehicle with immeasurable state and multiple actuator faults. Aerosp. Sci. Technol. 2024, 152, 109378. [Google Scholar] [CrossRef]
Ding, Y.B.; Yue, X.K.; Chen, G.S.; Si, J.S. Review of control and guidance technology on hypersonic vehicle. Chin. J. Aeronaut. 2022, 7, 1–18. [Google Scholar] [CrossRef]
Li, B.; Gan, Z.; Chen, D.; Sergey Alekandrovich, D. UAV maneuvering target tracking in uncertain environments based on deep reinforcement learning and meta-learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
Zhuang, X.; Li, D.; Wang, Y.; Liu, X.; Li, H. Optimization of high-speed fixed-wing UAV penetration strategy based on deep reinforcement learning. Aerosp. Sci. Technol. 2024, 148, 109089. [Google Scholar] [CrossRef]
Fainkich, M.; Shima, T. Cooperative guidance for simultaneous interception using multiple entangled sliding surfaces. J. Guid. Control Dyn. 2025, 48, 591–599. [Google Scholar] [CrossRef]
Zheng, Z.W.; Li, J.Z.; Feroskhan, M. Three-dimensional terminal angle constraint guidance law with class K∞ function-based adaptive sliding mode control. Aerosp. Sci. Technol. 2024, 147, 109005. [Google Scholar] [CrossRef]
Yan, T.; Cai, Y.L.; Xu, B. Evasion guidance algorithms for air-breathing hypersonic vehicles in three-player pursuit-evasion games. Chin. J. Aeronaut. 2020, 33, 3423–3436. [Google Scholar] [CrossRef]
Imado, F.; Uehara, S. High-g barrel roll maneuvers against proportional navigation from optimal control viewpoint. J. Guid. Control Dyn. 1998, 21, 876–881. [Google Scholar] [CrossRef]
Yu, W.B.; Chen, W.C.; Jiang, Z.G.; Zhang, W.Q.; Zhao, P.L. Analytical entry guidance for coordinated flight with multiple no-fly-zone constraints. Aerosp. Sci. Technol. 2018, 84, 273–290. [Google Scholar] [CrossRef]
Wang, C.C.; Wang, Z.L.; Zhang, S.Y.; Tan, J.R. Adam-assisted quantum particle swarm optimization guided by length of potential well for numerical function optimization. Swarm Evol. Comput. 2023, 79, 101309. [Google Scholar] [CrossRef]
Morelli, A.C.; Hofmann, C.; Topputo, F. Robust low-thrust trajectory optimization using convex programming and a homotopic approach. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 2103–2116. [Google Scholar] [CrossRef]
Mao, Y.Q.; Szmuk, M.; Xu, X.R.; Acikmese, B. Successive convexification: A superlinearly convergent algorithm for non-convex optimal control problems. arXiv 2018, arXiv:1804.06539. [Google Scholar]
Liu, Z.; Zhang, X.G.; Wei, C.Z.; Cui, N.G. High-precision adaptive convex programming for reentry trajectories of suborbital vehicles. Acta Aeronaut. Astronaut. Sin. 2023, 44, 729430. [Google Scholar]
Zhang, J.L.; Liu, K.; Fan, Y.Z.; Yu, Z.Y. A piecewise predictor-corrector re-entry guidance algorithm with no-fly zone avoidance. J. Astronaut. 2021, 42, 122–131. [Google Scholar]
He, R.Z.; Liu, L.H.; Tang, G.J.; Bao, W.M. Entry trajectory generation without reversal of bank angle. Aerosp. Sci. Technol. 2017, 71, 627–635. [Google Scholar] [CrossRef]
Liang, Z.X.; Ren, Z. Tentacle-based guidance for entry flight with no-fly zone constraint. J. Guid. Control Dyn. 2017, 40, 1–10. [Google Scholar] [CrossRef]
Chai, R.Q.; Tsourdos, A.; Savvaris, A.; Cai, S.C.; Xia, Y.Q. High-fidelity trajectory optimization for aeroassisted vehicles using variable order pseudospectral method. Chin. J. Aeronaut. 2021, 34, 237–251. [Google Scholar] [CrossRef]
Rao, H.P.; Zhong, R.; Li, P.J. Fuel-optimal deorbit scheme of space debris using tethered space-tug based on pseudospectral method. Chin. J. Aeronaut. 2020, 34, 210–223. [Google Scholar] [CrossRef]
Hou, L.F.; Li, L.; Chang, L.L.; Wang, Z.; Sun, G.Q. Pattern dynamics of vegetation based on optimal control theory. Nonlinear Dyn. 2025, 113, 1–23. [Google Scholar] [CrossRef]
Wu, S. Linear-quadratic non-zero sum backward stochastic differential game with overlapping information. IEEE Trans. Autom. Control 2023, 68, 1800–1806. [Google Scholar] [CrossRef]
Yu, X.Y.; Wang, X.F.; Lin, H. Optimal penetration guidance law with controllable missile escape distance. J. Astronaut. 2023, 44, 1053–1063. [Google Scholar]
Liu, C.; Sun, S.S.; Tao, C.G.; Shou, Y.X.; Xu, B. Optimizing evasive maneuvering of planes using a flight quality driven model. Sci. China Inf. Sci. 2024, 67, 132206. [Google Scholar] [CrossRef]
Du, Q.F.; Hu, Y.D.; Jing, W.X.; Gao, C.S. Three-dimensional target evasion strategy without missile guidance information. Aerosp. Sci. Technol. 2025, 157, 109857. [Google Scholar] [CrossRef]
Singh, S.K.; Reddy, P.V. Dynamic network analysis of a target defense differential game with limited observations. IEEE Trans. Control Netw. Syst. 2023, 10, 308–320. [Google Scholar] [CrossRef]
Wang, Y.Q.; Ning, G.D.; Wang, X.F. Maneuver penetration strategy of near space vehicle based on differential game. Acta Aeronaut. Astronaut. Sin. 2020, 41, 724276. [Google Scholar]
Wang, X.; Wang, S.; Liang, X.X.; Zhao, D.W.; Huang, J.C.; Xu, X. Deep reinforcement learning: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5064–5078. [Google Scholar] [CrossRef]
Kiran, B.R. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Chen, J.Y.; Yu, C.; Li, G.S.; Tang, W.H.; Ji, S.L.; Yang, X.Y. Online planning for multi-uav pursuit-evasion in unknown environments using deep reinforcement learning. IEEE Robot. Autom. Lett. 2025, 10, 8196–8203. [Google Scholar] [CrossRef]
Qu, X.Q.; Gan, W.H.; Song, D.L.; Zhou, L.Q. Pursuit-evasion game strategy of USV based on deep reinforcement learning in complex multi-obstacle environment. Ocean Eng. 2023, 273, 114016. [Google Scholar] [CrossRef]
Li, J.S.; Wang, X.F.; Lin, H. Intelligent penetration policy for hypersonic cruise missiles based on virtual targets. Acta Armamentarii. 2024, 45, 3856–3867. [Google Scholar]
Yan, T.; Liu, C.; Gao, M.J.; Jiang, Z.J.; Li, T. A deep reinforcement learning-based intelligent maneuvering strategy for the high-speed UAV pursuit-evasion game. Drones 2024, 8, 309. [Google Scholar] [CrossRef]
Duan, Z.K.; Xu, G.J.; Liu, X.; Ma, J.Y.; Wang, L.Y. Optimal confrontation position selecting games model and its application to one-on-one air combat. Def. Technol. 2024, 31, 417–428. [Google Scholar] [CrossRef]
Mishley, A.; Shaferman, V. Near-optimal evasion from acceleration bounded modern pursuers. J. Guid. Control Dyn. 2025, 48, 793–807. [Google Scholar] [CrossRef]
Hao, Z.M.; Zhang, R.; Li, H.F. Parameterized evasion strategy for hypersonic glide vehicles against two missiles based on reinforcement learning. Chin. J. Aeronaut. 2025, 38, 103173. [Google Scholar] [CrossRef]

Figure 1. Confrontation scenario under SPEC. Where

H_{0}

represents the initial position of the HSUAV;

I_{0}

denotes the initial position of the pursuers.

Figure 1. Confrontation scenario under SPEC. Where

H_{0}

represents the initial position of the HSUAV;

I_{0}

denotes the initial position of the pursuers.

Figure 2. Confrontation scenario under NCTC.

Figure 3. Confrontation scenario under CTC.

Figure 4. X-Z plane adversarial geometry of the three-play confrontation.

Figure 5. Initial common space interval design flow.

Figure 6. TD3 algorithm structure framework.

Figure 7. HSUAV evasion modeling framework based on the SCPL-TD3 algorithm.

Figure 8. Comparison of training among SCPL-TD3, PPO, and TD3.

Figure 9. Simulation results based on SCPL-TD3 in Scenario 1. (a) Two-dimensional plane diagram (b) Curve of relative distance (c) Overload comparison diagram (d) HSUAV ballistic deflection angle.

Figure 10. Simulation results based on the M1 in Scenario 1. (a) Two-dimensional plane diagram (b) Overload comparison diagram.

Figure 11. Simulation results based on SCPL-TD3 in Scenario 2. (a) Two-dimensional plane diagram (b) Overload comparison diagram (c) Overload comparison diagram (d) HSUAV ballistic deflection angle.

Figure 12. Simulation results based on the M1 in Scenario 2. (a) Two-dimensional plane diagram (b) Overload comparison diagram.

Figure 13. Simulation results based on SCPL-TD3. (a) Two-dimensional plane diagram (b) Curve of relative distance (c) Overload comparison diagram (d) HSUAV ballistic deflection angle.

Figure 14. Simulation results based on the constant-value maneuver. (a) Two-dimensional plane diagram (b) Overload comparison diagram.

Figure 15. Simulation results based on the Bang-Bang maneuver. (a) Two-dimensional plane diagram (b) Overload comparison diagram.

Figure 16. Simulation results based on the M1. (a) Two-dimensional plane diagram (b) Overload comparison diagram.

Figure 17. Monte Carlo simulation results. (a) Miss distance versus initial space interval for Scenarios 1 and 2 (b) Miss distance versus initial space interval for Scenario 3. (c) Miss distance distribution (d) Cumulative reward distribution.

Figure 18. Real-time verification simulation results. (a) Simulation results based on the algorithm from the Ref. [9]. (b) Simulation results based on the SCPL-TD3 algorithm.

Table 1. Comparison of HSUAV Evasion Strategies.

Method Category	Advantages	Limitations	Key Distinctions from Proposed Method
Predefined Maneuvers	Simple implementation, computationally efficient	Predictable, inflexible in dynamic environments	SCPL-TD3 uses adaptive learning rather than fixed patterns
Trajectory Optimization	Provides safe trajectories by avoiding threat zones	Overly conservative, high-energy consumption, large deviations	SCPL-TD3 balances evasion with mission constraints
Modern Control-Based Guidance Laws	Rigorous theoretical foundation, optimal solutions under ideal conditions	Requires perfect target knowledge, limited to unilateral solutions	SCPL-TD3 is model-free and computationally efficient for high-dimensional states
Existing DRL Methods	model-free approach, potential for complex environments	Slow convergence, unstable training, limited to one-on-one scenarios	SCPL-TD3 accelerates convergence and handles coordinated pursuit

Table 2. Overall network structure in the simulation.

Type of the Network	Policy Network	Actor Network
Input layer	8	8
Hidden layer 1	256	256
Hidden layer 2	256	256
Hidden layer 3	256	256
Hidden layer 4	128	256
Hidden layer 5	64	128
Output layer	1	1

Table 3. Simulation parameters.

Variable	Value
Initial coordinate value of HSUAV $(x_{H 0}, y_{H 0}, z_{H 0}) / km$	$(0, 25, 0)$
Initial coordinate value of the typical $situation 1 (x_{I_{1, 2} 0}, y_{I_{1, 2} 0}, z_{I_{1, 2} 0}) / km$	$(4, 25, 0), (5.5, 25, 0)$
Initial coordinate value of the typical $situation 2 (x_{I_{1, 2} 0}, y_{I_{1, 2} 0}, z_{I_{1, 2} 0}) / km$	$(4, 25, 0), (4.5, 25, 0)$
Initial coordinate value of the typical $situation 3 (x_{I_{1, 2} 0}, y_{I_{1, 2} 0}, z_{I_{1, 2} 0}) / km$	$(4, 25, 0), (8, 25, 0)$
$Space interval range [A_{1}, A_{2}] / km$	$[0.75, 2]$
Mach number $V_{H}, V_{I_{1, 2}} / Ma$	6, 4, 4
Initial value of the ballistic deviation angle $ψ_{V H 0}, ψ_{V I_{1, 2} 0} /^{\circ}$	0, 180, 180
Time constants of autopilot $τ_{H}, τ_{I_{1, 2}} / s$	0.5
Maximum lateral overload $n_{H m a x}, n_{I_{1, 2} m a x}$	2, 6, 6
Miss distance threshold $z_{c} / m$	1
Sampling time/ms	0.1
Navigation coefficient $N$	4
Size of the ERB	$10^{6}$
Small batch sample size	256
Learning rate of actor network and critic network	$1 \times 10^{- 3}, 1 \times 10^{- 4}$
Initial exploration rate	0.3
Attenuation factor	$6 \times 10^{- 5}$
Discount factor $γ$	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Yan, T.; Li, T.; Liu, C.; Jiang, Z.; Yan, J. SCPL-TD3: An Intelligent Evasion Strategy for High-Speed UAVs in Coordinated Pursuit-Evasion. Drones 2025, 9, 685. https://doi.org/10.3390/drones9100685

AMA Style

Zhang X, Yan T, Li T, Liu C, Jiang Z, Yan J. SCPL-TD3: An Intelligent Evasion Strategy for High-Speed UAVs in Coordinated Pursuit-Evasion. Drones. 2025; 9(10):685. https://doi.org/10.3390/drones9100685

Chicago/Turabian Style

Zhang, Xiaoyan, Tian Yan, Tong Li, Can Liu, Zijian Jiang, and Jie Yan. 2025. "SCPL-TD3: An Intelligent Evasion Strategy for High-Speed UAVs in Coordinated Pursuit-Evasion" Drones 9, no. 10: 685. https://doi.org/10.3390/drones9100685

APA Style

Zhang, X., Yan, T., Li, T., Liu, C., Jiang, Z., & Yan, J. (2025). SCPL-TD3: An Intelligent Evasion Strategy for High-Speed UAVs in Coordinated Pursuit-Evasion. Drones, 9(10), 685. https://doi.org/10.3390/drones9100685

Article Menu

SCPL-TD3: An Intelligent Evasion Strategy for High-Speed UAVs in Coordinated Pursuit-Evasion

Abstract

Highlights

Abstract

1. Introduction

2. Problem Formulation and Vehicle Modeling

2.1. Coordinated PE Scenario Construction

2.2. Vehicle Modeling

3. Methods

3.1. Initial Common Space Interval Generation

3.2. SCPL-TD3 Algorithm

3.3. Design of State Space and Motion Space

3.4. Reward Function Design Considering the Evasion Cost

3.5. Network Structure Design

4. Simulation and Analysis

4.1. Effectiveness Validation

4.1.1. Training Efficiency and Performance

4.1.2. Simple Evasion Scenario

4.1.3. Difficult Evasion Scenario

4.2. Robustness Verification

4.3. Real-Time Performance Validation

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI