A DRL Framework for Autonomous Pursuit-Evasion: From Multi-Spacecraft to Multi-Drone Scenarios

Xu, Zhenyang; Shao, Shuyi; Han, Zengliang

doi:10.3390/drones9090636

Open AccessArticle

A DRL Framework for Autonomous Pursuit-Evasion: From Multi-Spacecraft to Multi-Drone Scenarios

by

Zhenyang Xu

,

Shuyi Shao

^*

and

Zengliang Han

College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(9), 636; https://doi.org/10.3390/drones9090636

Submission received: 8 August 2025 / Revised: 3 September 2025 / Accepted: 8 September 2025 / Published: 10 September 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

A single, generalizable DRL framework is proposed for autonomous pursuit-evasion that is successfully validated across both multi-spacecraft and multi-drone scenarios.
A novel prediction-based reward function combined with curriculum learning significantly improves performance, achieving a 90.7% success rate and superior robustness under perturbations.

What is the implication of the main finding?

The framework demonstrates adaptability to complex environments, including scenarios with larger team sizes (3-vs.-1) and static obstacles, highlighting its potential for real-world applications.
The transferable nature of the prediction-based reward offers a practical methodology for overcoming sparse reward challenges in other complex aerospace tasks, such as spacecraft on-orbit servicing and mission-oriented UAV applications.

Abstract

To address the challenges of autonomous pursuit-evasion in aerospace, particularly in achieving cross-domain generalizability and handling complex terminal constraints, this paper proposes a generalizable deep reinforcement learning (DRL) framework. The core of the method is a self-play Proximal Policy Optimization (PPO) architecture enhanced by two key innovations. First, a dynamics-agnostic curriculum learning (CL) strategy is employed to accelerate training and enhance policy robustness by structuring the learning process from simple to complex. Second, a transferable prediction-based reward function is designed to provide dense, forward-looking guidance, utilizing forward-state projection to effectively satisfy mission-specific terminal conditions. Comprehensive simulations were conducted in both multi-spacecraft and multi-drone scenarios. In the primary spacecraft validation, the proposed method achieved a 90.7% success rate, significantly outperforming baseline algorithms like traditional PPO and Soft Actor-Critic (SAC). Furthermore, it demonstrated superior robustness, with a performance drop of only 8.3% under stochastic perturbations, a stark contrast to the over 18% degradation seen in baseline methods. The successful application in a multi-drone scenario, including an obstacle-rich environment, confirms the framework’s potential as a unified and robust solution for diverse autonomous adversarial systems.

Keywords:

pursuit-evasion games; deep reinforcement learning; autonomous decision-making; multi-spacecraft systems; multi-drone systems; generalizable framework

1. Introduction

With the rapid advancement of autonomous systems, the complexity and dynamism of operational environments, both in space and in the air, have increased significantly. In the space domain, next-generation intelligent spacecraft are being developed for complex tasks like on-orbit servicing, debris removal, and close-proximity inspection [1,2,3]. Similarly, in the aerial domain, autonomous unmanned aerial vehicles (UAVs or drones) are becoming ubiquitous in surveillance, package delivery, and mission-oriented applications. The pursuit-evasion game is a fundamental challenge that spans both domains. It is a dynamic and adversarial scenario where agents must make real-time decisions to achieve conflicting objectives [4]. For these critical encounters, traditional ground-based or pre-programmed decision-making approaches are often inadequate. Consequently, developing robust autonomous decision-making strategies has become a critical research focus, with deep reinforcement learning emerging as a powerful paradigm for learning optimal policies in uncertain and adversarial settings [5].

Orbital and aerial pursuit-evasion games represent a canonical two-agent (or multi-agent) decision-making process with competing goals. Unlike standard trajectory planning where future states can be predetermined, these games involve strategic interactions where each agent’s optimal move is contingent on its opponent’s anticipated actions. This interdependency means neither the pursuer nor the evader can pre-calculate a fixed optimal trajectory, necessitating adaptive strategies that can respond to an intelligent adversary in real time [6,7]. Traditional methods for solving such problems primarily fall into two categories: model-based differential game theory and data-driven reinforcement learning.

In model-based methods, differential game theory and optimal control principles are leveraged to compute Nash equilibrium strategies for both the pursuer and the evader. The pursuit-evasion problem is often formulated as a two-player zero-sum game, where the Hamilton–Jacobi–Isaacs (HJI) equation describes the optimal strategies for both agents [8,9]. However, in most practical scenarios, these equations lack analytical solutions and must be solved numerically using techniques such as the shooting method, pseudo-spectral methods, or collocation techniques [10]. In [11], a pseudo-spectral optimal control approach was proposed to solve spacecraft pursuit-evasion problems, but its computational cost remains a challenge for real-time applications. Similarly, in [12], high-thrust impulsive maneuvers were incorporated into pursuit-evasion scenarios, significantly increasing the problem’s complexity due to discrete thrust applications. However, these model-based methods often struggle with high computational costs and the inability to adapt to complex, dynamic, and uncertain environments.

Recent advancements in artificial intelligence have established reinforcement learning as a powerful framework for addressing complex decision-making challenges within uncertain and adversarial environments [13]. RL enables autonomous agents to learn through interactions with their environment, optimizing their strategies via reward-based feedback. This capability renders RL particularly well-suited for highly dynamic and time-sensitive adversarial tasks [14,15]. In particular, actor–critic RL architectures, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), have demonstrated superior performance in high-dimensional control problems, achieving a balance between exploration and exploitation while ensuring stable learning convergence [16,17]. The application of DRL to aerospace pursuit-evasion has become an active area of research, with studies exploring near-optimal interception strategies, two-stage pursuit missions, and intelligent maneuvers for hypersonic vehicles [18,19,20].

Despite these advancements, a review of the current literature reveals several key research gaps that motivate this work. First, the challenge of generalizability is evident, as many existing RL applications focus on specific domains, such as planetary landing or orbital transfers, without emphasizing the transferability of the core decision-making architecture [21,22,23]. Few studies have proposed a unified framework applicable to pursuit-evasion games in both space and aerial contexts. Second, many approaches struggle to effectively address complex, state-dependent terminal constraints. This is often due to the difficulty in designing a reward function that effectively guides agents toward these specific conditions, with many relying on simple distance-based rewards that can lead to suboptimal or myopic strategies [24]. To highlight how the proposed framework addresses these gaps in relation to recent works, a methodological comparison is provided in Table 1.

Motivated by these challenges, the primary objective of this paper is to develop and validate a single, unified DRL framework that is both effective and generalizable for autonomous pursuit-evasion tasks across different domains. Specifically, this work aims to (1) design a modular architecture that can seamlessly transition from multi-spacecraft to multi-drone scenarios; (2) introduce a training methodology that accelerates learning and enhances policy robustness; and (3) create a reward mechanism capable of handling complex, state-dependent terminal constraints that are common in aerospace missions.

To achieve these objectives, a generalizable DRL-based framework is developed in this paper. The method’s robustness and versatility are demonstrated in both a challenging multi-spacecraft orbital scenario and a multi-drone aerial scenario. The main contributions of this paper are threefold:

A generalizable DRL framework is proposed for autonomous pursuit-evasion, with its modularity demonstrated across both multi-spacecraft and multi-drone domains, directly addressing the challenge of generalizability.
A dynamics-agnostic curriculum learning strategy is integrated to improve the training efficiency and robustness of the learned policies, providing an effective training methodology for complex adversarial games.
A transferable prediction-based reward function is designed to effectively handle complex, state-dependent terminal constraints, a common challenge in aerospace missions.

The remainder of this paper is structured as follows. Section 2 details the problem formulation. Section 3 presents the proposed DRL framework, including the CL strategy and reward design. Section 4 provides a comprehensive experimental validation, and Section 5 concludes the paper.

2. Problem Definition

This section formulates the pursuit-evasion problem using a multi-spacecraft orbital game as the primary case for validation. While the dynamics are specific to the space domain, the general structure of the game, constraints, and objectives are analogous to those in aerial drone scenarios.

2.1. Orbital Game Dynamics Modeling

Orbital maneuvering games refer to dynamic interactions between two or more spacecraft within the gravitational field of celestial bodies. These interactions are governed by orbital dynamics constraints, with each spacecraft pursuing its own objectives within its allowable control capabilities and available information.

This study focuses on local orbital maneuvering games within geosynchronous orbit (GEO), where the pursuit-evasion dynamics occur over a distance of thousands of kilometers. Specifically, a multi-spacecraft orbital pursuit-evasion scenario involving n homogeneous active spacecraft (pursuers) and a single non-cooperative target (evader) is investigated. The pursuers, initially located at a distance of m km from the target, aim to intercept it within the capture zone, potentially establishing a safe approach corridor for subsequent operations. Meanwhile, the target seeks to evade capture while remaining within l km of its initial orbital position [26].

The orbital pursuit task is considered successful when any pursuer enters the capture zone at the terminal time

t_{f}

. To describe the geometry of the engagement, a relative orbital coordinate system is first established. Its origin is a virtual reference point located on a circular geostationary orbit (GEO), which serves as the center of the relative motion frame. This reference point defines the nominal orbit, while the target (evader) as well as the pursuers maneuver relative to this frame. The relative orbital coordinate system is defined following the traditional Hill’s frame, which is aligned with the RSW coordinate system described in Ref. [27] (p. 389). Specifically, the axes are defined as follows: the

x_{o}

(S) axis points along the orbital velocity direction (along-track), the

z_{o}

(R) axis points from the spacecraft to the Earth’s center (radial), and the

y_{o}

(W) axis completes the right-hand system, pointing along the orbital angular momentum direction (cross-track). This definition ensures consistency with the Clohessy–Wiltshire equations in Equation (1) and with the state transition matrix adopted later in this work. The state of a spacecraft is thus described by its relative position

r_{i} (k) = {[x_{i} (k), y_{i} (k), z_{i} (k)]}^{⊤}

and velocity

v_{i} (k) = {[{\dot{x}}_{i} (k), {\dot{y}}_{i} (k), {\dot{z}}_{i} (k)]}^{⊤}

.

With the reference frame defined, the capture zone can be described as a cone-shaped region with a fixed generatrix length, L, and a half-opening angle,

θ

. The apex of the cone is located at the target’s position [28]. The central axis of this cone is fixed along the positive z-axis of the evader’s relative orbital frame. This orientation, which directs the approach path towards the Earth (i.e., from the zenith direction), is representative of specific mission profiles. For instance, in missions involving Earth observation or remote sensing, a pursuer might need to maintain a nadir-pointing instrument orientation during its final approach to collect data on the target against the background of the Earth. This geometry is therefore adopted to reflect such operational constraints. The cone maintains this orientation relative to the evader’s frame throughout the engagement. The schematic diagram of this orbital maneuvering game is shown in Figure 1.

In this study, the game is formulated within a discrete-time framework where control actions are modeled as impulsive thrusts applied at fixed decision intervals. This modeling choice is a pragmatic necessity driven by the need for computational tractability in a complex, game-theoretic reinforcement learning context. While deterministic optimal control methods often handle continuous-thrust profiles, such approaches are typically intractable for real-time, multi-agent strategic decision-making. By discretizing the control actions, the problem’s complexity is significantly reduced, making it amenable to solution via DRL algorithms like PPO. This abstraction allows the focus to shift from low-level continuous trajectory optimization to the high-level strategic interactions, which is the core of the pursuit-evasion game.

The relative motion between spacecraft is governed by nonlinear dynamics. To provide a challenging and realistic testbed for the DRL agents, a high-fidelity nonlinear model of relative orbital motion is used as the simulation environment in this study. All agent interactions and trajectories are propagated by numerically integrating the full nonlinear equations of motion derived from the two-body problem.

For the agent’s internal model, however, particularly for the prediction-based reward function, which requires rapid state projections, a linearized approximation is necessary for computational tractability. The well-known Clohessy–Wiltshire (CW) equations are employed for this purpose. The form of the equations presented below is specific to the coordinate system defined above and is consistent with authoritative sources [27,29,30]:

\begin{matrix} \{\begin{matrix} \ddot{x} - 2 ω \dot{y} - 3 ω^{2} x = 0 \\ \ddot{y} + 2 ω \dot{x} = 0 \\ \ddot{z} + ω^{2} z = 0 \end{matrix} \end{matrix}

(1)

where

ω

is the mean orbital motion. It is crucial to note that this linearized model is used exclusively by the agent for its internal reward calculation. In contrast, its actual movement within the simulation is governed by the high-fidelity nonlinear dynamics. This creates a realistic model mismatch that adds to the complexity of the learning task and tests the robustness of the learned policy. From the CW equations, the natural acceleration vector

a_{i} (k)

under the linearized model at time step k is given by

a_{i} (k) = {[2 ω {\dot{y}}_{i} (k) + 3 ω^{2} x_{i} (k), - 2 ω {\dot{x}}_{i} (k), - ω^{2} z_{i} (k)]}^{⊤} .

(2)

Maneuvers are executed as impulsive thrusts,

∆ v_{i} (k)

, at discrete time steps

T_{d}

. The state propagation between maneuvers is then calculated as

\begin{matrix} r_{i} (k + 1) & = r_{i} (k) + v_{i} (k) T_{d} + \frac{1}{2} a_{i} (k) T_{d}^{2}, \end{matrix}

(3)

\begin{matrix} v_{i} (k + 1) & = v_{i} (k) + a_{i} (k) T_{d} + ∆ v_{i} (k) . \end{matrix}

(4)

The state propagation equations above represent the Euler integration based on the linearized dynamics and are used internally by the agent for its predictions. This does not represent the high-fidelity propagation used in the main simulation environment.

2.2. Design of Constraints

To ensure the realism and practicality of the multi-agent maneuvering problem, a set of constraints are incorporated into the decision-making and training process. These constraints, including input saturation, terminal conditions, and collision avoidance, are fundamental to real-world scenarios in both the space and aerial domains.

To reflect realistic actuator limitations, the velocity increment that can be applied at each decision step is constrained. This input saturation is described as

∥ ∆ v_{i} (k) ∥ \leq ∆ v_{max}

(5)

where

∥ ∆ v_{i} (k) ∥

is the magnitude of the velocity increment for spacecraft i at step k and

∆ v_{max}

is the maximum allowable impulse magnitude per step.

The terminal conditions for a successful pursuit at time

t_{f}

are defined by both distance and angle constraints within the capture cone:

∥ r_{t p} ∥ \leq L and ∠ (r_{t p}, u) \leq θ

(6)

where

r_{t p}

is the vector from the evader to the pursuer and

u

is the cone’s central axis (defined as the positive z-axis in our setup). The specific values for L and

θ

are mission-dependent. For long-range GEO scenarios, they define a relatively large ’approach corridor’ rather than a precise docking port, as will be detailed in the simulation setup. The evader’s objective is to violate at least one of these conditions at

t_{f}

.

The constraints of the evader aim to violate either condition to achieve escape. The constraints of the pursuer and the evader at terminal time

t_{f}

are depicted as in Figure 2.

Furthermore, to ensure safe operations and prevent mission failure due to collisions, safety distance constraints can be described as in [31]:

\{\begin{matrix} ∥r_{p_{i}} (t) - r_{p_{j}} (t)∥ \geq d_{safe, pp}, & \forall i, j \in {1, \dots, n}, i \neq j \\ ∥r_{p_{k}} (t) - r_{e} (t)∥ \geq d_{safe, pe}, & \forall k \in {1, \dots, n} \end{matrix}

(7)

where

r_{p_{i}} (t)

,

r_{p_{j}} (t)

, and

r_{p_{k}} (t)

denote the positions of pursuer i, pursuer j, and pursuer k at time t, and

r_{e} (t)

denotes the position of the evader. The parameters

d_{safe, pp}

and

d_{safe, pe}

represent the minimum safe separation distance between any two pursuers and between any pursuer and the evader, respectively.

As shown in Figure 3, the constraint defines the safety distance around each spacecraft to prevent collisions.

3. Track Game Decision Method Based on PPO

This section details the proposed DRL framework. The core of the method is a self-play training architecture where pursuer and evader agents concurrently learn and evolve their strategies. Proximal Policy Optimization (PPO) was selected as the learning algorithm due to its balance of sample efficiency and stable convergence, a critical feature for adversarial multi-agent training. Unlike more complex algorithms such as Soft Actor–Critic (SAC) or off-policy methods like Deep Q-Networks (DQN), PPO’s clipped surrogate objective provides robust performance with simpler implementation [32]. The framework integrates PPO with two key innovations: a curriculum learning strategy and a prediction-based reward function, which are detailed below.

3.1. Self-Play Training Architecture

A self-play training strategy is employed to develop the agents’ policies. In this architecture, both the pursuer and evader agents are trained simultaneously, continuously competing against each other. This adversarial process allows each agent’s policy, represented by a neural network, to improve iteratively in response to the evolving strategy of its opponent.

The decision-making process for each agent is as follows. At each time step, the agent receives a state observation from the environment. This state is then fed into its policy network. The network outputs a continuous action vector, which directly represents the impulsive thrust command (i.e., the velocity increment

∆ v

) to be applied. The environment executes this action, transitions to a new state, and returns a reward signal. This entire interaction loop is depicted in Figure 4, which illustrates the overall training architecture. This framework naturally extends to multi-agent scenarios, facilitating either cooperative or competitive behaviors.

3.2. Curriculum Learning for Enhanced Training

To accelerate learning and improve policy robustness, a curriculum learning (CL) strategy is implemented. This approach is dynamics-agnostic and structures the training process by initially presenting the agents with a simplified task. The task difficulty is then gradually increased over the course of the training [33].

The core principle of this CL strategy is to treat the terminal capture conditions not as fixed constraints during training but as adjustable parameters of the learning problem itself. This addresses the challenge of sparse rewards in complex tasks. By starting with a large, easy-to-achieve capture zone (i.e., lenient terminal conditions), agents can more frequently receive positive rewards and learn a basic, viable policy. As training progresses, the capture zone is progressively tightened. This process forces the agents to refine their strategies from coarse maneuvering to precise targeting. It is important to note that this is a training methodology; the final, trained policy is evaluated against the original, fixed mission constraints.

In this work, the curriculum is implemented by dynamically adjusting three key parameters as a function of the training episode number. The concept of progressively tightening goal-related parameters is a common technique in CL for control tasks [34]. The generatrix length L and the half-opening angle

θ

of the capture cone are gradually reduced, as illustrated in Figure 5.

The generatrix length L is updated as

L (e p i s o d e) = L_{initial} - \frac{L_{initial} - L_{final}}{N} \cdot e p i s o d e

(8)

where

L_{initial}

is the initial generatrix length,

L_{final}

is the final generatrix length, N is the total number of training episodes, and

e p i s o d e

is the current training episode number.

Similarly, the capture cone half-opening angle

θ

is updated as

θ (e p i s o d e) = θ_{initial} - \frac{θ_{initial} - θ_{final}}{N} \cdot e p i s o d e

(9)

where

θ_{initial}

is the initial half-opening angle and

θ_{final}

is the desired final half-opening angle.

Furthermore, the curriculum extends to the prediction horizon,

t_{p}

, used within the terminal prediction reward. This parameter defines the duration over which the future state is predicted to assess capturability with respect to the current

L (e p i s o d e)

and

θ (e p i s o d e)

. The prediction horizon

t_{p}

is also dynamically adjusted as

t_{p} (e p i s o d e) = t_{p, initial} + \frac{t_{p, final} - t_{p, initial}}{N} \cdot e p i s o d e

(10)

where

t_{p, initial}

is the initial short prediction horizon and

t_{p, final}

is the final longer prediction horizon. This progressive extension of

t_{p}

encourages the development of more far-sighted strategies.

3.3. PPO Algorithm Design

PPO is an actor–critic algorithm that optimizes the policy and value functions concurrently. It employs two neural networks: a policy network (the actor,

π_{ω}

) that maps states to action probabilities and a value network (the critic) that estimates the state-value function. The state-value function, denoted as

V_{ω} (s_{t})

, is trained to approximate the expected return (cumulative discounted reward) from a given state

s_{t}

. Mathematically, the value function under a policy

π

is defined as in [35]:

V_{π} (s) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1} | S_{t} = s]

(11)

where

γ

is the discount factor and R is the reward at each time step. In our implementation, the value network with parameters

ω

approximates this function, i.e.,

V_{ω} (s_{t}) \approx V_{π_{ω}} (s_{t})

. This value serves as a crucial baseline to reduce the variance of policy gradient estimations.

The PPO algorithm optimizes a clipped surrogate objective. The overall loss function, which is a composite of policy loss, value loss, and an entropy bonus, is defined as [32]:

L_{t}^{Total} (ω) = E_{t} [L_{t}^{CLIP} (ω) - c_{1} L_{t}^{VF} (ω) + c_{2} S [π_{ω}] (s_{t})]

(12)

where

E_{t}

indicates the expectation over a batch of samples,

c_{1}

and

c_{2}

are weighting coefficients, and S denotes an entropy bonus to encourage exploration.

The core of PPO is the clipped surrogate objective for the policy,

L_{t}^{CLIP} (ω)

, which is given by

L_{t}^{CLIP} (ω) = E_{t} [min (r_{t} (ω) {\hat{A}}_{t}, clip (r_{t} (ω), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(13)

where

r_{t} (ω) = \frac{π_{ω} (a_{t} | s_{t})}{π_{ω_{old}} (a_{t} | s_{t})}

is the probability ratio between the new and old policies and

ϵ

is a hyperparameter for the clipping range.

The advantage function,

{\hat{A}}_{t}

, quantifies how much better an action is than the average action at a state. It is computed using Generalized Advantage Estimation (GAE) [36]:

{\hat{A}}_{t} = \sum_{l = 0}^{T - 1} {(γ λ)}^{l} δ_{t + l}, where δ_{t} = r_{t} + γ V_{ω} (s_{t + 1}) - V_{ω} (s_{t})

(14)

Here,

γ

is the discount factor,

λ

is the GAE parameter, and

δ_{t}

is the Temporal-Difference (TD) error.

The value function loss,

L_{t}^{VF} (ω)

, is typically a squared-error loss:

L_{t}^{VF} (ω) = E_{t} [{(R_{t} - V_{ω} (s_{t}))}^{2}]

(15)

where

R_{t}

is the actual return calculated from the trajectory. The network parameters

ω

are updated using gradient-based optimization on

L_{t}^{Total}

.

The specific steps of the PPO-based self-play training with curriculum learning are depicted in Algorithm 1.

Algorithm 1 PPO-based Self-Play Training with Curriculum Learning

1:: Input: Initial policy parameters $ω_{p}, ω_{e}$ ; value function parameters; curriculum learning parameters
2:: Output: Trained policy networks $π_{ω_{p}}$ and $π_{ω_{e}}$
3:: Initialize policy networks $π_{ω_{p}}$ , $π_{ω_{e}}$ and value networks $V_{ω_{p}}$ , $V_{ω_{e}}$
4:: for episode = 1, 2, … do
5:: Update curriculum parameters ( $L (e p)$ , $θ (e p)$ , $t_{p} (e p)$ )
6:: Initialize a data buffer $D$ for each agent (pursuers and evader)
7:: Collect a set of trajectories for all agents by running policies in the environment. Store transitions $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $D$ .
8:: {— Compute advantages and returns —}
9:: for each agent $j \in {pursuers, evader}$ do
10:: Compute advantage estimates ${\hat{A}}_{t}$ for all time steps in $D_{j}$ using GAE.
11:: Compute returns $R_{t}$ for all time steps in $D_{j}$ .
12:: end for
13:: {— Update networks —}
14:: for K optimization epochs do
15:: for each agent $j \in {pursuers, evader}$ do
16:: Compute the total loss $L^{Total}$ using the current batch of data from $D_{j}$ .
17:: Update the parameters of $π_{ω_{j}}$ and $V_{ω_{j}}$ by performing a gradient descent step on $L^{Total}$ .
18:: end for
19:: end for
20:: end for

3.4. Reward Function Design

The design of the reward function is paramount for guiding agents toward complex mission objectives, especially those with sparse terminal conditions. A key innovation of this framework is a prediction-based reward component,

R_{prediction}

, engineered to provide dense, forward-looking guidance. This component is crucial for enabling agents to learn effective strategies for satisfying state-dependent terminal constraints and is designed to be generalizable across different system dynamics.

The total reward for the pursuers,

R_{p}

, is constructed as a weighted sum of three components: a prediction reward (

R_{prediction}

), a terminal reward (

R_{terminal}

), and an efficiency penalty (

R_{efficiency}

). The overall structure is

R_{p} = λ_{pred} R_{prediction} + λ_{term} R_{terminal} + λ_{eff} R_{efficiency}

(16)

where

λ_{pred}

,

λ_{term}

, and

λ_{eff}

are positive weighting hyperparameters that balance the influence of predictive guidance, terminal objective satisfaction, and control effort minimization. Their specific values are determined through empirical tuning and are reported in the simulation section.

Reflecting the adversarial nature of the game, the evader’s reward function,

R_{e}

, is designed with opposing signs:

R_{e} = - μ_{pred} R_{prediction} - μ_{term} R_{terminal} - μ_{eff} R_{efficiency}

(17)

where

μ_{i}

are also positive weighting hyperparameters.

To provide a more intelligent, forward-looking signal than simple distance-based rewards, the prediction reward,

R_{prediction}

, evaluates the capturability of the current state. It achieves this by projecting the system’s state forward in time over a short, curriculum-defined horizon,

t_{p} (e p i s o d e)

, assuming zero control input during this interval. This projection uses the state transition matrix (STM),

Φ (∆ t)

, which represents the analytical solution to the linearized CW equations (Equation (1)) over a time interval

∆ t

. The STM is a 6 × 6 matrix that maps an initial relative state to a future state, and its form is well established in astrodynamics [27]. It can be expressed in a partitioned form as

Φ (∆ t) = [\begin{matrix} Φ_{r r} (∆ t) & Φ_{r v} (∆ t) \\ Φ_{v r} (∆ t) & Φ_{v v} (∆ t) \end{matrix}]

(18)

where the 3 × 3 sub-matrices relate position to position (

Φ_{r r}

), velocity to position (

Φ_{r v}

), and so on. For brevity, the full analytical expressions for these sub-matrices are provided in Appendix A.

The prediction process begins with the current six-dimensional relative state vector,

x_{t p} (t)

, which comprises the relative position and velocity. It is defined as

x_{t p} (t) = {[r_{t p} {(t)}^{⊤}, v_{t p} {(t)}^{⊤}]}^{⊤}

(19)

where

r_{t p} (t)

and

v_{t p} (t)

are the relative position and velocity, respectively.

Using the state transition matrix, the predicted state,

x_{t p, pred}

, is then calculated as

x_{t p, pred} = Φ (t_{p} (e p i s o d e)) \cdot x_{t p} (t)

(20)

The predicted state vector,

x_{t p, pred}

, contains the predicted relative position,

r_{t p, p r e d}

, and velocity,

v_{t p, p r e d}

:

x_{t p, pred} = {[r_{t p, pred}^{⊤}, v_{t p, pred}^{⊤}]}^{⊤}

(21)

The reward function is then calculated based on the predicted position component,

r_{t p, p r e d}

. The reward function is given by

\begin{matrix} R_{prediction} = & w_{d} \cdot exp (- \frac{k_{L} \cdot max (0, ∥ r_{t p, p r e d} {∥ - L (e p))}^{2}}{L {(e p)}^{2}}) \\ + w_{a} \cdot exp (- \frac{k_{θ} \cdot max {(0, ∠ (r_{t p, p r e d}, u) - θ (e p))}^{2}}{θ {(e p)}^{2}}) \end{matrix}

(22)

where

w_{d}

and

w_{a}

are weights for the distance and angle components and

k_{L}, k_{θ} > 0

are shaping factors. The notations

L (e p)

and

θ (e p)

are shorthand for

L (e p i s o d e)

and

θ (e p i s o d e)

, respectively, representing the curriculum-dependent parameters at the current episode.

The overall concept is illustrated in Figure 6. This structure incentivizes actions that lead to favorable future states, not just those that myopically reduce the current error.

The terminal reward,

R_{terminal}

, provides a large, discrete reward or penalty based on the final state at the end of an episode, directly enforcing the mission objective. It is formulated as

R_{terminal} = \{\begin{matrix} + ϕ, if ∥r_{t p}∥ \leq L (e p) and ∠ (r_{t p}, u) \leq θ (e p) \\ - ψ \cdot (w_{d}^{'} \frac{max (0, ∥r_{t p}∥ - L (e p))}{L (e p)} \\ + w_{a}^{'} \frac{max (0, ∠ (r_{t p}, u) - θ (e p))}{θ (e p)}), otherwise \end{matrix}

(23)

where

ϕ

is a substantial positive reward for success and

ψ

is a penalty coefficient.

w_{d}^{'} \geq 0

and

w_{a}^{'} \geq 0

are weights for the distance and angle error terms.

The efficiency penalty,

R_{efficiency}

, discourages excessive control effort. It is defined as a quadratic penalty on the velocity increment,

∆ v

, applied at the current decision step:

R_{efficiency} = - β {∥ ∆ v ∥}^{2}

(24)

where

β

is a positive weighting factor.

4. Simulation and Results

In this section, the proposed framework is validated through a two-stage simulation process. First, a detailed analysis is conducted in a multi-spacecraft orbital game to rigorously test the core algorithm. Then, the same method is applied to a multi-drone aerial scenario to demonstrate its generalizability and confirm that its performance advantages are transferable across domains.

4.1. Simulation Setup

To evaluate the performance of the proposed method, multi-agent pursuit-evasion game simulations were conducted within the challenging context of local orbital maneuvering near GEO. These simulations benchmarked our proposed PPO algorithm (with curriculum learning and prediction reward) against a traditional PPO algorithm (without these features) and the Soft Actor–Critic (SAC) algorithm, a state-of-the-art off-policy method included for a comprehensive comparison. To ensure a fair comparison, all algorithms shared identical hyperparameter settings where applicable.

The orbital motion was simulated by numerically integrating the high-fidelity nonlinear equations of relative motion derived from the two-body problem. The C-W equations, as detailed in Section 2.1, were used as a linearized approximation only within the agent’s internal model for the prediction-based reward calculation. The reference frame is centered on a point in a GEO at (45° E, 0° N). For each evaluation episode, the initial positions were set as follows to ensure varied engagement geometries: the evader started from a fixed nominal position, while the two pursuers were randomly and independently sampled from a spherical shell region defined relative to the evader. The key mission parameters are listed in Table 2, and the DRL hyperparameters are detailed in Table 3.

It is important to discuss the selection of the terminal capture cone parameters. The final values of

L = 50

km and

θ = 30^{\circ}

are chosen to represent a long-range engagement scenario in GEO. In this context, the objective is not a precise docking but the establishment of a safe inspection or co-location corridor, which is consistent with the operational scale in this orbital regime.

The curriculum learning strategy dynamically adjusted several key parameters based on the training episode number, facilitating a gradual increase in task difficulty. The specific initial and final values for these parameters, which were linearly interpolated over a total of 60,000 training episodes, are summarized in Table 4.

The simulation training was conducted on a 64-bit Windows 10 operating system with 32 GB of RAM, an Intel^® Core™ i7-11800H CPU (Intel Corporation, Santa Clara, CA, USA), and an NVIDIA T600 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA).

4.2. Training Process and Ablation Study

To rigorously evaluate the effectiveness of the two key innovations proposed in this framework—curriculum learning (CL) and the prediction-based reward (

R_{prediction}

)—a comprehensive ablation study was conducted. The performance of our full proposed method was benchmarked against several ablated versions and a standard PPO baseline. Four distinct configurations for the pursuer agents were trained, while the evader agent consistently employed the full method to serve as a challenging and consistent opponent:

Baseline PPO: A standard PPO implementation where the reward is based only on the current state error, without the benefits of CL or $R_{prediction}$ .
PPO + CL: The baseline PPO augmented solely with the curriculum learning strategy.
PPO + $R_{prediction}$ : The baseline PPO augmented solely with the prediction-based reward function.
Ours (Full Method): The complete proposed framework that integrates both CL and $R_{prediction}$ .

The learning dynamics of these configurations are visualized in Figure 7. This plot shows the average cumulative episodic reward for the pursuer agents, smoothed over a window of training episodes. To facilitate a clearer comparison of the learning trends on a consistent scale, the reward values for the spacecraft scenario shown in this figure have been scaled by a factor of 0.01 (i.e., divided by 100). This scaling brings the y-axis values into a range that highlights the relative differences in convergence, while the drone scenario plots (e.g., Figure 13) display the original, unscaled reward values.

The results depicted in Figure 7 clearly illustrate the contribution of each component. The Baseline PPO agent struggles significantly with the task’s complexity and the sparse nature of the terminal reward, resulting in a slow and unstable learning progress. The introduction of curriculum learning (PPO + CL) provides a notable improvement, especially in the initial training phase. By starting with an easier task, the agent quickly learns a foundational policy, leading to a much faster initial rise in rewards. Separately, augmenting the baseline with only the prediction reward (PPO +

R_{prediction}

) enables the agent to eventually achieve a substantially higher reward plateau. This confirms that the dense, forward-looking signal from

R_{prediction}

provides superior guidance for finding a more optimal policy.

Crucially, the combination of both components in our full method (Ours) yields the best performance. It benefits from the rapid initial learning kickstarted by CL and the high-quality, long-term guidance of

R_{prediction}

, resulting in both the fastest convergence and the highest final reward. This demonstrates a clear synergistic effect, where the two innovations complement each other to overcome the challenges of the complex pursuit-evasion task.

To provide a quantitative validation of these findings, the final pursuit success rate of each configuration was evaluated over 1000 independent test scenarios after training was complete. The results are summarized in Table 5.

The quantitative results in Table 5 strongly corroborate the insights from the learning curves. Both CL and

R_{prediction}

individually provide significant improvements over the baseline, increasing the success rate by 20.4 and 32.8 percentage points, respectively. Their combination, however, achieves the highest performance with a 90.7% success rate, a remarkable 45.4 percentage point improvement over the baseline. This analysis validates that both proposed components are critical and effective contributors to the overall performance of our framework.

4.3. Performance Evaluation and Comparison

Following the training process, a comprehensive evaluation consisting of 1000 scenarios was conducted. One representative engagement is illustrated in Figure 8, demonstrating the successful approach of the pursuers into the evader’s capture zone. The relative distance and relative angle between the pursuers and the evader over time for this engagement are shown in Figure 9 and Figure 10, respectively, confirming that the terminal conditions were met.

To demonstrate the enhanced game-playing capabilities of the trained agents, 1000 scenarios were simulated with agents employing different algorithms. The SAC algorithm was included in our comparative analysis. The pursuit success rates of the pursuers are summarized in Table 6.

The superiority of the proposed algorithm is clearly demonstrated by the results in Table 6. When the pursuers use the proposed algorithm, they achieve higher success rates against all evader strategies compared to traditional PPO or SAC. Conversely, when the evader uses the proposed algorithm, it is significantly more effective at evading (i.e., the pursuers’ success rate is lowest). For instance, a pursuer using the proposed algorithm has an 85.6% success rate against an equally intelligent evader, but this rate drops to 78.4% if the pursuer uses traditional PPO. These findings confirm that the proposed method enhances the strategic capabilities of both pursuers and evaders and is a more effective approach for this multi-agent game than both traditional PPO and SAC.

In addition to the success rate, a more comprehensive performance assessment was conducted by evaluating key efficiency metrics: the average time to capture and the average control effort (

∆

V). These metrics were averaged over all successful pursuit scenarios for each algorithm match-up where the pursuer used a given strategy against the strongest evader. For clarity, the strongest evader refers to an agent trained with our full proposed algorithm, but possessing the exact same physical constraints (e.g., maximum

∆

V) as all other agents to ensure a fair comparison. The results are presented in Table 7.

The data in Table 7 reveals further advantages of our method. Not only is it more successful, but it is also more efficient. Our algorithm achieves capture approximately 9.1% faster than traditional PPO and 5.0% faster than SAC. More importantly, it is significantly more fuel-efficient, requiring 25.5% less

∆

V than traditional PPO and 14.4% less than SAC. This superior efficiency can be attributed to the proactive, non-myopic policies learned via the prediction reward, which enable agents to plan smoother, more direct trajectories and avoid costly last-minute corrections. This comprehensive analysis confirms that our proposed framework yields policies that are not only more effective but also more efficient and practical for real-world applications.

To further illustrate the strategic differences between the learned policies, a qualitative comparison of their typical behaviors was performed. Figure 11 shows a representative trajectory generated by the SAC-trained pursuers under the same initial conditions as the case shown in Figure 8.

By visually comparing the trajectory from our proposed method (Figure 8) with that from SAC (Figure 11), a clear difference in strategy can be observed. The policy learned by our framework generates smoother, more proactive, and seemingly more coordinated trajectories. The pursuers appear to anticipate the evader’s future position and move towards a future intercept point. In contrast, the SAC-trained agents, while still effective, tend to exhibit more reactive or hesitant maneuvers, often involving more pronounced course corrections. This qualitative evidence corroborates the quantitative findings in Table 7, suggesting that our framework’s superior efficiency stems from its ability to learn more strategically sound and less myopic policies.

4.4. Robustness Test Under Perturbations

To evaluate the robustness of the learned policies against unmodeled dynamics and external disturbances, a further set of experiments was conducted. In these tests, a stochastic perturbation force was applied to all agents at each decision step. This perturbation was modeled as a random acceleration vector, where each component was independently sampled from a zero-mean Gaussian distribution,

N (0, σ_{a}^{2})

. The standard deviation of the perturbation,

σ_{a}

, was set to a small fraction of the maximum control capability. This was done to simulate a range of realistic, unmodeled effects, such as minor thruster misalignments, sensor noise leading to control execution errors, or other unpredictable external forces.

The performance of our proposed algorithm, along with the baselines, was evaluated over 1000 test scenarios in both the nominal (no perturbation) and the perturbed environments. The resulting pursuit success rates are compared in Table 8.

The results presented in Table 8 highlight the superior robustness of our proposed framework. While all algorithms experience a performance degradation in the presence of perturbations, the policy learned by our method is significantly more resilient. The success rate of our algorithm drops by only 8.3%, whereas traditional PPO and SAC suffer much larger drops of 23.1% and 18.0%, respectively.

This enhanced robustness can be attributed to two factors. First, the prediction-based reward function implicitly encourages the agent to seek states that are not just currently good, but also inherently stable and less sensitive to small deviations. Second, the curriculum learning strategy, by exposing the agent to a wide range of scenarios (from easy to hard), may lead to a more generalized policy that is less overfitted to the ideal dynamics. These findings suggest that our framework is not only effective in nominal conditions but also possesses a high degree of robustness, making it more suitable for real-world deployment.

4.5. Scalability to Larger Team Sizes

To evaluate the scalability of the proposed framework, the experiments were extended to include a larger team size. Specifically, the DRL framework was applied to a 3-vs.-1 engagement scenario. The core algorithm, including the curriculum learning strategy and the prediction-based reward, remained unchanged. The only adaptation required was a modification of the neural network’s input layer to accommodate the state information of the additional pursuer. The model was subsequently retrained from scratch for this new scenario.

The performance of the fully trained policy was evaluated over 500 test scenarios. The key performance metrics are summarized and compared with the original 2-vs.-1 baseline in Table 9. A representative successful trajectory is visualized in Figure 12, demonstrating effective multi-agent coordination.

The results demonstrate the excellent scalability of our framework. In the 3-vs.-1 scenario, the pursuers effectively leverage their numerical advantage to achieve a higher success rate (95.2%) and a faster capture time compared to the 2-vs.-1 baseline. The successful application of our framework to this more complex scenario with minimal modifications strongly supports its potential as a scalable solution for multi-agent autonomous systems.

4.6. Generalizability to Multi-Drone Airspace Pursuit-Evasion

While this study primarily validates the proposed DRL framework within a multi-spacecraft orbital game, its core methodology is designed to be generalizable. The framework’s modularity allows for its direct application to multi-drone pursuit-evasion scenarios in airspace by primarily substituting the underlying dynamics model. This section outlines how the key architectural components, such as the self-play PPO structure, curriculum learning, and the predictive reward function, are translated to a drone-based application.

4.6.1. Dynamics Model and State-Action Spaces

The primary adaptation for the multi-drone scenario involves substituting the orbital dynamics with an appropriate model for UAV flight. A full 6-DOF (six-degree-of-freedom) dynamic model for a quadrotor is complex, involving aerodynamic effects and rotor dynamics [37]. While such high-fidelity models are essential for low-level controller design, they introduce significant complexity that can be prohibitive for learning high-level, multi-agent strategic policies.

Therefore, for the purpose of this study, which focuses on the strategic decision-making aspect of the pursuit-evasion game, a simplified point-mass kinematic model is employed. This is a common and effective abstraction in multi-UAV planning and game theory research [38]. In this model, the state of a drone i is represented by its position

p_{i} = {[p_{x}, p_{y}, p_{z}]}^{⊤}

and velocity

v_{i} = {[v_{x}, v_{y}, v_{z}]}^{⊤}

. The agent’s action

a_{i} = {[a_{x}, a_{y}, a_{z}]}^{⊤}

corresponds to a desired acceleration command, which is assumed to be tracked perfectly by a low-level controller. The state propagation over a time step

∆ t

is then given by

\begin{matrix} p_{i} (k + 1) & = p_{i} (k) + v_{i} (k) ∆ t + \frac{1}{2} a_{i} (k) ∆ t^{2} \end{matrix}

(25)

\begin{matrix} v_{i} (k + 1) & = v_{i} (k) + a_{i} (k) ∆ t \end{matrix}

(26)

This abstraction allows the DRL framework to focus on learning the high-level strategic interactions, which is the primary contribution of this work, rather than the intricacies of flight control. The agent’s state observation consists of the relative positions and velocities between agents, which is analogous to the spacecraft scenario.

4.6.2. Constraint and Objective Adaptation

The game’s objectives and constraints translate directly. The pursuer’s goal is to maneuver into a capture cone—a virtual conical region centered on the evader. This cone is defined by a length L and a half-opening angle

θ

, and its central axis is typically oriented relative to the evader’s state (e.g., opposing its velocity vector), representing a common engagement criterion in aerial adversarial scenarios [23]. Collision avoidance constraints between drones (Equation (7)) and actuator saturation limits (e.g., maximum acceleration) are also directly applicable.

4.6.3. Prediction Reward Function Transferability

The key innovation of the prediction reward function (

R_{prediction}

) is highly transferable. The prediction of the future relative state, originally performed using the CW state transition matrix, instead uses a simple linear kinematic projection based on the current relative velocity

v_{t p}

[39]:

r_{t p, p r e d} = r_{t p} (t) + v_{t p} (t) \cdot t_{p}

(27)

The reward is then calculated using the same shaped exponential function based on this predicted state

r_{t p, p r e d}

and the current capture cone parameters (

L (e p)

,

θ (e p)

). This demonstrates that the predictive guidance principle is independent of the specific system dynamics.

4.6.4. Curriculum Learning Application

The curriculum learning strategy is entirely dynamics-agnostic. The process of starting with a large, easy-to-achieve capture zone (large

L_{initial}

,

θ_{initial}

) and gradually tightening these parameters to their final, more challenging values (Equations (8) and (9)). This is a general principle for accelerating training in goal-oriented tasks. It would be applied identically in a multi-drone simulation to guide the agents from simple to complex maneuvering strategies.

4.6.5. Illustrative Simulation for Multi-Drone Engagement

To empirically validate the framework’s generalizability, a concise simulation of a 2-vs.-1 multi-drone pursuit-evasion scenario was conducted with randomized initial conditions to ensure diverse engagement geometries. The simulation uses the point-mass kinematic model described above, with parameters scaled appropriately for an aerial engagement, as detailed in Table 10. The identical DRL architecture, including the PPO algorithm, prediction reward, and curriculum learning, was applied without modification.

The training process demonstrated stable convergence, akin to the spacecraft scenario, as shown in Figure 13. To conduct a meaningful performance evaluation, a validation test of 500 scenarios per match-up was performed. As presented in Table 11, the comparative analysis for the drone engagement yields results that are highly consistent with the spacecraft section. They confirm that the strategic advantages of the proposed algorithm are preserved in the drone domain. For instance, when both sides use the proposed algorithm, pursuers achieve an 88.5% success rate. This rate increases to 94.2% against a traditional PPO evader, while a traditional PPO pursuer’s success rate drops to 80.7% against the proposed evader. A representative trajectory of a successful capture is depicted in Figure 14.

In addition to success rates, the average capture time was also evaluated to provide a more comprehensive assessment of policy efficiency. The results, averaged over all successful scenarios for each match-up, are presented in Table 12.

The capture time results provide further insights. Our proposed algorithm, when used by the pursuers against the strongest evader (also using our proposed algorithm), achieves a fast capture time of 12.8 s. This demonstrates that the high success rate is not achieved at the cost of prolonged engagement times. The analysis of these multiple performance metrics in the drone domain further strengthens the conclusion that our framework’s advantages are robustly transferable.

4.6.6. Validation in an Obstacle-Rich Environment

To further test the adaptability of our framework in more complex and diverse aerial environments, static obstacles were introduced into the 2-vs-.1 drone scenario. The environment was populated with several large, spherical obstacles. Handling this required two minor adaptations to the DRL agent. First, the state representation was augmented to include the relative position and distance to the nearest obstacle. Second, a collision penalty term was added to the reward function to encourage obstacle avoidance. This penalty is defined as

R_{obstacle} = - c \cdot exp (- d_{obs})

(28)

where

d_{obs}

is the distance to the surface of the nearest obstacle and c is a positive scaling hyperparameter. The agent’s total reward in this environment is the sum of the previously defined rewards and

R_{obstacle}

.

The model was then retrained in this new, more challenging environment. A representative trajectory from a successful capture in the obstacle-rich environment is shown in Figure 15. The figure clearly illustrates the policy’s ability to navigate the complex environment, guiding the pursuers to effectively avoid obstacles while coordinating their attack.

The quantitative performance in this cluttered environment was compared to the obstacle-free case, as summarized in Table 13.

As expected, the presence of obstacles makes the task more challenging, resulting in a slightly lower success rate and a longer average capture time. Nevertheless, the algorithm still achieves a high success rate of 89.6%, demonstrating that the core principles of our framework—particularly the forward-looking guidance from the prediction reward—can be effectively adapted to learn complex, multi-objective behaviors such as simultaneous pursuit and collision avoidance. This result further underscores the versatility and robustness of the proposed approach.

In summary, adapting the proposed framework for multi-drone pursuit-evasion primarily involves substituting the environment’s physics engine. As demonstrated by the illustrative simulation, the core learning algorithm, predictive reward structure, and curriculum-based training strategy remain robust and highly effective in a fundamentally different physical domain. This empirical validation demonstrates that the method’s comparative advantages are preserved across different domains. Such a result strongly underscores the generalizability of our approach for developing autonomous decision-making systems for diverse adversarial games.

5. Conclusions

In this paper, a generalizable deep reinforcement learning framework was proposed for autonomous pursuit-evasion games and validated in both multi-spacecraft and multi-drone scenarios. The framework integrates a self-play PPO architecture with two key innovations: a dynamics-agnostic curriculum learning (CL) strategy and a transferable prediction-based reward function. Comprehensive simulation results confirmed that this synergistic approach yields highly effective, efficient, and robust policies. In the primary spacecraft validation, the proposed method achieved a high pursuit success rate of 90.7%. Furthermore, it demonstrated remarkable resilience, with a performance drop of only 8.3% under stochastic perturbations. These results represent a significant improvement over standard DRL baselines. The CL strategy was shown to accelerate training, while the prediction reward proved crucial for achieving these high success rates by effectively handling complex terminal constraints.

Despite the promising results, this study has several limitations that offer avenues for future research. The dynamics models, while realistic, still represent a simplification of the true physics. Specifically, for the spacecraft scenario, a key limitation is the model mismatch between the high-fidelity nonlinear environment and the agent’s internal prediction model, which relies on the linearized CW equations. A quantitative analysis of the performance degradation directly caused by this model error was not conducted in this study. Investigating the bounds of this error and developing policies that are even more robust to such model inaccuracies (e.g., by incorporating online system identification or more precise onboard models) is a critical direction for future work. Furthermore, real-world factors such as communication delays and sensor noise were not considered, and the framework’s scalability to larger and more complex team engagements requires further validation.

Building upon these findings and limitations, future research will proceed along several key directions. First, a primary thrust will be the validation of the framework in higher-fidelity simulation environments (e.g., considering orbital perturbations and complex aerodynamics) and eventually on physical hardware testbeds. Second, future work will focus on extending the framework to handle more complex scenarios, including the introduction of environmental factors like communication delays and the deployment of heterogeneous agent teams. Finally, investigating the strategic robustness of the trained policies against a wider range of sophisticated, unforeseen adversarial tactics remains a crucial area for ensuring real-world viability.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Z.X., S.S. and Z.H. The first draft of the manuscript was written by Z.X. and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Laboratory of Space Intelligent Control, grant number HTKJ2023KL502002; the China Postdoctoral Science Foundation, grant number 2020M681587; the National Natural Science Foundation of China, grant number 62403239; the China Postdoctoral Science Foundation, grant number GZC20242230; and the National Defense Science, Technology and Industry Bureau Stability Support Project, grant number HTKJ2024KL502029.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

DURC Statement

The current research is limited to the academic field of autonomous decision-making for aerospace systems, which is beneficial for enhancing the autonomy of unmanned systems in applications such as on-orbit servicing and aerial surveillance and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving deep reinforcement learning for multi-agent pursuit-evasion strategies and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, the authors strictly adhere to relevant national and international laws about DURC. The authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial outcomes. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. State Transition Matrix for CW Equations

The sub-matrices of the state transition matrix

Φ (∆ t)

for the Clohessy–Wiltshire equations, as referenced in Equation (18), are defined as follows. In these expressions, n represents the mean motion of the reference orbit and

∆ t

is the propagation time interval.

The position-to-position sub-matrix

Φ_{r r} (∆ t)

is given by

Φ_{r r} (∆ t) = [\begin{matrix} 4 - 3 cos (n ∆ t) & 0 & 0 \\ 6 (sin (n ∆ t) - n ∆ t) & 1 & 0 \\ 0 & 0 & cos (n ∆ t) \end{matrix}]

(A1)

The velocity-to-position sub-matrix

Φ_{r v} (∆ t)

is given by

Φ_{r v} (∆ t) = [\begin{matrix} \frac{sin (n ∆ t)}{n} & \frac{2 (1 - cos (n ∆ t))}{n} & 0 \\ \frac{2 (cos (n ∆ t) - 1)}{n} & \frac{4 sin (n ∆ t) - 3 n ∆ t}{n} & 0 \\ 0 & 0 & \frac{sin (n ∆ t)}{n} \end{matrix}]

(A2)

The position-to-velocity sub-matrix

Φ_{v r} (∆ t)

is given by

Φ_{v r} (∆ t) = [\begin{matrix} 3 n sin (n ∆ t) & 0 & 0 \\ 6 n (cos (n ∆ t) - 1) & 0 & 0 \\ 0 & 0 & - n sin (n ∆ t) \end{matrix}]

(A3)

The velocity-to-velocity sub-matrix

Φ_{v v} (∆ t)

is given by

Φ_{v v} (∆ t) = [\begin{matrix} cos (n ∆ t) & 2 sin (n ∆ t) & 0 \\ - 2 sin (n ∆ t) & 4 cos (n ∆ t) - 3 & 0 \\ 0 & 0 & cos (n ∆ t) \end{matrix}]

(A4)

References

Li, W.J.; Cheng, D.Y.; Liu, X.G.; Wang, Y.B.; Shi, W.H.; Tang, Z.X.; Gao, F.; Zeng, F.-M.; Chai, H.-Y.; Luo, W.-B.; et al. On-orbit service (OOS) of spacecraft: A review of engineering developments. Prog. Aerosp. Sci. 2019, 108, 32–120. [Google Scholar] [CrossRef]
Wijayatunga, M.C.; Armellin, R.; Holt, H.; Pirovano, L.; Lidtke, A.A. Design and guidance of a multi-active debris removal mission. Astrodynamics 2023, 7, 383–399. [Google Scholar] [CrossRef]
Borelli, G.; Gaias, G.; Colombo, C. Rendezvous and proximity operations design of an active debris removal service to a large constellation fleet. Acta Astronaut. 2023, 205, 33–46. [Google Scholar] [CrossRef]
Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of Drones: Multi-UAV Pursuit-Evasion Game With Online Motion Planning by Deep Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 7900–7909. [Google Scholar] [CrossRef]
Wu, J.; Huang, Z.; Lv, C. Uncertainty-aware model-based reinforcement learning: Methodology and application in autonomous driving. IEEE Trans. Intell. Veh. 2022, 8, 194–203. [Google Scholar] [CrossRef]
Li, S.; Wang, C.; Xie, G. Optimal strategies for pursuit-evasion differential games of players with damped double integrator dynamics. IEEE Trans. Autom. Control 2023, in press. [Google Scholar] [CrossRef]
Zhao, L.; Zhang, Y.; Dang, Z. PRD-MADDPG: An efficient learning-based algorithm for orbital pursuit-evasion game with impulsive maneuvers. Adv. Space Res. 2023, 72, 211–230. [Google Scholar] [CrossRef]
Yu, W.; Liu, C.; Yue, X. Reinforcement learning-based decision-making for spacecraft pursuit-evasion game in elliptical orbits. Control Eng. Pract. 2024, 153, 106072. [Google Scholar] [CrossRef]
Zhang, F.; Duan, G. Coupled Dynamics and Integrated Control for position and attitude motions of spacecraft: A survey. IEEE/CAA J. Autom. Sin. 2023, 10, 2187–2208. [Google Scholar] [CrossRef]
Balesdent, M.; Brevault, L.; Valderrama-Zapata, J.L.; Urbano, A. All-At-Once formulation integrating pseudo-spectral optimal control for launch vehicle design and uncertainty quantification. Acta Astronaut. 2022, 200, 462–477. [Google Scholar] [CrossRef]
Saban, S.; Swaminathan, S. Real Time Trajectory Optimization of L1 Optimal Control Problem using Gauss-Pseudo-spectral Method. IFAC-PapersOnLine 2020, 53, 258–265. [Google Scholar]
Li, Z.; Luo, Y. Orbital Impulsive Pursuit–Evasion Game Formulation and Stackelberg Equilibrium Solutions. J. Spacecr. Rockets 2024, in press. [Google Scholar] [CrossRef]
Hu, S.; Shen, L.; Zhang, Y.; Chen, Y.; Tao, D. On transforming reinforcement learning with transformers: The development trajectory. IEEE Trans. Pattern Anal. Mach. Intell. 2024, in press. [Google Scholar] [CrossRef] [PubMed]
Tipaldi, M.; Iervolino, R.; Massenio, P.R. Reinforcement learning in spacecraft control applications: Advances, prospects, and challenges. Annu. Rev. Control 2022, 54, 1–23. [Google Scholar] [CrossRef]
Jahanshahi, H.; Zhu, Z.H. Review of Machine Learning in Robotic Grasping Control in Space Application. Acta Astronaut. 2024, in press. [Google Scholar] [CrossRef]
Ye, B.L.; Zhang, J.; Li, L.; Wu, W. A Novel Trajectory Planning Method Based on Trust Region Policy Optimization. IEEE Trans. Intell. Veh. 2024, in press. [Google Scholar] [CrossRef]
Kuo, P.H.; Yang, W.C.; Hsu, P.W.; Chen, K.L. Intelligent proximal-policy-optimization-based decision-making system for humanoid robots. Adv. Eng. Inform. 2023, 56, 102009. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, K.; Zhang, Y.; Shi, H.; Tang, L.; Li, M. Near-optimal interception strategy for orbital pursuit-evasion using deep reinforcement learning. Acta Astronaut. 2022, 198, 9–25. [Google Scholar] [CrossRef]
Yang, B.; Liu, P.; Feng, J.; Li, S. Two-stage pursuit strategy for incomplete-information impulsive space pursuit-evasion mission using reinforcement learning. Aerospace 2021, 8, 299. [Google Scholar] [CrossRef]
Guo, Y.; Jiang, Z.; Huang, H.; Fan, H.; Weng, W. Intelligent maneuver strategy for a hypersonic pursuit-evasion game based on deep reinforcement learning. Aerospace 2023, 10, 783. [Google Scholar] [CrossRef]
Gaudet, B.; Linares, R.; Furfaro, R. Deep reinforcement learning for six degree-of-freedom planetary landing. Adv. Space Res. 2020, 65, 1723–1741. [Google Scholar] [CrossRef]
Abe, H.; Lee, D. Distributed coding of actual and hypothetical outcomes in the orbital and dorsolateral prefrontal cortex. Neuron 2011, 70, 731–741. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Zheng, S.; Tai, S.; Liu, H.; Yue, T. UAV air combat autonomous trajectory planning method based on robust adversarial reinforcement learning. Aerosp. Sci. Technol. 2024, 153, 109402. [Google Scholar] [CrossRef]
Chai, R.; Chen, K.; Cui, L.; Chai, S.; Inalhan, G.; Tsourdos, A. Advanced Trajectory Optimization, Guidance and Control Strategies for Aerospace Vehicles; Springer Nature: Singapore, 2023. [Google Scholar]
Chen, W.; Hu, Y.; Gao, C.; Jing, W. Luring cooperative capture guidance strategy for the pursuit—evasion game under incomplete target information. Astrodynamics 2024, 8, 675–688. [Google Scholar] [CrossRef]
Qian, H.; Chen, Z.; Wang, X.; Xiao, B.; Meng, L.; Ma, Y. A swarm-independent behaviors-based orbit maneuvering approach for target-attacker-defender games of satellites. Inf. Sci. 2025, 699, 121790. [Google Scholar] [CrossRef]
Vallado, D.A. Fundamentals of Astrodynamics and Applications, 4th ed.; Microcosm Press: Hawthorne, CA, USA, 2013. [Google Scholar]
Xie, W.; Zhao, L.; Dang, Z. Game tree search-based impulsive orbital pursuit–evasion game with limited actions. Space Sci. Technol. 2024, 4, 0087. [Google Scholar] [CrossRef]
Schaub, H.; Junkins, J.L. Analytical Mechanics of Space Systems; AIAA Education Series; American Institute of Aeronautics and Astronautics: Reston, VA, USA, 2003. [Google Scholar]
Clohessy, W.H.; Wiltshire, R.S. Terminal guidance system for satellite rendezvous. J. Aerosp. Sci. 1960, 27, 653–658. [Google Scholar] [CrossRef]
Qu, Q.; Liu, K.; Wang, W.; Lü, J. Spacecraft proximity maneuvering and rendezvous with collision avoidance based on reinforcement learning. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 5823–5834. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
Florensa, C.; Held, D.; Wulfmeier, M.; Zhang, M.; Abbeel, P. Reverse curriculum generation for reinforcement learning. In Proceedings of the Conference on Robot Learning (CoRL), Mountain View, CA, USA, 13–15 November 2017; pp. 482–495. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.I.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Hoffmann, G.; Huang, H.; Waslander, S.; Tomlin, C. Quadrotor helicopter flight dynamics and control: A new approach. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, Hilton Head, SC, USA, 20–23 August 2007; p. 6542. [Google Scholar]
Eklund, J.M.; Sprinkle, J.; Sastry, S.S. Switched and Symmetric Pursuit/Evasion Games Using Online Model Predictive Control With Application to Autonomous Aircraft. IEEE Trans. Control Syst. Technol. 2012, 20, 604–620. [Google Scholar] [CrossRef]
Ma, X.; Yuan, Y.; Guo, L. Hierarchical Reinforcement Learning for UAV-PE Game with Alternative Delay Update Method. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4639–4651. [Google Scholar] [CrossRef]

Figure 1. The schematic diagram of orbital maneuvering game.

Figure 2. The constraints of the pursuer and the evader at terminal time

t_{f}

.

Figure 2. The constraints of the pursuer and the evader at terminal time

t_{f}

.

Figure 3. Safety distance for collision avoidance between pursuers and the evader, defined by

d_{safe, pp}

and

d_{safe, pe}

.

Figure 3. Safety distance for collision avoidance between pursuers and the evader, defined by

d_{safe, pp}

and

d_{safe, pe}

.

Figure 4. The. overall training architecture.

Figure 5. Curriculum learning via gradual reduction of capture zone size, illustrating the progressive tightening of the terminal capture condition parameters L and

θ

.

Figure 5. Curriculum learning via gradual reduction of capture zone size, illustrating the progressive tightening of the terminal capture condition parameters L and

θ

.

Figure 6. Conceptual illustration of the prediction reward calculation based on the predicted future state

r_{t p, p r e d}

relative to the current episode’s capture cone (

L (e p), θ (e p)

).

Figure 6. Conceptual illustration of the prediction reward calculation based on the predicted future state

r_{t p, p r e d}

relative to the current episode’s capture cone (

L (e p), θ (e p)

).

Figure 7. Learning curves of the pursuer agents under different algorithm configurations. The proposed full method (Ours) demonstrates both faster convergence and a higher final reward, highlighting the synergistic effect of CL and

R_{prediction}

.

Figure 7. Learning curves of the pursuer agents under different algorithm configurations. The proposed full method (Ours) demonstrates both faster convergence and a higher final reward, highlighting the synergistic effect of CL and

R_{prediction}

.

Figure 8. Representative engagement demonstrating the successful approach of the pursuers into the evader’s capture zone.

Figure 9. The relative distance between the pursuers and the evader.

Figure 10. The relative angle between the pursuers and the evader.

Figure 11. A representative engagement trajectory generated by the SAC algorithm from the same initial conditions as Figure 8.

Figure 12. Illustrative trajectory from a successful 3-vs.-1 engagement, showing a coordinated encirclement by the three pursuers.

Figure 13. Training reward curves for the multi-drone engagement, showing stable convergence for both pursuers (blue, orange) and the evader (green).

Figure 14. Illustrative trajectory of a successful drone capture. The pursuers (blue, orange) coordinate to intercept the evader (green). The capture condition is met as both pursuers successfully enter the capture cone from different directions to form a pincer movement.

Figure 15. Illustrative trajectory from the obstacle-rich drone scenario. The pursuers (blue, orange) successfully navigate around the static obstacles (grey spheres) to intercept the evader (green).

Table 1. Methodological comparison with recent works.

Study	Method and Domain	Distinction from the Proposed Approach
Zhang et al. [4]	MADDPG for Multi-UAV Air Combat	Focuses on online planning; the proposed framework emphasizes cross-domain generalizability and utilizes a more stable on-policy algorithm (PPO).
Yu et al. [8]	Hybrid SAC + HJI for Spacecraft	Combines RL with complex differential game models. The presented approach is a pure, end-to-end DRL solution.
Zhao et al. [7]	PRD-MADDPG for Spacecraft	Employs a specific reward decomposition method. The current work introduces a novel prediction-based reward and a curriculum learning strategy.
Chen et al. [25]	DRL with Luring Strategy for Incomplete Information Scenarios	Focuses on complex information structures (e.g., luring, partial observability). The proposed framework focuses on generalizability and robustness under full observability.

Table 2. Simulation parameters for the multi-spacecraft mission.

Parameter	Value
Reference Orbit	Geostationary (GEO) at 45° E, 0° N
Scenario	2 Pursuers vs. 1 Evader
Evader Initial Position	(80, 70, 150) km
Pursuer Initial Position Sampling	Randomly sampled from a spherical shell
- Min Relative Distance from Evader	80 km
- Max Relative Distance from Evader	120 km
Terminal Time ( $t_{f}$ )	60,000 s
Final Capture Cone Length ( $L_{final}$ )	50 km
Final Capture Cone Angle ( $θ_{final}$ )	30°

Table 3. DRL hyperparameter settings.

Hyperparameter	Value
RL Algorithm Parameters
Optimizer	Adam
Learning rate ( $α$ )	$1 \times 10^{- 4}$
Discount factor ( $γ$ )	0.99
GAE parameter ( $λ$ )	0.95
PPO clipping ( $ϵ$ )	0.2
Training epochs per update (K)	10
Minibatch size	128
Network Architecture
Hidden layers	[256, 256, 128]
Activation function	ReLU
Output activation	Tanh
Reward Function Weights
$λ_{pred}, μ_{pred}$	1.0
$λ_{term}, μ_{term}$	1.0
$λ_{eff}, μ_{eff}$	1.0
$w_{d}, w_{a}$ (Prediction)	0.5, 0.5
$k_{L}, k_{θ}$ (Prediction)	5.0, 5.0
$ϕ$ (Success Reward)	200.0
$ψ$ (Failure Penalty Coefficient)	100.0
$w_{d}^{'}, w_{a}^{'}$ (Terminal Penalty)	0.5, 0.5
$β$ (Efficiency)	0.01

Table 4. Curriculum learning parameter schedule over 60,000 episodes.

Parameter	Symbol	Initial Value	Final Value
Capture Cone Generatrix Length	$L (e p)$	100 km ( $L_{initial}$ )	50 km ( $L_{final}$ )
Capture Cone Half-Opening Angle	$θ (e p)$	60 degrees ( $θ_{initial}$ )	30 degrees ( $θ_{final}$ )
Prediction Horizon	$t_{p} (e p)$	5 s ( $t_{p, initial}$ )	10 s ( $t_{p, final}$ )

Table 5. Ablation study results on final pursuit success rate.

Configuration (Pursuer Agents)	Success Rate (%)
Baseline PPO	45.3%
PPO + CL	65.7%
PPO + $R_{prediction}$	78.1%
Ours (Full Method)	90.7%

Table 6. Pursuit success rates with different maneuvering strategies.

	Pursuer Strategy
Evader Strategy	Proposed Algorithm	Traditional PP	SAC
Proposed Algorithm	85.6%	78.4%	2.1%
Traditional PPO	90.7%	81.3%	11.9%
SAC	93.5%	87.6%	18.8%

Table 7. Efficiency metrics against the strongest evader.

Pursuer Algorithm	Avg. Capture Time (s)	Avg. Control Effort ( $∆$ V, m/s)
Traditional PPO	54,100	132.5
SAC	51,800	115.3
Proposed Algorithm	49,200	98.7

Table 8. Pursuit success rate comparison in nominal and perturbed environments.

Algorithm	Success Rate (Nominal, %)	Success Rate (Perturbed, %)	Performance Drop (%)
Traditional PPO	81.3%	62.5%	−23.1%
SAC	85.5%	70.1%	−18.0%
Proposed Algorithm	90.7%	83.2%	−8.3%

Table 9. Performance comparison for 2-vs.-1 and 3-vs.-1 scenarios.

Scenario	Pursuit Success Rate (%)	Avg. Capture Time (s)
2 vs. 1 (Baseline)	90.7%	49,200
3 vs. 1	95.2%	45,100

Table 10. Simulation parameters for multi-drone pursuit-evasion task.

Parameter	Value
Scenario	2 Pursuers, 1 Evader
Nominal Initial Pursuer Positions	(0, 0, 10) m; (10, 0, 10) m
Nominal Initial Evader Position	(100, 50, 10) m
Initial Position Randomization Radius	5 m
Maximum Acceleration	5 ${m / s}^{2}$
Safety Distance ( $d_{safe}$ )	2 m
Final Capture Cone Length ( $L_{final}$ )	5 m
Final Capture Cone Angle ( $θ_{final}$ )	45°
RL Hyperparameters	Identical to Spacecraft Simulation (Table 3)

Table 11. Pursuit success rates in multi-drone simulation (500 scenarios per match-up).

	Pursuer Strategy
Evader Strategy	Proposed Algorithm	Traditional PPO	SAC
Proposed Algorithm	88.5%	80.7%	74.1%
Traditional PPO	94.2%	85.1%	79.8%
SAC	96.8%	90.3%	65.5%

Table 12. Average capture time (in seconds) in the multi-drone scenario.

	Pursuer Strategy
Evader Strategy	Proposed	Trad. PPO	SAC
Proposed	12.8 s	13.5 s	14.1 s
Trad. PPO	12.1 s	13.2 s	13.8 s
SAC	11.9 s	12.7 s	15.5 s

Table 13. Performance comparison in drone scenarios with and without obstacles.

Environment	Success Rate (%)	Avg. Capture Time (s)
Obstacle-Free	94.2%	12.8 s
Obstacle-Rich	89.6%	14.5 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Z.; Shao, S.; Han, Z. A DRL Framework for Autonomous Pursuit-Evasion: From Multi-Spacecraft to Multi-Drone Scenarios. Drones 2025, 9, 636. https://doi.org/10.3390/drones9090636

AMA Style

Xu Z, Shao S, Han Z. A DRL Framework for Autonomous Pursuit-Evasion: From Multi-Spacecraft to Multi-Drone Scenarios. Drones. 2025; 9(9):636. https://doi.org/10.3390/drones9090636

Chicago/Turabian Style

Xu, Zhenyang, Shuyi Shao, and Zengliang Han. 2025. "A DRL Framework for Autonomous Pursuit-Evasion: From Multi-Spacecraft to Multi-Drone Scenarios" Drones 9, no. 9: 636. https://doi.org/10.3390/drones9090636

APA Style

Xu, Z., Shao, S., & Han, Z. (2025). A DRL Framework for Autonomous Pursuit-Evasion: From Multi-Spacecraft to Multi-Drone Scenarios. Drones, 9(9), 636. https://doi.org/10.3390/drones9090636

Article Menu

A DRL Framework for Autonomous Pursuit-Evasion: From Multi-Spacecraft to Multi-Drone Scenarios

Abstract

Highlights

Abstract

1. Introduction

2. Problem Definition

2.1. Orbital Game Dynamics Modeling

2.2. Design of Constraints

3. Track Game Decision Method Based on PPO

3.1. Self-Play Training Architecture

3.2. Curriculum Learning for Enhanced Training

3.3. PPO Algorithm Design

3.4. Reward Function Design

4. Simulation and Results

4.1. Simulation Setup

4.2. Training Process and Ablation Study

4.3. Performance Evaluation and Comparison

4.4. Robustness Test Under Perturbations

4.5. Scalability to Larger Team Sizes

4.6. Generalizability to Multi-Drone Airspace Pursuit-Evasion

4.6.1. Dynamics Model and State-Action Spaces

4.6.2. Constraint and Objective Adaptation

4.6.3. Prediction Reward Function Transferability

4.6.4. Curriculum Learning Application

4.6.5. Illustrative Simulation for Multi-Drone Engagement

4.6.6. Validation in an Obstacle-Rich Environment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

Appendix A. State Transition Matrix for CW Equations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI