Distributed Pursuit–Evasion Game Decision-Making Based on Multi-Agent Deep Reinforcement Learning

Lin, Yanghui; Gao, Han; Xia, Yuanqing

doi:10.3390/electronics14112141

Open AccessArticle

Distributed Pursuit–Evasion Game Decision-Making Based on Multi-Agent Deep Reinforcement Learning

by

Yanghui Lin

¹

,

Han Gao

^1,2,* and

Yuanqing Xia

¹

School of Automation, Beijing Institute of Technology, Beijing 100081, China

²

Advanced Technology Research Institute, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2141; https://doi.org/10.3390/electronics14112141

Submission received: 14 April 2025 / Revised: 17 May 2025 / Accepted: 22 May 2025 / Published: 24 May 2025

(This article belongs to the Special Issue Advanced Control Strategies and Applications of Multi-Agent Systems)

Download

Browse Figures

Versions Notes

Abstract

Pursuit–evasion games are a fundamental framework for advancing autonomous decision-making and cooperative control in multi-UAV systems. However, the application of reinforcement learning to pursuit–evasion games involving fixed-wing UAVs remains challenging due to constraints, such as minimum velocity, limited turning radius, and high-dimensional continuous action spaces. To address these issues, this paper proposes a method that integrates automatic curriculum learning with multi-agent proximal policy optimization. A self-play mechanism is introduced to simultaneously train both pursuers and evaders, enabling dynamic and adaptive encirclement strategies. In addition, a reward structure specifically tailored for the encirclement task was designed to guide the pursuers in gradually achieving the encirclement of the evader while ensuring their own safety. To further improve training efficiency and convergence, this paper develops a subgame curriculum learning framework that progressively exposes agents to increasingly complex scenarios, facilitating experience accumulation and skill transfer. The simulation results demonstrate that the proposed approach improves learning efficiency and cooperative pursuit performance under realistic fixed-wing UAV dynamics. This work provides a practical and scalable solution for multiple fixed-wing UAV pursuit–evasion missions in complex environments.

Keywords:

deep reinforcement learning; curriculum learning; pursuit–evasion game; unmanned aerial vehicle

1. Introduction

With the continuous advancement of unmanned systems, unmanned aerial vehicles (UAVs) have garnered significant attention from researchers due to their low cost, high operational flexibility, and reduced risk. Compared to a single UAV, multi-UAV systems exhibit enhanced cooperative capabilities, improved task execution efficiency, and broader operational coverage, making them highly promising for applications such as search and rescue [1], target tracking [2], and border defense [3]. Among these applications, pursuit–evasion games (PEGs) represent a crucial research domain within multi-UAV systems, as they involve adversarial interactions among intelligent agents and serve as a fundamental framework for studying autonomous decision-making and cooperative control in UAV swarms. Typically, a PEG consists of two factions: the pursuer and the evader. The goal of the pursuer is to capture the evader as quickly as possible, while the evader aims to evade capture for as long as possible.

Traditionally, PEGs have been studied using differential game theory [4], with control strategies derived via optimal control [5], Model Predictive Control (MPC) [6], and other advanced control techniques. For instance, in [5], an algorithm integrating geometric methods and differential game theory was proposed to derive the optimal strategies for PEGs of players with damped double-integrator dynamics. In [6], a cooperative pursuit problem involving multiple pursuers and a single evader in an unbounded two-dimensional environment was studied. A robust MPC framework was proposed to guarantee that the evader was consistently constrained within the convex hull formed by the pursuers. In [7], an open-loop two-on-one PEG was investigated, where the evader was chosen as the leader, and a numerical solution for the time-optimal constrained trajectory was obtained using the open-loop Stackelberg approach. While these methods offer strong theoretical guarantees, they typically require accurate system models and intensive computation, posing limitations in real-time applications involving uncertain or dynamic environments. Furthermore, their scalability is severely constrained when handling large-scale multi-agent systems.

In recent years, the emergence of reinforcement learning (RL) has provided a novel solution for control and decision-making in nonlinear systems. Unlike traditional methods that rely on precise environmental models, RL algorithms directly optimize policies based on feedback obtained from interactions with the environment. Through continuous learning, agents gradually acquire the ability to take optimal actions in different states, thereby maximizing cumulative rewards. Furthermore, RL enables an offline training and onboard execution paradigm, effectively bypassing the computational complexity associated with real-time decision-making, which makes it well suited for real-time guidance in lightweight UAVs. Extensive and in-depth research has been conducted by scholars worldwide on the application of RL techniques to PEGs [8,9,10,11,12,13]. For instance, in [8], a reinforcement learning-based distributed cooperative pursuit strategy that enables agents to autonomously learn cooperative control and communication policies was designed. An online motion planning algorithm based on a target prediction network and the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm was proposed to address the UAV PEG in [9]. Xiao et al. [12] introduced an asynchronous multi-stage deep reinforcement learning framework to tackle the challenge of safe evader navigation in the presence of physical attacks from multiple pursuers. It should be noted that while extensive research has been conducted on UAV pursuit–evasion games using RL methods, most studies have focused on dynamically simpler models, such as quadcopter UAVs or particle-based models, which are not well suited for fixed-wing UAVs [14].

In certain complex environments, fixed-wing UAVs demonstrate greater endurance and resilience compared to rotary-wing UAVs. However, the design of autonomous decision-making systems for fixed-wing platforms presents significant challenges. Due to the necessity of maintaining a minimum forward speed to generate lift, fixed-wing UAVs are incapable of hovering or executing arbitrary directional turns. Moreover, they require simultaneous control over multiple degrees of freedom and parameters such as roll, yaw, and velocity, which leads to high system dimensionality and strong coupling in dynamic game scenarios [15]. These characteristics substantially increase the difficulty of applying pursuit–evasion strategies to fixed-wing UAVs. Research on PEGs involving fixed-wing UAVs has primarily focused on the air combat domain [16,17,18,19,20,21,22]. In [16], a short-range air combat autonomous operation decision-making model for UAVs was developed, leveraging reinforcement learning to enable autonomous decision-making in air combat scenarios. Building on this foundation, a communication mechanism was integrated into reinforcement learning in [19], enabling multi-UAV information sharing based on self-learning communication. In [20], a strategy combining line-of-sight (LOS) angle rate correction techniques with DRL was employed to learn high-speed UAV pursuit–evasion strategies. Nevertheless, RL is also subject to the curse of dimensionality, which poses substantial challenges in complex tasks involving numerous agents, high-dimensional state spaces, and continuous action domains. In such scenarios, RL training often suffers from convergence issues, low sample efficiency, and limited policy exploration, all of which significantly hinder its practical applicability. In [21], an algorithm combining hierarchical reinforcement learning and behavior cloning was proposed to improve the stability and learning efficiency of autonomous decision-making training in within-visual-range air combat scenarios. However, the heavy reliance on expert knowledge limits the algorithm’s adaptability and generalization capability. In [22], a training framework combining Particle Swarm Optimization (PSO) with the Minimax Multi-Agent Deep Deterministic Policy Gradient (M3DDPG) algorithm was proposed, where PSO is used to optimize the sample data set, thereby improving sample efficiency and convergence speed. However, the training is conducted solely against opponents with fixed maneuvering strategies, which limits the robustness and adaptability of the learned policy when facing more diverse or adaptive adversaries in dynamic and uncertain environments.

More recently, the emergence of curriculum learning [23] has provided a novel approach to address the aforementioned challenges. Within the curriculum learning framework, agents are initially trained in lower-complexity environments to master fundamental strategies. As training progresses, the complexity of the tasks is gradually increased. By leveraging this approach, the convergence performance and sample efficiency of the policy training process can be improved [24,25,26]. In [24,25], the task difficulty was adjusted by setting parameters such as motion speed and encirclement radius, allowing the agent to undergo training from easy to more complex tasks, thereby enhancing the robustness of the entire training process. In [26], task-specific curriculum learning was proposed to accelerate the training of collision avoidance and clustering strategies for fixed-wing UAVs. Additionally, curriculum learning has been introduced in the domain of fixed-wing UAV pursuit–evasion games to expedite the training of MARL [27,28,29]. In [27], the task was simplified by gradually weakening the strategy of the opponent to design an appropriate curriculum sequence, and a curriculum evaluation method was proposed to verify the effectiveness of the generated curriculum. In [28], the target task was divided into multiple subtasks with distinct reward structures, allowing the agent to achieve progressive learning by sequentially mastering each subtask. It is worth noting that the aforementioned curriculum learning approaches typically rely on heuristics or prior domain knowledge for curriculum design, with task sequences manually crafted. Such methods are often subjective and inflexible, lacking the adaptability to dynamically respond to changing environments or evolving task complexities.

To bridge these gaps, this paper proposes a novel framework that integrates automatic subgame curriculum learning with multi-agent proximal policy optimization (MAPPO) to address the PEG problem among multiple fixed-wing UAVs. Specifically, we introduce a self-play training mechanism to co-evolve strategies for both pursuers and evaders, enhancing policy robustness against adaptive opponents. Additionally, a reward structure tailored for encirclement is designed to guide cooperative strategies while ensuring agent safety. To improve learning efficiency, we develop a subgame curriculum learning module that adaptively samples learning scenarios based on difficulty estimation, facilitating faster and more stable policy training. The primary contributions of this paper are as follows.

This paper investigates the dynamic PEG problem for multiple fixed-wing UAVs and proposes a distributed control method based on automatic subgame curriculum learning and MAPPO. In contrast to previous studies that typically assume fixed or simplistic evader strategies, our method employs a self-play mechanism to jointly train both pursuer and evader policies. This enables adaptive and dynamic encirclement of more capable and intelligent evaders.
A reward structure tailored for encirclement missions is introduced in this paper. Unlike methods that rely solely on distance or UAV orientation to determine task success, our approach takes into account the dynamic surrounding of the evader while ensuring the safety of the pursuer, guiding the pursuer to perform effective encirclement without compromising its own security.
To address the challenges of self-play solution difficulties and low sample efficiency, a training framework based on subgame curriculum learning and MAPPO is proposed. Compared to manually designed curricula or empirically determined subgame sequences, the proposed framework strategically plans the learning order by integrating self-guided subgame sampling. This enhances both sampling and learning efficiency while reducing reliance on human expertise and subjective bias.

This paper is organized as follows: Section 2 introduces the preliminaries required for this paper, including the fixed-wing UAV model, Markov games, and proximal policy optimization. In Section 3, the multiple fixed-wing UAVs PE scenario considered in this paper is described in detail, along with a comprehensive introduction to the proposed curriculum-based MARL algorithm. Section 4 covers the presentation and analysis of the simulation results, while Section 5 offers the conclusions.

2. Preliminaries

2.1. UAV Dynamics Model

To address the autonomous decision-making challenges in multi-UAV pursuit–evasion scenarios, a simplified yet representative dynamics model is essential for computational efficiency and real-time maneuver planning. In the real world, the motion kinematics and dynamics of a fixed-wing UAV can be described using a six-degrees-of-freedom (DOF) model. However, in this study, the primary focus is on the cooperative encirclement control of the pursuer UAVs against the evader UAV in the horizontal plane, where altitude variations have a limited impact on the overall encirclement performance. Therefore, it is assumed that each UAV maintains a constant flight altitude, and a simplified 4-DOF model [14,26] is employed to describe their motion.

The fixed-wing UAV model is established within a two-dimensional inertial coordinate system, as shown in Figure 1. In this system, the x-axis points toward the east, and the y-axis points north. Then, the state of the UAV can be defined as

{[x, y, v, ψ, φ]}^{T}

, where

(x, y)

represents the position of the UAV in the two-dimensional inertial coordinate system; v is a scalar indicating the velocity of the UAV; and

ψ

and

φ

denote the yaw angle and roll angle of the UAV, respectively. With the above five variables, the state of the UAV at a certain moment can be determined. The centroid kinematic equations for the UAV can be defined as follows.

\begin{matrix} \{\begin{matrix} \dot{x} & = v cos ψ \\ \dot{y} & = v sin ψ \end{matrix} . \end{matrix}

(1)

Disregarding the attitude control of UAVs, the control variables of UAVs can be represented by two parameters,

n_{x}, φ

, where

n_{x}

represents the overload in the direction of the velocity of the UAV. Thus, the dynamic model of the UAV is defined as follows.

\begin{matrix} \{\begin{matrix} \dot{v} & = g n_{x} \\ \dot{ψ} & = \frac{g tan φ}{v} \end{matrix}, \end{matrix}

(2)

where g is the acceleration of gravity. From the dynamic model (2), it can be seen that among the three control variables,

n_{x}

controls the magnitude of the velocity, while

φ

controls the change in the direction of the velocity collectively.

2.2. Markov Game

The Markov game is a widely used framework for addressing multi-agent decision-making and control problems. It can be viewed as an extension of the Markov Decision Process (MDP) to a multi-agent environment, which allows for multiple participants to interact, compete, or collaborate within a shared environment. Generally, a Markov game is defined as a tuple

MG = < N, S, A, P, R, γ, ρ >

, where

N = {1, \dots, N}

is the set of N agents;

S

is the state space, which is the set of all possible states of the agents;

A = \prod_{i = 1}^{N} A_{i}

is the joint action space with

A_{i}

being the action space of agent i;

P : S \times A \to Δ (S)

denotes the function of the state transition probability;

R = (R_{1}, \dots, R_{N}) : S \times A \to R^{N}

represents the joint reward function with

R_{i}

being the reward function for agent i;

γ \in [0, 1]

is the discount factor of rewards; and

ρ

is the distribution of initial states. For the given state s and joint action

a = (a_{1}, \dots, a_{N})

, the game moves to the next state

s^{'}

with probability

P (s^{'} | s, a)

and agent i receives a reward

R_{i} (s, a)

.

For the Markov game

MG

, a subgame

MG (s)

refers to the game initiated from a specific state s [30], meaning the initial state distribution satisfies

ρ (s) = 1

. Thus, selecting a subgame essentially corresponds to determining the starting state of the overall Markov game.

In a zero-sum game, the interests of the participating agents are completely opposed, meaning that for all state–action pairs, the total reward for all agents sums to zero, i.e.,

R_{1} (s, a) + \dots + R_{N} (s, a) = 0, \forall (s, a) \in (S, A)

. It is assumed that agent i follows a strategy

π_{i} : S \to A_{i}

to generate actions and maximize its own accumulated reward. To achieve this objective, the value function and Q-function are introduced.

V_{i}^{π} (s) = E [\sum_{t} γ^{t} R_{i} (s^{t}, a^{t}) | s^{0} = s],

(3)

Q_{i}^{π} (s, a) = E [\sum_{t} γ^{t} R_{i} (s^{t}, a^{t}) | s^{0} = s, a^{0} = a],

(4)

where

V_{i}^{π} (s)

represents the value function, which measures the expected return in the current state s under the policy

π

, and

Q_{i}^{π} (s, a)

denotes the Q-function, which evaluates the expected return for a given state–action pair

(s, a)

under the policy

π

.

The solution to a zero-sum game is defined as a Nash equilibrium (NE). A joint policy

π^{*} = (π_{1}^{*}, \dots, π_{N}^{*})

is a Nash equilibrium of a zero-sum game if for all initial states

s^{0}

with

ρ (s^{0}) > 0

, the following condition holds.

π_{i}^{*} = \underset{π_{i}}{arg max} V_{i}^{(π_{i}, π_{- i}^{*})} (s^{0}), i = 1, \dots, N,

(5)

where subscript

- i

represents the variables of all other agents excluding agent i.

Then, the minimax nature of zero-sum games can be inferred.

V_{i}^{*} (s) = max_{π_{i}} min_{π_{- i}} E_{a \sim π (\cdot | s)} [Q_{i}^{*} (s, a)],

(6)

Q_{i}^{*} (s, a) = R_{i} (s, a) + γ E_{s^{'} \sim P (\cdot | s, a)} [V_{i}^{*} (s^{'})] .

(7)

2.3. Proximal Policy Optimization

In the previously established Markov game environment, RL aims to obtain optimal policies by continuously interacting with the environment to maximize the expected cumulative reward gained by the agent. In DRL, the policy of the agent is typically represented by an actor network, denoted as

π_{θ}

with

θ

being the trainable parameters of the neural network. Thus, the task of finding the optimal policy

π^{*}

of the agent can be formulated as an optimization problem over the parameters of the actor network, that is,

θ^{*} = \underset{θ}{arg max} J (θ) = \underset{θ}{arg max} (\underset{τ \sim π_{θ}}{E} [\sum_{k = 0}^{N - 1} γ^{k} R_{k}]),

(8)

where

J (θ)

represents the objective function of the agent. N denotes the total number of timesteps.

γ \in [0, 1]

is the discount factor of the rewards.

Building upon the aforementioned optimization objective, proximal policy optimization (PPO) [31] is a widely used policy gradient method that enhances training stability and sample efficiency by constraining the magnitude of policy updates. Specifically, in PPO, the objective defined in (8) can be rewritten as follows.

θ^{*} = \underset{θ}{arg max} (J^{clip} (θ) + c S (θ)),

(9)

where

J^{clip} (θ)

represents the objective function for the policy update, which is obtained by modifying the data sampled from the demonstration policies through importance weights and the clipping operations. S represents the entropy term, which is incorporated to encourage the agent to explore the environment fully and to prevent premature convergence to a suboptimal solution. c is the hyperparameters that control the relative importance between items.

In addition, PPO employs the actor–critic architecture, where the actor network is responsible for policy execution, while the critic network evaluates the value of states to guide policy updates. This structure helps improve learning efficiency by reducing variance in policy gradient estimation, leading to more stable and effective training. The objective of the critic network can be defined as follows.

ϕ^{*} = \underset{ϕ}{arg max} (- H (ϕ)),

(10)

where

ϕ

is the parameter of the critic network. H denotes the objective function of the value network and is defined by the mean square error of the value function.

Taking the actor network as an example, stochastic gradient ascent (SGA) is employed to update the parameters

θ

. Based on Equation (9), the update law of

θ

is defined as

θ \leftarrow θ + η_{θ} Δ_{θ} J (θ)

with

η_{θ}

being the learning rate. The gradient

Δ_{θ} J (θ)

is reformulated in a manner related to the policy, defined as follows:

Δ_{θ} J (θ) = \underset{τ \sim π_{θ}}{E} [\sum_{k = 0}^{N - 1} Δ_{θ} log π_{θ} (a_{k} | s_{k}) (J^{clip} (θ) + c S (θ))] .

(11)

Updating the parameter

θ

based on Equation (11) increases the probability of selecting actions that yield positive feedback while decreasing the probability of choosing actions that result in negative feedback. This iterative process gradually optimizes the policy toward the optimal policy

π^{*}

.

3. Problem Formulation and Methods

This paper addresses the problem of PEG involving multiple fixed-wing UAVs and proposes an autonomous decision-making method based on subgame curriculum learning and multi-agent reinforcement learning. First, a Markov game model for the PEG scenario is established, derived from the kinematics and dynamics models of the fixed-wing UAVs. Next, the overall framework of the algorithm is introduced, followed by a detailed explanation of the subgame curriculum learning mechanism and the multi-agent proximal policy optimization within the CTDE framework.

3.1. Problem Statement

We consider a PEG involving multiple fixed-wing UAVs in a two-dimensional continuous space, where a group of pursuers aims to cooperatively capture an evader. The game is modeled as a Markov game, defined by the tuple:

MG = < N, S, O, A, P, R, γ, ρ >

.

The objective of each agent i is to maximize its expected discounted return:

J_{i} (π_{i}) = E [\sum_{t = 0}^{\infty} γ^{t} R_{i} (s_{t}, a_{t}^{i})],

(12)

where

π_{i}

is the policy of agent i.

To improve training efficiency in this multi-agent setting, a subgame curriculum learning framework is introduced. Let s denotes the initial state of the subgame. A subgame curriculum sampler

ψ

is defined to select subgames over time:

s \sim Ψ (D_{π}),

(13)

where

D_{π}

evaluates the importance of each subgame based on the current policy

π

. The curriculum policy seeks to order subgames to facilitate progressive learning and faster convergence.

The training objective becomes the following:

max_{π} E_{s_{k} \sim Ψ (D_{π})} [\sum_{i \in I} J_{i} (π_{i} | s)] .

(14)

This formulation integrates curriculum learning into multi-agent reinforcement learning (MARL), where the learning sequence is guided by estimated subgame importance.

3.2. Pursuit–Evasion Scene

To construct the Markov game environment for the multiple UAV pursuit–evasion scenario, it is necessary to specifically define the tuple

MG = < N, S, O, A, P, R, γ, ρ >

of the Markov game. This includes defining the state representation, action space, reward structure, and other relevant aspects of the environment. The established Markov game environment is illustrated in Figure 2.

The Markov game environment established in this study models a PEG of multiple fixed-wing UAVs, where multiple pursuers coordinate to capture a single evader. As illustrated in Figure 2, the pursuers aim to move toward the evader from arbitrary initial positions, leveraging local observations while ensuring collision avoidance and cooperative encirclement. The success criterion for the pursuers is defined as achieving a uniform encirclement around the evader within a specified proximity. Conversely, the evader’s objective is to maximize its distance from the pursuers and avoid being encircled. The environment is formulated under a zero-sum game framework to model the adversarial nature of the pursuit–evasion interactions.

3.2.1. State Representation and Observation Model

In the PE game scenario, the state space of the environment provides a complete representation of the environment, where the situation at any given moment can be fully described by the environmental state. It is assumed that there are N drones in the environment, consisting of

n_{adv}

pursuers and

(N - n_{adv})

evaders, denoted by i, where

i < n_{adv}

represents a pursuer, and the remaining drones are evaders. Based on the previously established UAV kinematics model (1) and dynamics model (2), the state representation of the UAVs is selected as

s_{i} = {[x, y, v, ψ]}^{T}

. Ultimately, by incorporating the current timestep of the environment, the state representation is defined as follows.

S = {[s_{1}, \dots, s_{n}, n_{env}]}^{T},

(15)

where N represents the number of UAVs in the environment;

n_{e n v}

denotes the current step count in the environment.

From the kinetic model (1) and dynamics model (2) of UAVs, the state transition equation for the fixed-wing UAV i is derived as follows.

s_{i}^{t + 1} = s_{i}^{t} + [\begin{matrix} v^{t} cos ψ^{t} \\ v^{t} sin ψ^{t} \\ g n_{x}^{t} \\ \frac{g tan φ^{t}}{v^{t}} \end{matrix}] d t,

(16)

where the superscript t denotes the timestep of the environment, and

d t

represents the time interval between consecutive environment steps.

Next, an observation model is established, consisting of the combined observations from each agent. In the PE game scenario, agents need to perceive the current game situation based on the observation information and make maneuvering decisions accordingly. It is assumed that each agent can access its own state as well as the relative states of other agents with respect to itself. To improve training efficiency and avoid the difficulties neural networks face when learning complex trigonometric functions, quantities related to angles in the observation information are directly represented by their trigonometric values rather than by the original angle values. Therefore, the observation gained by agent i is defined as

o_{i} = [{\bar{s}}_{i j}^{T}, n_{env}, s_{i}^{T}], j = 0, \dots, N, j \neq i; k = 0, \dots, m

with

{\bar{s}}_{i j} = {[x_{j} - x_{i}, y_{j} - y_{i}, v_{j}, cos ψ_{j}, sin ψ_{j}]}^{T}

being the relative state of the agent j.

3.2.2. Action Space

In the scenario studied in this paper, it is essential for each UAV to make maneuvering decisions in the pursuit–evasion game based on its own state and the relative states of other UAVs. The evader aims to escape, while the pursuers aim to surround and intercept the evader. Within this context, the focus is on position control of the UAVs, without considering the attitude control. Based on the dynamic model of the fixed-wing UAVs defined in Equation (2), the control variables for the UAVs in this paper are set as

n_{x}, φ

, which include the lateral overload and the roll angle. Consequently, the environmental actions are defined as follows.

a = {[n_{x}, φ]}^{T} .

(17)

3.2.3. Reward Structure

The reward function serves as the cornerstone of the environment, directly shaping agent behavior through its incentive mechanism. A well-designed reward function not only accelerates convergence and enhances learning stability but also ensures that the agent’s behavior aligns with the underlying task objectives. In the multi-UAV pursuit–evasion game studied in this paper, the objective of the pursuers is to form and maintain a uniform encirclement around the evader, while the evader aims to avoid being surrounded through agile and evasive maneuvers. To achieve this control objective, the reward is decomposed into three intuitive components: a relative distance-related reward, an encirclement formation-related reward, and a mission success-related reward. Each component is crafted based on domain-specific considerations to effectively guide the agents’ behavior toward the desired outcomes.

Relative distance-related reward. To encourage the pursuer to approach the evader and the evader to distance themselves from the pursuer, rewards and penalties are assigned based on the distance between the two parties. In addition, a stepwise reward and penalty scheme based on relative distance is designed. This metric is defined as follows.

$R_{d, p} = - c_{d} \sum_{i = 0}^{n_{adv}} | r_{i} - r_{e} | + \{\begin{matrix} - 50, & if | r_{i} - r_{e} | > 5 km \\ 10, & if | r_{i} - r_{e} | \leq 2 km \\ 0, & else \end{matrix},$

(18)

$R_{d, e} = c_{d} \sum_{i = 0}^{n_{adv}} | r_{i} - r_{e} | + \{\begin{matrix} 50, & if | r_{i} - r_{e} | > 5 km \\ - 10, & if | r_{i} - r_{e} | \leq 2 km \\ 0, & else \end{matrix},$

(19)

where $R_{d, p}$ and $R_{d, e}$ represent the relative distance-related reward of pursuers and evaders, respectively. $r_{i}$ and $r_{e}$ are the position vectors of pursuers i and the evader, respectively. $c_{d}$ denotes the weight coefficient of the relative distance-related reward term. When the relative distance between the pursuer and the evader is larger than the threshold $5 km$ , the evader is considered to have achieved victory. A significant reward is assigned to the evader, while a substantial penalty is imposed on the pursuer to encourage it to close the distance. When the relative distance falls below $2 km$ , the pursuer is deemed to have partially achieved its objective and is accordingly granted a reward. The thresholds $2 km$ and $5 km$ are determined by the farthest attack range and field of view of UAVs, respectively. Under this reward design, pursuers are encouraged to approach the evader while avoiding excessive proximity, thereby maintaining their own safety.
Encirclement formation-related reward. To guide pursuers toward forming an encircling configuration around the evader, we employ a cosine-based metric to measure angular uniformity. This geometric formulation penalizes colinear configurations and encourages dispersion around the target. The reward term is defined as follows.

$R_{e, p} = - c_{e} \sum_{i \neq j} \frac{(r_{i} - r_{e}) \cdot (r_{j} - r_{e})}{| r_{i} - r_{e} | | r_{j} - r_{e} |},$

(20)

$R_{e, e} = c_{e} \sum_{i \neq j} \frac{(r_{i} - r_{e}) \cdot (r_{j} - r_{e})}{| r_{i} - r_{e} | | r_{j} - r_{e} |},$

(21)

where $R_{e, p}$ and $R_{e, e}$ represent the encirclement formation-related reward of pursuers and evaders, respectively. $c_{e}$ denotes the weight coefficient of the encirclement formation-related reward term. This formulation promotes pursuers spreading out in opposing directions around the evader, increasing the probability of successful encirclement.
Mission success-related reward. A reward is provided for successful task completion. In this study, the pursuers succeed if they form a uniform encirclement around the evader within a certain distance; otherwise, the evader is considered successful. Therefore, task success is determined in conjunction with the encirclement-related rewards defined in (20) and (21). The specific reward for task completion is defined as follows.

$R_{m, p} = \{\begin{matrix} 50, & if (r_{i} - r_{e}) \cdot (r_{j} - r_{e}) < 0, \forall i \neq j AND max | r_{i} - r_{e} | < d_{\min} \\ 0, & else \end{matrix},$

(22)

$R_{m, e} = \{\begin{matrix} - 50, & if (r_{i} - r_{e}) \cdot (r_{j} - r_{e}) < 0, \forall i \neq j AND max | r_{i} - r_{e} | < d_{\min} \\ 0, & else \end{matrix},$

(23)

where $R_{m, p}$ and $R_{m, e}$ represent the mission success-related reward of pursuers and evaders, respectively. The criteria for task success include constraints on both the capture formation and the relative distances. Specifically, the task is considered successfully completed if the angles between the direction vectors from the evader to each pursuer are all obtuse, and the maximum distance between the evader and any pursuer does not exceed $d_{\min} = 2 km$ . With this task success reward, the pursuers are guided to form an encirclement near the farthest attack range of the evader, effectively constraining the escape space of the evader while maximizing their own safety.

Taking into account the relative distance between the two parties, the encirclement formation, mission success, and the penalty of collisions, the overall reward function is designed as follows:

R_{p} = R_{d, p} + R_{e, p} + R_{m, p},

(24)

R_{e} = - R_{p} = R_{d, e} + R_{e, e} + R_{m, e} .

(25)

3.3. Framework Design

The proposed algorithm framework is designed to address the PEG involving multiple fixed-wing UAVs through a self-play training scheme, in which both the pursuers and the evader simultaneously learn their strategies. The algorithm integrates automatic curriculum learning and MARL to enhance learning efficiency and robustness in complex, adversarial environments. A framework of the proposed algorithm is shown in Figure 3.

In the training stage, both the pursuers and the evaders are trained using a curriculum-based MARL framework. Through self-play, each side continuously adapts its strategy in response to the behavior of the opponents, encouraging the emergence of robust and dynamic policies that can generalize to strong adversaries. To improve sample efficiency and training stability, a subgame curriculum learning mechanism is employed, based on a subgame sampling metric and a subgame sampler. Specifically, the agents evaluate the training demand of different subgame initial states using the subgame sampling matrix, which quantifies the learning value of each state. Based on this evaluation, a subgame sampler autonomously determines the learning sequence, thereby enabling self-organized curriculum scheduling that guides the training process from simpler to more complex tasks. Finally, the policy training is conducted using the MAPPO algorithm under the CTDE framework, which allows the system to leverage centralized information for learning while maintaining decentralized decision-making during execution.

3.4. Subgame Curriculum Learning

As defined earlier, a subgame

MG (s)

refers to a Markov game

MG

that begins from a fixed state s. In the training of MARL, the underlying environment

MG

typically has a fixed initial state distribution

ρ

, from which each trajectory can be viewed as a subgame starting from a state sampled from

ρ

. However, not all subgames have the same training demand. For some subgames, the policies of agents may already be close to a Nash equilibrium, and further training yields limited improvement. In contrast, other subgames remain under-trained and exhibit a significant gap from equilibrium strategies; training in these regions can lead to substantial policy improvement. Relying solely on sampling from a fixed initial distribution may require numerous samples before encountering subgames that are critical for improving the policy, leading to poor sample efficiency. In contrast, prioritizing subgames that better improve the current policy can significantly boost sampling efficiency. To this end, a subgame curriculum learning method is proposed. This approach evaluates and records the training demand of visited states, and uses this information to guide the sampling of subgames during training, thereby allocating more learning effort to the most impactful regions of the state space.

3.4.1. Subgame Sampling Metric

To properly determine the learning sequence of subgames, it is crucial to evaluate their importance and prioritize them accordingly. As previously discussed, the training demand of a subgame is positively correlated with the gap between the current policy and the Nash equilibrium policy in that subgame. In zero-sum games, this gap is commonly measured by exploitability [32], which is defined as follows.

δ_{i} (π) = {\hat{r}}_{i} (π_{i}^{*}, π_{- i}) - {\hat{r}}_{i} (π),

(26)

where

δ_{i} (π)

represents the exploitability of agent i with policy

π_{i}

.

\hat{r}

represents the expected return of agent i.

π_{i}^{*}

denotes the best response strategy to the opponent’s policy

π_{- i}

. This metric reflects the performance loss of agent i when using the current strategy instead of the optimal one against the best possible opponents.

For a given state s, the expected return of agent i is represented by the value function

V_{i}^{π_{i}} (s)

. Therefore, the exploitability at state s can be reformulated as follows:

δ_{i} (π, s) = V_{i}^{π_{i}^{*}, π_{- i}} (s) - V_{i}^{π} (s) = V_{i}^{*} (s) - V_{i}^{π} (s) .

(27)

Accordingly, the training demand of a given state s is quantified as the average of the squared exploitability values across all agents, that is,

D (s) = \frac{1}{2} \sum_{i = 1}^{2} {(V_{i}^{*} - V_{i})}^{2}

(28)

Then, the probability weight of state s is defined as follows.

\begin{matrix} w (s) = & D (s) = \frac{1}{2} \sum_{i = 1}^{2} {(V_{i}^{*} (s) - V_{i} (s))}^{2} \\ = & E_{i}^{2} [V_{1}^{*} (s) - {\tilde{V}}_{i} (s)] + {Var}_{i} [V_{1}^{*} (s) - {\tilde{V}}_{i} (s)], \end{matrix}

(29)

where

E_{i}

denotes the expectation with respect to subscripts i, which in this context refers to the average value over subscripts i, and

{Var}_{i}

represents the variance with respect to subscripts i.

{\tilde{V}}_{i} (s) = {(- 1)}^{i - 1} V_{i} (s)

is a standardized notation based on the zero-sum nature of the established PE game scenario, i.e.,

V_{1}^{*} (s) = - V_{2}^{*} (s)

. This unifies the indices in a consistent format for easier interpretation and representation.

Since the NE value of a certain state is a constant, it does not affect the variance, that is,

{Var}_{i} [V_{1}^{*} (s) - {\tilde{V}}_{i} (s)] = {Var}_{i} [{\tilde{V}}_{i} (s)]

. Additionally, the NE value of a state is generally difficult to obtain, which makes Equation (29) hard to solve. Considering that the convergence process of the policy training tends to fluctuate less near the optimal value and more when there is a larger deviation from the optimal value, the mean squared error of value changes between two consecutive policy optimizations can be used to approximate the mean squared error between the value and the NE value. Alternatively, Equation (29) can be replaced by the following form.

\tilde{w} (s) = α E_{i}^{2} [{\tilde{V}}_{i}^{π_{i}^{t}} (s) - {\tilde{V}}_{i}^{π_{i}^{t - 1}} (s)] + {Var}_{i} [{\tilde{V}}_{i} (s)],

(30)

where

α

is a hyperparameter controlling the relative importance of the two terms in Equation (30).

π_{i}^{t}

and

π_{i}^{t - 1}

represent the current policy and the policy before the last update, respectively.

3.4.2. Subgame Sampler

After calculating the weight of each state, the initial state sampling is performed based on these weights, which helps plan the sequence of subsequent subgames. However, in the multi-UAV PE game scenario, the continuity of the state and action spaces leads to an infinite number of subgame processes. It is unpractical to directly calculate and sample weights over the whole state space. Alternatively, a state buffer

B

with maximum length K is established to store the visited states with their weights during the training process of the policy. In this setup, the effectiveness of curriculum selection largely depends on how well the state buffer represents the state space. The states in the buffer should ideally cover the entire state space and maintain a uniform distribution, which means that the states stored in the state buffer should be as far apart from each other as possible. Thus, the objective of the established state buffer can be expressed as follows.

max_{B \subset S} \sum_{s \in B} min_{s^{'} \in B} | s - s^{'} |,

(31)

where

S

is the state space. The distance between states is represented using the Euclidean distance.

Due to the continuous nature of UAV motion, the states collected during a single run in the environment tend to be concentrated and similar to each other. To address this issue, farthest-point sampling (FPS) [33] is used when the state buffer overflows, retaining the states with greater differences in the buffer, thereby improving the performance of the state buffer in representing the overall state space. As training progresses and more states are visited, the states in the buffer will gradually converge to the ideal distribution that covers the whole state space as broadly and uniformly as possible.

Then, the sampling probability for a certain state can be given as follows.

P (s) = \frac{\tilde{w} (s)}{\sum_{s^{'} \in B} \tilde{w} (s^{'})} .

(32)

Combining the subgame sampling metric and the subgame sampler, the subgame curriculum learning framework is summarized in Algorithm 1.

It is worth noting that when the policies trained via subgame curriculum learning converge, the resulting policies correspond to the NE of all the subgames involved in training. In Algorithm 1, subgame sampling is not solely dependent on the constructed state buffer; instead, initial states are sampled from the state buffer

B

with probability p, and are sampled directly from the original initial state distribution

ρ

with probability

1 - p

. As a result, the training process still covers all subgames initiated from initial states

s_{0}

where

ρ (s_{0}) > 0

. Therefore, the NE obtained from training over these subgames remains a NE of the original Markov game. In this way, the proposed subgame curriculum learning strategy preserves the convergence property of the original MARL algorithm.

Algorithm 1 Subgame Curriculum Learning

Require: State buffers

B

with capacity K, probability p for sampling initial state in the state buffers, experience replay buffer

D

, and Markov game environment

MG

.

1:: Initialize sampling probability $P (s)$ , policy $π_{i}$ and value function $V_{i}$ for player $i = 1, 2$ ;
2:: repeat
3:: $V_{i}^{'} \leftarrow V_{i}, i = 1, 2$ ;
4:: for each parallel environment do
5:: Sample $s^{0} \sim P (s)$ (sample initial state from the state buffer) with probability p, else $s^{0} \sim ρ (\cdot)$ (sample initial state from the original distribution);
6:: Rollout in $MG (s^{0})$ and collect samples into $D$ ;
7:: end for
8:: Train ${π_{i}, V_{i}}_{i = 1}^{2}$ via MAPPO;
9:: ${\tilde{w}}^{t} \leftarrow α E_{i}^{2} [{\tilde{V}}_{i}^{π_{i}^{t}} (s) - {\tilde{V}}_{i}^{π_{i}^{t - 1}} (s)] + {Var}_{i} [{\tilde{V}}_{i} (s)], t = 0, \dots, T$ ;
10:: $B \leftarrow B \cup {(s^{t}, {\tilde{w}}^{t})}_{t = 0}^{T}$ ;
11:: if $∥ B ∥ > K$ then
12:: $B \leftarrow F P S (B, K)$ ;
13:: end if
14:: Update sampling probability $P (s)$ with new weights ${\tilde{w}}^{t}$ and new state buffer $B$ ;
15:: until $(π_{1}, π_{2})$ converges.

3.5. Multi-Agent Reinforcement Learning

Building on the foundation of subgame curriculum learning, the multi-agent proximal policy optimization (MAPPO) algorithm is used for multi-agent policy learning. MAPPO is an extended version of the PPO algorithm within the CTDE framework. The core idea of MAPPO is to jointly optimize the learning process of multiple agents by using a centralized training approach while allowing for decentralized execution.

As a multi-agent decision-making framework that lies between centralized and decentralized decision-making, the CTDE framework allows agents to share global information during training to simplify the training process, while maintaining decentralized and independent decision-making during execution. PPO is based on the actor–critic (AC) framework, requiring both a critic network to estimate value and an actor network to make decisions. Since the critic network is not needed during decision-making, in the training phase, the critic network is built using joint observation

o^{t} = {o_{1}^{t}, \dots, o_{N}^{t}}

and joint actions

a = {a_{1}^{t}, \dots, a_{N}^{t}}

. The actor network, on the other hand, is built based on local observations to maintain the distributed nature of execution. In this setup, the value optimization objective

L (ϕ)

is defined as follows.

L (ϕ) = \frac{1}{B N} \sum_{t = 1}^{B} \sum_{i = 1}^{N} (max [{(V_{ϕ} (s^{t}) - {\hat{R}}^{t})}^{2}, {(clip (V_{ϕ} (s^{t}), V_{ϕ_{o l d}} (s^{t}) - ε, V_{ϕ_{o l d}} (s^{t}) + ε) - {\hat{R}}^{t})}^{2}]),

(33)

where B and N denote the batch size and the number of agents, respectively.

{\hat{R}}^{t}

is the discounted sum of the rewards after step t, which is the true value at state

s^{t}

.

clip (V_{ϕ} (s^{t}), V_{ϕ_{o l d}} (s^{t}) - ε, V_{ϕ_{o l d}} (s^{t}) + ε)

is the function that constrains the value of

V_{ϕ} (s^{t})

within the range

[V_{ϕ_{o l d}} (s^{t}) - ε, V_{ϕ_{o l d}} (s^{t}) + ε]

with

ε

being the hyperparameter that controls the update magnitude of the parameter optimization.

And the objective of the actor network

L (θ)

can be defined as follows.

L (θ) = \frac{1}{B N} \sum_{t = 1}^{B} \sum_{i = 1}^{N} (min [r_{θ, i}^{t} A_{i}^{t}, clip (r_{θ, i}^{t}, 1 - ϵ, 1 + ϵ) A_{i}^{t}]) + σ \frac{1}{B N} \sum_{t = 1}^{B} \sum_{i = 1}^{N} S (π_{θ} (o_{i}^{t})),

(34)

where

r_{i, t}^{θ} = \frac{π_{θ} (a_{i}^{t} | o_{i}^{t})}{π_{θ_{old}} (a_{i}^{t} | o_{i}^{t})}

is the importance ratio. S is the policy entropy, and

σ

is the entropy coefficient hyperparameter.

A_{i}^{t}

is the advantage function. The generalized advantage estimate (GAE) is applied to estimate the value of the advantage function unbiasedly with the following form.

{\hat{A}}_{i}^{t} = \sum_{t^{'} = t}^{n - 1} {(γ λ)}^{t^{'} - t} (R_{i}^{t} + γ V_{ϕ}^{t + 1} (s_{i}^{t + 1}) - V_{ϕ}^{t} (s_{i}^{t})),

(35)

where

λ

is the tunable hyperparameter in the GAE.

Algorithm 2 summarizes the training framework. First, the actor and critic networks are initialized, and an empty experience pool

D

is created. The iterative training process then begins. In each iteration, each agent generates actions based on its local observations and interacts with the environment. The environment transitions to the next state and generates a reward R. These data are stored in the experience pool

D

, and the advantage function is estimated using Equation (35). Once sufficient data have been collected, the parameters are updated, and the experience pool is cleared before repeating the process.

Algorithm 2 MAPPO

Require: Markov game environment

MG

, empty experience replay buffer

D

, learning rate

η_{θ}

,

η_{ϕ}

for the actor network and critic network.

1:: Initialize $θ$ , the parameters for actor $π$ and $ϕ$ , the parameters for critic V, by Orthogonal initialization;
2:: while $s t e p \leq s t e p_{m a x}$ do
3:: $D \leftarrow {}$ ;
4:: for each iteration do
5:: $τ \leftarrow []$ ;
6:: Reset the environment, get initial observation $o^{t}$
7:: for each environment step t do
8:: for each agent i do
9:: $a_{i}^{t} \sim π_{θ, i} (o_{i}^{t})$ ;
10:: end for
11:: $V^{t} \sim V_{ϕ} (o^{t})$ ;
12:: Execute action $a^{t}$ , get rewards $R^{t}$ , state $s^{t + 1}$ and observation $o^{t + 1}$ ;
13:: $τ \leftarrow τ + [s^{t}, o^{t}, a^{t}, R^{t}, s^{t + 1}, o^{t + 1}]$
14:: end for
15:: Compute advantage estimate $\hat{A}$ via Equation (35) on trajectory $τ$ ;
16:: Compute reward-to-go $\hat{R}$ on $τ$ and normalize it;
17:: $D \leftarrow D \cup {τ, \hat{A}, \hat{R}}$ ;
18:: end for
19:: for each PPO epoch do
20:: $ϕ \leftarrow ϕ - η_{ϕ} \nabla_{ϕ} L (ϕ)$ update the critic network by Equation (33)
21:: $θ \leftarrow θ - η_{θ} \nabla_{θ} L (θ)$ update the actor network by Equation (34)
22:: end for
23:: end while

4. Results

In this section, the multiple fixed-wing UAV PE environment is implemented based on the Gym framework to validate the effectiveness of the proposed method.

4.1. Simulation Setup

4.1.1. Environmental Parameters

In this work, a multi-UAV PE training environment was developed based on the OpenAI Gym interface, and the pursuing and evading strategies were trained using PyTorch (Python 3.9). Specifically, the simulation scenario consists of three pursuer UAVs (UAV0, UAV1, and UAV2) and one evader UAV (UAV3). As illustrated in Figure 4, the evader is randomly initialized within the light blue region of the environment, i.e.,

{(x, y) | - 3 \leq x \leq - 1, - 3 \leq y \leq - 1}

, while the pursuers are randomly initialized within the light red region, i.e.,

{(x, y) | 1 \leq x \leq 3, 1 \leq y \leq 3}

. The initial yaw and roll angles of all the UAVs are set to zero, that is,

ψ^{0} = φ^{0} = 0

. The initial speed of all the UAVs is set to the minimum speed of itself, i.e.,

v^{0} = v_{\min}

. The maximum number of environment steps is limited to 200, with a time interval of

0.5

s between consecutive steps. To enhance the accuracy of the discretized model, each environment step is further divided into five computational sub-steps to perform kinematic calculations, thereby approximating continuous-time integration. For the PE environment, the objective of the pursuers is to collaboratively encircle the evader within the maximum number of steps, maintaining a close and uniform formation around the evader to prevent escape. Meanwhile, the evader aims to maximize its distance from the pursuers and avoid being encircled. To ensure that the reward for task success is comparable in magnitude to the penalty for task failure, and that the reward for achieving proximity is similar to that for forming a uniform surrounding in terms of angular distribution, the reward weights are set as follows:

c_{d} = 1, c_{e} = 10

.

To ensure a meaningful and solvable pursuit–evasion interaction, the evader is designed to have slightly inferior maneuverability compared to the pursuers. This design choice addresses a key issue arising from the absence of environmental boundaries: if the evader has equal or superior maximum speed, it could trivially evade capture by continuously moving away from the pursuers, rendering the game unsolvable and uninformative. For example, in the initial configuration shown in Figure 4, the evader is initialized at the bottom-left corner of the map, while the pursuers are placed in the top-right corner. That is, all pursuers are initially located at a relative position to the upper right of the evader. In this setting, the evader’s optimal escape strategy is to move steadily toward the bottom left, maintaining maximum distance from the pursuers. Since there is a non-zero initial distance between the two sides, if the evader has equal or greater speed than the pursuers, this distance would remain constant or even increase over time, making it impossible for the pursuers to ever approach, let alone encircle or capture the evader. By limiting the evader’s speed, we encourage strategic behavior from both sides and facilitate effective training of pursuit policies.

To mitigate this, the maximum speed of the pursuer UAVs and the evader UAV are

v_{\max, p} = 400 m / s

and

v_{\max, e} = 300 m / s

, respectively. The minimum speed is

v_{\min, p} = v_{\min, e} = 100 m / s

. For control space, set

n_{x} \in [- 1, 2]

, and

φ \in [- arccos (1 / 8),

arccos (1 / 8)]

.

4.1.2. Hyperparameters of Algorithm

The hyperparameters involved in the proposed algorithm are summarized in Table 1, including those related to MAPPO [34] as well as the subgame curriculum learning method proposed in this paper. The hyperparameters related to MAPPO were set based on those used in similar encirclement tasks [28]. Only minor adjustments were made to the number of parallel threads and PPO epochs according to hardware constraints. Additionally, the hyperparameters related to the subgame curriculum learning were determined empirically through multiple trial experiments.

It is worth noting that, due to the high-dimensional nature of the observation space where each agent perceives not only its own state but also the relative states of all other agents, this study employs a Transformer-based architecture for constructing both the actor and critic networks. Specifically, the input observations are first processed through a LayerNorm layer, and then the normalized observations are partitioned into different agents, including the agent itself, collaborators, and opponents. Each agent is then passed through a fully connected layer with 32 units to obtain its embedding, with the weights shared among entities of the same type. Subsequently, the embedding of each agent is concatenated with its corresponding observation and processed by a self-attention network. The outputs of the attention module are normalized and concatenated with the self-embedding to obtain the final representations. These representations are then passed through two fully connected hidden layers, each with 64 units, followed by separate network heads: one for producing the value estimation (critic) and another for generating the action distribution (actor).

For subgame curriculum learning, a state buffer

B

is established to store visited states and to sample from these states for subgame training. Since the environment developed in this study does not impose any spatial boundaries on the UAVs, the corresponding state space is theoretically infinite. Constructing a buffer that covers the entire state space would thus be impractical. Alternatively, a localized state buffer is built around the initial state region, specifically including only those states with the positions within the range of

{(x, y) | - 3 \leq x \leq 3, - 3 \leq y \leq 3}

. A weighting mechanism is applied to these stored states to guide subgame sampling. Compared to global state-space sampling, this localized approach focuses subgame curriculum learning around the initial phase of the episode. This not only maintains sufficient curriculum learning performance but also significantly improves computational efficiency and the representational capacity of the state buffer with fixed capacity.

4.2. Results and Discussion

Simulation validation was conducted based on the environment and algorithm settings described above. Under the subgame curriculum framework, training was performed for a total of 250 million environment steps. The obtained results are presented as follows. Figure 5 shows the single-step average rewards obtained by both the pursuers and the evader during the training process. Figure 6 illustrates the evolution of the encirclement success rate of the pursuers over the course of training.

As shown in Figure 5, after training for 250 million environment steps, the single-step average reward of the pursuers approaches 0. Since the pursuers receive penalties when they are far from the evader and the initial distance between the pursuers and evaders in testing environments is set to be greater than the penalty threshold distance, the pursuers initially accumulate substantial penalties. The single-step average reward is calculated by averaging the total rewards obtained throughout the episode over the total number of environment steps. Therefore, when the single-step average reward approaches zero, it indicates that the pursuers consistently achieve successful captures in the later stages of the task, with the rewards obtained compensating for the penalties incurred earlier. Due to the self-play training framework employed between the pursuers and the evader, the reward curves are not strictly monotonically increasing or decreasing. Overall, the evolution of the rewards can be divided into three stages:

0 \sim 80 M

,

80 M \sim 160 M

, and

160 M \sim 250 M

steps. In the first stage, the pursuers learn to approach the evader in order to reduce penalties and gain rewards, resulting in a rapid increase in their average reward. In the second stage, although the pursuers develop effective approaching policies, learning the more complex coordinated encirclement policies proves to be challenging, whereas the training of the policy of the evader is comparatively easier. As a result, the rewards of the pursuers slightly decline. In the third stage, the pursuers gradually master effective encirclement policies, preventing the evaders from breaking out of the encirclement and thereby earning higher rewards. As a result, the average reward of the pursuers rises again. However, due to the improved evading policy at this stage, the growth rate of the reward is relatively moderate.

As shown in Figure 6, the trend of the encirclement success rate curve corresponds closely to the reward curve and can also be approximately divided into the three aforementioned stages. In the first stage, due to the limited decision-making capability of the evader, the capture success rate rises rapidly. However, the success rate only increases to about

60 %

, indicating that the pursuers have not yet learned effective encirclement policies; the rapid rise in rewards is largely attributed to the achievement of secondary objectives (reducing the distance to the evader). In the second stage, as the decision-making ability of the evader improves, the encirclement success rate declines. In the third stage, the pursuers progressively master effective encirclement policies, resulting in a significant increase in the encirclement success rate to over

80 %

.

To validate the effectiveness of the subgame curriculum learning method, the initial positions of the evader in the sampled subgames were recorded during the training process, as shown in Figure 7. In the figure, both the horizontal and vertical axes are in units of kilometers (km). As illustrated in Figure 7, as training progresses, the initial positions of the evader gradually shift from an initial uniform distribution to being concentrated near the edges of the state buffer and eventually toward the lower-left corner. This evolution indicates that, during the training process, the initial positions of the evaders gradually move farther away from the pursuers, meaning that the selection of subgames progressively shifts from simpler scenarios to more challenging ones.

To better demonstrate the effectiveness of the pursuers in capturing the evader, the trained model was tested, and the UAV trajectories at various stages of task execution are illustrated in Figure 8. In the figure, both the horizontal and vertical axes are in units of km. The starting positions of the agents are indicated by dots, while the final positions are marked by crosses. As shown, at the beginning of the task, the average initial distance between the pursuers and the evader exceeds

5 km

. Since both the pursuers and the evader initially move in the positive x-axis direction, the pursuers experience significant penalties during the early phase of the task. This also explains why the final per-step average reward of the pursuers remains around 0. During the first 65 steps, the evader adjusts its heading to move away from the pursuers, while the pursuers modify their trajectories to close the gap with the evader. By step 130, it can be observed that the pursuers have successfully encircled the evader. Due to the relatively even distribution of the pursuers around the evader, it becomes difficult for the evader to break through the encirclement. Constrained by its minimum speed and turning radius, the evader is forced into circular motion to maintain its current reward and avoid further penalties. In the later stages of the task, the pursuers consistently maintain the encirclement until the task concludes. From the above process, it can be observed that the pursuers successfully carried out an effective encirclement of the evader, forming a uniform surrounding formation at a close distance and effectively preventing the evader from escaping further.

Figure 9 illustrates the comparison between the proposed algorithm with two baseline algorithms—MAPPO [34] and PPO [31]—in terms of the pursuers’ reward curves and encirclement success rates after the same number of training steps. In the figure, the proposed method is abbreviated as SCL-MAPPO. Due to the zero-sum nature of the game, the reward curve of the evader is simply the mirrored version of the pursuers’ reward curve across the x-axis. For clarity, only the pursuers’ rewards are shown. The hyperparameter settings for MAPPO and PPO are consistent with those listed in Table 1. As shown in Figure 9a, after 250 million training steps, the training curves of MAPPO and PPO can be roughly divided into two phases:

0 \sim 170 M

and

170 M \sim 250 M

for MAPPO, and

0 \sim 210 M

and

210 M \sim 250 M

for PPO. These phases approximately correspond to the first two training stages of the proposed method:

0 \sim 80 M

, where the pursuers’ rewards increase, and

80 M \sim 160 M

, where the rewards slightly decrease. Figure 9b presents the encirclement success rates throughout training. The proposed method consistently outperforms PPO and only slightly underperforms MAPPO during a brief period in the middle of training. By the end of training, SCL-MAPPO achieves significantly higher success rates compared to both baselines. Taken together, these results demonstrate that the proposed subgame curriculum learning framework effectively improves both sampling efficiency and convergence speed.

To evaluate the robustness of the proposed algorithm, we tested its performance under different random seeds. A total of 1000 trials were conducted, and the reward distribution statistics are presented in Figure 10. In the figure, the blue box represents the interquartile range, capturing the middle

50 %

of the data. The red line inside the box indicates the median, the green arrow marks the mean, and the hollow circles denote outliers that fall outside the typical data range. As shown in the figure, the interquartile range of the box lies between

- 10

and 10 with few outliers, indicating a dense reward distribution. This suggests that the proposed algorithm maintains relatively stable policy performance across different random seeds, with a high median reward, demonstrating good generalization and robustness. In addition, the algorithm achieved a capture success rate of

94.1 %

over 1000 trials, providing further evidence of its effectiveness.

Considering that the constructed model involves several simplifications, we introduce Gaussian noise to compensate for unmodeled dynamics and various external disturbances encountered during UAV flight in order to reduce the gap between simulation and real-world scenarios. Noise

{[η_{x}, η_{y}, η_{v}, η_{ψ}]}^{T}

is independently injected into the state variables

s = {[x, y, v, ψ]}^{T}

of the fixed-wing UAVs. All noise terms are sampled from zero-mean Gaussian distributions. In this study, the standard deviations for each state variable are set as follows:

σ_{x} = σ_{y} = σ_{v} = 0.02 km, σ_{ψ} = 0.2 rad

, which were derived by proportionally scaling the uncertainty parameters used in [26]. We retrain our algorithm in this uncertainty-augmented environment. The resulting average rewards and encirclement success rates are shown in Figure 11. As shown in the figure, during the early stages of training, the reward curves and encirclement success rates for agents trained in both deterministic and uncertain environments are quite similar, indicating comparable performance. However, in the later stages of training, agents trained under uncertainty exhibit faster reward growth and higher capture success rates. This improvement is possibly due to the presence of uncertainty disrupting the local “stalemate” between the pursuers and evaders, enabling the agents to escape local optima and discover more effective strategies. Overall, the introduction of uncertainty does not degrade the performance of the algorithm; on the contrary, it leads to noticeable improvements. This further demonstrates the robustness of the proposed method and provides a preliminary attempt toward sim-to-real transfer.

Furthermore, regarding the generalizability of the proposed method, we believe it demonstrates a certain degree of robustness. The reasons are as follows: First, the training environment is configured with challenging initial conditions, where all pursuers are initialized on the same side of the evader and at a disadvantageous distance. Learning under such difficult scenarios encourages the development of robust coordination strategies. Second, both pursuers and the evader rely solely on relative observations, including the positions and headings of other agents relative to themselves. This design ensures that the learned policy is invariant to global position shifts or rotations, allowing it to generalize naturally to environments with different initial placements that preserve the underlying task structure. Additionally, we find that the learned policy transfers effectively to environments with different numbers of agents in multi-pursuer single-evader settings. Since the core objective—achieving coordinated encirclement—remains unchanged, the trained agents can adapt to varying team sizes with minimal performance degradation. However, in multi-evader scenarios, the nature of the task changes significantly, requiring more complex decision-making such as target assignment and dynamic regrouping, which are not addressed by the current model. Extending the framework to handle such cases constitutes an important direction for future research.

5. Conclusions

This paper focuses on the problem of distributed autonomous decision-making for fixed-wing UAVs in PEGs based on multi-agent reinforcement learning. A Markov PEG model is established, considering the dynamics and relevant constraints of fixed-wing UAVs. Based on a subgame curriculum learning mechanism and a self-play framework, systematic training and validation are conducted using the MAPPO algorithm within the CTDE framework. Simulation results demonstrate that the proposed reward structure effectively guides the pursuers to learn efficient strategies within the designed pursuit–evasion scenarios, enabling them to achieve a high capture success rate and prevent the evaders from further escaping. In addition, the introduced subgame curriculum learning mechanism further facilitates the training process by progressively guiding the agents from simple to more complex tasks. Finally, testing and analysis confirm the control effectiveness of the trained model in practical pursuit–evasion missions, highlighting the feasibility and potential of the proposed method.

While the proposed framework demonstrates promising performance in simulated 2D pursuit–evasion scenarios with fixed-wing UAVs, several limitations remain in the current work. First, all experiments are conducted in a two-dimensional simulation environment, which may not fully reflect the dynamics and challenges of real-world three-dimensional UAV deployments. Second, this study is restricted to many-to-one pursuit scenarios under zero-sum game assumptions. This limits the generality of the proposed approach, especially in scenarios requiring target assignment strategies or cooperative behaviors among agents. Additionally, practical issues such as sensing noise, communication constraints, and real-time control delays are not considered and may affect real-world deployment.

To address the above limitations, potential future research directions include the following: (1) extending the framework to three-dimensional environments to better reflect realistic UAV motion and control challenges; (2) adapting the method to handle multi-target cooperative pursuit tasks, which will require integrating target allocation and coordination mechanisms; (3) exploring non-zero-sum settings to model more general interaction dynamics beyond strictly competitive games; and (4) investigating practical deployment challenges by incorporating hardware-in-the-loop simulation, sensor noise modeling, and communication constraints to evaluate the real-world feasibility of the proposed approach.

Author Contributions

Conceptualization, all authors; methodology, Y.L. and H.G.; software, Y.L. and H.G.; validation, Y.L. and H.G.; formal analysis, Y.L. and H.G.; investigation, Y.L. and H.G.; resources, H.G. and Y.X.; writing—original draft preparation, Y.L. and H.G.; writing—review and editing, Y.L. and H.G.; supervision, H.G. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant Number: 62103043), the National Key Laboratory of Space Intelligent Control (Grant Number: 2024-CXPT-GF-JJ-012-14), the Natural Science Foundation of Shandong Province (Grant Number: ZR2024QF006), the National Key Laboratory of Science and Technology on Space Born Intelligent Information Processing (Grant Number: TJ-01-22-08) and Beijing Institute of Technology Research Fund Program for Young Scholars.

Data Availability Statement

The data are contained within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schedl, D.C.; Kurmi, I.; Bimber, O. An autonomous drone for search and rescue in forests using airborne optical sectioning. Sci. Robot. 2021, 6, eabg1188. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Jiang, F.; Zhang, B.; Ma, R.; Hao, Q. Development of UAV-based target tracking and recognition systems. IEEE Trans. Intell. Transp. Syst. 2019, 21, 3409–3422. [Google Scholar] [CrossRef]
Von Moll, A.; Garcia, E.; Casbeer, D.; Suresh, M.; Swar, S.C. Multiple-pursuer, single-evader border defense differential game. J. Aerosp. Inf. Syst. 2020, 17, 407–416. [Google Scholar] [CrossRef]
Isaacs, R. Differential Games: A Mathematical Theory with Applications to Warfare and Pursuit, Control and Optimization; Courier Corporation: North Chelmsford, MA, USA, 1999. [Google Scholar]
Li, S.; Wang, C.; Xie, G. Optimal strategies for pursuit-evasion differential games of players with damped double integrator dynamics. IEEE Trans. Autom. Control 2023, 69, 5278–5293. [Google Scholar] [CrossRef]
Wang, C.; Chen, H.; Pan, J.; Zhang, W. Encirclement guaranteed cooperative pursuit with robust model predictive control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 1473–1479. [Google Scholar]
Zhang, Y.; Zhang, P.; Wang, X.; Song, F.; Li, C.; Hao, J. An open loop Stackelberg solution to optimal strategy for UAV pursuit-evasion game. Aerosp. Sci. Technol. 2022, 129, 107840. [Google Scholar] [CrossRef]
Wang, Y.; Dong, L.; Sun, C. Cooperative control for multi-player pursuit-evasion games with reinforcement learning. Neurocomputing 2020, 412, 101–114. [Google Scholar] [CrossRef]
Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of drones: Multi-UAV pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7900–7909. [Google Scholar] [CrossRef]
Kokolakis, N.M.T.; Vamvoudakis, K.G. Safety-aware pursuit-evasion games in unknown environments using gaussian processes and finite-time convergent reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 3130–3143. [Google Scholar] [CrossRef]
Xiong, H.; Zhang, Y. Reinforcement learning-based formation-surrounding control for multiple quadrotor UAVs pursuit-evasion games. ISA Trans. 2024, 145, 205–224. [Google Scholar] [CrossRef]
Xiao, J.; Feroskhan, M. Learning multi-pursuit evasion for safe targeted navigation of drones. IEEE Trans. Artif. Intell. 2024, 5, 2691–4581. [Google Scholar] [CrossRef]
Chen, Y.; Shi, Y.; Dai, X.; Meng, Q.; Yu, T. Pursuit-evasion game with online planning using deep reinforcement learning. Appl. Intell. 2025, 55, 512. [Google Scholar] [CrossRef]
Hung, S.M.; Givigi, S.N. A Q-learning approach to flocking with UAVs in a stochastic environment. IEEE Trans. Cybern. 2016, 47, 186–197. [Google Scholar] [CrossRef]
Zhuang, X.; Li, D.; Li, H.; Wang, Y.; Zhu, J. A dynamic control decision approach for fixed-wing aircraft games via hybrid action reinforcement learning. Sci. China Inf. Sci. 2025, 68, 132201. [Google Scholar] [CrossRef]
Yang, Q.; Zhang, J.; Shi, G.; Hu, J.; Wu, Y. Maneuver decision of UAV in short-range air combat based on deep reinforcement learning. IEEE Access 2019, 8, 363–378. [Google Scholar] [CrossRef]
Zhang, H.; Zhou, H.; Wei, Y.; Huang, C. Autonomous maneuver decision-making method based on reinforcement learning and Monte Carlo tree search. Front. Neurorobot. 2022, 16, 996412. [Google Scholar] [CrossRef]
Wang, H.; Zhou, Z.; Jiang, J.; Deng, W.; Chen, X. Autonomous Air Combat Maneuver Decision-Making Based on PPO-BWDA. IEEE Access 2024, 12, 119116–119132. [Google Scholar] [CrossRef]
Luo, D.; Fan, Z.; Yang, Z.; Xu, Y. Multi-UAV cooperative maneuver decision-making for pursuit-evasion using improved MADRL. Def. Technol. 2024, 35, 187–197. [Google Scholar] [CrossRef]
Yan, T.; Liu, C.; Gao, M.; Jiang, Z.; Li, T. A Deep Reinforcement Learning-Based Intelligent Maneuvering Strategy for the High-Speed UAV Pursuit-Evasion Game. Drones 2024, 8, 309. [Google Scholar] [CrossRef]
Li, L.; Zhang, X.; Qian, C.; Zhao, M.; Wang, R. Cross coordination of behavior clone and reinforcement learning for autonomous within-visual-range air combat. Neurocomputing 2024, 584, 127591. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, M.; Zhang, J.; Yang, Q.; Shi, G.; Lu, M.; Jiang, F. Multi-UAV pursuit-evasion gaming based on PSO-M3DDPG schemes. Complex Intell. Syst. 2024, 10, 6867–6883. [Google Scholar] [CrossRef]
Wang, X.; Chen, Y.; Zhu, W. A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4555–4576. [Google Scholar] [CrossRef] [PubMed]
De Souza, C.; Newbury, R.; Cosgun, A.; Castillo, P.; Vidolov, B.; Kulić, D. Decentralized multi-agent pursuit using deep reinforcement learning. IEEE Robot. Autom. Lett. 2021, 6, 4552–4559. [Google Scholar] [CrossRef]
Li, F.; Yin, M.; Wang, T.; Huang, T.; Yang, C.; Gui, W. Distributed Pursuit-Evasion Game of Limited Perception USV Swarm Based on Multiagent Proximal Policy Optimization. IEEE Trans. Syst. Man. Cybern. Syst. 2024, 54, 6435–6446. [Google Scholar] [CrossRef]
Yan, C.; Wang, C.; Xiang, X.; Low, K.H.; Wang, X.; Xu, X.; Shen, L. Collision-avoiding flocking with multiple fixed-wing UAVs in obstacle-cluttered environments: A task-specific curriculum-based MADRL approach. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 10894–10908. [Google Scholar] [CrossRef]
Wang, L.; Zheng, S.; Tai, S.; Liu, H.; Yue, T. UAV air combat autonomous trajectory planning method based on robust adversarial reinforcement learning. Aerosp. Sci. Technol. 2024, 153, 109402. [Google Scholar] [CrossRef]
Li, B.; Wang, J.; Song, C.; Yang, Z.; Wan, K.; Zhang, Q. Multi-UAV roundup strategy method based on deep reinforcement learning CEL-MADDPG algorithm. Expert Syst. Appl. 2024, 245, 123018. [Google Scholar] [CrossRef]
Zhao, B.; Zhao, Y.; Jia, S.; Li, Z.; Huo, M.; Qi, N. Curriculum Based Reinforcement Learning for Pursuit-Escape Game between Uavs in Unknown Environment. SSRN Prepr. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5036940 (accessed on 21 May 2025).
Chen, J.; Xu, Z.; Li, Y.; Yu, C.; Song, J.; Yang, H.; Fang, F.; Wang, Y.; Wu, Y. Accelerate multi-agent reinforcement learning in zero-sum games with subgame curriculum learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 11320–11328. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Timbers, F.; Bard, N.; Lockhart, E.; Lanctot, M.; Schmid, M.; Burch, N.; Schrittwieser, J.; Hubert, T.; Bowling, M. Approximate exploitability: Learning a best response in large games. arXiv 2020, arXiv:2004.09677. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Advances in Neural Information Processing Systems. Volume 30. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]

Figure 1. The 4-DOF fixed-wing UAV model in a 2D inertial frame.

Figure 2. Markov game environment for multi-agent pursuit–evasion involving fixed-wing UAVs.

Figure 3. Architecture of the proposed curriculum-based MARL algorithm. (a) The PEG environment for fixed-wing UAVs offers complex pursuit–evasion scenarios and an interactive setting for MARL; (b) the subgame curriculum learning module illustrates the structure of the proposed training framework based on subgame prioritization; and (c) the MAPPO-based decision-making module depicts the process of policy optimization and action selection within the CTDE and self-play framework.

Figure 4. Illustration of initial states for pursuers and evaders.

Figure 5. Reward curves of pursuers and the evader: (a) Stepwise reward received by the pursuers. (b) Stepwise reward received by the evader.

Figure 6. Evolution of the encirclement success rate of the pursuers.

Figure 7. Visualization of the evader’s initial position heatmap generated by subgame curriculum learning at different training timesteps. (a) Initial distribution: uniform across the environment. (b) Distribution at timestep 112 M: concentrated along the edges. (c) Distribution at timestep 250 M: concentrated in the corners.

Figure 8. Motion trajectories of fixed-wing UAVs at different timesteps in a three versus one PE scene: (a) Initial positions of all agents. (b) Trajectories up to timestep 65. (c) Trajectories up to timestep 130. (d) Trajectories up to timestep 200.

Figure 9. Performance comparison with baseline algorithms (MAPPO and PPO). SCL-MAPPO denotes the proposed subgame curriculum learning-based MAPPO. (a) Training reward curves. (b) Encirclement success rates.

Figure 10. Sensitivity analysis of the proposed algorithm across 1000 random seeds.

Figure 11. Performance comparison with training results in uncertain environment and deterministic environment. (a) Training reward curves. (b) Encirclement success rates.

Table 1. Hyperparameters related to the proposed algorithm.

Hyperparameter	Value
Learning rate of actor network ( $η_{θ}$ )	$5 \times 10^{- 4}$
Discount rate ( $γ$ )	$0.99$
GAE parameter ( $λ_{GAE}$ )	$0.95$
Gradient clipping	$10.0$
Adam stepsize	$1 \times 10^{- 5}$
Entropy coefficient ( $σ$ )	$0.01$
Parallel threads	100
PPO clipping range ( $ϵ$ )	$0.2$
PPO epochs	5
Probability of subgame sampling (p)	$0.7$
Capacity of the state buffer (K)	10,000
Weight of the value difference ( $α$ )	$0.7$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, Y.; Gao, H.; Xia, Y. Distributed Pursuit–Evasion Game Decision-Making Based on Multi-Agent Deep Reinforcement Learning. Electronics 2025, 14, 2141. https://doi.org/10.3390/electronics14112141

AMA Style

Lin Y, Gao H, Xia Y. Distributed Pursuit–Evasion Game Decision-Making Based on Multi-Agent Deep Reinforcement Learning. Electronics. 2025; 14(11):2141. https://doi.org/10.3390/electronics14112141

Chicago/Turabian Style

Lin, Yanghui, Han Gao, and Yuanqing Xia. 2025. "Distributed Pursuit–Evasion Game Decision-Making Based on Multi-Agent Deep Reinforcement Learning" Electronics 14, no. 11: 2141. https://doi.org/10.3390/electronics14112141

APA Style

Lin, Y., Gao, H., & Xia, Y. (2025). Distributed Pursuit–Evasion Game Decision-Making Based on Multi-Agent Deep Reinforcement Learning. Electronics, 14(11), 2141. https://doi.org/10.3390/electronics14112141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distributed Pursuit–Evasion Game Decision-Making Based on Multi-Agent Deep Reinforcement Learning

Abstract

1. Introduction

2. Preliminaries

2.1. UAV Dynamics Model

2.2. Markov Game

2.3. Proximal Policy Optimization

3. Problem Formulation and Methods

3.1. Problem Statement

3.2. Pursuit–Evasion Scene

3.2.1. State Representation and Observation Model

3.2.2. Action Space

3.2.3. Reward Structure

3.3. Framework Design

3.4. Subgame Curriculum Learning

3.4.1. Subgame Sampling Metric

3.4.2. Subgame Sampler

3.5. Multi-Agent Reinforcement Learning

4. Results

4.1. Simulation Setup

4.1.1. Environmental Parameters

4.1.2. Hyperparameters of Algorithm

4.2. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI