Cooperative Encirclement and Obstacle Avoidance of Fixed-Wing UAVs via MADDPG with Curriculum Learning

Zhao, Xinrui; Tan, Jianwen; Meng, Wenyue; Yu, Ziping; Yan, Yongzhao; Zhang, Zijian

doi:10.3390/drones9100727

Open AccessArticle

Cooperative Encirclement and Obstacle Avoidance of Fixed-Wing UAVs via MADDPG with Curriculum Learning

by

Xinrui Zhao

^1,2

,

Jianwen Tan

^2,3

,

Wenyue Meng

¹

,

Ziping Yu

¹

,

Yongzhao Yan

^1,2 and

Zijian Zhang

^1,*

¹

National Key Laboratory of Science and Technology on Advanced Light-Duty Gas-Turbine, Institute of Engineering Thermophysics, Chinese Academy of Sciences, Beijing 100190, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(10), 727; https://doi.org/10.3390/drones9100727

Submission received: 17 September 2025 / Revised: 17 October 2025 / Accepted: 20 October 2025 / Published: 21 October 2025

(This article belongs to the Special Issue Distributed Control, Optimization, and Game of UAV Swarm Systems (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

The proposed curriculum-learning-based MADDPG framework effectively im-proves the training stability and convergence efficiency in cooperative fixed-wing UAV encirclement tasks.
Experimental results show that fixed-wing UAVs achieve over 95% encirclement success rates in simple environments and maintain above 70% success rates in complex environments with obstacles and moving targets.

What is the implication of the main finding?

The study demonstrates that curriculum learning is a practical solution for over-coming sparse rewards, local optima, and high task complexity in UAV swarm training.
The proposed approach provides a promising pathway for applying cooperative reinforcement learning to real-world UAV systems operating in dynamic and high-risk environments.

Abstract

Multi-UAV cooperative encirclement tasks have attracted considerable attention in areas such as military defense and target interception. Fixed-wing UAVs face substantial challenges due to intrinsic dynamic limits, including their minimum velocity and turning radius, particularly when engaging evasive target and navigating in obstacle environments. This paper presents a hybrid deep reinforcement learning approach, in which a cooperative task environment is developed for fixed-wing UAVs that jointly integrates encirclement and obstacle avoidance. A composite MADDPG framework enhanced with curriculum learning is designed, employing progressive task staging and reward optimization to accelerate convergence and improve policy stability. Simulation results demonstrate that the proposed method achieves single-step encirclement success rates exceeding 80% in complex environments, while maintaining 10-step success rates around 70%, thereby strengthening both encirclement capability and obstacle avoidance safety in fixed-wing UAV swarm. This study provides new insights into the intelligent cooperative control of fixed-wing UAVs in high-risk missions.

Keywords:

multi-agent deep reinforcement learning; curriculum learning; cooperative encirclement; fixed-wing UAV

1. Introduction

In recent years, unmanned aerial vehicles (UAVs) have played a significant role in various military operations, including the Russia–Ukraine conflict, and have become a pivotal force in modern warfare [1]. Owing to their affordability, rapid deployment, and high maneuverability, UAVs have been widely applied in surveillance [2], search [3], communication [4], and transportation [5]. Nevertheless, an individual UAV is inherently limited by restricted payload capacity, short endurance, and constrained onboard functionality [6], making it challenging to accomplish diverse missions in complex environments. In contrast, multiple UAVs can achieve enhanced flexibility, robustness, and broader task coverage through cooperative sensing, distributed decision-making, and task allocation [7], thus becoming a research frontier in intelligent control and swarm intelligence.

With the rapid development of deep reinforcement learning (DRL) [8], multi-agent deep reinforcement learning (MADRL) has been increasingly applied to UAV encirclement tasks, aiming to overcome the limitations of traditional rule-based and model predictive control approaches in high-dimensional, nonlinear, and adversarial environments [9]. Early studies such as Value-Decomposition Networks (VDN) [10] and QMIX [11] enabled multi-agent cooperation through value function decomposition, which proved effective in discrete action spaces. However, these approaches cannot be directly applied to physical systems with continuous control inputs, such as UAVs. To address this issue, Lowe et al. proposed Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [12], which incorporates the centralized training and decentralized execution (CTDE) paradigm. This framework allows multiple agents to efficiently perform cooperative learning and dynamic coordination in continuous action spaces. Owing to these advantages, MADDPG has shown strong potential in UAV swarm control, formation, and encirclement. Subsequently, variants such as COMA [13] and MAPPO [14] were developed to improve credit assignment and stability; nevertheless, MADDPG remains one of the most representative and practical policy gradient methods in continuous UAV control.

Encirclement (pursuit–evasion) is a representative multi-agent cooperative control problem with critical applications in battlefield target interception, counter-UAV operations, and sensitive area defense [15]. The encirclement task requires UAV swarm to jointly capture a maneuverable target in complex scenarios, which poses high demands on maneuverability, task allocation, and global coordination. For example, Du et al. [16] applied curriculum learning with multirotor UAVs to achieve cooperative encirclement of intruders in urban airspace. Fang et al. [17] investigated the interception of a high-speed evader by distributed encircling UAVs, confirming the advantages of decentralized control in high-speed environments. Zhang et al. [18] proposed the GM-TD3 algorithm for intercepting high-speed targets with multiple fixed-wing UAVs in obstacle-rich environments, further enhancing stability and coordination.

In complex environments, UAV swarm must achieve mission objectives while avoiding collisions with both environmental obstacles and teammates. Traditional obstacle avoidance methods, such as geometric approaches, artificial potential fields, and optimization-based control, are often limited in handling high-dimensional continuous state–action spaces and are prone to local optima in unstructured environments [19]. In contrast, MADRL-based methods have demonstrated stronger adaptability and flexibility. Xiang and Xie [20] introduced coupled rules in adversarial tasks, significantly improving the success rate of UAV swarm. Li et al. [21] proposed a dual-layer experience replay mechanism to enhance policy generalization. Zhao et al. [22] combined meta-learning with MADDPG and proposed the MW-MADDPG framework to strengthen group decision-making. Cao and Chen [23] incorporated attention mechanisms into MADDPG, accelerating convergence and improving decision accuracy. Xie et al. [24] addressed obstacle avoidance for multirotor UAV formations in environments with static and dynamic obstacles, demonstrating the adaptability of DRL in diverse scenarios.

In addition to task complexity, UAV platform characteristics are also critical to cooperative control. Multirotor UAVs, while offering hovering and high maneuverability, are limited in range and speed, making them unsuitable for high-speed encirclement. By contrast, fixed-wing UAVs have distinct advantages in endurance and velocity, making them more suitable for long-range cruising and fast target interception [25]. However, fixed-wing UAVs also introduce nonlinear and coupled challenges due to constraints such as minimum speeds, minimum turning radii, and inability to hover [26]. To address these challenges, DRL methods have been increasingly applied to fixed-wing UAV decision-making and control. For example, Yan et al. [27] proposed the PASCAL algorithm, which integrates curriculum learning with MADRL to achieve collision-free formation flight of large-scale fixed-wing UAV swarms. Yan et al. [28] further enhanced efficiency and scalability by employing attention-based DRL with parameter sharing. Zhuang et al. [29] studied penetration strategies for high-speed fixed-wing UAVs, validating the effectiveness of DRL in maneuvering tasks. Li et al. [30] combined DQN with PID control for autonomous longitudinal landing, demonstrating the potential of DRL in low-level flight dynamics control.

Despite these advances, several limitations remain. First, many studies focus on single-task scenarios, such as those that involve only encirclement or only obstacle avoidance, whereas real-world missions typically require UAVs to simultaneously execute multiple tasks, including encirclement, obstacle avoidance, and formation maintenance. Second, although curriculum learning has shown promise in improving training efficiency and convergence in MADRL [31], its application to fixed-wing UAV cooperative missions remains underexplored, particularly in terms of systematically integrating composite tasks with dynamic constraints.

Based on the above background, this paper proposes a composite deep reinforcement learning method with curriculum learning to address the challenges faced by fixed-wing UAV swarm in simultaneously executing encirclement and obstacle avoidance tasks in complex environments. The main contributions are summarized as follows:

Task modeling: A cooperative multi-agent fixed-wing UAV environment is developed, integrating encirclement and obstacle avoidance with realistic fixed-wing dynamics, maneuverable multirotor evaders, and laser-based local perception;
Method design: A composite MADDPG framework enhanced with curriculum learning is introduced, employing progressively staged tasks and reward optimization to accelerate convergence and enhance policy generalization;
Performance validation: Comparative experiments are conducted in a simulation environment, evaluating metrics such as encirclement duration, capture success rate, and obstacle avoidance safety rate, thereby verifying the effectiveness and superiority of the proposed method in complex scenarios.

2. Problem Formulation

This paper investigates multi-agent deep reinforcement learning (MADRL) for encirclement tasks in obstacle-rich environments, focusing on a scenario where multiple fixed-wing UAVs encircle a multirotor UAV. At the beginning of each episode, the positions and orientations of the UAVs, as well as the positions and radius of the obstacles, are randomly initialized. The scenario setup is illustrated in Figure 1a. The pursuit is considered successful only if all fixed-wing encircling UAVs (UAVs_e), without collision or leaving the bounded area, are able to keep the multirotor target UAV (UAV_t) within the capture range for a specified number of steps, as illustrated in Figure 1b.

The environment is modeled as a rectangular area with dimensions

[W, H]

. The UAVs_e swarm consists of

N^{e n c}

agents, where the position of the

i

-th fixed-wing UAV is

p_{i}^{e n c} = (x_{i}^{e n c}, y_{i}^{e n c}),

and its safety radius is denoted as

r^{s a f e}

. The UAV_t is defined as the target, whose position is represented as

p^{t a r},

with a capture radius

r^{c a p}

.

The environment also contains multiple clusters of trees and restricted zones, abstracted as static circular obstacles. The total number of obstacles is

N^{o b s}

, where the center of the

j

-th obstacle is denoted as

p_{j}^{o b s},

and its radius as

r_{j}^{o b s} .

The perception of each UAV is modeled using laser beams. A total of

N^{l a s}

beams are employed, with a detection range of

L

. The direction of the

k

-th beam is denoted as

θ_{k}^{l a s}

, and its detected distance is given by

r_{k}^{l a s}

.

2.1. UAV Modeling

Establishing the UAV motion equations involves coordinate transformations. The ground coordinate system

O x y z^{(g)}

is fixed on the Earth, defined according to the North-East-Down convention, with its origin located at the takeoff point. The

O x^{(g)}

axis points north, the

O z^{(g)}

axis points vertically downward, and the

O y^{(g)}

axis is perpendicular to the

O x z^{(g)}

plane, pointing east.

The trajectory coordinate system

O x y z^{(k)}

has its origin at the UAV’s center of mass. The

O x^{(k)}

axis is aligned with the velocity direction,

O z^{(k)}

lies in the vertical plane containing

O x^{(k)}

, perpendicular to

O x^{(k)}

, and points downward, while

O y^{(k)}

is perpendicular to the

O x y^{(k)}

plane and points to the right.

The primary forces acting on a fixed-wing UAV originate from engine thrust and aerodynamic forces. Aerodynamic forces depend on the deflection angles of ailerons, rudder, and elevator, as well as the UAV’s airspeed and body shape. Neglecting the three-dimensional attitude of the fixed-wing UAV, the acceleration in the trajectory coordinate system is simplified as:

\{\begin{array}{l} a_{x}^{(k)} = \frac{F_{T} - F_{D} - F_{g} \sin γ}{m} \\ a_{y}^{(k)} = \frac{F_{Y}}{m} \\ a_{z}^{(k)} = \frac{F_{L} - F_{g} \cos γ}{m} \end{array}

(1)

where the flight path angle

γ

is defined as the angle between the

O x^{(k)}

axis and the

O x y^{(g)}

plane, positive upward; thrust

F_{T}

is generated by the engine along the body

x

axis; drag

F_{D}

opposes the velocity direction; gravity

F_{g}

has magnitude

m g

and points vertically downward; lift

F_{L}

is generated by the wings and is perpendicular to the incoming airflow; lateral force

F_{Y}

is generated by banking and yaw control surfaces. The coordinate system together with the force and attitude angles is illustrated in Figure 2.

In the simplified model, the velocity coordinate system coincides with the body coordinate system, and the flight altitude is assumed low, with the ground approximated as a plane. The transformation matrix from the trajectory coordinate system to the ground coordinate system is given by:

L_{g k} = L_{z} (χ) L_{y} (γ) = [\begin{matrix} \cos γ \cos χ & - \sin χ & \sin γ \cos χ \\ \cos γ \sin χ & \cos χ & \sin γ \sin χ \\ - \sin γ & 0 & \cos γ \end{matrix}] L_{x} = [\begin{matrix} 1 & 0 & 0 \\ 0 & \cos θ & \sin θ \\ 0 & - \sin θ & \cos θ \end{matrix}] L_{y} = [\begin{matrix} \cos θ & 0 & - \sin θ \\ 0 & 1 & 0 \\ \sin θ & 0 & \cos θ \end{matrix}] L_{z} = [\begin{matrix} \cos θ & \sin θ & 0 \\ - \sin θ & \cos θ & 0 \\ 0 & 0 & 1 \end{matrix}]

(2)

where the heading angle

χ

is defined as the angle between the projection of the

O x^{(v)}

axis onto the

O x y^{(g)}

plane and the

O x^{(g)}

axis, positive to the right, and

L_{x}

,

L_{y}

,

L_{z}

are the rotation matrices about the

x

,

y

, and

z

axes, respectively.

The UAV velocity can be expressed as:

\frac{d v^{(g)}}{d t} = L_{g k} [\begin{matrix} a_{x}^{(k)} \\ a_{y}^{(k)} \\ a_{z}^{(k)} \end{matrix}]

(3)

Since the

O z^{(g)}

axis in the ground coordinate system points downward, which is opposite to the figure coordinate system

O x y z^{(f)}

used for visualization in which the

O z^{(f)}

axis points upward, the final plotting requires the following transformation:

L_{f g} = L_{x} (π)

(4)

For multirotor UAVs, the kinematic equations in the image coordinate system are directly employed for modeling.

To reduce computational complexity and highlight the core design of the control algorithm, all UAV dynamic models in this study are based on simplified assumptions, without considering actuator dynamics, wind field effects, or six-degree-of-freedom (6 DoF) coupling.

2.2. Constraint Settings and State Calculation

In reality, the magnitude of forces and the payload capacity of UAVs are limited, so velocity cannot change abruptly. In this study, the motion space is defined in terms of acceleration, and the acceleration vector is constrained as:

a = clip (a, a_{\min}, a_{\max}) clip (x, a, b) = \max {\min {x, b}, a}, a \leq b

(5)

where

clip (\cdot)

denotes the interval clipping function, which saturates the value to the boundary if it exceeds the upper or lower limits. The maximum and minimum accelerations along the three axes differ for fixed-wing UAVs, while they are identical for multirotor UAVs. The simplified computation of acceleration limits is given by:

\{\begin{array}{l} {|a_{x}^{f i x}|}_{\max}^{(k)} = \frac{V_{\max} - V_{\min}}{κ \cdot Δ t} \\ {|a_{y}^{f i x}|}_{\max}^{(k)} = \frac{V_{\max}^{2}}{r_{\min}} \end{array}, {|a^{t a r}|}_{\max}^{(f)} = \frac{V_{\max}^{t a r} - V_{\min}^{t a r}}{κ^{t a r} \cdot Δ t}

(6)

where

Δ t

denotes the time step,

κ

the time scaling coefficient,

V_{\max}

and

V_{\min}

the maximum and minimum UAV speeds, respectively, and

r_{\min}

the minimum turning radius.

For multirotor UAVs, the minimum speed is 0. For fixed-wing UAVs, a certain forward flight speed is required, so the velocity range is limited. The speed magnitude is constrained as:

v (t + 1) = {sat}_{[V_{\min}, V_{\max}]} (v (t) + a (t) Δ t) {sat}_{[a, b]} (z) = \frac{clip (‖ z ‖, a, b)}{‖ z ‖} z

(7)

where

{sat}_{[\cdot, \cdot]} (\cdot)

is the velocity saturation operator that limits the speed magnitude within

[V_{\min}, V_{\max}]

. The discrete-time position update is given by:

p (t + 1) = p (t) + v (t + 1) Δ t

(8)

2.3. Laser-Based Environment Perception

The core of radar-based detection is to acquire spatial information by analyzing the interaction between electromagnetic waves and targets or the environment. In this work, to ensure computational efficiency while maintaining reasonable positioning accuracy, the radar beam propagation and environmental obstacles are geometrically simplified, allowing real-time computation of beam lengths at different angles.

The simplified model abstracts the radar-emitted electromagnetic waves as straight rays of length

L

, divided into

N^{l a s}

beams.

Figure 3a shows the intersections of laser rays with circular obstacle. Using analytic geometry, the intersection points between the

k

-th ray and the surface of the

j

-th obstacle are obtained by solving the corresponding system of Equation (9):

‖ p (s) - p_{j}^{o b s} ‖^{2} = {(r_{j}^{o b s})}^{2}, p (s) = p_{i}^{u a v} + s d_{k}^{l a s}, t \geq 0

(9)

where

p (s)

denotes the intersection point coordinates between the UAV laser ray and the circular obstacle,

d_{k}^{l a s}

is the direction vector of the laser, and

s

is the parameterized distance along the ray.

For a rectangular environment with radar-detectable boundaries, the intersections of the ray with the top, bottom, left, and right boundaries are evaluated individually:

\{\begin{matrix} l_{k}^{t o p} = \frac{H - y_{i}^{u a v}}{\sin θ_{k}^{l a s}}, \sin θ_{k}^{l a s} > 0 \\ l_{k}^{b o t t o m} = \frac{- y_{i}^{u a v}}{\sin θ_{k}^{l a s}}, \sin θ_{k}^{l a s} < 0 \\ l_{k}^{r i g h t} = \frac{W - x_{i}^{u a v}}{\cos θ_{k}^{l a s}}, \cos θ_{k}^{l a s} > 0 \\ l_{k}^{l e f t} = \frac{- x_{i}^{u a v}}{\cos θ_{k}^{l a s}}, \cos θ_{k}^{l a s} < 0 \end{matrix}

(10)

As an example, Figure 3b illustrates the intersection points with the top and right boundaries.

Since a ray may be intercepted by nearby obstacles first, the minimum distance is taken as the real-time ray length:

l_{k} = \min {L, l^{t o p}, l^{b o t t o m}, l^{r i g h t}, l^{l e f t}, s}_{k}

(11)

3. Proposed Method

This section first introduces the basic modeling approach for multi-UAV collision avoidance, i.e., the Partially Observable Markov Game (POMG), and defines the state space, action space, and reward function within this framework. Subsequently, the application mechanism of Multi-Agent Deep Deterministic Policy Gradient (MADDPG), a representative framework in MADRL, is presented for collaborative pursuit and obstacle avoidance tasks.

Furthermore, the concept of Curriculum Learning (CL) is incorporated, employing a graded environment difficulty design and dynamic task scheduling to significantly enhance training stability and the generalization capability of learned policies. Based on this idea, a curriculum-learning-enhanced composite multi-agent deep reinforcement learning algorithm (MADRL_CL), is proposed.

This method not only effectively addresses collision avoidance and pursuit challenges of fixed-wing UAVs in highly dynamic target and complex obstacle environments, but also demonstrates advantages in collaboration efficiency and learning convergence.

3.1. MADRL

3.1.1. POMG

The multi-UAV collaborative obstacle avoidance and pursuit problem is a typical cooperative MADRL task. It is commonly modeled as a POMG [32], formally defined as an 8-tuple:

M = (N, S, A, O, P, R, O, γ)

(12)

where

N = {1, \dots, n}

denotes the set of agents;

S = \prod_{i = 1}^{n} S^{i}

is the joint state space;

A = \prod_{i = 1}^{n} A^{i}

is the joint action space;

O = \prod_{i = 1}^{n} O^{i}

is the joint observation space;

R

is the joint reward function;

P

is the state transition probability function;

O

is the observation function; and

γ \in [0, 1)

is the discount factor.

At each time step

t \in ℕ

, each agent

i \in N

receives a local observation

o_{t}^{i} ~ O^{i} (\cdot ∣ s_{t})

from the environment and selects an action

a_{t}^{i} ~ μ^{i} (\cdot ∣ o_{t}^{i})

according to its policy. All agents act simultaneously without sequential dependencies. The joint action

a_{t} = (a_{t}^{1}, \dots, a_{t}^{n})

drives the state transition

s_{t + 1} ~ P (\cdot ∣ s_{t}, a_{t})

, and the team receives a shared reward

r_{t} = R (o_{t}, a_{t})

. The objective is to find an optimal policy

μ^{*} (a ∣ o)

that maximizes the expected cumulative return

G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1}

given the environment dynamics:

μ^{*} = \arg \max_{μ} E_{μ} [G_{t}]

(13)

3.1.2. State Space and Action Space

The state space of the

i

-th UAV is defined as

s_{i} = {s_{i}^{e n v}, s_{i}^{u a v}}

, where the environment state

s^{e n v}

includes the boundaries and obstacle positions, and the UAV state

s^{u a v}

contains the UAV's position and velocity.

Since the environment is partially observable, the environment state is perceived by the UAV through laser-based detection, the states of other UAVs_e are obtained via inter-UAV communication, and the UAV_t’s state is provided by the ground station.

The observation of the

i

-th UAV_e is given by:

o_{i} = {o^{u a v}, o^{t e a m}, o^{g o a l}, o^{l a s}}_{i}

(14)

where self-observation

o_{i}^{u a v} = {x, y, u, v}_{i}^{(f)}

represents the UAV’s position and velocity; team observation

o_{i}^{t e a m} = {p_{m} ∣ m = 1, 2, \dots, N^{u a v}, m \neq i}_{i}^{(f)}

represents the positions of other UAVs; goal observation

o_{i}^{g o a l} = {d, θ}_{i}^{(f)}

represents the distance and relative angle to the UAV_t; laser observation

o_{i}^{l a s} = {l_{1}, l_{2}, \dots, l_{N^{l a s}}}_{i}

represents the lengths of rays emitted in multiple directions.

The observation of the UAV_t is:

o_{t a r} = {o^{u a v}, o^{e n e m y}, o^{l a s}}_{t a r},

(15)

where enemy observation

o_{t a r}^{e n e m y} = {d_{1}, d_{2}, \dots, d_{N^{u}}}_{t a r}

denotes the distances from each UAV to the target.

The action space of fixed-wing UAVs is defined in the trajectory coordinate system as:

a_{i} = {a_{x}, a_{y}}_{i}^{(k)},

(16)

While the action space of multirotor UAV is defined in the image coordinate system as:

a_{t a r} = {a_{x}, a_{y}}_{t a r}^{(f)} .

(17)

Furthermore, since different state variables may have different dimensions and ranges, directly using them can lead to unstable or inefficient learning. In deep reinforcement learning, state and observation spaces are normalized as a preprocessing step, calculated as:

f (\cdot) = \{\begin{array}{l} x / L & y / L & u / V_{\max} & v / V_{\max} & d / (\sqrt{W^{2} + H^{2}}) & θ / π & l / L \end{array}\}

(18)

3.1.3. MADDPG

MADDPG is a typical actor–critic based multi-agent reinforcement learning method, extending single-agent DDPG to multi-agent scenarios. The core idea is that each agent’s Critic can access information from all other agents, enabling centralized training with decentralized execution (CTDE). Specifically, during training, a centralized Critic with global observation guides the Actor’s learning to improve stability, while during execution, only the Actor with local observation is used, aligning with real-world conditions.

The Actor network, i.e., the policy network, is defined as

a_{i} = μ_{θ_{i}} (o_{i})

, indicating that the Actor selects action

a_{i}

based solely on its own observation

o_{i}

. The Critic network

Q_{ϕ_{i}} (o, a)

evaluates the value using the joint observation

o

and joint action

a

of all agents.

Since the update relies on bootstrapping, directly using the online network parameters to compute target values would couple the target function with the network parameters being updated. This can lead to instability or divergence during training. To mitigate this, a target network mechanism is introduced. The target Actor and target Critic networks are updated via a soft update:

{\tilde{θ}}_{i} \leftarrow τ θ_{i} + (1 - τ) {\tilde{θ}}_{i}, {\tilde{ϕ}}_{i} \leftarrow τ ϕ_{i} + (1 - τ) {\tilde{ϕ}}_{i}

(19)

where

τ \in (0, 1)

is the update rate, and

\tilde{ϕ}, \tilde{θ}

denote the parameters of the target Q-network and target policy network, respectively.

The Critic network of agent

i

is updated by minimizing the temporal-difference (TD) loss:

L (ϕ_{i}) = E_{(s, o, a, s^{'}, r, d) ~ D} [{(r + (1 - d) γ Q_{\tilde{ϕ_{i}}} (s^{'}, μ_{\tilde{θ_{i}}} ({o^{'}}_{i})) - Q_{ϕ_{i}} (s, μ_{θ_{i}} (o_{i})))}^{2}]

(20)

where

D

is the experience replay buffer, and

d \in {0, 1}

indicates episode termination. This minimizes the mean squared error between the predicted value and TD-target, enabling the Critic to learn accurate value estimates.

Simultaneously, the Actor network is optimized using the deterministic policy gradient:

\nabla_{θ_{i}} J (θ_{i}) = E_{s_{i} ~ D} {[\nabla_{θ_{i}} μ_{θ_{i}} (o_{i}) \nabla_{a_{i}} Q_{ϕ_{i}} (s_{i}, a_{i})]}_{a_{i} = μ_{θ_{i}} (o_{i})}

(21)

During action selection, exploration noise is added in a decaying manner during training, while deterministic outputs are used during evaluation. The final action executed in the training phase is:

a_{t} = clip (a_{t} + σ_{t} ϵ_{t}, - 1, 1)

(22)

where

ϵ_{t}

is a random noise within a certain range and

σ_{t}

is the decay scaling factor.

The overall process is illustrated in Figure 4.

3.2. Curriculum Learning Strategy Design

During the pursuit process, the core challenge for fixed-wing UAVs compared to multirotor UAVs lies in their flight characteristics: they cannot hover or fly backward when encountering obstacles. Determining how to perform deceleration and turning control in advance based on radar detection distances becomes a critical challenge for deep reinforcement learning algorithms.

Due to the size of the pursuit area, obstacle distribution, and dynamic interference from other UAVs, collisions may occur, potentially terminating training episodes prematurely. If highly complex environments are used at the initial stage, UAVs may adopt local suboptimal strategies such as hovering in place to avoid severe collision penalties. Frequent collisions during exploration also hinder the search for global optimal solutions, significantly increasing training difficulty. Therefore, a progressively increasing task and environment complexity strategy is more reasonable, allowing UAV agents to gradually optimize their parameters and eventually master effective pursuit control policies.

To incrementally enhance task difficulty and generalization, we employ a CL mechanism to design the pursuit task in a staged manner. By gradually introducing obstacles, target mobility, and spatial constraints, the stability and convergence efficiency of training are effectively improved.

The task is divided into four curriculum levels: Level 0 to Level 3:

Level 0: basic environment, no obstacles, stationary target, small initial positions, long capture radius;

Level 1: introduces obstacles, enlarges the initial range, reduces the capture radius;

Level 2: enables the UAV_t to move, further enlarges the initial range, reduces the capture radius again;

Level 3: the initial range almost covers the entire scene, with the smallest capture radius.

To reduce training steps, a performance-driven curriculum scheduler is introduced. This mechanism dynamically adjusts task difficulty based on task performance and training progress. The system adaptively increases the curriculum level according to the pursuit success rate over a recent window of episodes, calculated as:

\begin{matrix} {\bar{C}}^{(M)} = \frac{1}{M} \sum_{m = e - M + 1}^{e} c_{m}, & c_{m} = \{\begin{array}{l} 1, & Encicle = True \\ 0, & Encicle = False \end{array} \end{matrix}

(23)

where

M

is the number of recent episodes in the sliding window, and an episode is considered successful if all UAVs enter the UAV_t’s capture range and remain within it for a specified number of steps. If any UAV leaves the capture range, the step count is reset to zero.

To prevent premature level promotion, a dual-condition mechanism is adopted. Level switching occurs only if both the performance threshold and training quantity conditions are satisfied. The threshold condition requires a minimum experience: the number of episodes in the current stage must be at least twice the check window, and the cumulative steps must exceed twice the replay buffer capacity. The pseudocode is shown in Algorithm 1.

Algorithm 1: MADDPG_CL (Multi-Agent Deep Deterministic Policy Gradient with Curriculum Learning)
Input: Level = 0, number of agents $N$ , hyperparameters.
Output: Environmental rewards, task success rate.
1.	Initialize replay buffer $D$
2.	$for agent i = 1$ $N$ do
3.	Initialize actor, critic, target actor, target critic networks with parameters $θ i, ϕ i, \tilde{θ} i, \tilde{ϕ} i$
4.	end for
5.	$for episode e = 1$ $M$ do
6.	Initialize environment and agent states
7.	$for step t = 1$ $to T^{\max}$ do
8.	$Each agent observes o_{i}$ $and selects action a_{i}$ via actor network
9.	Environment updates, returning reward $r$ and next state $s'$
10.	$Store transition (s, o, a, s^{'}, o^{'}, r, d)$ into $D$
11.	$if d = = True$ then
12.	$if t^{c a p} \geq T^{c a p}$ $then c_{m} = 1$ end if
13.	break
14.	end if
15.	$s \leftarrow s^{'}$
16.	$for agent i = 1$ $N$ do
17.	$Sample minibatch B = {(s_{j}, o_{j}, a_{j}, {s^{'}}_{j}, {o^{'}}_{j}, r_{j}, d_{j})}_{j = 1}^{N^{b}} ~ D$
18.	$Update critic : L (ϕ_{i}) = E [{(r + (1 - d) γ Q_{\tilde{ϕ_{i}}} (s^{'}, a^{'}) - Q_{ϕ_{i}} (s, a))}^{2}]$
19.	$Update actor : \nabla_{θ_{i}} J (θ_{i}) = E {[\nabla_{θ_{i}} μ_{θ_{i}} (o_{i}) \nabla_{a_{i}} Q_{ϕ_{i}} (s_{i}, a_{i})]}_{a_{i} = μ_{θ_{i}} (o_{i})}$
20.	$Update target networks : {\tilde{θ}}_{i} \leftarrow τ θ_{i} + (1 - τ) {\tilde{θ}}_{i}, {\tilde{ϕ}}_{i} \leftarrow τ ϕ_{i} + (1 - τ) {\tilde{ϕ}}_{i}$
21.	end for
22.	end for
23.	$Compute sliding window success rate : {\bar{C}}^{(M)} = \frac{1}{M} \sum_{m = e - M + 1}^{e} c_{m}$
24.	$if {\bar{C}}^{(M)} \geq C^{l e v e l_{u} p}$ $and e^{l e v e l} \geq E^{l e v e l_{u} p}$ $and t^{l e v e l} \geq T^{l e v e l_{u} p}$ then
25.	Level $\leftarrow$ Level + 1
26.	end if
27.	end for

3.3. Reward Function Design

The reward function is the core mechanism guiding agents to learn effective pursuit strategies. It needs to account for multiple aspects of UAV behavior, including approaching the target, collision avoidance, and cooperative encirclement. In pursuit tasks, the primary capture reward is typically sparse, so shaping dense intermediate rewards is introduced to provide continuous feedback and prevent learning stagnation.

To keep the sparse and dense rewards within a reasonable scale, the terminal discounted reward is computed as:

R_{final} = γ^{T} \cdot R_{sparse} + \frac{1 - γ^{T}}{1 - γ} R_{dense},

(24)

where

γ

is the discount factor,

T

is the episode length,

R^{sparse}

represents the sparse terminal reward, and

R^{dense}

represents the continuous reward.

The terminal discounted reward also adjusts the relative weighting of different behavioral rewards and differentiates the reward logic between UAVs_e and UAV_t, forming an adversarial learning framework. The per-step reward

R

is defined as:

R = \sum_{i = 1}^{n} R_{i}, R_{i} = μ_{i} \cdot r_{i},

(25)

where

μ_{i}

is a reward coefficient that can be positive or negative, and

r_{i}

is the normalized reward for each behavior.

3.3.1. Reward Function for Encircling UAVs

In collision avoidance and pursuit tasks, multiple UAVs_e need to autonomously plan optimal trajectories to efficiently encircle the UAV_t. This requires simultaneously satisfying safety constraints, time-optimality, and cooperative encirclement objectives. The total reward for UAVs_e consists of collision penalty, distance reward, distance rate reward, encirclement reward, completion reward, and time penalty, formulated as:

R_{enc} = R_{c} + R_{d} + R_{d d} + R_{enc} + R_{t} + R_{suc} + R_{comp}

(26)

The components are defined as follows:

Collision penalty: Penalizes collisions with obstacles, boundaries, or other UAVs.

R_{c} = μ_{c} \cdot r_{c}, r_{c} = \{\begin{array}{l} 1, & if collision occurs \\ 0, & otherwise \end{array}

(27)

where

μ_{c} = - 60

is the collision penalty coefficient, and

r_{c}

is the collision indicator.

Distance reward: Encourages UAVs_e to approach the UAV_t.

R_{d} = μ_{d} \cdot r_{d}, r_{d} = 1 - clip (\frac{2 d}{\sqrt{W^{2} + H^{2}}}, 0, 1)

(28)

where

μ_{d} = 0.1

is the distance reward coefficient, and

d

is the current distance between the UAV_e and the UAV_t.

Distance rate reward: Provides directional guidance towards the UAV_t.

R_{d d} = μ_{d d} \cdot r_{d d}, r_{d d} = \frac{Δ d}{V_{\max} \cdot Δ t}

(29)

where

μ_{d d} = 0.1

is the rate coefficient, and

Δ d

is the distance change from the previous step.

Encirclement reward: Promotes cooperative encirclement formation, quantified by the area ratio of triangles formed by UAVs.

R_{enc} = μ_{enc} \cdot r_{enc}, r_{enc} = \frac{S_{0}}{S_{1} + S_{2} + S_{3}}

(30)

where

μ_{enc} = 0.1

,

S_{1}, S_{2}, S_{3}

are areas of triangles formed by the UAV_t and any two UAVs_e, and

S_{0}

is the area formed by the three UAVs_e.

Time penalty: Penalizes unnecessary hovering or oscillatory movement to reduce local optima.

R_{t} = μ_{t} \cdot r_{t}, r_{t} = 1, μ_{t} = - (μ_{d} + μ_{d d} + μ_{cap})

(31)

where

μ_{t} = - 0.3

is the negative sum of other positive reward coefficients, and

r_{t}

is the time step indicator.

Single-UAV success reward: Rewards when a UAV_e enters the encirclement range.

R_{suc} = μ_{suc} \cdot r_{suc}, r_{suc} = \{\begin{array}{l} 1, & d \leq r^{c a p} \\ 0, & otherwise \end{array}

(32)

where

μ_{done} = 0.5

is the reward coefficient for entering the capture radius.

Multi-UAV Completion Reward: Rewards maintaining encirclement for a specified number of steps.

R_{comp} = μ_{comp} \cdot r_{comp}, r_{comp} = \{\begin{array}{l} 1, & t_{round} \geq T_{round} \\ 0, & t_{round} < T_{round} \end{array}

(33)

where

μ_{comp} = 80

is the sparse reward coefficient for completing the encirclement objective when all UAVs_e remain within the capture radius for the required steps.

3.3.2. Reward Function for Target UAV

The reward mechanism for the UAV_t is adversarial to the UAVs_e, aiming to avoid encirclement and perform autonomous collision avoidance. The total reward is formulated as:

R_{tar} = R_{cap} + R_{cen} + R_{c}

(34)

The components are defined as follows:

Encirclement Penalty: Penalizes the encirclement tendency of the UAVs_e.

R_{cap} = μ_{cap} \cdot r_{cap}, r_{cap} = \frac{1}{N^{fix}} \sum_{i = 1}^{N^{fix}} r_{d d, i}

(35)

Center Guidance Reward: Encourages the UAV_t to move towards the center.

R_{cen} = μ_{cen} \cdot r_{cen}, r_{cen} = 1 - clip (\frac{2 d_{cen}}{\sqrt{W^{2} + H^{2}}}, 0, 1)

(36)

Collision and Obstacle Avoidance Reward: Encourages the UAV to avoid collisions.

R_{c} = μ_{c} \cdot r_{c}, r_{c} = \{\begin{array}{l} 1, & if collision occurs \\ 0, & otherwise \end{array}

(37)

where

μ_{cap} = - 1

is the encirclement penalty coefficient,

μ_{cen} = 1

is the center guidance reward coefficient, and

μ_{c} = - 5

is the collision penalty coefficient.

4. Experiments and Results

The experiments were conducted in a custom UAV simulation environment built on Python 3.10.17 and Gymnasium 1.1.1, incorporating dynamics models for both fixed-wing and multirotor UAVs.

In terms of computational performance, all experiments were carried out on a laptop equipped with an Intel Core Ultra 9 275HX (2.70 GHz) processor (Intel, Santa Clara, CA, USA), an NVIDIA RTX 5060 GPU with CUDA acceleration (NVIDIA, Santa Clara, CA, USA), and 32 GB of RAM. During training, the GPU utilization was approximately 80%, with CPU usage around 10% and memory consumption of about 3 GB. Under these hardware conditions, the average training efficiency was approximately 10,000 episodes per 6 h, and the complete curriculum training of 40,000 episodes took about 24 h in total.

4.1. Parameter Settings

Table 1 summarizes the parameters of the environment, UAVs, and onboard radar sensors, balancing scenario realism with computational feasibility.

To validate the rationality of the adopted simplified model parameters, a comparative analysis was conducted based on the standard Coordinated Turn Model. The governing equations are given as:

a_{y} = \frac{V^{2}}{R} = g \tan ϕ, n = \frac{1}{\cos ϕ}

(38)

where

V

is the flight speed,

R

is the turning radius,

g

is the gravitational acceleration,

ϕ

is the roll angle, and

n

is the load factor.

Considering the maneuvering characteristics of small fixed-wing UAVs, when the flight speed is within 20–32 m/s and the maximum lateral acceleration is 2.5 m/s², the corresponding roll angle is approximately 14.3°, and the load factor is

n

= 1.03, which represents a light-load coordinated turn. This parameter range is consistent with typical small fixed-wing UAVs, whose load factors during level coordinated turns usually range between 1.0 and 1.4. These results demonstrate that the simplified point-mass model adopted in this study provides reasonable dynamic consistency and physical interpretability for planar flight tasks.

The hyperparameters of MADDPG are given in Table 2, with most values adopted from common practices in multi-agent reinforcement learning. The learning rate and discount factor were tuned through preliminary experiments to ensure stable convergence. The parameter settings of the four curriculum levels are listed in Table 3, where task complexity is progressively increased by adjusting UAV initial positions, obstacle density, and target maneuverability. The coefficients of the reward function are detailed in Section 3.3. All experiments were conducted with fixed random seeds to ensure both randomness and reproducibility.

4.2. Experimental Results

A collision failure is determined if any UAV approaches an obstacle, another UAV, or the environment boundary closer than the predefined safety radius. An encirclement step is considered complete when all UAVs_e enter the UAV_t’s capture range without collision and form a surrounding trend, as illustrated in Figure 1b. Within the maximum number of steps allowed per episode, if the UAVs_e swarm successfully completes the required number of encirclement steps, the pursuit is regarded as successful. During training, several issues may arise, including constraints imposed by the environment boundaries, collisions with obstacles, collisions among UAVs, and stuck in local optimum. These issues are partially illustrated in Figure 5.

To more intuitively demonstrate the learning effect of the pursuit task, we plotted the successful pursuit scenarios at different training stages. The results are autonomously generated by the trained neural network policies, and GIF animations are provided in the Supplementary Materials. To ensure clarity, the illustrations for Level 0 and Level 1 are displayed until the final step of task completion, as shown in Figure 6. In Level 0, UAVs_e enter the capture range and circle around the UAV_t. Level 1 UAVs can avoid obstacles while approaching the UAV_t.

The illustrations for Level 2 and Level 3 show successful pursuit after exceeding 10 capture steps, with early termination. The Level 2 results are shown in Figure 7, and Level 3 results in Figure 8. In Level 2, the fixed-wing UAVs can turn in time to pursue when the multirotor moves backward. In Level 3, even when initial positions cover the entire scene, UAVs can still successfully complete the pursuit task.

The reward function evolution during training is shown in Figure 9. In Level 0, the UAVs_e first learn to fly within the environment boundaries and successfully encircle a stationary UAV_t. After entering Level 1, obstacles are introduced in the environment, resulting in negative rewards for collisions. UAVs gradually learn obstacle avoidance behavior. In Level 2, the UAV_t is capable of motion, making early-stage pursuit difficult and rewards relatively low; through learning, UAVs_e gradually master target identification and tracking strategies. Level 3 further enlarges the initial generation range and reduces the capture area, significantly increasing task difficulty. The final reward stabilizes around 125 points. As the task level increases, the required time for pursuit increases, and cumulative rewards decrease due to time penalties.

In experimental evaluation, 1000 episodes were conducted for each level, calculating capture success rates for every 100 episodes. To ensure robustness, error bars are included. The results are shown in Figure 10, illustrating differences in performance under different levels and capture step requirements. Overall, the success rate remains above 70%, with failures mainly due to collisions with obstacles, teammates, or the UAV_t. Other failures are mostly due to not completing the pursuit within the limited steps. In obstacle-free, stationary target scenarios, the success rate exceeds 95%. With increasing capture step requirements, success rates slightly decline. In the most complex Level 3 scenario (global random initial positions, random initial headings, maximum steps 200, capture range within 100 m), the single-step encirclement success rates exceed 80%, and the 10-step success rates remain around 70%, demonstrating the method's effectiveness and robustness in complex environments.

5. Conclusions

This paper addresses the challenges of fixed-wing UAVs performing encirclement and obstacle avoidance tasks in complex environments, and proposes a curriculum-based composite deep reinforcement learning method. A multi-agent fixed-wing UAV collaborative task environment is constructed, integrating pursuit and obstacle avoidance, while fully accounting for UAV dynamic constraints, multirotor target UAV maneuverability, and local perception in obstacle-rich scenarios, thereby validating the effectiveness of the proposed approach.

Experimental results demonstrate that the composite MADDPG_CL framework effectively mitigates issues such as local optima and slow convergence during training, substantially enhancing policy stability and generalization. The multirotor target UAVs are also controlled via deep reinforcement learning, exhibiting strong maneuvering and adversarial capabilities that often induce collisions for the encircling UAVs, further increasing task complexity. In obstacle-free environments, fixed-wing encircling UAVs complete encirclement tasks efficiently, achieving success rates exceeding 95%. Even in complex environments with moving targets and random initial conditions, success rates remain around 70%. The proposed method also demonstrates robust performance in collision avoidance and extended capture duration.

Overall, this study provides a novel solution for intelligent collaborative control of fixed-wing UAV swarm in high-risk, complex environments. Although the UAVs_e used in the current training are homogeneous, the training framework itself supports heterogeneity in both parameters and dynamics models. Moreover, training results with UAV_t demonstrate that the same neural network can adapt to cooperative tasks involving different types of UAVs. Future work will focus on more realistic scenarios, incorporating communication constraints, sensor noise, and dynamic obstacles, and will explore transfer learning and imitation learning to further enhance the adaptability and robustness of the framework in real-world systems. Validation experiments will be conducted on the PX4-SITL platform and the Gazebo 6-DOF simulation environment to achieve seamless integration and testing of the algorithms within a high-fidelity flight control system.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/mogualla/Encirclement-of-fixed-wing-UAVs-via-MADDPG_CL.git (accessed on 17 October 2025).

Author Contributions

Conceptualization and methodology, X.Z.; software, validation and formal analysis, X.Z. and J.T.; investigation, W.M.; resources, Z.Y. and Y.Y.; data curation and visualization, J.T.; writing—original draft preparation, X.Z.; writing—review and editing, W.M., Z.Y. and Z.Z.; supervision, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The result data supporting the findings of this study are available at https://github.com/mogualla/Encirclement-of-fixed-wing-UAVs-via-MADDPG_CL.git (accessed on 17 October 2025). The source codes used for the simulations will be made publicly available within one year after the acceptance of this paper.

DURC Statement

Current research is limited to the field of multi-agent reinforcement learning and UAV coordination, which is beneficial advancing autonomous control methods, and does not pose a threat to public health or national security. Authors acknowledge the dual-use potential of the research involving UAV and confirm that all necessary precautions have been taken to prevent potential misuse. As an ethical responsibility, authors strictly adhere to relevant national and international laws about DURC. Authors advocate for responsible deployment, ethical considerations, regulatory compliance, and transparent reporting to mitigate misuse risks and foster beneficial out-comes.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
UAV_t	Multirotor Target Unmanned Aerial Vehicle
UAVs_e	Fixed-wing Encircling Unmanned Aerial Vehicles
DRL	Deep Reinforcement Learning
MADRL	Multi-Agent Deep Reinforcement Learning
MADDPG	Multi-Agent Deep Deterministic Policy Gradient
POMG	Partially Observable Markov Game
TD	Temporal-difference
CL	Curriculum Learning

References

Mittal, V.; Goetz, J. A Quantitative Analysis of the Effects of Drone and Counter-Drone Systems on the Russia-Ukraine Battlefield. Def. Secur. Anal. 2025, 41, 490–503. [Google Scholar] [CrossRef]
Yun, W.J.; Park, S.; Kim, J.; Shin, M.; Jung, S.; Mohaisen, D.A.; Kim, J.-H. Cooperative Multiagent Deep Reinforcement Learning for Reliable Surveillance via Autonomous Multi-UAV Control. IEEE Trans. Ind. Inform. 2022, 18, 7086–7096. [Google Scholar] [CrossRef]
Shen, G.; Lei, L.; Zhang, X.; Li, Z.; Cai, S.; Zhang, L. Multi-UAV Cooperative Search Based on Reinforcement Learning With a Digital Twin Driven Training Framework. IEEE Trans. Veh. Technol. 2023, 72, 8354–8368. [Google Scholar] [CrossRef]
Li, B.; Fei, Z.; Zhang, Y. UAV Communications for 5G and Beyond: Recent Advances and Future Trends. IEEE Internet Things J. 2019, 6, 2241–2263. [Google Scholar] [CrossRef]
Jahani, H.; Khosravi, Y.; Kargar, B.; Ong, K.-L.; Arisian, S. Exploring the Role of Drones and UAVs in Logistics and Supply Chain Management: A Novel Text-Based Literature Review. Int. J. Prod. Res. 2025, 63, 1873–1897. [Google Scholar] [CrossRef]
Banafaa, M.K.; Pepeoglu, O.; Shayea, I.; Alhammadi, A.; Shamsan, Z.A.; Razaz, M.A.; Alsagabi, M.; Al-Sowayan, S. A Comprehensive Survey on 5G-and-Beyond Networks With UAVs: Applications, Emerging Technologies, Regulatory Aspects, Research Trends and Challenges. IEEE Access 2024, 12, 7786–7826. [Google Scholar] [CrossRef]
Yan, K.; Xiang, L.; Yang, K. Cooperative Target Search Algorithm for UAV Swarms With Limited Communication and Energy Capacity. IEEE Commun. Lett. 2024, 28, 1102–1106. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Xiong, H.; Zhang, Y. Reinforcement Learning-Based Formation-Surrounding Control for Multiple Quadrotor UAVs Pursuit-Evasion Games. ISA Trans. 2024, 145, 205–224. [Google Scholar] [CrossRef]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’ 18), Stockholm, Sweden, 10–15 July 2018; Assoc Computing Machinery: New York, NY, USA, 2018; pp. 2085–2087. [Google Scholar]
Rashid, T.; Samvelyan, M.; de Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Dy, J., Krause, A., Eds.; Jmlr-Journal Machine Learning Research: San Diego, CA, USA, 2018; Volume 80. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Neural Information Processing Systems (NIPS): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Foerster, J.N.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence/Thirtieth Innovative Applications of Artificial Intelligence Conference/Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Assoc Advancement Artificial Intelligence: Palo Alto, CA, USA, 2018; pp. 2974–2982. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Choi, S.; Lee, Y.; Kim, S.; Park, C.-S.; Hwang, H.; Park, F.C. Diverse Policy Learning via Random Obstacle Deployment for Zero-Shot Adaptation. IEEE Robot. Autom. Lett. 2025, 10, 2510–2517. [Google Scholar] [CrossRef]
Du, W.; Guo, T.; Chen, J.; Li, B.; Zhu, G.; Cao, X. Cooperative Pursuit of Unauthorized UAVs in Urban Airspace via Multi-Agent Reinforcement Learning. Transp. Res. Part C Emerg. Technol. 2021, 128, 103122. [Google Scholar] [CrossRef]
Fang, X.; Wang, C.; Xie, L.; Chen, J. Cooperative Pursuit With Multi-Pursuer and One Faster Free-Moving Evader. IEEE T. Cybern. 2022, 52, 1405–1414. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, M.; Yuan, Y.; Zhang, J.; Yang, Q.; Shi, G.; Jiang, F.; Lu, M. Multi-UAV Cooperative Pursuit of a Fast-Moving Target UAV Based on the GM-TD3 Algorithm. Drones 2024, 8, 557. [Google Scholar] [CrossRef]
Debnath, D.; Vanegas, F.; Sandino, J.; Hawary, A.F.; Gonzalez, F. A Review of UAV Path-Planning Algorithms and Obstacle Avoidance Methods for Remote Sensing Applications. Remote Sens. 2024, 16, 4019. [Google Scholar] [CrossRef]
Xiang, L.; Xie, T. Research on UAV Swarm Confrontation Task Based on MADDPG Algorithm. In Proceedings of the 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE 2020), Harbin, China, 25–27 December 2020; IEEE: New York, NY, USA, 2020; pp. 1513–1518. [Google Scholar]
Li, B.; Liang, S.; Gan, Z.; Chen, D.; Gao, P. Research on Multi-UAV Task Decision-Making Based on Improved MADDPG Algorithm and Transfer Learning. Int. J. Bio-Inspired Comput. 2021, 18, 82–91. [Google Scholar] [CrossRef]
Zhao, M.; Wang, G.; Fu, Q.; Guo, X.; Chen, Y.; Li, T.; Liu, X. MW-MADDPG: A Meta-Learning Based Decision-Making Method for Collaborative UAV Swarm. Front. Neurorobot. 2023, 17, 1243174. [Google Scholar] [CrossRef] [PubMed]
Cao, Z.; Chen, G. Advanced Cooperative Formation Control in Variable-Sweep Wing UAVs via the MADDPG–VSC Algorithm. Appl. Sci. 2024, 14, 9048. [Google Scholar] [CrossRef]
Xie, Y.; Yu, C.; Zang, H.; Gao, F.; Tang, W.; Huang, J.; Chen, J.; Xu, B.; Wu, Y.; Wang, Y. Multi-UAV Formation Control with Static and Dynamic Obstacle Avoidance via Reinforcement Learning. arXiv 2025, arXiv:2410.18495. [Google Scholar]
Paull, L.; Thibault, C.; Nagaty, A.; Seto, M.; Li, H. Sensor-Driven Area Coverage for an Autonomous Fixed-Wing Unmanned Aerial Vehicle. IEEE T. Cybern. 2014, 44, 1605–1618. [Google Scholar] [CrossRef]
Lv, X.; Peng, C.; Ma, J. Control Barrier Function-Based Collision Avoidance Guidance Strategy for Multi-Fixed-Wing UAV Pursuit-Evasion Environment. Drones 2024, 8, 415. [Google Scholar] [CrossRef]
Yan, C.; Xiang, X.; Wang, C.; Li, F.; Wang, X.; Xu, X.; Shen, L. PASCAL: PopulAtion-Specific Curriculum-Based MADRL for Collision-Free Flocking with Large-Scale Fixed-Wing UAV Swarms. Aerosp. Sci. Technol. 2023, 133, 108091. [Google Scholar] [CrossRef]
Yan, C.; Low, K.H.; Xiang, X.; Hu, T.; Shen, L. Attention-Based Population-Invariant Deep Reinforcement Learning for Collision-Free Flocking with A Scalable Fixed-Wing UAV Swarm. In Proceedings of the 2022 Ieee/Rsj International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: New York, NY, USA, 2022; pp. 13730–13736. [Google Scholar]
Zhuang, X.; Li, D.; Wang, Y.; Liu, X.; Li, H. Optimization of High-Speed Fixed-Wing UAV Penetration Strategy Based on Deep Reinforcement Learning. Aerosp. Sci. Technol. 2024, 148, 109089. [Google Scholar] [CrossRef]
Li, J.; Xu, S.; Wu, Y.; Zhang, Z. Automatic Landing Control for Fixed-Wing UAV in Longitudinal Channel Based on Deep Reinforcement Learning. Drones 2024, 8, 568. [Google Scholar] [CrossRef]
de Souza, C.; Newbury, R.; Cosgun, A.; Castillo, P.; Vidolov, B.; Kulic, D. Decentralized Multi-Agent Pursuit Using Deep Reinforcement Learning. IEEE Robot. Autom. Lett. 2021, 6, 4552–4559. [Google Scholar] [CrossRef]
Wen, M.; Kuba, J.; Lin, R.; Zhang, W.; Wen, Y.; Wang, J.; Yang, Y. Multi-Agent Reinforcement Learning Is a Sequence Modeling Problem. Adv. Neural Inf. Process. Syst. 2022, 35, 16509–16521. [Google Scholar]

Figure 1. Scenario of multiple fixed-wing UAVs encircling a multirotor UAV in an obstacle environment. (a) Environment illustration; (b) successful encirclement illustration. The solid lines denote UAV flight paths, the dashed lines denote radar detection range, the colored small-radius circles denote represent safety range, the large-radius circles denote encirclement range, and the gray circles denote obstacles.

Figure 2. Coordinate system with force vectors and attitude angles.

Figure 3. Schematic of laser ray length calculation in the UAV environment. (a) Intersections of laser rays with obstacle. (b) Intersections of laser rays with environment boundaries, with the intersections with the top and right boundaries shown as an example. The black thick solid lines denote the environment boundaries, the blue dashed line denote the radar detection range, and the blue and orange solid lines denote the actual lengths of the radar rays.

Figure 4. Flowchart of the MADDPG algorithm.

Figure 5. Illustration of Task Failure Factors.

Figure 6. Results for Level 0 and Level 1.

Figure 7. Level 2 Results.

Figure 8. Level 3 Results.

Figure 9. Reward Evolution During Curriculum Learning Training.

Figure 10. Success Rates for Pursuit Tasks Across Different Levels.

Table 1. Simulation Parameter Settings.

Title 1	Parameter	Symbol	Value	Unit
Environment	Side Length	$[W, H]$	[1000, 1000]	$m$
	Time Step	$Δ t$	0.5	$s$
	Safety Radius	$r_{s a f e}$	10	$m$
	Curriculum Level	$C$	[0, 3]
	Noise Amplitude	$ϵ$	[0.01, 0.75]
	Noise Decay Rate	$σ$	0.999995
	Obstacle Number	$N^{o b s}$	4
	Obstacle Radius	$r^{o b s}$	[50, 100]	$m$
Fixed-wing UAV	Number	$N^{f i x}$	3
	Maximum Speed	$V_{\max}$	32	$m / s$
	Minimum Speed	$V_{\min}$	20	$m / s$
	Maximum Longitudinal Acceleration	${\|a_{x}^{f i x}\|}_{\max}^{(k)}$	2.4	$m / s^{2}$
	Maximum Lateral Acceleration	${\|a_{y}^{f i x}\|}_{\max}^{(k)}$	2.5	$m / s^{2}$
	Minimum Turning Radius	$r_{\min}$	40	$m$
Multirotor UAV	Maximum Speed	$V_{\max}^{t a r}$	[0, 10, 15]	$m / s$
	Minimum Speed	$V_{\min}^{t a r}$	0	$m / s$
	Maximum Acceleration	${\|a^{t a r}\|}_{\max}^{(f)}$	[0, 3]	$m / s^{2}$
Sensor	Maximum Sensing Distance	$L$	160	$m$
Sensor	Number of Beams	$N^{l a s}$	16

Table 2. MADDPG Parameter Settings.

Parameter	Symbol	Value
Action Dimension	$N^{a}$	2
Observation Dimension of UAV_e	$N_{f i x}^{o}$	26
Observation Dimension of UAV_t	$N_{t a r}^{o}$	23
Batch Size	$n$	256
Replay Buffer Size	$N^{b}$	10⁶
Optimizer		Adam
Discount Factor	$γ$	0.99
Soft Update Coefficient	$τ$	0.01
Hidden Layer Dimension	$d_{h}$	128, 128
Actor Learning Rate	$α$	10⁻⁴
Critic Learning Rate	$β$	4 × 10⁻⁴

Table 3. Curriculum Level Parameter Settings.

Level	Max Steps	UAV_t Max Speed (m/s)	Number of Obstacles	UAV_t Initial Range (m)	UAV_e Initial Range (m)	Capture Radius (m)
0	100	0	0	40	640	120
1	120	0	0	320	800	112
2	140	10	4	640	800	104
3	160	15	4	920	920	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, X.; Tan, J.; Meng, W.; Yu, Z.; Yan, Y.; Zhang, Z. Cooperative Encirclement and Obstacle Avoidance of Fixed-Wing UAVs via MADDPG with Curriculum Learning. Drones 2025, 9, 727. https://doi.org/10.3390/drones9100727

AMA Style

Zhao X, Tan J, Meng W, Yu Z, Yan Y, Zhang Z. Cooperative Encirclement and Obstacle Avoidance of Fixed-Wing UAVs via MADDPG with Curriculum Learning. Drones. 2025; 9(10):727. https://doi.org/10.3390/drones9100727

Chicago/Turabian Style

Zhao, Xinrui, Jianwen Tan, Wenyue Meng, Ziping Yu, Yongzhao Yan, and Zijian Zhang. 2025. "Cooperative Encirclement and Obstacle Avoidance of Fixed-Wing UAVs via MADDPG with Curriculum Learning" Drones 9, no. 10: 727. https://doi.org/10.3390/drones9100727

APA Style

Zhao, X., Tan, J., Meng, W., Yu, Z., Yan, Y., & Zhang, Z. (2025). Cooperative Encirclement and Obstacle Avoidance of Fixed-Wing UAVs via MADDPG with Curriculum Learning. Drones, 9(10), 727. https://doi.org/10.3390/drones9100727

Article Menu

Cooperative Encirclement and Obstacle Avoidance of Fixed-Wing UAVs via MADDPG with Curriculum Learning

Abstract

Highlights

Abstract

1. Introduction

2. Problem Formulation

2.1. UAV Modeling

2.2. Constraint Settings and State Calculation

2.3. Laser-Based Environment Perception

3. Proposed Method

3.1. MADRL

3.1.1. POMG

3.1.2. State Space and Action Space

3.1.3. MADDPG

3.2. Curriculum Learning Strategy Design

3.3. Reward Function Design

3.3.1. Reward Function for Encircling UAVs

3.3.2. Reward Function for Target UAV

4. Experiments and Results

4.1. Parameter Settings

4.2. Experimental Results

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

DURC Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Level	Max Steps	UAV_t Max Speed (m/s)	Number of Obstacles	UAV_t Initial Range (m)	UAV_e Initial Range (m)	Capture Radius (m)
0	100	0	0	40	640	120
1	120	0	0	320	800	112
2	140	10	4	640	800	104
3	160	15	4	920	920	100

Level	Max Steps	UAV_t Max Speed (m/s)	Number of Obstacles	UAV_t Initial Range (m)	UAV_e Initial Range (m)	Capture Radius (m)
0	100	0	0	40	640	120
1	120	0	0	320	800	112
2	140	10	4	640	800	104
3	160	15	4	920	920	100

Level	Max Steps	UAV_t Max Speed (m/s)	Number of Obstacles	UAV_t Initial Range (m)	UAV_e Initial Range (m)	Capture Radius (m)
0	100	0	0	40	640	120
1	120	0	0	320	800	112
2	140	10	4	640	800	104
3	160	15	4	920	920	100