Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm

Zhang, Feiqiao; Wang, Qian; Ma, Xin

doi:10.3390/electronics15081632

Open AccessArticle

Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm

by

Feiqiao Zhang

¹,

Qian Wang

² and

Xin Ma

^2,*

¹

College of Economics and Management, Civil Aviation Flight University of China, Deyang 618311, China

²

College of Air Traffic Management, Civil Aviation Flight University of China, Deyang 618311, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1632; https://doi.org/10.3390/electronics15081632

Submission received: 13 March 2026 / Revised: 2 April 2026 / Accepted: 11 April 2026 / Published: 14 April 2026

Download

Browse Figures

Versions Notes

Abstract

To address cooperative path planning for multiple UAVs in complex environments, this paper proposes an improved multi-agent deep deterministic policy gradient algorithm, named Prioritized Experience Multi-Agent Deep Deterministic Policy Gradient (PE-MADDPG). An urban low-altitude inspection environment is first constructed within a reinforcement-learning framework, in which dynamic constraints, safety-separation requirements, and formation-cooperation objectives are incorporated into a partially observable Markov decision process. To improve training effectiveness, prioritized experience replay is introduced to increase the utilization of informative samples, an adaptive exploration-noise strategy is designed to regulate exploration intensity, and a multi-head attention mechanism is embedded in the Critic network to enhance the representation of inter-agent interactions. Simulation results in a three-dimensional urban inspection scenario show that PE-MADDPG outperforms the selected benchmark methods in task completion rate, formation maintenance, flight efficiency, and energy consumption. These results provide an effective solution for urban low-altitude inspection tasks.

Keywords:

multi-UAV; cooperative path planning; multi-agent reinforcement learning; prioritized experience replay; MADDPG; urban low-altitude inspection

1. Introduction

Unmanned aerial vehicles (UAVs) have demonstrated considerable potential in complex mission scenarios because of their high maneuverability, relatively low deployment cost, and ability to reduce human exposure to hazardous environments [1]. However, as mission scale and environmental complexity increase, a single UAV inevitably faces limitations in sensing coverage, task parallelism, and fault tolerance. In contrast, multi-UAV systems can substantially enhance operational robustness and mission efficiency through distributed perception and cooperative decision-making. Accordingly, autonomous coordination and swarm intelligence have become important research directions in this area [2].

Compared with single-UAV planning, cooperative path planning for multi-UAV systems must satisfy not only platform-level requirements, including dynamic feasibility, obstacle avoidance, and mission effectiveness, but also system-level coupling constraints such as formation maintenance, collision separation, and communication topology constraints [3,4,5]. Moreover, with the development of low-altitude traffic management [6] and the growing demand for safe operations, both feasibility and operational safety within constrained airspace must be considered. Therefore, achieving efficient, stable, and scalable cooperative planning and control under local information constraints remains a critical challenge [7].

1.1. Related Work

Existing studies on multi-UAV cooperative path planning can be broadly categorized into classical model-based approaches and learning-based approaches. Traditional approaches to path planning and cooperative collision avoidance mainly include classical graph-search methods and heuristic optimization algorithms. Dijkstra and A* provide clear optimality or near-optimality guarantees on discrete graphs. However, when these methods are applied to three-dimensional continuous airspace with multiple constraints, they often suffer from high-dimensional state spaces and high computational burden. Although D* can update paths when obstacles or no-fly zones change dynamically, it still relies on repeated replanning, which incurs substantial computational cost and makes real-time control difficult to achieve [8,9,10,11]. Another important category consists of heuristic optimization methods, such as genetic algorithms (GAs), particle swarm optimization (PSO), and ant colony optimization (ACO) [12,13,14]. By mimicking natural evolution or swarm intelligence, these methods can alleviate the curse of dimensionality to some extent and produce approximately optimal solutions. However, they generally require many iterations, exhibit limited adaptability to dynamic environments and coupled constraints, and are prone to slow convergence or entrapment in local optima [15]. Representative model-based studies have shown that graph-based geometric planning can be combined with clothoid smoothing and mixed-integer speed optimization to generate collision-free and dynamically feasible formation trajectories [16]. Similarly, for reconnaissance-oriented missions in dynamic interference environments, non-convex path planning with initial path guidance and rapid replanning has been explored to improve timeliness and safety, but such methods still depend heavily on explicit modeling and repeated optimization under changing conditions [17].

With the rapid development of artificial intelligence, deep reinforcement learning (DRL) has provided a new paradigm for learning end-to-end decision policies under incomplete information through iterative trial-and-feedback interactions. The deep Q-network (DQN) proposed in [18] achieved remarkable results on Atari tasks, demonstrating the effectiveness of deep neural networks for high-dimensional state representation and decision-making. The deep deterministic policy gradient (DDPG) algorithm proposed in [19] has exhibited strong capability in continuous-control problems and has since been validated in robotic motion control [20] and autonomous driving [21]. For multi-agent cooperation, MADDPG was proposed in [22] under the centralized training and decentralized execution (CTDE) framework, where a centralized Critic is used to alleviate the non-stationarity caused by multiple learning agents, thereby providing a feasible learning paradigm for cooperative control. In related studies, MAPPO was proposed in [23], to improve training stability through policy clipping, whereas SAC was introduced in [24] to enhance exploration efficiency and reduce premature convergence through entropy regularization. Beyond these general MARL advances, collision-aware multi-agent pursuit with dynamic target allocation has been shown to depend strongly on both target-assignment strategy and policy balance across agents [25]. In application-specific multi-UAV reconnaissance, DRL-based swarm trajectory planning has also been used together with grid partitioning and risk-aware patrol design to address practical limitations such as energy constraints and coordination requirements [26]. More recently, it has been shown that integrating multi-head attention with dynamic priority experience replay can substantially improve feature extraction, sample utilization, and convergence efficiency in complex UAV path-planning environments [27]. Beyond reinforcement-learning-based planning, alternative learning-based control paradigms have also shown value for autonomous path following under environmental uncertainty. For example, iterative learning control and related uncertainty-aware path-following methods have been explored in autonomous-system studies to improve disturbance adaptation and obstacle-avoidance performance, while event-triggered control has been used to reduce unnecessary communication and computation and to support the transition from algorithmic design to experimental validation [28,29]. They provide complementary insights into uncertainty adaptation, communication efficiency, and real-world implementation. Beyond the MARL architectures discussed above, recent years have also seen the emergence of more complex approaches, including Transformer-based multi-agent reinforcement learning and constrained or safe MARL methods. These architectures often achieve strong performance on large-scale or safety-critical benchmarks, but typically at the cost of substantially increased model complexity, higher computational overhead, and more demanding hyperparameter tuning. In contrast, the proposed PE-MADDPG framework is designed to maintain the relative simplicity of the original MADDPG while incorporating three targeted enhancements—prioritized experience replay, adaptive exploration noise, and an attention-enhanced Critic. As a result, the value of PE-MADDPG lies in offering a more computationally efficient and easier-to-deploy solution for practical urban low-altitude inspection tasks where moderate swarm sizes and real-time responsiveness are critical.

1.2. Motivations and Contributions

Despite these advances, existing DRL-based methods for multi-UAV cooperative planning in dense three-dimensional environments with strongly coupled constraints still face several challenges, particularly in terms of sample efficiency, exploration stability, and inter-agent interaction modeling. These limitations make it difficult to achieve stable and efficient coordinated policy learning when dynamic feasibility, obstacle avoidance, safety separation, and formation cooperation must be handled simultaneously.

To address these limitations, this study develops a task-oriented PE-MADDPG framework for multi-UAV cooperative path planning in urban low-altitude inspection scenarios. The novelty of this work lies in jointly integrating prioritized experience replay, adaptive exploration-noise adjustment, and an attention-enhanced centralized Critic within a unified CTDE framework, while explicitly embedding dynamic feasibility, obstacle avoidance, safety separation, and formation cooperation into the same POMDP formulation. Comparative experiments and ablation studies are further conducted to evaluate the effectiveness of this integrated design in terms of mission completion rate, formation maintenance, flight efficiency, and energy proxy.

Based on the above motivation and methodological design, the main contributions of this paper are summarized as follows:

A learning-based cooperative path-planning method for urban low-altitude multi-UAV inspection scenarios is developed under the MADDPG-based CTDE framework, where dynamic constraints, obstacle-avoidance safety, and formation cooperation are jointly incorporated into the POMDP formulation and reward design.
A prioritized experience replay mechanism is introduced, in which sample priorities are constructed by aggregating multi-agent TD errors, thereby improving the utilization of critical samples and enhancing training convergence efficiency.
An adaptive exploration-noise mechanism is proposed to automatically regulate exploration intensity without changing the deterministic policy-gradient structure, which helps alleviate sparse-reward and local-optimum problems while improving training stability.
A centralized Critic network enhanced by a multi-head attention mechanism is designed to explicitly model inter-agent interaction dependencies and to improve the accuracy of value estimation for cooperative decision-making.
Extensive comparative experiments, scalability analysis, and ablation studies are conducted in a three-dimensional urban inspection simulation environment. The effectiveness and robustness of the proposed method are validated using metrics including mission completion rate, formation maintenance rate, flight efficiency, and energy proxy.

2. Problem Description and Modeling

To clearly define the research scope and associated constraints, this section formulates the urban low-altitude cooperative inspection task within a mathematical framework. Specifically, the inspection scenario is introduced first, followed by the formalization of UAV dynamics and the decision-making process, and finally the construction of a performance evaluation framework.

2.1. Urban Low-Altitude Cooperative Inspection Scenario

Consider a three-dimensional urban low-altitude airspace, whose feasible flight domain is represented as a bounded set. The environment contains a set of static obstacles and a set of dynamic obstacles. Static obstacles are described by axis-aligned cuboids corresponding to urban buildings; dynamic obstacles are described by cylinders to represent temporarily controlled airspace or dynamic threats from non-cooperative aircraft. In order to unify collision-avoidance constraints and reduce computational complexity, each obstacle is approximated by an equivalent bounding-sphere radius with an additional safety margin. This conservative approximation trades off geometric precision for computational efficiency, which is acceptable for task-level planning validation. Accordingly, the

k

-th obstacle is characterized by its center position

o_{k}

and equivalent radius

r_{k}

, together with an additional safety margin.

Let the position of the

i

-th UAV be denoted by

p_{i}

. Then, the minimum safe-separation requirement between a UAV and an obstacle can be expressed as follows:

‖ p_{i} - p_{o b s, k} ‖_{2} \geq R_{o b s, k} + d_{s a f e}, \forall k \in O_{s t a t i c} \cup O_{d y n a m i c} .

(1)

The multi-UAV system is composed of

N

UAVs, including one leader UAV and

N - 1

followers. The leader index is indexed by 1 and the follower set is

I_{f} = {2,3, \dots, N}

. The mission requires the leader UAV to fly from the start point to the goal point while satisfying all operational constraints. Meanwhile, each follower UAV is required to maintain a predefined formation topology and relative pose with respect to the leader, while simultaneously satisfying obstacle avoidance and safety separation requirements. Through this coordinated process, the inspection mission is completed collaboratively. In the present study, this cooperative formation mechanism is evaluated under nominal actuator and communication conditions. A schematic illustration of the considered scenario is presented in Figure 1.

2.2. UAV Dynamics Model and Constraints

To achieve a balance between model fidelity and real-time computational efficiency, a three-dimensional point-mass kinematic model is employed, in which both yaw and pitch motions are explicitly taken into account. The position of the

i

-th UAV at time

t

is defined as

p_{i} (t) = {[x_{i} (t), y_{i} (t), z_{i} (t)]}^{T},

(2)

where

x_{i} (t)

,

y_{i} (t)

, and

z_{i} (t)

are coordinates along the

x

,

y

and

z

axes, respectively; the coordinate origin is denoted by

O

. Let the speed magnitude be

v_{i} (t)

, the yaw angle be

ψ_{i} (t)

, and the pitch angle be

θ_{i} (t)

. Then the unit direction vector is

n_{i} (t) = {[\cos ψ_{i} (t) \cos θ_{i} (t), \sin ψ_{i} (t) \cos θ_{i} (t), \sin θ_{i} (t)]}^{T},

(3)

and the velocity vector is

{\dot{p}}_{i} (t) = v_{i} (t) n_{i} (t),

(4)

The longitudinal acceleration

a_{i, v} (t)

, yaw rate

ω_{i, ψ} (t)

, and pitch rate

ω_{i, θ} (t)

are used as control inputs; thus the action vector of the

i

-th UAV is

a_{i} (t) = {[a_{i, v} (t), ω_{i, ψ} (t), ω_{i, θ} (t)]}^{T},

(5)

Under a discrete timestep

Δ t

, Euler integration is used for state updates:

\{\begin{array}{l} v_{i} (t + 1) = clip (v_{i} (t) + a_{i, v} (t) Δ t, v_{m i n}, v_{m a x}) \\ ψ_{i} (t + 1) = ψ_{i} (t) + ω_{i, ψ} (t) Δ t \\ θ_{i} (t + 1) = clip (θ_{i} (t) + ω_{i, θ} (t) Δ t, θ_{m i n}, θ_{m a x}) \\ p_{i} (t + 1) = p_{i} (t) + v_{i} (t + 1) \cdot n_{i} (t + 1) \cdot Δ t \end{array} .

(6)

where

clip (\cdot)

denotes the clipping function.

Euler integration is adopted here as a computationally efficient discrete-time approximation for task-level cooperative path-planning simulation, so as to balance model simplicity and simulation efficiency. Accordingly, the present training and execution framework is implemented on a fixed-time-step basis to maintain synchronized multi-agent interaction and a consistent simulation environment.

During flight, UAVs must also satisfy the following safety constraint:

\{\begin{array}{l} ‖ p_{i} (t) - p_{j} (t) ‖_{2} \geq d_{i n t e r}, \forall i \neq j \\ p_{i} (t) \in X, \forall i \in {1, 2, \dots, N} \end{array} .

(7)

where

d_{i n t e r}

is the minimum safe separation distance between UAVs.

2.3. POMDP Decision Model

Based on the above scenario and dynamics, the multi-UAV cooperative path planning problem is formulated as a partially observable Markov decision process (POMDP), thereby providing a standardized decision-making framework for algorithm design. The model is represented as a seven-tuple

(S, O, A, P, O, R, γ)

, defined as follows.

2.3.1. State Space

The global state

s \in S

is used for centralized value evaluation in the CTDE training stage and is formed by concatenating all individual UAV states. The individual state of UAV

i

is

s_{i} = {[{flag}_{i}, x_{i}, y_{i}, z_{i}, v_{i}, ψ_{i}, θ_{i}]}^{T},

(8)

where

{flag}_{i}

is a UAV type indicator (0 for leader and 1 for follower). The global state is

s = {[s_{1}, s_{2}, \dots, s_{N}]}^{T} .

(9)

2.3.2. Observation Space

During execution, each UAV only obtains local observations. In order to maintain a consistent observation size for neural-network training and inference, obstacle information is truncated to a fixed dimension. Let the observation of UAV

i

be

o_{i}

, including its own dynamic state, goal guidance information, obstacle information, and relative-position information of teammates. For the leader UAV:

o_{1} = {[s_{1}, p_{g o a l} - p_{1}, O_{o b s, 1}, {p_{j} - p_{1}}_{j \in F}]}^{T},

(10)

For a follower UAV:

o_{i} = {[s_{i}, p_{1} - p_{i}, O_{o b s, i}, {p_{j} - p_{i}}_{j \neq i}]}^{T}, i \in F .

(11)

where

O_{obs, i}

denotes obstacle information within the sensing range of UAV

i

.

2.3.3. Action Space

A continuous action space is adopted for each UAV. The action dimension for a single UAV is 3. The joint action space for the multi-UAV system is the Cartesian product of all individual action spaces:

A = A_{1} \times A_{2} \times \dots \times A_{N} .

(12)

The joint action is

a = {[a_{1}, a_{2}, \dots, a_{N}]}^{T} .

(13)

2.3.4. Transition Probability

A stochastic difference form of the system state update is

s (t + 1) = f (s (t), a (t)) + ξ (t),

(14)

where

ξ (t)

is zero-mean Gaussian noise, representing environmental and dynamic uncertainty. The corresponding transition probability is

P (s^{'} | s, a) = ℙ (s (t + 1) = s^{'} ∣ s (t) = s, a (t) = a) .

(15)

2.3.5. Observation Probability

A UAV’s local observation is a stochastic mapping of its state and environmental information:

o_{i} (t) = g (s (t)) + ζ_{i} (t),

(16)

where

ζ_{i} (t)

is observation noise representing sensor measurement errors. The corresponding observation probability is

O (o_{i} | s) = ℙ (o_{i} (t) = o_{i} ∣ s (t) = s) .

(17)

2.3.6. Multi-Objective Reward Functions

To guide UAVs to learn cooperative path planning efficiently under constraints of obstacle avoidance, safety separation, and cooperative formation, a comprehensive multi-objective reward function is designed, with differentiated optimization for the leader and followers. The leader reward is

r_{1} = w_{1} r_{b o u n d} + w_{2} r_{o b s} + w_{3} r_{g o a l} + w_{4} r_{f o r m}^{(1)} + w_{5} r_{v e l}^{(1)} .

(18)

where

w_{1}, w_{2}, w_{3}, w_{4}, w_{5}

are weights satisfying

\sum_{k = 1}^{5} w_{k} = 1

, and

w_{k} > 0

. The reward weights were selected according to two practical principles: maintaining comparable numerical scales among the sub-reward terms and reflecting the operational priorities of the present urban inspection scenario. In particular, safety-related objectives were given relatively higher importance than purely efficiency-related terms, while target-reaching and coordination objectives were balanced to support both mission completion and cooperative stability during training. The sub-rewards are defined as follows:

Boundary reward: This penalizes flying outside feasible airspace

r_{b o u n d} = \{\begin{array}{l} - R_{b o u n d}, & p_{1} \notin X \\ 0, & p_{1} \in X d (p_{1}, \partial X) \geq d_{t h r e s h} \\ - k_{b o u n d} \cdot d (p_{1}, \partial X), & otherwise \end{array},

(19)

where

\partial X

is the feasible airspace boundary,

d_{t h r e s h}

is the boundary buffer threshold,

R_{b o u n d}

is the out-of-bounds penalty value,

k_{bound}

is the buffer penalty coefficient.

2.: Obstacle avoidance reward: used to guide the UAV to maintain a safe distance from obstacles

r_{o b s} = \{\begin{array}{l} - R_{o b s}, & d_{o b s, 1} \leq d_{s a f e} \\ k_{o b s} \cdot (d_{o b s, 1} - d_{s a f e}), & d_{s a f e} < d_{o b s, 1} \leq d_{s a f e} + d_{b u f f e r} \\ 0, & d_{o b s, 1} > d_{s a f e} + d_{b u f f e r} \end{array},

(20)

where

d_{b u f f e r}

is the obstacle avoidance buffer scale,

R_{o b s}

is the collision penalty value,

k_{obs}

is the buffer reward coefficient, and

d_{obs, 1}

is the minimum clear gap between the UAV and the obstacle.

3.: Inspection target reward: used to guide the leader UAV to fly to the target point

r_{g o a l} = \{\begin{array}{l} R_{g o a l}, & ‖ p_{1} - p_{g o a l} ‖_{2} \leq d_{a r r i v e} \\ k_{g o a l} \cdot (d_{l a s t} - d_{c u r r e n t}), & d_{a r r i v e} < ‖ p_{1} - p_{g o a l} ‖_{2} \leq d_{m a x} \\ - R_{g o a l}, & ‖ p_{1} - p_{g o a l} ‖_{2} > d_{m a x} \end{array},

(21)

where

d_{arrive}

is the arrival judgment threshold,

d_{\max}

is the maximum effective distance,

r_{arrive}

is the target arrival reward,

d_{l a s t}

and

d_{c u r r e n t}

are the distances from the UAV to the target point at the previous and current moments respectively, and

k_{goal}

is the distance reward coefficient.

4.: Formation maintenance reward: used to guide follower UAVs to maintain the preset formation distance with the leader UAV. For any follower UAV $i \in F$ , define its distance from the leader UAV as $d_{1 i} = ‖ p_{1} - p_{i} ‖_{2}$ , its formation maintenance sub-reward is

r_{f o r m}^{(i)} = \{\begin{array}{l} R_{f o r m}, & | d_{1 i} - d_{f o r m}^{*} | \leq ε_{f o r m} \\ - k_{f o r m} \cdot | d_{1 i} - d_{f o r m}^{*} |, & else \end{array},

(22)

To make the reward of the leader UAV consistent with the overall formation, its formation maintenance reward is defined as the mean value of the formation rewards of all follower UAVs:

r_{f o r m}^{(1)} = \frac{1}{| F |} \sum_{i \in F} r_{f o r m}^{(i)} .

(23)

5.: Velocity cooperation reward: used to guide multi-UAV to maintain the consistency of speed and heading. For the leader UAV, define the average speed and average heading angle of the follower UAVs as ${\bar{v}}_{f} = \frac{1}{| F |} \sum_{i \in F} v_{i}$ and ${\bar{ψ}}_{f} = \frac{1}{| F |} \sum_{i \in F} ψ_{i}$ respectively, then the velocity cooperation sub-reward of the leader UAV is

r_{v e l}^{(1)} = k_{v e l} \cdot (\exp (- | v_{1} - {\bar{v}}_{f} |) + w_{ψ} \cdot \exp (- | ψ_{1} - {\bar{ψ}}_{f} |)),

(24)

For any follower UAV

i \in F

, its velocity cooperation sub-reward is defined as the consistency constraint with the leader UAV:

r_{v e l}^{(i)} = k_{v e l} \cdot (\exp (- | v_{i} - v_{1} |) + w_{ψ} \cdot \exp (- | ψ_{i} - ψ_{1} |)),

(25)

where

k_{v e l}

is the velocity cooperation reward coefficient, and

w_{ψ}

is the heading weight.

The reward function of the follower UAV is

r_{i} = w_{1} r_{b o u n d} + w_{2} r_{o b s} + w_{4} r_{f o r m}^{(i)} + w_{5} r_{v e l}^{(i)}, i \in F .

(26)

where the definition of each sub-reward is consistent with that of the leader UAV.

2.4. Performance Evaluation Metrics

To comprehensively evaluate the proposed algorithm in terms of task completion, cooperative stability, efficiency, and energy consumption, the following performance metrics are defined.

(1): Mission Completion Rate (MCR):

MCR = \frac{N_{success}}{N_{all}} \times 100 %

(27)

where

N_{success}

is the number of episodes that successfully complete the inspection task, and

N_{all}

is the total number of episodes.

(2): Obstacle Avoidance Rate (OAR):

OAR = \frac{N_{avoid}}{N_{encounter}} \times 100 %

(28)

where

N_{encounter}

is the number of obstacle encounter events recorded in the statistical window, and

N_{avoid}

is the corresponding number of successful avoidance times.

(3): Flight Time (FT):

FT = T_{end} \cdot Δ t

(29)

where

T_{end}

is the discrete time step index at the end of the task.

(4): Flight Distance (FD):

FD = \sum_{t = 0}^{t_{e n d} - 1} ‖ p_{1} (t + 1) - p_{1} (t) ‖_{2}

(30)

(5): Energy Proxy (EP):

{EP}_{i} = \sum_{t = 0}^{t_{e n d} - 1} (λ_{v} v_{i} {(t)}^{2} + λ_{a} ‖ a_{i} (t) ‖_{2}^{2}) Δ t

(31)

where

λ_{v}

,

λ_{a}

denote the weighting coefficients. To facilitate cross-algorithm comparison, this study employs the cluster-average energy consumption proxy indicator, denoted by

EP = \frac{1}{N} \sum_{i = 1}^{N} {EP}_{i}

. This proxy captures the primary sources of mechanical energy consumption in a point-mass model, i.e., changes in kinetic energy and work against drag. It serves as a consistent and computationally efficient metric for comparing the relative energy efficiency of different path-planning policies.

(6): Formation Keeping Rate (FKR):

FKR = \frac{1}{N_{t o t a l} \cdot t_{e n d}} \sum_{k = 1}^{N_{t o t a l}} \sum_{t = 0}^{t_{e n d} - 1} \prod_{i \in F} I (| d_{1 i} (t) - d_{f o r m}^{*} | \leq ε_{f o r m})

(32)

where

I (\cdot)

is the indicator function, and

ε_{f o r m}

is the formation error tolerance threshold.

3. Cooperative Path Planning Method Based on PE-MADDPG

Building on the POMDP formulation presented in Section 2, this section focuses on the algorithmic realization of PE-MADDPG, including the CTDE framework of MADDPG, the prioritized experience replay mechanism, the adaptive exploration-noise mechanism, the attention-enhanced Critic design, and the complete training procedure.

3.1. MADDPG Basic Framework

The MADDPG algorithm adopts the core framework of Centralized Training with Decentralized Execution (CTDE). In the training phase, the Critic network of each agent can access the global state and joint action information of all agents; while the Actor network only uses the local observation information of the agent itself. In the execution phase, each agent independently executes the actions output by its own Actor network without relying on global information.

Let the policy of the

i

-th agent be

π_{ϕ_{i}} (a_{i} | o_{i})

, where

ϕ_{i}

is the parameter of the Actor network. In the execution phase, each agent outputs deterministic actions according to local observations to obtain a smooth flight trajectory. In the training phase, exploration noise is superimposed on the deterministic action to improve exploration efficiency, as detailed in Section 3.3.

The centralized Critic network is used to evaluate the value of the joint action in the global state, and its

i

-th component is denoted as

Q_{f_{i}} (s, a)

, where

f_{i}

is the parameter of the Critic network. The basic optimization objective of the Actor network is to maximize the expected cumulative reward:

J (θ_{i}) = E_{s ~ ρ_{π}, a ~ π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r_{i} (t)],

(33)

where

ρ_{π}

is the state distribution induced by the policy

π

. The corresponding policy gradient is

\nabla_{θ_{i}} J (θ_{i}) = E_{s, o, a} [\nabla_{θ_{i}} π_{θ_{i}} (a_{i} | o_{i}) \cdot \nabla_{a_{i}} Q_{f_{i}} (s, a)] .

(34)

However, the standard MADDPG mainly faces three core problems in the multi-UAV cooperative path planning task: insufficient utilization of critical samples caused by uniform experience replay, difficulty in adaptive adjustment of fixed exploration intensity, and insufficient modeling ability of the Critic network for multi-agent interaction information. To this end, this paper introduces the following three improvements on the CTDE framework to construct the PE-MADDPG algorithm.

3.2. Prioritized Experience Replay Mechanism

To alleviate the low utilization of informative samples and the slow convergence caused by uniform experience replay in standard MADDPG, a multi-agent prioritized experience replay mechanism is introduced. Specifically, priorities are assigned to joint transitions in the replay buffer, samples are drawn with non-uniform probabilities, and importance sampling weights are used to correct the bias introduced by prioritized sampling. For a joint transition

e_{t} = (s_{t}, o_{t}, a_{t}, r_{t}, s_{t + 1}, {done}_{t})

, the TD target and TD error of the

i

-th agent are

y_{i, t} = r_{i, t} + γ (1 - {done}_{t}) Q_{ϕ_{i}^{-}} (s_{t + 1}, {a^{'}}_{t}),

(35)

δ_{i, t} = y_{i, t} - Q_{ϕ_{i}} (s_{t}, a_{t}),

(36)

where

{a^{'}}_{t}

denotes the target joint action generated by the target Actor network,

θ_{i}^{-}

and

ϕ_{i}^{-}

denote the parameters of the target Actor and target Critic networks of agent respectively.

Under the CTDE setting, the replay unit corresponds to a joint transition of the entire multi-UAV system rather than the local experience of a single agent. Therefore, the mean absolute TD-error across agents is adopted as the sample-level error signal to measure the overall cooperative deviation of the joint transition. In the considered task, coordinated team behavior is more directly related to mission completion than the performance fluctuation of any isolated UAV. From this perspective, mean aggregation serves as a team-level prioritization scheme:

δ_{t} = \frac{1}{N} \sum_{i = 1}^{N} | δ_{i, t} |,

(37)

However, this formulation may smooth rare but safety-critical anomalies dominated by only one or a few agents, such as local collision risk or abrupt formation break. Compared with max-based aggregation, mean aggregation is less sensitive to isolated extreme errors from a single agent and is therefore more stable for the present cooperative task, although it may partially average out agent-specific variance differences in TD-error estimation.

The sample priority is defined as follows:

p_{t} = {(δ_{t} + ϵ)}^{α_{p e r}},

(38)

where

ϵ

is a small constant to avoid zero priority, and

α

controls the concentration degree of priority. The sampling probability is

P (t) = \frac{p_{t}}{\sum_{k = 1}^{K} p_{k}} .

(39)

where

K

is the total number of transitions stored in the replay buffer.

To correct the deviation introduced by non-uniform sampling, the importance sampling weight is introduced:

w_{t} = {(\frac{1}{K \cdot P (t)})}^{β},

(40)

where

β

is the deviation correction coefficient, and

w_{t}

is normalized by

\max_{k} w_{k}

. Accordingly, the weighted mean-squared TD loss of the Critic network is given by

L (f_{i}) = \frac{1}{M} \sum_{t = 1}^{M} w_{t} \cdot δ_{i, t}^{2} .

(41)

where

M

is the batch sample size.

3.3. Adaptive Exploration Noise Mechanism

To solve the problems of insufficient exploration and easy falling into local optimum of the algorithm in complex urban scenarios, we adopt an adaptive exploration noise mechanism in the training phase, which realizes automatic adjustment of exploration intensity without changing the deterministic policy gradient framework of MADDPG, so as to balance exploration and exploitation and improve training stability.

In the training phase, the action of the

i

-th agent is obtained by superimposing the output of the deterministic policy and zero-mean Gaussian noise:

a_{i, t} = clip (μ_{θ_{i}} (o_{i, t}) + ε_{t}, a_{\min}, a_{\max}), ε_{t} ~ N (0, σ_{e}^{2} I)

(42)

where

σ_{e}

is the exploration noise intensity corresponding to the

e

-th training episode, and

clip (\cdot)

is used to crop the action to the legal range. In the test phase, to obtain a stable and reproducible path planning strategy, the deterministic action is directly used.

To realize the adaptive adjustment of exploration intensity, this paper updates the noise intensity online according to whether the episode successfully completes the task. Let the success indicator variable of the

e

-th episode be

I_{s u c c} (e)

, then the noise intensity is updated as follows:

σ_{e + 1} = \{\begin{array}{l} clip ((1 - η_{σ}) σ_{e}, σ_{\min}, σ_{\max}), & I_{s u c c} (e) = 1 \\ clip ((1 + η_{σ}) σ_{e}, σ_{\min}, σ_{\max}), & I_{s u c c} (e) = 0 \end{array}

(43)

where

η_{σ}

is the noise adjustment step size,

σ_{\min}

and

σ_{\max}

are the lower and upper bounds of the noise intensity respectively. The imposed lower bound on the noise scale prevents exploration from vanishing completely, even when consecutive successful episodes are observed, while the upper bound avoids excessively aggressive perturbations during unsuccessful phases. Since this update is driven by a binary episode-level outcome, it may be less sensitive to gradual performance improvement within an episode and may respond more strongly to occasional failures in later training.

3.4. Attention-Enhanced Critic Network

Traditional centralized Critics based on direct state-action concatenation often have limited ability to represent dynamic inter-agent dependencies. To address this limitation, a multi-head attention-enhanced Critic is designed to adaptively capture interaction relevance among agents.

Inter-agent spatial relations in the present setting are mainly reflected by relative distance, formation deviation, local collision relevance, and heading–velocity consistency. These factors are already contained in the joint state–action representation through teammate relative-position information, obstacle-related features, and cooperative constraints. The attention mechanism therefore emphasizes agents that are more relevant to the current decision, improving the Critic’s representation of spatial interaction dependencies. It selectively reweights existing interaction cues for value estimation.

Firstly, feature embedding is performed on the input of each agent to obtain a feature vector with uniform dimension:

h_{i} = ReLU (W_{e} \cdot [s_{i}, a_{i}] + b_{e}),

(44)

where

W_{e}

,

b_{e}

are the parameters of the embedding layer,

ReLU

is the activation function, and

h_{i}

is the embedding feature of the

i

-th agent.

Then, the multi-head dot-product attention mechanism is used to model the interaction of the embedded features of all agents. For the

h

-th attention head, define the query, key, and value projection matrices as

W_{Q}^{h}

,

W_{K}^{h}

,

W_{V}^{h}

respectively, then the corresponding projection features are

Q^{m} = H W_{Q}^{m}, K^{m} = H W_{K}^{m}, V^{m} = H W_{V}^{m},

(45)

where

H = {[h_{1}, h_{2}, \dots, h_{N}]}^{T} \in ℝ^{N \times d_{e m b}}

is the embedding feature matrix of all agents, and the dimension of the key vector is consistent with the dimension of the attention head.

The attention weight and context vector of the

h

-th attention head are

A t t^{h} = Softmax (\frac{Q^{h} {(K^{h})}^{T}}{\sqrt{d_{k}}}),

(46)

C^{h} = A t t^{h} \cdot V^{h} .

(47)

The outputs of all attention heads are concatenated and fused through a linear layer to obtain the final global context feature:

C = Concat (C^{1}, C^{2}, \dots, C^{H}) \cdot W_{o} .

(48)

where

W_{o}

is the output projection matrix, and

H

is the number of attention heads. Finally, the value component corresponding to the current agent is obtained by inputting the global context feature and the individual embedding feature into the subsequent fully connected value network:

Q_{i} (s, a) = f_{f c} ([C, h_{i}]) .

(49)

where

f_{f c} (\cdot)

is the fully connected value network.

Figure 2 provides a structural illustration of the attention-enhanced centralized Critic. As shown in the figure, the interaction cues relevant to cooperative path planning, such as relative position and distance, local collision relevance, formation deviation, and heading–velocity consistency, are not introduced as additional raw inputs beyond the CTDE setting. Instead, they are already contained in the joint state-action representation and are selectively reweighted through agent-wise embedding, query-key-value projection, multi-head attention, and output projection. The resulting fused context feature is then used to support interaction-aware state-action value estimation for the current agent. The Critic can thus place greater emphasis on agents that are more relevant to the current decision context, thereby improving the representation of inter-agent dependencies while preserving decentralized execution of the Actor networks.

3.5. PE-MADDPG Algorithm Flow

Based on the components described in Section 3.1, Section 3.2, Section 3.3 and Section 3.4, the complete training procedure of PE-MADDPG can be summarized as an integrated multi-agent actor–critic learning process. This subsection focuses on the integrated training and update flow, while the detailed definitions of the POMDP variables and reward functions follow the problem formulation in Section 2. During environment interaction, each UAV agent selects its action according to its own local observation, which preserves decentralized executability in practical deployment. Meanwhile, the centralized Critic is trained with global state information and joint actions of all agents to alleviate the non-stationarity caused by simultaneous policy updates in the multi-agent setting.

Algorithm 1 presents the complete training procedure of PE-MADDPG. At each interaction step, each agent generates its action through the corresponding Actor network, while adaptive Gaussian exploration noise is added during training to improve exploration efficiency. The resulting joint transition, including the global state, local observations, joint actions, reward vector, next state, and next observations, is then stored in the prioritized replay buffer. Once the number of collected samples exceeds the mini-batch threshold, prioritized sampling is performed, and the sampled transitions are used to update the Critic and Actor networks of all agents. In this process, the target values are computed using the target networks, the TD errors of all agents are evaluated for each sampled transition, and the aggregated multi-agent TD error is further used to update the sampling priority of that transition in the replay buffer. The Critic is optimized by minimizing the importance-weighted TD loss, and the Actor is updated using the deterministic policy gradient. After each training step, the target networks are softly updated to maintain learning stability. At the end of each episode, the exploration noise intensity is adaptively adjusted according to the episode-level task success indicator.

Algorithm 1. Complete training procedure of PE-MADDPG

Input Number of agents

N

; Actor network

{π_{ϕ_{i}}}_{i = 1}^{N}

; Critic network

{Q_{θ_{i}}}_{i = 1}^{N}

; target networks

{π_{ϕ_{i}^{'}}}_{i = 1}^{N}

,

{Q_{θ_{i}^{'}}}_{i = 1}^{N}

; prioritized replay buffer

D

; initial exploration noise scale

σ_{0}

; mini-batch size

B

; discount factor

γ

; soft update coefficient

τ

Output Trained Actor networks target networks

{π_{ϕ_{i}}}_{i = 1}^{N}

1: Initialize Actor network

π_{ϕ_{i}}

, Critic network

Q_{θ_{i}}

, and corresponding target networks

π_{ϕ_{i}^{'}}

,

Q_{θ_{i}^{'}}

for each agent

i = 1, \dots, N

2: Initialize prioritized replay buffer

D

and set episode index

k = 0

3: Initialize exploration noise scale

σ_{k} = σ_{0}

4. for each episode

k = 1, 2, \dots, K

do
5. Reset the environment and obtain the initial global state

s_{0}

and local observations

o_{0} = {o_{1, 0}, \dots, o_{N, 0}}

6. for each time step

t = 0, 1 \dots, T - 1

do
7. for each agent

i = 1, \dots, N

do
8. Select action

a_{i, t} = clip (π_{ϕ_{i}} (o_{i, t}) + σ_{k} N (0, 1), a_{\min}, a_{\max})

9. end for
10. Execute the joint action

a_{t} = {a_{1, t}, \dots, a_{N, t}}

, and observe reward vector

r_{t}

, next global state

s_{t + 1}

, next local observations

o_{t + 1}

, and termination flag

d_{t}

11. Store transition

e_{t} = (s_{t}, a_{t}, r_{t}, s_{t + 1}, o_{t + 1}, d_{t})

Into

D

with initial priority

p_{t}

12. if

|D| \geq B

then
13. Sample a mini-batch

{e_{b}}_{b = 1}^{B}

from

D

according to prioritized sampling probabilities
14. Compute importance-sampling weights

w_{b}

for all sampled transitions
15. for each sampled transition

e_{b}

do
16. Generate target joint action

{a^{'}}_{b} = {{a^{'}}_{1, b}, \dots, {a^{'}}_{N, b}}

with target Actors, where

{a^{'}}_{i, b} = π_{{ϕ^{'}}_{i}} ({o^{'}}_{i, b})

17. Use the attention-enhanced Critic to estimate

Q_{θ_{i}} (s_{b}, a_{b})

and target value

y_{i, b} = r_{i, b} + γ (1 - d_{b}) Q_{θ_{i}^{'}} ({s^{'}}_{b}, {a^{'}}_{b})

for each agent

i

18. Compute per-agent TD errors

δ_{i, b} = y_{i, b} - Q_{θ_{i}} (s_{b}, a_{b})

19. Aggregate the multi-agent TD errors into a sample-level priority signal

{\bar{δ}}_{b} = \frac{1}{N} \sum_{i = 1}^{N} |δ_{i, b}|

, and update the priority of

e_{b}

in

D

20. end for
21. for each agent

i = 1, \dots, N

do
22. Update the attention-enhanced Critic by minimizing the weighted TD loss

L (θ_{i}) = \frac{1}{B} \sum_{b = 1}^{B} w_{b} {(y_{i, b} - Q_{θ_{i}} (s_{b}, a_{b}))}^{2}

23. Update the Actor using the deterministic policy gradient
24. Softly update the target networks:

ϕ_{i}^{'} \leftarrow τ ϕ_{i} + (1 - τ) ϕ_{i}^{'}, θ_{i} \leftarrow τ θ_{i} + (1 - τ) θ_{i}

25.    end for
26.   end if
27.   Set

s_{t} \leftarrow s_{t + 1}

o_{t} \leftarrow o_{t + 1}

28. if

d_{t} = 1

then break
29. end for
30. Determine the episode-level task success indicator

f_{k}

31. Adapt the exploration noise scale

σ_{k + 1}

according to

f_{k}

32. end for
33. Return the trained Actor networks

{π_{ϕ_{i}}}_{i = 1}^{N}

Figure 3 further illustrates how prioritized experience replay, adaptive exploration noise, and the attention-enhanced centralized Critic are coupled within the PE-MADDPG training loop. This figure highlights the functional interaction among the three enhancement modules during off-policy learning. Specifically, prioritized replay increases the reuse of informative transitions, the attention-enhanced Critic improves interaction-aware value estimation, and the adaptive exploration rule regulates the exploration intensity across training episodes.

Therefore, the three enhancement modules operate cooperatively within the same training framework. Prioritized replay improves sample efficiency, the attention-enhanced Critic strengthens the representation of interaction dependencies, and adaptive exploration noise stabilizes the exploration–exploitation balance throughout training. Through the coordinated action of these mechanisms, PE-MADDPG is able to achieve more stable convergence and more effective cooperative path planning in complex urban low-altitude environments.

4. Simulation Settings

To verify the effectiveness of the PE-MADDPG algorithm, a series of simulation experiments is conducted in this section to provide a comprehensive evaluation from the perspectives of training convergence, comparative performance, scalability, and module effectiveness. The entire experimental procedure is performed in a controllable and repeatable environment based on a unified simulation platform.

4.1. Experimental Environment Setup

4.1.1. Simulation Environment

The simulation scenario is constructed in a 3000 m × 3000 m × 300 m 3D urban low-altitude environment, including 35 static building obstacles (axis-aligned bounding boxes) with sizes randomly varying from 50 m × 50 m × 50 m to 200 m × 200 m × 250 m; another 8 cylindrical no-fly zones with a radius of 80 m and a height of 200 m are set to simulate urban controlled airspace and temporary no-fly areas, as shown in Figure 4.

4.1.2. UAV Parameter Settings

The default inspection platform in this paper is the DJI Matrice 350 RTK (M350 RTK). According to official data, its maximum horizontal speed is 23 m/s, maximum wind resistance is Level 7, maximum flight time without load is about 55 min, and the protection level is IP55.

To fit the safety margin and load impact in the urban inspection scenario, we adopt a conservative amplitude limiting coefficient for maneuverability at the planning and control level, so as to map the UAV specifications to the simulation dynamic constraints. Considering that the leader UAV is equipped with a pan-tilt load resulting in slightly limited maneuverability, we set slightly differentiated control limit parameters for the leader and follower UAVs (the rest of the constraints are consistent), as shown in Table 1.

4.1.3. Training Settings

The training of all algorithms is completed on a computer equipped with an Intel Core i7-12700H processor and 16 GB RAM. Python 3.9 is used as the programming language, implemented in conjunction with the PyTorch 2.5.1 deep learning framework; if equipped with an NVIDIA GPU, CUDA 12.1 is enabled for training acceleration. To ensure the fairness of the comparative experiments, all algorithms are run under the same environment and the same training budget. The core parameter settings of PE-MADDPG are shown in Table 2.

The comparison algorithms include DDPG, MAPPO, and standard MADDPG. These baselines were selected to represent three relevant reference settings for the present task, namely single-agent deterministic control, off-policy multi-agent actor–critic learning, and on-policy multi-agent policy optimization. To ensure a controlled comparison, all methods were implemented under the same simulation environment, training budget, and basic network scale, while retaining only the algorithm-specific settings required by each method. All comparison algorithms adopt the same basic network structure and core training hyperparameters as PE-MADDPG, and only retain the core hyperparameter differences of the algorithm itself to ensure the fairness of the comparative experiments. This unified setting was adopted to control the comparison conditions across algorithms.

5. Results

5.1. Comparative Analysis of Training Process

To evaluate the convergence speed and stability of the proposed PE-MADDPG in the training phase, and verify the role of each improved module in the learning process, this section compares five training configurations: standard MADDPG-Baseline, MADDPG-PER with only prioritized experience replay, MADDPG-Attention with only attention-enhanced Critic, MADDPG-Entropy with entropy regularization exploration as control, and the complete PE-MADDPG. All methods are trained under the same training budget and keep the network scale consistent. Each configuration is trained independently for 5 times, and the mean and standard deviation of the reward per episode are recorded. The results are shown in Figure 5.

Figure 5 presents the average reward curves of PE-MADDPG and its principal control variants over 500 training episodes, where the shaded region represents the variability across multiple independent training runs. The results show that the full PE-MADDPG model enters the positive-reward regime more rapidly in the early stage, achieves stable improvement after approximately 200–250 episodes, and ultimately converges near a reward level of 1000. In contrast, MADDPG-Baseline exhibits both slower convergence and a lower final reward. Although introducing PER alone (MADDPG-PER) yields some benefit in the middle and late stages of training, the overall improvement remains limited. By comparison, introducing the attention-enhanced Critic alone (MADDPG-Attention) fails to obtain stable positive returns under the present task setting, indicating that interaction modeling must operate in coordination with sample efficiency and exploration mechanisms. This result suggests that the attention mechanism mainly improves interaction representation in the Critic, whereas stable learning still depends on whether informative samples can be sufficiently replayed and whether adequate exploration can be maintained during training.

The convergence statistics of 500 episodes of training are shown in Table 3. Among them, the convergence episode is the first episode where the sliding mean reaches 95% of the final mean and remains for 30 episodes.

Figure 6 illustrates the evolution of the total reward and its sub-reward components during the training process of PE-MADDPG. In the early stage of training, the agents remain primarily in the exploration phase, and both the total reward and the sub-rewards fluctuate substantially at relatively low levels. As training proceeds, the total reward gradually increases and becomes stable in the later stage, while the sub-reward curves of the leader UAV and follower UAVs improve synchronously and eventually enter stable ranges. These results indicate that the PE-MADDPG algorithm can converge effectively and that the agents successfully learn a near-optimal cooperative path-planning policy under multiple operational constraints.

5.2. Sensitivity Analysis

To examine the stability and parameter sensitivity of the adaptive exploration rule adopted in this paper, we conducted a sensitivity analysis by varying only the noise-update step size while keeping all other training settings unchanged. Three representative settings were considered, namely

η_{σ} = 0 . 01

,

η_{σ} = 0 . 02

, and

η_{σ} = 0 . 05

. We trained each setting under the same simulation environment and task configuration as those used in the main experiments, and repeated the 500-episode training process 5 times to obtain the mean trajectories and variance statistics. The resulting curves are shown in Figure 7.

As shown in Figure 7a, the adaptive exploration rule is clearly sensitive to the choice of the update step size. When

η_{σ} = 0 . 01

, the reward curve improves only slowly and remains at a relatively low level during the later stage of training, indicating that an overly small update step weakens exploration adaptability and limits policy improvement. Increasing the step size to

η_{σ} = 0 . 05

, it led to a substantial reward gain over the conservative setting, but its overall reward level still remains below that of the default setting. Among the tested settings,

η_{σ} = 0 . 02

achieves the highest reward level in the later stage of training, while maintaining a relatively stable convergence trend.

Figure 7b shows that the exploration noise remains bounded under all tested settings and follows distinct adaptation patterns for different step sizes. The setting

η_{σ} = 0 . 01

changes more slowly, whereas

η_{σ} = 0 . 05

produces more aggressive adjustments. By comparison,

η_{σ} = 0 . 02

exhibits a more balanced adaptation trajectory and corresponds to the most favorable reward performance in Figure 7a. These results indicate that, within the tested range, the adopted default step size provides the most effective trade-off between exploration responsiveness and final training performance under the present simulation setting.

5.3. Performance Comparison Experiments of Different Algorithms

To comprehensively evaluate the performance of the proposed algorithm, this section compares PE-MADDPG with three selected representative baselines, namely DDPG, MAPPO, and standard MADDPG. All comparative experiments are conducted during the testing phase after training, with the discrete time step set to 0.1 s. For the benchmark formation task considered in this section, the start and end points are randomly sampled across the entire domain using Latin hypercube sampling (LHS), thereby ensuring that the test set covers diverse combinations of mission geometry and obstacle-avoidance difficulty. Each algorithm is evaluated on 100 LHS test cases, and the mean and standard deviation of each performance metric are reported. The results are presented in Table 4.

As shown in Table 4, PE-MADDPG achieves the most favorable overall performance among the selected baseline algorithms in the multi-UAV cooperative path-planning scenario. Its mission completion rate reaches 92.0%, which is 7 and 14 percentage points higher than those of standard MADDPG and MAPPO, respectively, and substantially higher than that of DDPG. The formation keeping rate also shows a clear improvement over the three baselines, indicating that the proposed framework is more effective at maintaining coordinated multi-UAV behavior under the present setting. In terms of flight efficiency, PE-MADDPG yields the lowest average flight time and flight distance among the compared methods, suggesting more efficient path generation in the tested urban environment. It also achieves the lowest energy proxy, indicating a more favorable balance between cooperative performance and relative energy expenditure within the current modeling framework.

As shown in Figure 8, PE-MADDPG exhibits a more favorable convergence trend, higher final reward level, and lower variability across repeated runs than other baseline algorithms under the present setting.

5.4. Scalability Verification

To verify the scalability of the proposed algorithm, this section reports its performance under different numbers of UAVs across 100 LHS test cases.

Figure 9 shows that as the number of UAVs increases from 2 to 5, the task completion rate decreases slightly from 94.0% to 88.0% but consistently remains above 85%. The formation maintenance rate declines more noticeably, from 68.5% to 28.6%, which can be attributed to the substantial increase in constraint complexity as the swarm size expands. Meanwhile, flight time, flight distance, and average energy consumption increase only moderately, and no significant performance degradation is observed. These results indicate that under larger formation scales, PE-MADDPG still maintains a relatively stable task completion rate and satisfactory cooperative performance within the tested range of 2–5 UAVs. As the number of UAVs increases, the convergence characteristics of the centralized Critic and the associated training-time information-processing burden may change substantially because the joint state–action representation and interaction modeling become more complex. The present scalability analysis is therefore limited to the tested swarm sizes of 2–5 UAVs, and whether similar performance trends can be maintained in larger-scale formations requires further validation.

5.5. Cooperative Behavior and Trajectory Visualization Analysis

To deeply analyze the cooperative performance of the algorithm, this section shows the change curve of the cooperative relationship between the leader UAV and the follower UAVs in a successful test, as well as the path planning results of different algorithms.

In the three-UAV formation scenario, the evolution of the cooperative performance indicators for the leader and follower UAVs is shown in Figure 10. The results indicate that the relative distance, speed difference, and heading difference of the follower UAVs converge rapidly to the prescribed ranges, indicating effective coordination in distance, speed, and heading.

The path visualizations in Figure 11 and Figure 12 further show that the trajectory generated by PE-MADDPG is smoother and more compact, without obvious local detours or oscillations. Moreover, stable formation spacing is maintained throughout the mission, and the safety-separation constraints are satisfied over the entire flight process. By contrast, the trajectories produced by MADDPG and MAPPO exhibit noticeable local fluctuations and inefficient detours, resulting in poorer formation stability. Owing to the absence of an explicit cooperation mechanism, DDPG generates dispersed multi-UAV trajectories and even produces risky paths near obstacles, making stable formation cooperation difficult to achieve. These visualization results further confirm the advantages of PE-MADDPG in coordination, safety, and path optimality.

5.6. Ablation Experiment

To evaluate the contribution of each enhanced module in PE-MADDPG and the synergistic effect among them, this section conducts ablation experiments based on the 500-episode training data of five algorithmic variants. All variants are run independently five times to reduce random effects, and the results are recorded as mean ± standard deviation.

The five algorithm variants are set as follows:

MADDPG-Baseline: Standard MADDPG algorithm.
MADDPG-PER: Only the prioritized experience replay mechanism is introduced.
MADDPG-Noise: Only the adaptive exploration noise mechanism is introduced.
MADDPG-Attention: Only the multi-head attention-enhanced Critic network is introduced.
PE-MADDPG: The complete algorithm including all improved modules.

For quantitative comparison, this paper also summarizes the statistical results of key indicators of each variant in the test phase (100 LHS tests), as shown in Table 5.

To assess whether the observed performance improvements exceed the variability across independent runs, we conducted independent two-sample t-tests to compare PE-MADDPG with each variant (MADDPG-Baseline, MADDPG-Noise, MADDPG-PER, and MADDPG-Attention) based on the results from five independent training runs, where each run was evaluated on the same 100 test cases. The results indicate that PE-MADDPG achieves statistically significant improvements (p < 0.05) over all four variants in mission completion rate (MCR), obstacle avoidance rate (OAR), formation keeping rate (FKR), flight distance (FD), and energy proxy (EP). For flight time (FT), the improvement over MADDPG-PER does not reach the 0.05 significance level (p = 0.08), whereas the differences relative to the other three variants are statistically significant (p < 0.05). These statistical analyses provide quantitative support that several of the observed gains exceed the variability across independent runs and are consistent with the complementary effects of the three enhancement modules under the present setting. To unify the scales of indicators with different dimensions, this paper normalizes each performance indicator in Figure 13.

The ablation results show that PE-MADDPG achieves the most favorable overall performance among the tested variants under the present setting. Among the individual enhancement modules, prioritized experience replay (PER) is associated with comparatively larger gains in mission completion, obstacle avoidance, and flight-efficiency-related metrics, suggesting that improved sample reuse is particularly beneficial under the present setting. The adaptive exploration-noise mechanism is associated with improved training stability and moderate gains in cooperative performance, suggesting a more balanced exploration–exploitation process under the present setting. The attention-enhanced Critic network is associated with more pronounced improvements in formation-related performance, suggesting that interaction-aware value representation is more beneficial to coordination-sensitive metrics under the present setting. At the same time, the results show that the performance gains brought by any single enhancement remain limited, whereas the combination of all three modules yields a more favorable overall outcome.

Taken together, these results suggest that the three modules provide complementary benefits under the present setting, and that their combined use yields a more favorable overall outcome than any single enhancement alone.

6. Conclusions

To address the challenges of low sample efficiency, unstable exploration, and insufficient inter-agent interaction modeling in multi-UAV cooperative path planning for complex urban environments, this paper proposes an improved MADDPG algorithm that integrates prioritized experience replay, an adaptive exploration-noise mechanism, and a multi-head attention mechanism. By constructing a three-dimensional urban inspection simulation environment with strong engineering relevance, the paper incorporates UAV dynamic constraints, safety-separation requirements, and formation-cooperation objectives into a unified POMDP framework, thereby enabling end-to-end learning of a cooperative path-planning policy. Comparative experiments indicate that PE-MADDPG demonstrates favorable performance over DDPG, MAPPO, and standard MADDPG in key metrics such as task completion rate, formation maintenance, flight efficiency, and energy consumption control. Scalability tests and ablation experiments further support the robustness of the proposed method under different swarm sizes, as well as the effectiveness and synergy of the three enhancement modules.

Nevertheless, this study has several limitations. The current validation is restricted to simulation, and practical factors such as communication delays, sensor noise, and non-cooperative aircraft are not yet considered. The modeling framework adopts engineering simplifications, including Euler discretization, fixed-dimensional obstacle observations, and bounding-sphere obstacle approximations, which may introduce conservative distortions in highly anisotropic urban environments. Additionally, the priority formulation based on mean absolute TD-error is a stability-oriented design choice, and the multi-objective reward weights are task-specific. Moreover, the current obstacle-avoidance and boundary-handling strategy is implemented mainly through reward shaping and soft penalties. Under this formulation, the learned policy cannot be interpreted as providing hard safety guarantees in the formal sense.

Future work will proceed along three directions. First, the current validation framework will be extended to more realistic sensing and communication conditions, including communication delay, sensor noise, non-Gaussian disturbances, urban wind effects, and more realistic nonlinear aerodynamic uncertainty. Possible remedies such as domain randomization, robust control, and disturbance-compensation mechanisms will also be considered to improve sim-to-real transfer and practical deployability. Second, the methodological design will be further refined through alternative priority aggregation schemes, broader sensitivity analysis of reward and exploration parameters, and richer interaction-modeling mechanisms. In addition, integrating hard safety mechanisms, such as control barrier functions or shielded reinforcement learning, will be an important direction for strengthening safety assurance beyond soft-penalty learning. Third, deployment-oriented extensions will be investigated, including event-triggered execution, fault-tolerant cooperative formation, and real-flight experimental validation. In addition, uncertainty-aware path-following ideas from related autonomous-system studies may be incorporated to improve disturbance adaptation and practical deployability in complex low-altitude environments.

Author Contributions

Conceptualization, F.Z.; Methodology, F.Z.; Software, Q.W.; Validation, Q.W.; Formal analysis, X.M.; Data curation, Q.W. and X.M.; Writing—original draft, Q.W.; Writing—review & editing, F.Z., Q.W. and X.M.; Visualization, Q.W.; Supervision, X.M.; Project administration, F.Z. and X.M.; Funding acquisition, F.Z. and X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Laboratory of Unmanned Aerial Vehicle Technology in NPU (Grant No. WRFX-202502) and the Science and Technology Program of Xizang Autonomous Region (Grant No. XZ202403ZY0014). The APC was funded by the authors’ supporting projects.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tarekegn, G.B.; Tesfaw, B.A.; Juang, R.-T.; Saha, D.; Tarekegn, R.B.; Lin, H.-P.; Tai, L.-C. Trajectory control and fair communications for multi-UAV networks: A federated multi-agent deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2025, 24, 7598–7611. [Google Scholar] [CrossRef]
Liu, Y.; Liu, Z. Flexible multi-uav formation control via integrating deep reinforcement learning and affine transformations. Aerosp. Sci. Technol. 2025, 157, 109934. [Google Scholar] [CrossRef]
Nazemi Jenabi, M.; Asharioun, H.; Pourgholi, M. 3D UAV path planning based on an improved TD3 deep reinforcement learning for data collection in an urban environment. J. Netw. Comput. Appl. 2025, 244, 104336. [Google Scholar] [CrossRef]
Ivić, S.; Crnković, B.; Grbčić, L.; Matleković, L. Multi-UAV trajectory planning for 3D visual inspection of complex structures. Autom. Constr. 2023, 147, 104709. [Google Scholar] [CrossRef]
Guo, J.; Harmati, I. Lane-changing system based on deep q-learning with a request–respond mechanism. Expert Syst. Appl. 2024, 235, 121242. [Google Scholar] [CrossRef]
Hamissi, A.; Dhraief, A. A survey on the unmanned aircraft system traffic management. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, M.; Yuan, Y.; Zhang, J.; Yang, Q.; Shi, G.; Jiang, J. Large-scale UAV swarm path planning based on mean-field reinforcement learning. Chin. J. Aeronaut. 2025, 38, 336–349. [Google Scholar] [CrossRef]
Zhan, F.B.; Noon, C.E. Shortest path algorithms: An evaluation using real road networks. Transp. Sci. 1998, 32, 65–73. [Google Scholar] [CrossRef]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Stentz, A. Optimal and efficient path planning for partially-known environments. In Proceedings of the 1994 IEEE International Conference on Robotics and Automation, San Diego, CA, USA, 8–13 May 1994; pp. 3310–3317. [Google Scholar]
Xie, K.; Qiang, J. Research and optimization of D-start lite algorithm in track planning. IEEE Access 2020, 8, 161920–161928. [Google Scholar] [CrossRef]
Holland, J.H. Adaptation in Natural and Artificial System, 2nd ed.; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
Coello, C.C.; Lechuga, M.S. MOPSO: A proposal for multiple objective particle swarm optimization. In Proceedings of the 2002 IEEE Congress on Evolutionary Computation, Honolulu, HI, USA, 12–17 May 2002; pp. 1051–1056. [Google Scholar]
Dorigo, M.; Maniezzo, V.; Colorni, A. Ant system: Optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. B 1996, 26, 29–41. [Google Scholar] [CrossRef]
Ugwoke, K.C.; Nnanna, N.A. Simulation-based review of classical, heuristic, and metaheuristic path planning algorithms. Sci. Rep. 2025, 15, 12643. [Google Scholar] [CrossRef]
Blasi, L.; D’Amato, E. Clothoid-based path planning for a formation of fixed-wing UAVs. Electronics 2023, 12, 2204. [Google Scholar] [CrossRef]
Han, B.; Shi, L. Multi-agent multi-target pursuit with dynamic target allocation and actor network optimization. Electronics 2023, 12, 4613. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Sumiea, E.H.; Abdulkadir, S.J.; Alhussian, H.S.; Al-Selwi, S.M.; Alqushaibi, A.; Ragab, M.G.; Fati, S.M. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697. [Google Scholar] [CrossRef] [PubMed]
Recht, B. A tour of reinforcement learning: The view from continuous control. Annu. Rev. Control Robot. Auton. Syst. 2019, 2, 253–279. [Google Scholar] [CrossRef]
Liu, Q.; Xiong, P.; Zhu, Q.; Xiao, W.; Wang, K.; Hu, G.; Li, G. A DDPG-based path following control strategy for autonomous vehicles by integrated imitation learning and feedforward exploration. Chin. J. Mech. Eng. 2025, 38, 174. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6379–6390. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. In NIPS’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022; pp. 24611–24624. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Cheng, L.; Shang, W. Multi-head attention dqn and dynamic priority for path planning of unmanned aerial vehicles oriented to penetration. Electronics 2025, 15, 167. [Google Scholar] [CrossRef]
Demir, K.; Tumen, V. A deep reinforcement learning algorithm for trajectory planning of swarm UAV fulfilling wildfire reconnaissance. Electronics 2024, 13, 2568. [Google Scholar] [CrossRef]
Liu, H.; Long, X.; Li, Y.; Yan, J.; Li, M.; Chen, C.; Gu, F.; Pu, H.; Luo, J. Adaptive multi-UAV cooperative path planning based on novel rotation artificial potential fields. Knowl.-Based Syst. 2025, 317, 113429. [Google Scholar] [CrossRef]
Zhang, G.; Sun, Z.; Li, J.; Huang, J.; Qiu, B. Iterative learning control for path-following of ASV with the ice floes auto-select avoidance mechanism. IEEE Trans. Intell. Transp. Syst. 2025, 26, 13927–13938. [Google Scholar] [CrossRef]
Zhang, G.; Yin, S. Game-based event-triggered control for unmanned surface vehicle: Algorithm design and harbor experiment. IEEE Trans. Cybern. 2025, 55, 2729–2741. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Conceptual illustration of the multi-UAV cooperative inspection scenario in an urban low-altitude environment. T1–T6 denote six inspection target points.

Figure 2. Structural illustration of the attention-enhanced centralized Critic. Interaction cues in the joint state-action representation are embedded and projected into query, key, and value spaces, after which multi-head attention reweights inter-agent relevance. The resulting fused context feature is combined with the current agent representation for interaction-aware state-action value estimation.

Figure 3. Overall framework of the proposed PE-MADDPG algorithm for multi-UAV cooperative path planning.

Figure 4. Schematic diagram of the three-dimensional urban inspection environment used for simulation.

Figure 5. Average reward curves under different training configurations.

Figure 6. PE-MADDPG sub-reward curves.

Figure 7. Sensitivity analysis of the adaptive exploration rule under different noise-update step sizes. (a) Average reward curves under settings, where the shaded region indicates the variability across repeated training runs. (b) Evolution of the adaptive exploration noise scale under the corresponding step-size settings.

Figure 8. Training reward comparison among different reinforcement learning algorithms.

Figure 9. Scalability analysis of PE-MADDPG under different numbers of UAV agents. Colored bars represent the average values of MCR, FKR, FT, FD, and EP for different numbers of UAVs, while the black square markers denote the corresponding mean values and the error bars represent the standard deviations.

Figure 10. Evolution of cooperative performance indicators during a representative mission.

Figure 11. Three-dimensional flight trajectories generated by different algorithms. Panels (a–d) correspond to DDPG, MAPPO, MADDPG, and PE-MADDPG, respectively. Yellow squares indicate the starting points of the UAV flight paths, green circles indicate the inspection target point, cyan cuboids denote buildings, and pink cylinders denote no-fly zones.

Figure 12. Two-dimensional top-view flight paths under different planning algorithms. Panels (a–d) correspond to DDPG, MAPPO, MADDPG, and PE-MADDPG, respectively. Yellow squares indicate the starting points of the UAV flight paths, green circles indicate the inspection target points, T1–T7 denote the labeled inspection target points, cyan rectangles denote buildings, and pink circles denote no-fly zones.

Figure 13. Normalized performance comparison of algorithm variants in the ablation study.

Table 1. UAV dynamic parameter settings.

Parameter	Leader UAV	Follower UAV
Velocity (m/s)	[0, 18]	[0, 18]
Acceleration (m/s²)	[−1, 1]	[−1.5, 1.5]
Angular velocity (rad/s)	[−0.55, 0.55]	[−0.6, 0.6]
Optimal formation distance (m)	300	300
Pitch angle range (rad)	$[- \frac{π}{4}, \frac{π}{4}]$	$[- \frac{π}{4}, \frac{π}{4}]$
Minimum safety distance (m)	30	30

Table 2. PE-MADDPG algorithm training parameter settings.

Core Hyperparameter	Leader UAV
Actor network learning rate	1 × 10⁻⁴
Critic network learning rate	2 × 10⁻⁴
Initial value of exploration noise	0.2
Lower bound of exploration noise	2 × 10⁻²
Upper bound of exploration noise	0.5
Adaptive step size of noise	2 × 10⁻²
Soft update rate $τ$	0.5 × 10⁻²
Discount factor $γ$	0.95
Maximum training episodes	500
Batch Size	128
Maximum steps per episode	1000
Maximum capacity of experience pool	2 × 10⁴
Actor network	Input → Hidden (256, 256, ReLU) → Output (3, Tanh)
Critic network	Input → Embedding → Multi-head Attention → Hidden (256, 256, ReLU) → Output (1)
Weight initialization	Normal distribution (mean = 0, std = 0.1)

Table 3. Training convergence statistics.

Variant	Avg. Reward (Last 50 Episodes)	Avg. Error Band	Convergence Episode	Improvement vs. Baseline
MADDPG-Baseline	487.08	153.52	321	+0.00%
MADDPG-Entropy	−470.37	137.2	-	−196.57%
MADDPG-PER	494.51	153.52	411	+1.52%
MADDPG-Attention	−2.12	137.2	-	−100.44%
PE-MADDPG	982.88	100	234	+101.7%

Table 4. Performance comparison of PE-MADDPG and benchmark algorithms.

Algorithm	MCR (%)	OAR (%)	FKR (%)	FT (Step)	FD (m)	EP
DDPG	45.0 ± 5.5	72.3 ± 6.2	2.3 ± 1.2	650.8 ± 95.6	1580.2 ± 245.4	680.7 ± 125.5
MAPPO	78.0 ± 4.2	89.5 ± 4.8	28.5 ± 3.8	480.5 ± 65.3	1120.8 ± 185.2	520.3 ± 95.7
MADDPG	85.0 ± 3.5	91.3 ± 4.1	32.1 ± 3.2	410.7 ± 58.4	980.6 ± 150.3	450.8 ± 85.4
PE-MADDPG	92.0 ± 2.1	98.6 ± 1.5	45.8 ± 2.8	350.4 ± 42.6	820.3 ± 120.5	380.6 ± 75.2

Table 5. Performance indicators of different algorithm strategies.

Variant	MCR (%)	OAR (%)	FKR (%)	FT (Step)	FD (m)	EP
MADDPG-Baseline	85.1 ± 3.4	92.3 ± 3.5	32.2±3.1	411.5 ± 57.8	988.7 ± 151.6	451.3 ± 84.7
MADDPG-Noise	87.6 ± 3.0	94.5 ± 2.8	34.8 ± 2.9	395.8 ± 52.3	952.4 ± 142.3	432.6 ± 80.5
MADDPG-PER	90.2 ± 2.5	96.8 ± 2.1	35.1 ± 3.0	372.4 ± 48.6	895.3 ± 130.5	408.5 ± 78.2
MADDPG-Attention	88.5 ± 2.8	95.2 ± 2.6	38.4 ± 2.6	388.6 ± 50.1	926.8 ± 138.4	425.7 ± 82.1
PE-MADDPG	92.3 ± 2.1	98.7 ± 1.4	45.6 ± 2.7	352.6 ± 43.5	832.5 ± 122.8	381.2 ± 74.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, F.; Wang, Q.; Ma, X. Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm. Electronics 2026, 15, 1632. https://doi.org/10.3390/electronics15081632

AMA Style

Zhang F, Wang Q, Ma X. Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm. Electronics. 2026; 15(8):1632. https://doi.org/10.3390/electronics15081632

Chicago/Turabian Style

Zhang, Feiqiao, Qian Wang, and Xin Ma. 2026. "Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm" Electronics 15, no. 8: 1632. https://doi.org/10.3390/electronics15081632

APA Style

Zhang, F., Wang, Q., & Ma, X. (2026). Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm. Electronics, 15(8), 1632. https://doi.org/10.3390/electronics15081632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-UAV Cooperative Path Planning Method Based on an Improved MADDPG Algorithm

Abstract

1. Introduction

1.1. Related Work

1.2. Motivations and Contributions

2. Problem Description and Modeling

2.1. Urban Low-Altitude Cooperative Inspection Scenario

2.2. UAV Dynamics Model and Constraints

2.3. POMDP Decision Model

2.3.1. State Space

2.3.2. Observation Space

2.3.3. Action Space

2.3.4. Transition Probability

2.3.5. Observation Probability

2.3.6. Multi-Objective Reward Functions

2.4. Performance Evaluation Metrics

3. Cooperative Path Planning Method Based on PE-MADDPG

3.1. MADDPG Basic Framework

3.2. Prioritized Experience Replay Mechanism

3.3. Adaptive Exploration Noise Mechanism

3.4. Attention-Enhanced Critic Network

3.5. PE-MADDPG Algorithm Flow

4. Simulation Settings

4.1. Experimental Environment Setup

4.1.1. Simulation Environment

4.1.2. UAV Parameter Settings

4.1.3. Training Settings

5. Results

5.1. Comparative Analysis of Training Process

5.2. Sensitivity Analysis

5.3. Performance Comparison Experiments of Different Algorithms

5.4. Scalability Verification

5.5. Cooperative Behavior and Trajectory Visualization Analysis

5.6. Ablation Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI