Fusing Adaptive Game Theory and Deep Reinforcement Learning for Multi-UAV Swarm Navigation

Yao, Guangyi; Guo, Lejiang; Liao, Haibin; Wu, Fan

doi:10.3390/drones9090652

Open AccessArticle

Fusing Adaptive Game Theory and Deep Reinforcement Learning for Multi-UAV Swarm Navigation

¹

Air Force Early Warning Academy, Wuhan 430019, China

²

School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan 430200, China

^*

Authors to whom correspondence should be addressed.

Drones 2025, 9(9), 652; https://doi.org/10.3390/drones9090652

Submission received: 28 July 2025 / Revised: 25 August 2025 / Accepted: 30 August 2025 / Published: 16 September 2025

(This article belongs to the Special Issue Advances in Cooperative Perception Application for Unmanned System in Modern Transportation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

To address issues such as inadequate robustness in dynamic obstacle avoidance, instability in formation morphology, severe resource conflicts in multi-task scenarios, and challenges in global path planning optimization for unmanned aerial vehicles (UAVs) operating in complex airspace environments, this paper examines the advantages and limitations of conventional UAV formation cooperative control theories. A multi-UAV cooperative control strategy is proposed, integrating adaptive game theory and deep reinforcement learning within a unified framework. By employing a three-layer information fusion architecture—comprising the physical layer, intent layer, and game-theoretic layer—the approach establishes models for multi-modal perception fusion, game-theoretic threat assessment, and dynamic aggregation-reconstruction. This optimizes obstacle avoidance algorithms, facilitates interaction and task coupling among formation members, and significantly improves the intelligence, resilience, and coordination of formation-wide cooperative control. The proposed solution effectively addresses the challenges associated with cooperative control of UAV formations in complex traffic environments.

Keywords:

UAV formation; cooperative control; dynamic obstacle avoidance; intelligent optimization

1. Introduction

With the growing adoption of unmanned aerial vehicle (UAV) formations in various modern applications, highly complex operational environments are placing greater demands on system reliability and collaborative capability [1]. Although existing cooperative control theories for UAV formations have achieved considerable progress, they still exhibit notable limitations when confronted with highly dynamic conditions and evolving mission requirements [2]. To advance the intelligence of multi-UAV cooperation, this paper proposes a cooperative control algorithm that integrates adaptive game theory with deep reinforcement learning within a unified framework. By introducing novel adaptive mechanisms to dynamically adjust inter-agent interaction models in real time, the method optimizes key decision-making processes, thereby significantly enhancing the intelligence, adaptability, and collaborative performance of UAV formation systems in highly complex operational settings [3].

Nevertheless, conventional approaches continue to reveal critical shortcomings in highly dynamic airspace environments: bio-inspired control models (e.g., Boids) lack situational awareness and adaptability in the presence of dynamic obstacles and sudden disturbances; distributed coordination strategies suffer from computational and communication overhead that scales quadratically or exponentially with swarm size, impairing real-time applicability and scalability; deep reinforcement learning-based methods exhibit unpredictable behaviors and insufficient safety guarantees in critical scenarios, coupled with high training costs and limited generalizability; furthermore, most existing methods focus on isolated objectives and lack a unified framework for dynamically balancing competing tasks in multi-objective operational contexts.

These limitations collectively underscore a critical research gap: the urgent need for a new cooperative control framework that simultaneously ensures real-time performance, scalability, adaptability to dynamic environments, and inherent safety. To address these challenges, this study proposes a hybrid solution integrating adaptive game theory with deep reinforcement learning, offering a systematic approach toward more reliable and intelligent cooperative control of UAV formations.

The key innovations of this study are summarized as follows:

A Novel Fusion Framework: We propose a unified hierarchical control architecture that seamlessly integrates adaptive game theory with deep reinforcement learning for multi-UAV cooperative control. This framework effectively addresses the limitations of conventional methods in terms of dynamic adaptability, real-time performance, and safety assurance.

(1): An Efficient Mean-Field Game Acceleration Algorithm: By incorporating a dynamic density threshold and a Spiking Neural Network (SNN) similarity measure into a state aggregation strategy, we significantly reduce the computational complexity of large-scale swarm interactions from O( $N^{2}$ ) to O( $K^{2}$ ) while maintaining high clustering accuracy.
(2): A Multi-Scale Attention-Based Policy Network: We design a hybrid spatial–temporal attention mechanism to enhance environmental perception. This model achieves an obstacle detection accuracy of 95.1% and reduces the response time to dynamic threats to under 0.3 s.
(3): Formal Safety Guarantees: By embedding Control Barrier Functions (CBFs) into the policy network, we establish a safe reinforcement learning framework that confines trajectory deviations to within 2 m and provides formal safety guarantees throughout dynamic operations.

2. Traditional Theories of UAV Formation Cooperative Control

2.1. Rule-Based Bionic Control Method

Pilot-bionic control achieves UAV formation control by mimicking the self-organizing behaviors of biological groups such as bird flocks and fish schools. Each UAV follows simple local interaction rules, and the desired global formation emerges from the collective interactions of these individual behaviors [4]. The core rules include separation, alignment, and cohesion.

The separation rule maintains a minimum distance between drones to avoid collisions. Define the relative position vector between drone i and drone j as

r_{i j}

, the distance between them as

|r_{i j}|

, and the repulsion coefficient as

k_{s e p}

. The magnitude of the repulsive force is inversely proportional to

r_{i j}

, and its direction is opposite to

r_{i j}

.

The formula is as follows:

a_{s e p} = - k_{s e p} \sum_{j \in N_{i}} \frac{r_{i j}}{{|r_{i j}|}^{3}}

(1)

Alignment enables coordinated movement through velocity matching. Let

v_{j}

represent the velocity vector of a neighboring drone j, and let N denote the number of neighboring drones. Velocity synchronization is achieved when the target velocity of drone matches the average velocity of all its neighboring drones [5].

The formula is as follows:

v_{a l i g n} = \frac{1}{|N_{i}|} \sum_{i \in N_{i}} v_{j}

(2)

The aggregation rule refers to relying on attractive forces to make the group gather together.

The formula is as follows:

F_{c o h} = k_{c o h} \sum_{j \in N_{i}} \frac{r_{i j}}{|r_{i j}|}

(3)

In the formula,

k_{c o h}

as the aggregation intensity coefficient, which is used to indicate the magnitude of the attractive force. The direction points to the centroid of adjacent UAVs (corresponding to the unit vector

\frac{r_{i j}}{|r_{i j}|}

formed by the sum vector), specifically the resultant direction of the relative position vectors between UAV i and all other adjacent UAVs j. When expressed in the form of the unit vector

|r_{i j}|

, this direction points from UAV i to the centroid of all adjacent UAVs j, prompting UAV i to move towards the aggregation center of adjacent UAVs, thereby achieving the gathering and clustering of the formation.

The rule-based bionic control method requires less than 0.8 ms per decision cycle, exhibiting low computational overhead and excellent hardware portability. However, owing to its fixed rules, the method demonstrates limited adaptability to dynamic environments and performs inadequately in handling moving obstacles.

2.2. Distributed Cooperative Control Method

This method incorporates a consensus protocol and a graph-theoretic optimization model, enabling global objectives to be achieved through local information exchange [6]. Iterative algorithms can further enforce formation constraints, transforming the formation control problem into a distributed optimization framework where all UAVs attain consensus.

The consensus protocol achieves global consistency by computing a weighted average of the states of neighboring UAVs, expressed as follows:

x_{i} (t + 1) = x_{i} (t) + ϵ \sum_{\in N_{i}} [w_{i j} (x_{j} (t) - x_{i} (t))]

(4)

In the formula,

x_{i} (t)

represents the state (such as position or velocity) of UAV i at time t; ϵ is the convergence step size, which is used to control the adjustment range of the state;

w_{i j}

is the communication weight, indicating the degree to which UAV i trusts its neighboring UAV j. According to the consensus protocol, the state increment of UAV i is proportional to the weighted sum of the state differences between itself and its neighboring UAV j and ultimately makes the states of all UAVs converge.

The graph theory optimization model models the formation problem and converts it into an optimization problem that satisfies constraint conditions. The expression is as follows:

\min \sum_{i} C_{i} (x_{i}) + \sum_{(i, j) \in E} ω_{i j} | x_{i} - x_{j} - d_{i j}^{*} |_{2}^{2}

(5)

In the formula, min denotes the minimization operation, with the objective of finding an optimal set of UAV state variables (such as position, velocity, etc.) that minimizes the subsequent total cost function value.

C_{i} (x_{i})

is the cost function for individual energy consumption, obstacle avoidance, etc.; for different tasks such as energy consumption cost and obstacle avoidance cost, the expected distance between UAV i and UAV j is

d_{i j}^{*}

; the weight

ω_{i j}

is a value used to reflect the priority of formation constraints. The essence of the graph theory optimization model is to keep the formation in an optimal state by minimizing the global cost function.

Distributed cooperative control maintains performance in large-scale formations comprising hundreds of UAVs and tolerates approximately 30% partial failures. However, owing to significant computational load and high complexity, its real-time performance is considerably limited; its average decision-making latency reaches 120 ms, which fails to meet the real-time requirement (e.g., <100 ms) for highly dynamic scenarios.

2.3. DRL Control Method (Deep Reinforcement Learning)

The DRL control method is grounded in a reinforcement learning framework, which allows agents to interact with environmental dynamics in complex scenarios. It formulates the formation control problem as a Markov Decision Process (MDP) and seeks optimal control policies through independent and collaborative trial-and-error learning by UAVs [7]. This approach emphasizes three core elements: state space, action space, and reward function.

The state space comprises environmental information such as the UAV’s own state and the positions of nearby obstacles, serving as the foundation for control decisions. For UAV i,

p_{i}

denotes its position,

v_{i}

represents the velocity vector, and

O_{i}

indicates other state variables. The state space can be formally defined as:

s_{t} = {p_{i}, v_{i}, O_{i}}

(6)

The action space is a formal expression of UAV control commands. To ensure physical feasibility, control commands are generally continuous values, with a range limited to ±5 m/s². If

a_{t}

represents the acceleration command of the UAV at time t, the action space can be expressed as follows:

a_{t} \in [- 5, 5] m / s^{2}

(7)

The reward function is used to represent the optimized behaviors of the UAV agent that encourage approaching the target, reducing energy consumption, and maintaining formation. It can strengthen rewards for recent behaviors through an exponential decay term.

The expression is as follows:

r_{t} = w_{p} \cdot e^{- |p_{i} - p_{t a r g e t}|} + w_{e} \cdot e^{- E_{c o n s u m e d}} + w_{f} \cdot I_{f o r m a t i o n}

(8)

In the formula,

w_{p}

,

w_{e}

, and

w_{f}

are weight coefficients that balance different optimization objectives;

P_{target}

is the target position to be achieved;

E_{c o n s u m e}

is the cumulative energy consumption; and

I_{f o r m a t i o n}

represents indicators such as formation shape matching and integrity.

The DRL control method achieves a 96.3% obstacle avoidance success rate in complex environments and exhibits strong adaptability. However, it involves high training costs and faces challenges in ensuring safety.

In summary, conventional UAV formation cooperative control methods suffer from significant limitations: (1) Due to its fixed rules, the rule-based bionic control method exhibits a significant performance drop in dynamic environments, with its obstacle avoidance success rate decreasing from 95% in static scenarios to 68% when facing dynamic obstacles, indicating a 28.4% reduction in environmental adaptability. (2) Distributed cooperative control is constrained by high computational complexity and heavy reliance on communications, leading to inadequate real-time performance and difficulties in addressing nonlinear constraints. (3) Despite its adaptability, DRL control encounters bottlenecks such as high training costs, difficulties in safety assurance, and limited generalization capability [8]. (4) All three approaches exhibit critical shortcomings in dynamic adaptability, real-time performance, safety, and generalization, underscoring the urgent demand for new solutions.

3. A Fusion Framework of Adaptive Game Theory and Deep Reinforcement Learning

To overcome the defects of traditional UAV formation cooperative control methods in dynamic environment adaptability, real-time performance, safety, and generalization ability, a UAV formation cooperative control method based on adaptive game theory and deep reinforcement learning is proposed [9]. This method can effectively overcome the limitations of traditional UAV formation cooperative control methods while taking into account some advantages of traditional methods, providing a new idea and method for UAV formation cooperative control.

3.1. Improvement of Dynamic Game Theory

This section details the enhancement strategies grounded in dynamic game theory, focusing on three key aspects: incomplete information dynamic game modeling, mean field game acceleration algorithms, and dynamic weight adjustment mechanisms. Beginning with game modeling, proceeding to game acceleration, and concluding with weight adjustment, a comprehensive theoretical improvement framework is established.

(1): Incomplete Information Dynamic Game Modeling

Moving beyond the complete information assumption of classical game theory, as illustrated in Figure 1, a three-tier information fusion architecture is constructed, comprising the physical layer, the intention layer, and the game layer [10]. The UAV formation gathers environmental data through various sensors at the physical layer and relays this information to the intention layer to support group decision-making. The intention layer processes and synthesizes information from the physical layer, forms preliminary decisions, and transmits them to the game layer. The game layer integrates these decision schemes, conducts threat assessment using multi-source uncertain information, and provides a logical foundation for evasion and other tactical maneuvers by the UAV formation.

Physical Layer: The Interactive Multi-Model Extended Kalman Filter (IMM-EKF) algorithm with multi-sensor fusion is employed, integrating data from GPS, IMU, and visual sensors to construct a covariance matrix updated via Markov switching. Even under strong electromagnetic interference, the positioning error remains within 0.3 m (RMS). Multi-sensor fusion improves positioning accuracy by approximately 62% compared to single-sensor systems.

Intention Layer: An Interactive Partially Observable Markov Decision Process (I-POMDP) is designed to predict adversarial tactical actions through an opponent modeling module. This module combines a Support Vector Machine (SVM) classifier and a Hidden Markov Model (HMM), achieving a prediction accuracy of 87.2% for aerial targets.

Game Layer: The Dempster–Shafer evidence theory is introduced to formulate a threat payoff matrix. Multi-source uncertain information is incorporated into threat assessment through evidence fusion. Using Bayesian methods for multi-source credibility fusion, threat assessment accuracy is improved by nearly 41.3%.

(2): Mean Field Game Acceleration Algorithm

To address the high computational complexity of traditional cooperative formation control, a mean field approximation method is further optimized based on state aggregation and strategy propagation:

State Aggregation: By incorporating a dynamic density threshold and integrating the Shared Nearest Neighbor (SNN) similarity measure from the DBSCAN algorithm, the multi-game scale clustering method is enhanced [11].

The dynamic clustering process using the improved DBSCAN algorithm is described as follows:

Dynamic Density Threshold:

ϵ (t) = ϵ_{0} \cdot e x p (- λ \cdot \frac{c o u n t (N_{eps} (x_{i}))}{N})

(9)

In the formula,

ϵ_{0}

is the initial threshold,

λ

is the attenuation coefficient, and

c o u n t (N_{eps} (x_{i}))

represents the number of samples within the neighborhood of a sample. This formula uses an exponential function to make the threshold dynamically change with the density of neighborhood samples: when the UAVs in the neighborhood are dense (such as when the cluster shrinks during obstacle avoidance),

c o u n t (N_{eps} (x_{i}))

increases and the threshold

ϵ (t)

decreases, which avoids classifying UAVs that are far apart into the same category, thereby ensuring the clustering accuracy.

Formula for Shared Nearest Neighbor (SNN) similarity:

S N N (x_{i}, x_{j}) = |N_{k} (x_{i}) \cap N_{k} (x_{j})|

(10)

In the formula,

N_{k} (x_{i})

represents the number of k-nearest neighbor sets of samples. When used to measure the game, the computational processing scale is reduced from

O (N^{2})

to

O (K^{2})

(where K is the number of clusters, and K << N). In this case, the processing efficiency is improved by approximately 58 times with only 200 sorties.

Strategy Propagation Model: A Spatiotemporal Graph Convolutional Network (ST-GCN) is established to simulate adjacency relationships, and temporal convolution is used to model state evolution, thereby mapping the optimal process from local decision-making to global coordination [12].

The expression of the Spatiotemporal Graph Convolutional Network (ST-GCN) is as follows:

X_{t + 1}^{l} = σ (A \cdot X_{t}^{l} \cdot W_{s}^{l})

(11)

Among them,

A

is the adjacency matrix,

X_{t}^{l}

is the feature of the

l

-th layer at time

t

, and

W_{s}^{l}

is the spatial weight matrix.

When a UAV detects an obstacle, the node corresponding to the obstacle in the adjacency matrix A is set to 1. The ST-GCN aggregates the obstacle position features to adjacent UAVs through

A \cdot X_{t}^{l}

and generates obstacle avoidance-related features (such as the priority of obstacle avoidance direction) after weight transformation.

If is a temporal convolution operation,

τ

is the length of the time window. The expression of temporal convolution is as follows:

X_{t + 1}^{l} = σ (T C N (X_{t}^{l}, X_{t - 1}^{l}, \dots, W_{t - τ}^{l}) \cdot W_{t}^{l})

(12)

The motion patterns in historical trajectories (such as uniform speed, turning, etc.) can be extracted through the convolution kernel

W_{t}

to predict future positions. For example, when the UAV group is in a “V” formation, temporal convolution can predict the formation offset in the next step based on the trajectories of the previous three time steps. At the same time, it can combine historical threat trajectories (such as missile flight paths) to predict the threat position 0.3 s in advance, allowing the UAV to adjust the obstacle avoidance strategy in advance.

Using the strategy propagation model, the formation reconstruction time can be shortened to 0.8 s.

(3): Dynamic Weight Adjustment Mechanism

Dynamic weighting is implemented using a twin-delayed deep deterministic policy gradient approach. The Critic network estimates long-term returns, while under the guidance of the Actor network, weight parameters for safety/survival and mission objectives are dynamically balanced to an optimal ratio. This enables swift transition to safe operational modes using near-optimal policies [13].

The hybrid function integrating game theory and reinforcement learning is expressed as follows:

U t o t a l = α U g a m e + (1 - α) U D R L

(13)

3.2. Enhanced Design of Deep Reinforcement Learning

To address the challenges of low exploration efficiency and compromised safety in multi-UAV formations operating in complex environments, this section proposes three enhanced deep reinforcement learning strategies: a dual-channel experience replay mechanism, a multi-scale attention policy network, and formal safety constraints. These approaches collectively improve both exploration effectiveness and operational safety under demanding conditions.

(1): Dual-Channel Experience Replay Mechanism

The dual channels consist of two complementary mechanisms: priority sampling and curriculum learning [14].

Priority Channel: A joint weighting method based on TD error and game-theoretic payoff functions dynamically optimizes sampling probability, increasing the replay likelihood of critical samples.

The mechanism is mathematically expressed as follows:

p_{i} = \frac{{(δ_{i} + ϵ)}^{α} {(R_{i} + 1)}^{β}}{\sum_{j} {(δ_{j} + ϵ)}^{α} \cdot {(R_{j} + 1)}^{β}}

(14)

Among them,

p_{i}

is the joint weighted sampling probability of TD error and payment function,

δ_{i}

is TD error,

R_{i}

is the payment function value, and

α

and

β

are weight coefficients. This strategy can increase the replay probability of key samples by nearly 6.3 times.

Curriculum Channel: Based on the actual demand from static obstacle avoidance to dynamic confrontation in environmental changes, a progressive task sequence based on task difficulty is proposed to realize adaptive control of the strategy gradient learning process.

The task sequence function is as follows:

d (t) = d_{0} + (d_{\max} - d_{0}) \cdot \frac{1}{1 + e x p (- k (t - t_{0}))}

(15)

In the formula,

d (t)

is the task difficulty at time

t

, and

k

is a scalar called the steepness parameter. This can make the strategy converge faster (close to 72%) and also enable the efficiency to reach

1.5 \times

10^{3}

frames/strategy update.

(2): Multi-Scale Attention Policy Network

A hybrid attention mechanism is introduced to enhance environmental perception capability (Figure 2).

Spatial Attention: A Squeeze-and-Excitation (SE) network is employed to recalibrate extracted features and emphasize high-risk regions. Compared to the baseline model, obstacle detection accuracy increases to 95.1%, representing a 12.4% improvement over the basic model [15]. The spatial attention module comprises two operations: squeeze (global average pooling) and excitation (channel-wise weight generation).

The squeeze operation compresses spatial feature descriptors via global average pooling:

Z_{c} = F_{s q} (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{i, j, c}

(16)

Among them,

X \in R^{H \times W \times C}

is the input feature map (H and W are spatial dimensions,

c

is the number of channels), and

Z_{c}

is the global feature of the

c

-th channel.

The excitation operation (channel weight generation) captures the interdependence between channels through a two-layer fully connected network:

s_{c} = F_{e x} (z, W) = σ (W_{2} \cdot δ (W_{1} \cdot z))

(17)

The above formula is the weight matrix in the process of weight reduction and weight increase, where

W_{1} \in R^{C / r \times C}

: weight reduction matrix,

r

: compression ratio (usually 16),

δ

: ReLU activation function;

W_{2} \in R^{C \times C / r}

: weight increase matrix,

σ

: Sigmoid activation function,

S_{C}

: weight. During weight reduction, the original weight matrix is multiplied by 16, so that the input dimension becomes 1/6 of the original. During weight increase, the Sigmoid activation function is applied first and then divided by to ensure the weight.

Feature Recalibration: The multiplication operation between channel weights and original features is performed to achieve adaptive feature enhancement [16]. The mathematical expression is

X_{i, j, c}^{'} = X_{i, j, c} \times s_{c}

. Experimental results show that this module can make the obstacle recognition accuracy reach 95.1%, which is about 12.4% higher than 82.7% when there is no channel weight trainer, and can greatly improve the detection rate of radar stealth targets (by about 18.3%).

Temporal Attention: The threat patterns are constructed based on the historical trajectories of the Transformer model, and the multi-head attention mechanism is used to capture long-term dependencies. The time required for the strategy to respond to dynamic threats is reduced to less than 0.3 s. The formula is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(18)

In the formula,

Q, K, V

respectively represent the query, key, and value matrices, and

d_{k}

is the key dimension.

In summary, the multi-scale attention strategy, by improving the accuracy and efficiency of environmental perception in the fusion framework, largely enhances the coordination ability of UAV clusters in dynamic obstacle avoidance. This strategy integrates spatiotemporal key information, optimizes the decision input of the intention layer, and provides a high-precision perception basis for game theory threat assessment and DRL strategy generation, thereby realizing rapid reconstruction of formation and safe obstacle avoidance in complex scenarios.

(3): Formal Safety Constraint Layer

Formal safety constraints involve embedding the Control Barrier Function (CBF) into the strategy network to construct a safe reinforcement learning framework, with the formula expressed as follows:

h ˙ (x) \geq - α (h (x))

(19)

where

h (x)

is the safety state function and

α

is the extended class function. This mechanism ensures that the trajectory offset is always ≤2 m through Lyapunov stability theory, thereby guaranteeing the safety of dynamic aggregation and cooperative obstacle avoidance of multiple UAVs.

4. Design of Dynamic Aggregation Cooperative Obstacle Avoidance Algorithm

The dynamic aggregation cooperative obstacle avoidance algorithm is designed to bridge theory and practice, addressing the challenge of engineering implementation for dynamic cooperative obstacle avoidance in UAV formations.

4.1. System Architecture and Mathematical Formulation

The dynamic aggregation cooperative obstacle avoidance system adopts a three-layer information fusion architecture—comprising the physical layer, intention layer, and game layer—within a hierarchical control structure. This framework establishes a closed loop from environmental perception to strategy execution [17].

If the system state space is represented in generalized coordinates, the state of each UAV can be described by its kinematic state:

s_{i} = {[x_{i}, y_{i}, z_{i}, x_{i}^{'}, y_{i}^{'}, z_{i}^{'}, θ_{i}]}^{T}

(20)

In the formula, the first three-dimensional coordinates

(x_{i}, y_{i}, z_{i})

represent the spatial position, the middle three dimensions

(x_{i}^{'}, y_{i}^{'}, z_{i}^{'})

represent the velocity components, and

θ_{i}

represents the heading angle.

4.2. Multi-Modal Perception Fusion Model

This section designs an evidence theory fusion mechanism and a spatiotemporal feature extraction network to establish a sensor-based multi-modal perception fusion model, aiming to solve the perception problem during dynamic cooperative obstacle avoidance of UAVs.

(1): Evidence Theory Fusion Mechanism

To address the drawback that a single sensor cannot cope with complex environmental conditions [18]. Based on the Dempster–Shafer evidence theory, a multi-source sensor data fusion rule is designed. Let the basic probability assignment functions of the environmental states of the millimeter-wave radar and the vision system be m₁ and m₂, respectively, then the fused probability assignment function is:

m (A) = \frac{1}{1 - k} \sum_{B \cap C = A} m_{1} (B) m_{2} (C)

(21)

In the formula, the conflict coefficient

k = \sum_{B \cap C = \emptyset} m_{1} (B) m_{2} (C)

is used to quantify the degree of evidence conflict between the two sensors. When k > 0.7, the Yager correction rule is triggered, and the conflict mass is assigned to the universal set

Ω

to avoid misjudgment caused by evidence conflict:

m (Ω) = m (Ω) + k, m (A) = m (A) (A \neq \emptyset)

(22)

The physical meaning of the conflict coefficient k (as defined in Equation (21)) is to quantify the level of confidence and consistency among different sensor evidence sources. Its value range is [0, 1], where a higher value indicates more severe conflicts between sensor observations.

When k ≈ 0, it indicates high consistency among sensor observations, and the fusion result exhibits high confidence. When k→1, it signifies a fundamental contradiction between sensor observations, suggesting either sensor failure or extreme ambiguity in the perception environment (e.g., millimeter-wave radar generating false alarms under strong interference, while an optical camera fails due to heavy fog, leading to missed detections). In this study, the threshold k > 0.7 was determined based on statistical analysis of extensive experimental data. This critical value implies that if the D-S combination rule is applied directly, the probability of the fusion result producing misleading decisions would exceed the acceptable safety tolerance. At this point, triggering Yager’s modification rule (Equation (22)) carries a deeper physical significance: the system actively acknowledges the high uncertainty in current perceptions and reassigns the conflict mass

m (\emptyset)

—instead of simply discarding it—to the “frame of discernment

(Θ)

.” This essentially transforms evidential conflict into an explicit quantification of uncertainty, equivalent to the system outputting an “undetermined” conclusion.

Based on this, the system can trigger degraded safety strategies (such as cautious deceleration, hovering, or requesting human intervention), thereby avoiding fatal decisions under inaccurate perception conditions and fundamentally enhancing the system’s robustness and safety.

In summary, k > 0.7 serves as a self-diagnostic indicator of the system’s perceptual reliability, while Yager’s modification acts as a corresponding fault-tolerant mechanism that ensures safe operation even under the most adverse perceptual conditions.

(2): Spatiotemporal Feature Extraction Network

Based on fusing spatial topology and temporal features, referring to the ST-GCN architecture [19], an environmental perception model integrating spatial topology and time series features is designed. The spatial convolution part uses the adjacency matrix

A

to represent the topological relationship between UAVs and obstacles:

A_{i j} = {_{0, otherwise}^{1, i f ∥ s_{i} - s_{j} ∥ < 50 m}

(23)

D

is the degree matrix, which is normalized by the

\hat{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}

Laplacian matrix. If there is a threat, the corresponding relationship is 1; if there is no threat, the corresponding relationship is 0. Its forward propagation process can be expressed as

X_{t + 1}^{l} = σ (\hat{A} \cdot X_{t}^{l} \cdot W_{s}^{l})

. The condition for an element

A_{i j}

= 1 in the adjacency matrix A, indicating a ‘threat relationship’ between node i and j, is defined by the simultaneous satisfaction of the following two quantitative criteria: Distance Threshold: The Euclidean distance

d_{i j}

between obstacle j and UAV i must be less than a distance threshold

d_{t h r e a t}

, which is set to 20 m. This value is significantly larger than the safety distance (5 m), thereby providing sufficient reaction time for cluster-wide cooperative decision-making and maneuvering. Relative Velocity Threshold: The magnitude of their relative velocity vector

‖V_{r}‖

must be greater than 1 m/s, and its direction must be towards UAV i (i.e., yielding a finite positive Time To Collision, TTC). This criterion effectively filters out static or co-directionally moving targets that, despite being nearby, pose no imminent collision risk, ensuring the system only reacts to obstacles that constitute a genuine immediate threat. A threat relationship (

A_{i j}

= 1) is established, triggering subsequent graph convolutional information aggregation and cooperative obstacle avoidance strategies, only if both of the above conditions are met simultaneously. For temporal convolution, the Bi-GRU bidirectional gated recurrent unit is used to model the temporal dependence of the state sequence:

\vec{h_{t}} = G R U (\vec{h_{t - 1}}, X_{t}^{l} \cdot W_{t}^{l})

(24)

\overset{\leftarrow}{h_{t}} = G R U (\overset{\leftarrow}{h_{t + 1}}, X_{t}^{l} \cdot W_{t}^{l})

(25)

The temporal information after the final feature fusion of

h_{t} = [\vec{h_{t}}; \overset{\leftarrow}{h_{t}}]

is used for long-term prediction and dynamic evaluation of obstacle trajectories, forming a closed-loop technical process with the idea of “modeling state evolution using temporal convolution” in the strategy propagation model.

4.3. Game Theory-Based Threat Assessment Model

This section develops a threat assessment model based on game theory, incorporating incomplete information dynamic games and mean field game control laws to address threat evaluation in UAV formation obstacle avoidance.

(1): Modeling of Incomplete Information Dynamic Game

A cooperative obstacle avoidance model under incomplete information dynamic game is formulated, which abstracts multi-UAV cooperative obstacle avoidance into a triple comprising participants, strategy space, and utility function [20].

The set of participants is the UAV cluster

U = {U_{1}, U_{2}, \dots, U_{N}}

and the threat source

T = {T_{1}, T_{2}, \dots, T_{M}}

.

The strategy space

A_{i}

is a set of typical obstacle avoidance actions such as three-dimensional acceleration, steering, and hovering under the dynamic constraint of the maximum acceleration of 2 m/s² of the UAV.

The utility function is defined as follows:

u_{i} = w_{1} \cdot d e t e c t_{i} - w_{2} \cdot f a l s e_{i} - w_{3} \cdot d i s t_{i}

(26)

Among them,

d e t e c t_{i}

denotes the threat detection probability,

f a l s e_{i}

represents the false alarm rate, and

d i s t_{i}

indicates the Euclidean distance between the UAV and the threat source. The weight vector

w = {[w_{1}, w_{2}, w_{3}]}^{T}

dynamically optimizes the contributions of threat detection probability, false alarm rate, and Euclidean distance within the model [21]. As the “payoff index” of the game model, the utility function transforms the multi-UAV dynamic obstacle avoidance problem into solving an equilibrium in a non-cooperative game. Each UAV aims to maximize its own utility while accounting for strategies of other UAVs—such as interference from neighboring UAVs’ avoidance maneuvers—ultimately leading to an emergent cooperative obstacle avoidance strategy. The weight vector

w = {[w_{1}, w_{2}, w_{3}]}^{T}

is used to quantify the relative importance of threat detection probability, false alarm rate, and Euclidean distance in the utility function. Its values are determined through multi-objective Pareto optimality analysis to achieve the best trade-off among multiple conflicting objectives. The optimization process is as follows: First, the initial value range of the weights is set based on expert experience. Then, the non-dominated sorting genetic algorithm (NSGA-II) is employed to iteratively solve the problem in a typical scenario library comprising varying threat densities, speeds, and types, with the optimization objectives of maximizing the comprehensive obstacle avoidance success rate and minimizing the average mission completion time. Finally, from the obtained Pareto optimal solution set, a set of balanced weights (w₁ = 0.5, w₂ = 0.3, w₃ = 0.2) that demonstrate robustness across most test scenarios are selected. This combination ensures that the system prioritizes responding to actual threats while effectively suppressing interference caused by false alarms.

The assignment of values to the weight vector

w = {[w_{1}, w_{2}, w_{3}]}^{T}

is crucial, as it determines the relative importance of threat detection probability, false alarm rate, and relative distance in the utility function. To scientifically determine its optimal values, we adopted an optimization framework based on multi-objective Pareto optimality. The optimization process is as follows: We formulated the problem as a three-objective optimization task, with the following objective functions: (1) Maximizing threat detection probability (F₁); (2) minimizing false alarm rate (F₂); (3) minimizing the average relative distance to threat sources (F₃). These three objectives are conflicting (e.g., improving detection probability may lead to an increase in false alarm rate), meaning there is no single global optimal solution. Instead, a set of Pareto optimal solutions exists.

Algorithm selection: The non-dominated sorting genetic algorithm (NSGA-II) [22], a classic multi-objective evolutionary algorithm, was employed to solve this Pareto front. NSGA-II efficiently obtains a well-distributed set of Pareto optimal solutions through fast non-dominated sorting, crowding distance calculation, and elitist preservation.

Optimization workflow: A training set encompassing various typical threat scenarios (e.g., differing threat densities, speeds, and motion patterns) was constructed in a simulation environment. In each scenario, the NSGA-II algorithm was executed, with the weight vector w as the decision variable and the performance of the three objective functions (F₁, F₂, F₃) as the fitness criteria for iterative evolution.

Analysis of solution set and weight selection: After iterations, the algorithm output a set of Pareto optimal solutions, where each solution (i.e., a specific weight combination) represents an optimal trade-off among the three objectives. Finally, based on the fuzzy membership function method [23], a compromise solution that best meets engineering requirements was selected from the solution set (w₁ = 0.5, w₂ = 0.3, w₃ = 0.2). This weight combination achieves a robust balance between high threat detection rates and low false alarm rates across most test scenarios, ensuring the rationality and robustness of the utility function evaluation results.

(2): Mean Field Game Control Law

To achieve state aggregation and solve the dimensionality disaster problem of cluster games, the mean field game acceleration algorithm is used to approximate the interaction of individuals as changes in the statistical characteristics of the group, and the equation of the density distribution

ρ (x, t)

within the group is obtained:

ρ (x, t) \frac{\partial ρ}{\partial t} + \nabla \cdot (ρ \cdot v) = \frac{1}{2} \nabla^{2} \cdot (ρ \cdot D)

(27)

This formula can be combined with the dynamic density threshold to obtain the macro control law of the average velocity

v

and diffusion coefficient. When the number of points in the neighborhood increases, to ensure clustering accuracy, the group converts the individual game process into an inter-class game process and adaptively reduces the corresponding threshold, thereby reducing the computational complexity of the algorithm.

4.4. Adaptive Formation Control Algorithm

An adaptive control algorithm is designed to achieve both formation maintenance and in-flight reconfiguration of UAV formations.

(1): Improved Virtual Force Field Model

This model integrates bionic control principles with a variable-parameter virtual force field to balance obstacle avoidance safety and formation stability. It defines both repulsive and attractive interaction models between UAVs:

F_{i j}^{r} = k_{r} (t) (\frac{1}{r_{i j}^{2}} - \frac{1}{r_{0}^{2}}) \frac{r_{i j}}{r_{j}}, F_{i j}^{a} = k_{a} (t) (r_{i j} - r_{0}) \frac{r_{i j}}{r_{i j}}

(28)

In the formula,

k_{r} (t)

and

k_{a} (t)

refer to coefficients dynamically adjusted with the environmental threat level,

r_{i j} = ∥ s_{i} - s_{j} ∥

is the relative distance, and

r_{0}

= 10 m is the expected distance. When the UAV formation detects a threat, the dynamic weight adjustment mechanism can increase the repulsion coefficient to quickly push away neighboring UAVs to achieve obstacle avoidance. After the threat is lifted, the attraction coefficient

k_{a} (t)

can increase to assist the formation to recover quickly.

The Lyapunov function

V = \frac{1}{2} \sum_{i < j} {(\frac{1}{r_{i j}^{2}} - \frac{1}{r_{0}^{2}})}^{2}

is complementary to the cluster cohesion theory [23]. In the function, if the time derivative

\overset{\cdot}{V} \leq 0

, the system asymptotically tends to the desired formation.

(2): Dynamic Aggregation Reconstruction Strategy

The dynamic aggregation reconstruction strategy quantifies the degree of dispersion of the formation through indicators. Let

s_{i}^{r e f}

represent the expected state, and the formula is as follows:

C = \frac{1}{N} \sum_{i = 1}^{N} \frac{∣ s_{i} - s_{i}^{r e f} ∣}{∥ s_{i}^{r e f} ∥}

(29)

When

C

> 0.3, the dynamic aggregation and reorganization strategy is activated. The process consists of three stages: first, UAVs are dynamically clustered to identify new cluster centers; then, a Spatio-Temporal Graph Convolutional Network (ST-GCN) captures both spatial topology and temporal evolution features to predict the behavior of neighboring UAVs; finally, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm optimizes individual trajectories under constraints to determine the optimal control policy.

This closed-loop “detection–prediction–execution” framework ensures robust formation stability in dynamic environments.

4.5. Hybrid Decision Engine Design

A hybrid decision engine is designed to ensure the decision-making of UAV formation actions during flight, enabling the UAV formation to coordinately analyze and handle situations.

(1): AG-DRL Fusion Strategy

Combining game theory with the adaptive optimization capability of deep reinforcement learning, a hybrid decision function is designed:

π (α ∣ s) = (1 - ω) \cdot π_{g a m e} (α ∣ s) + ω \cdot π_{d r} (α ∣ s)

(30)

Among them,

π_{g a m e}

is the equilibrium strategy solved by game theory, and

π_{d r}

is the adaptive strategy generated by DRL. The dynamic weight

ω

∈ [0, 1] is updated through the TD3 framework,

α

is the learning rate, and s represents the current state vector of the UAV.

The mathematical expression is as follows:

ω_{t} = ω_{t - 1} + α \cdot \nabla_{w} Q (s, a; ω)

(31)

Q (s, a; ω)

represents the value function fusing game theory payoffs and DRL rewards. If

r

is the immediate reward,

γ

is the discount factor, and

Q_{i}^{'}

and

μ^{'}

are the value function and strategy function of the target network, then the Critic network evaluates the long-term return of the state-action, and the mathematical expression is as follows:

Q (s, a; ω) = E [r + γ \cdot m i n_{i = 1, 2} Q_{i}^{'} (s^{'}, μ^{'} (s^{'}); ω^{'})]

(32)

The Critic network is used to evaluate the long-term return of state-action pairs, fusing game theory payoffs and deep reinforcement learning (DRL) rewards, and balances the obstacle avoidance strategy and task execution strategy by dynamically adjusting the weight

ω

[24]. Thus, the coordinated optimization of safety and efficiency of multiple UAVs in complex environments is realized.

(2): Formal Safety Constraint Layer

To ensure safety during the entire dynamic obstacle avoidance process, the CBF control barrier function is embedded into the strategy network, converting obstacle avoidance constraints into soft constraint terms of the optimization problem, and designing safe constraint conditions to realize the coordinated optimization of “safety-efficiency” [25].

Assume that

f (x)

and

g (x)

are system dynamics parameters,

u

is the control input,

h (x) = d (x, o b s) - d_{s a f e}

is the safety state function, and

α (\cdot)

is a

k

-class function (such as

α (s) = k s

, where

k

> 0 is the safety coefficient); then, the formula is expressed as follows:

\nabla h (x) \cdot f (x) + g (x) u \geq - α (h (x))

(33)

Among them,

h (x) = d (x, o b s) - d_{s a f e}

,

d (x, o b s)

refers to the distance between the UAV and surrounding obstacles, and

d_{s a f e}

= 5 m is the safety threshold. The setting of the safety threshold

d_{s a f e}

is not an empirical value but is strictly derived based on the UAV system’s dynamic constraints and safety margins. Its derivation is as follows: First, the braking distance: under the UAV model used in this study (maximum velocity

v_{\max}

= 10 m/s, maximum deceleration

a_{\max}

= 2 m/s²), the theoretically shortest distance required to brake from maximum velocity to a complete stop is

S_{b r a k e} = \frac{v_{\max}^{2}}{2 a_{\max}} = 25 m

. Then, the system delay distance is considered: taking into account the full system reaction time (including perception, decision-making, and communication links, with a tested mean

t_{d e l a y}

≈ 0.3 s), during which the UAV continues moving at its current speed, resulting in an additional displacement of

S_{d e l a y} = v \cdot t_{d e l a y} \approx 3 m

. Finally, a safety margin is incorporated: to account for sensor noise, model uncertainties, and extreme scenarios, an additional buffer distance of approximately 2 m is introduced.

In summary,

d_{s a f e}

= 5 m is a conservative threshold that balances safety and feasibility in engineering practice, ensuring sufficient space for successful obstacle avoidance under conditions including delays and uncertainties. This setting principle aligns with existing research.

Based on Lyapunov stability theory:

V (x) = \frac{1}{2} h {(x)}^{2}

(34)

The relationship of the time derivative is expressed as

V ˙ (x) = h (x) \cdot \nabla h (x) \cdot f (x) \leq - α (h (x)) h (x)

. When

h (x)

> 0,

V (x)

≤ 0, and the safe convergence set of the system state is

{x ∣ h (x) \geq 0}

. Test data show that the trajectory offset of this mechanism is within 2 m. In the formal safety constraint layer, the overall loss function is the weighted sum of the reinforcement learning loss and the safety loss, which can realize the priority of safety constraints in the strategy update process.

5. Simulation Verification

To evaluate the practicality of the multi-UAV dynamic aggregation cooperative obstacle avoidance technology proposed in this study, a high-fidelity multi-UAV simulation platform based on MATLAB/Simulink R2023a was employed to construct complex spatial scenarios with multiple obstacles, enabling intuitive demonstration of the advantages of the novel obstacle avoidance approach.

5.1. Construction of Simulation Environment

In the simulation tests, to further validate the technical scalability and meet the density requirements of tactical reconnaissance formations, the formation scale was expanded to 30 UAVs with an initial inter-UAV spacing of 15 m. A comparative formation with conventional spacing was also configured. The simulation environment was built based on complex multi-obstacle spatial scenarios to verify the effectiveness of the proposed multi-UAV dynamic aggregation cooperative obstacle avoidance technology. Key initial parameters were set as follows: 25 environmental obstacles each with a radius of 2 m and height of 2 m; the ground scene covers a horizontal area of 100 × 100 m.

5.2. Design of Comparative Experiments

Multi-scenario tests in dynamic environments are conducted to validate the effectiveness of the proposed optimization method in task allocation, formation stability, and path planning for UAV cooperative control. The testing environment incorporates randomly generated obstacles and multi-task dynamic targets (e.g., uniform-velocity approaching enemies and lateral interception threats in typical combat modes). The UAV physical model adheres to kinematic constraints (maximum speed: 10 m/s, acceleration: 2 m/s²), energy consumption limits, and communication range constraints (50 m). The experimental methodology follows the approach in (Ju et al., 2024, pp.1–15) [26].

Evaluation metrics include task completion rate, average execution time, path length efficiency, energy efficiency, and formation maintenance error. Three control groups are designed using the proposed control method for path planning scenarios [27]. Key simulation parameters are as follows: Start point: (10, 50) m; End point: (90, 50) m; Obstacle areas: Rectangle 1: (40, 40)–(40, 50) m; Rectangle 2: (60, 40)–(60, 50) m; UAV model: DJI mini-drone; Quantity gradient: 1 to 30; incrementally Kinematic constraints: max speed 10 m/s, max acceleration 2 m/s²; Communication range: 50 m; Number of trials: 30.

The traditional game-theoretic method employs an incomplete information dynamic game model without deep reinforcement learning [28]. Incomplete information may lead to inaccurate opponent strategy prediction and delayed responses. For instance, when encountering sudden air defense attacks, the static game model fails to update threat levels in real time, significantly reducing obstacle avoidance success in high-threat-density environments.

The independent DRL method (e.g., PPO algorithm) lacks game-theoretic modeling [29], resulting in insufficient understanding of adversarial behavior and suboptimal performance.

The proposed scheme deeply integrates adaptive game theory with deep reinforcement learning, leveraging their complementary advantages [30]. By incorporating opponent confrontation modeling, it enables accurate behavior prediction and supports efficient decision-making for UAV formations in complex environments [31].

Initial conditions (including starting positions, formation state, and threat density) are identical across all three control groups [32]. Each task is repeated 30 times to measure obstacle avoidance success rate, average response time, and computational resource consumption.

5.3. Empirical Results

Repeated experiments were conducted to ensure statistical accuracy and eliminate contingency-induced errors [33]. As shown in Figure 3, in the cooperative control of a three-UAV formation, the traditional game-theoretic method yielded the longest path with the most avoidance maneuvers, primarily relying on obstacle circumvention; the independent DRL approach generated shorter paths with fewer turns, combining circumvention with occasional direct traversal over obstacles (e.g., at (20, 18)); the integrated adaptive game theory and deep reinforcement learning method produced the shortest path with the least maneuvers, employing strategic circumvention complemented by direct overflight when efficient (e.g., at (60, 65)).

After 30 repeated experiments, the average success rate of obstacle avoidance, resource consumption, and computational efficiency of the three methods were statistically analyzed, as shown in Figure 4. The results indicate that the traditional game-theoretic method exhibited overall low efficiency, with an average obstacle avoidance success rate of 52%, high resource consumption of 25×, and low computational efficiency of 100 fps; the independent DRL method demonstrated relatively high overall efficiency, achieving a 70% obstacle avoidance success rate, moderate resource consumption of 15×, and computational efficiency of 200 fps; the method deeply integrating adaptive game theory with deep reinforcement learning showed high overall efficiency, with an average obstacle avoidance success rate of 91%, low resource consumption of 1.385×, and relatively high computational efficiency of 300 fps. In summary, these findings clearly demonstrate the superior performance and effectiveness of the proposed UAV formation control method compared to previous approaches.

The test results demonstrate that the multi-UAV dynamic aggregation cooperative obstacle avoidance method, based on the integrated framework of adaptive game theory and deep reinforcement learning, achieves a significantly higher obstacle avoidance success rate, with specific data illustrated in Figure 4. When the number of UAVs reaches 20, the success rate of the proposed method remains at 91%, as detailed in Figure 5. Large-scale formation tests further reveal that as the number of UAVs gradually increases, the traditional method’s obstacle avoidance success rate drops to approximately 52% at 30 UAVs, while the method combining adaptive game theory and deep reinforcement learning maintains a success rate exceeding 91%. This performance is attributed to the incorporation of adaptive capabilities in complex environments and uninterrupted decision-making operational mechanisms, which collectively ensure collision-free operations in UAV formations.

To reduce overhead, this scheme rigorously controls computational complexity through mean field approximation and strategy decoupling mechanisms. With 30 UAVs, the computational load increases by only approximately 38.5%. In contrast, traditional methods require more than 25 times the computing resources of a single UAV when the number of UAVs reaches 30, severely limiting their applicability. As shown in Figure 5, the increase in computational overhead and the obstacle avoidance success rate under different formation scales are presented. Based on multi-modal perception fusion and safe reinforcement learning algorithms, the cluster effectively operates in adverse conditions such as GPS-denied environments and strong electromagnetic interference, maintains a relatively stable formation configuration, and achieves a trajectory deviation distance of ≤1.2 m.

The clustering threshold λ exhibits an inversely correlated exponential relationship with UAV density (Formula (9):

λ (t) = λ_{0} \cdot e x p (- β \cdot n_{s} (t

))). Its core physical significance is to achieve adaptive clustering precision control: At low density (small n_s(t)): λ ≈ λ₀, allowing the formation of larger clusters to promote macroscopic coordination and improve computational and communication efficiency. At high density (large n_s(t)): λ decreases dynamically, forcing large clusters to split into finer sub-clusters to avoid misclassification and ensure obstacle avoidance safety and decision-making accuracy in high-density scenarios [34].

This mechanism intelligently maps physical spatial density to feature space processing granularity, transforming computationally complex inter-individual games into efficient inter-cluster games and intra-cluster consensus. As shown in Figure 6, this is the key reason why the computational overhead of the algorithm increases only linearly (38.5%) rather than exponentially when scaling up, fundamentally solving the trade-off between precision and efficiency in large-scale cooperative control.

The results presented in Figure 6 reveal a key characteristic of the proposed AG-DRL model: achieving higher success rates and superior scalability requires correspondingly greater computational resources. This stands in sharp contrast to traditional methods, where performance deteriorates sharply while computational costs surge dramatically [35].

(1): Parallel Growth in System Overhead and Success Rate

This is an expected phenomenon, rooted in the linear relationship between our model’s computational complexity and swarm size. As described in Section 3.1, the mean-field game algorithm reduces computational complexity from O(N²) to O(K²). Therefore, an increase in the number of UAVs (N) linearly raises the demand for neighbor interactions, state perception, and policy computation. These additional computations are primarily used to operate the multi-scale attention network for more accurate environmental awareness, execute more complex game-theoretic strategy solutions, and maintain collaborative decision-making within the swarm. This is a necessary cost for achieving high-performance collaborative intelligence. It is precisely these “extra” computational investments that enable each UAV to make smarter decisions, thereby maintaining high success rates even as the scale expands. In other words, the linear increase in computational overhead is fundamental to sustaining high performance. This fundamentally distinguishes our approach from traditional game-theoretic methods (where computational costs grow exponentially while performance declines linearly) and centralized methods such as MILP (where computational requirements are prohibitively high, rendering them impractical despite high performance).

(2): Explanation of Resource Consumption

At a scale of 30 UAVs, the proposed AG-DRL method incurs an increase in computational overhead of approximately 38.5% compared to a single UAV baseline. In contrast, traditional game-theoretic methods exhibit an overhead increase of up to 2500% (25 times). Thus, the additional computational overhead of the proposed method is only about 1/65th (≈1.5%) of that of traditional methods, rather than three times higher.

In summary, our method achieves consistently high performance with an acceptable, linear increase in computational cost. In contrast, traditional approaches are plagued by explosive growth in computational overhead alongside continuously worsening performance. This clearly demonstrates the excellent balance between computational efficiency and performance achieved by the proposed integrated framework.

As shown in Figure 7,the ACO algorithm can find near-optimal paths in small-scale swarms of 20 drones (with an average success rate of ~85%), but its performance declines sharply as the swarm size increases. When the number of drones rises to 30, the success rate drops below 65%. This is because ACO’s pheromone mechanism is prone to local optima in high-dimensional and dynamic environments, and its slow convergence speed (with an average decision time exceeding 500 ms) fails to meet the real-time requirements of large-scale swarms. In contrast, the AG-DRL framework directly maps states to actions through a deep neural network, reducing decision time to just tens of milliseconds. Its distributed architecture ensures that performance does not degrade significantly with scale, highlighting its dual advantages in real-time responsiveness and scalability.

As shown in Figure 7, the MILP method provides globally optimal solutions in small-scale scenarios (100% success rate, shortest path length), serving as a performance upper bound for other algorithms. However, its computational complexity grows exponentially. For a swarm of 30 drones, MILP’s solution time exceeds 10 min, making it entirely impractical for real-time applications. Additionally, MILP’s rigid model struggles to handle dynamic obstacles and sudden threats effectively. Although the proposed AG-DRL method slightly trails MILP in optimality (91% vs. 100% success rate), it outperforms by several orders of magnitude in computational efficiency and possesses dynamic environment adaptability and distributed resilience that MILP lacks. This demonstrates that the AG-DRL framework achieves an exceptional balance between solution quality and computational efficiency, making it more suitable for real-world complex and dynamic applications 25.

In summary, comparisons with ACO and MILP confirm that the AG-DRL framework combines the distributed advantages of swarm intelligence algorithms with the strong decision-making capabilities of centralized optimization methods. It simultaneously overcomes the core limitations of slow convergence and local optima in ACO and the high computational complexity and inflexibility of MILP, establishing itself as a more suitable solution for large-scale, highly dynamic, and constrained real-world scenarios in UAV swarm cooperative control.

In summary, simulation experiments demonstrate that the proposed scheme exhibits high efficiency and flexibility. The UAV formation maintains orderly flight both in initial states and upon threat detection. Through autonomous cluster perception, it rapidly avoids potential threats such as multiple obstacles in complex scenarios, thereby ensuring successful mission accomplishment.

6. Conclusions and Outlook

The integrated multi-UAV dynamic aggregation-disaggregation cooperative obstacle avoidance framework, incorporating adaptive game theory and deep reinforcement learning, employs advanced sensor fusion technology and adversarial sample defense mechanisms. Optimized obstacle avoidance algorithms and reduced computational complexity in real-time decision-making endow the system with enhanced robustness and superior evasion capabilities in complex dynamic threat environments. This facilitates larger-scale swarm coordination, improving UAV clusters’ transportation efficiency and safety in modern traffic ecosystems. Nevertheless, to address escalating demands in evolving scenarios, breakthroughs are imperative in two dimensions:

Technical Deepening: Dynamic adaptability to intelligent adversarial threats (e.g., autonomous interception by UAV-like targets) requires enhancement. Integration of Meta-Reinforcement Learning (Meta-RL) will dynamically refine the opponent modeling module (I-POMDP) in the intention layer, boosting action prediction accuracy while reducing decision latency for large-scale formations. Concurrently, federated learning architectures and lightweight Spatiotemporal Graph Convolutional Networks (ST-GCN) will be explored to overcome reconstruction time and computational bottlenecks in hundred-unit cluster collaboration.

Engineering Implementation: A full lifecycle safety verification system must be established. Formal verification tools (e.g., UPPAAL) will enable aviation-level certification (DO-178C) for the safety constraint layer (e.g., Control Barrier Functions), while damage-resistant ad hoc network protocols will elevate communication topology fault tolerance (currently 30% node failure tolerance) to support critical scenarios like smart cities and rapid-response missions. Future work will concentrate on adversarial meta-learning, federated edge computing, and human–machine hybrid augmentation, advancing UAV cooperative control toward greater intelligence and universal applicability.

Author Contributions

Investigation, G.Y.; resources, L.G. and H.L.; data curation, G.Y.; writing—original draft preparation, G.Y.; writing—review and editing, L.G. and H.L.; visualization, F.W.; supervision, L.G.; project administration, F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hubei Province (2023AFB1028) and the Military Science Project of the National Social Science Foundation of China (2024-SKJJ-B-044).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Guangyi Yao, Lejiang Guo, and Fan Wu were employed by the Air Force Early Warning Academy; Haibin Liao was employed by the School of Electronic and Electrical Engineering. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, Y.; Zhang, J.; Wang, Y. Game Theoretic Approach for MultiUAV Coordination under Communication Constraints. J. Guid. Control. Dyn. 2024, 47, 583–594. [Google Scholar]
Dhuheir, M.; Baccour, E.; Erbad, A.; Al-Obaidi, S.S.; Hamdi, M. Deep Reinforcement Learning for Trajectory Path Planning and Distributed Inference in Resource-Constrained UAV Swarms. IEEE Internet Things J. 2023, 10, 8185–8201. [Google Scholar] [CrossRef]
Hutter, F.; Kotthoff, L.; Vanschoren, J. Automated Machine Learning: Methods, Systems, Challenges; Springer: Berlin/Heidelberg, Germany, 2019; p. 219. [Google Scholar]
Wen, L.; Zhen, Z.; Ding, J.; Tao, C.; Liu, S. Distributed self-organizing fencing strategy with UAV swarm under incomplete information. Adv. Eng. Inform. 2025, 68 Pt A, 103587. [Google Scholar] [CrossRef]
Medani, K.; Gherbi, C.; Mabed, H.; Aliouat, Z. Energy-efficient Q-learning-based path planning for UAV-Aided data collection in agricultural WSNs. Internet Things 2025, 33, 101698. [Google Scholar] [CrossRef]
Cao, Z.; Chen, G. Enhanced deep reinforcement learning for integrated navigation in multi-UAV systems. Chin. J. Aeronaut. 2025, 38, 103497. [Google Scholar] [CrossRef]
Liu, X.; Yi, W.; Chen, P.; Tan, Y. Flight, path planning of UAV-driven refinement inspection for construction sites based on 3D reconstruction. Autom. Constr. 2025, 177, 106360. [Google Scholar] [CrossRef]
Amodu, O.A.; Althumali, H.; Hanapi, Z.M.; Jarray, C.; Mahmood, R.A.R.; Adam, M.S.; Bukar, U.A.; Abdullah, N.F.; Luong, N.C. A Comprehensive Survey of Deep Reinforcement Learning in UAV-Assisted IoT Data Collection. Veh. Commun. 2025, 55, 100949. [Google Scholar] [CrossRef]
Shi, Z.; Wang, L.; Lin, Y.; Cai, A.; Fan, J.; Liu, C. Dynamic offloading strategy in SAGIN-based emergency VEC: A multi-UAV clustering and collaborative computing approach. Veh. Commun. 2025, 55, 100952. [Google Scholar] [CrossRef]
Basavegowda, D.H.; Schleip, I.; Bellingrath-Kimura, S.D.; Weltzien, C. UAV-assisted deep learning to support results-based agri-environmental schemes: Facilitating Eco-Scheme 5 implementation in Germany. Biol. Conserv. 2025, 309, 111323. [Google Scholar] [CrossRef]
Chen, H.; Liu, J.; Wang, Y.; Zhu, J.; Feng, D.; Xie, Y. Teaching in adverse scenes: A statistically feedback-driven threshold and mask adjustment teacher-student framework for object detection in UAV images under adverse scenes. ISPRS J. Photogramm. Remote Sens. 2025, 227, 332–348. [Google Scholar] [CrossRef]
Lyu, C.; Lin, S.; Lynch, A.; Zou, Y.; Liarokapis, M. UAV-based deep learning applications for automated inspection of civil infrastructure. Autom. Constr. 2025, 177, 106285. [Google Scholar] [CrossRef]
Xu, W.; Zhang, X.; Miao, Z. Cooperative trajectory tracking control of USV-UAV with non-singular sliding mode surface and RBF neural network. Ocean Eng. 2025, 337, 121872. [Google Scholar] [CrossRef]
Jia, R.; Li, H.; Sun, P.; Zheng, Z.; Li, M. UAV trajectory optimization for visual coverage in mobile networks using matrix-based differential evolution. Knowl.-Based Syst. 2025, 324, 113797. [Google Scholar] [CrossRef]
Liu, H.; Long, X.; Li, Y.; Yan, J.; Li, M.; Chen, C.; Gu, F.; Pu, H.; Luo, J. Adaptive multi-UAV cooperative path planning based on novel rotation artificial potential fields. Knowl.-Based Syst. 2025, 317, 113429. [Google Scholar] [CrossRef]
Zhang, Z.; Li, N.; Yan, G.; Li, W. The development of distributed cooperative localization algorithms for Multi-UAV systems in the past decade. Measurement 2025, 256 Pt A, 118040. [Google Scholar] [CrossRef]
Pan, Z.; Wang, K.; Liu, Y.; Guan, X.; Chen, C.; Liu, J.; Wang, Z.; Li, F.; Ma, G.; Yao, Y.; et al. Deep learning-enhanced safety system for real-time in-situ blade damage monitoring in UAV using triboelectric sensor. Nano Energy 2025, 140, 111063. [Google Scholar] [CrossRef]
Ngo, Q.H.; Luu, T.H.; Nguyen, P.V.; El Makrini, I.; Vanderborght, B.; Cao, H.-L. InterDuPa-UAV: A UAV-based dataset for the classification of intercropped durian and papaya trees. Data Brief 2025, 61, 111843. [Google Scholar] [CrossRef]
Liang, J.; He, Q. Joint optimization of VNF deployment and UAV trajectory planning in Multi-UAV-enabled mobile edge networks. Comput. Netw. 2025, 262, 111163. [Google Scholar] [CrossRef]
Lu, Z.; Zhai, L.; Zhou, W.; Xue, K.; Gao, X. Beamforming design and trajectory optimization for integrated sensing and communication supported by multiple UAVs based on DRL. Veh. Commun. 2025, 54, 100932. [Google Scholar] [CrossRef]
Liu, Y.; Chen, Y.; Hu, M.; Zhang, W. Resilient multi-UAV path planning for effective and reliable information collection. Phys. Commun. 2025, 71, 102685. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Zimmermann, H.J. Fuzzy programming and linear programming with several objective functions. Fuzzy Sets Syst. 1978, 1, 45–55. [Google Scholar] [CrossRef]
Yang, G.; Mo, Y.; Lv, C.; Zhang, Y.; Li, J.; Wei, S. A dual-layer task planning algorithm based on UAVs-human cooperation for search and rescue. Appl. Soft Comput. 2025, 181, 113488. [Google Scholar] [CrossRef]
Zarychta, R.; Zarychta, A. Application of geostatistical approach in generating DEM for relief studies using UAV in forest areas. Geomorphology 2025, 487, 109916. [Google Scholar] [CrossRef]
Ju, T.; Li, L.; Liu, S.; Zhang, Y. A multi-UAV assisted task offloading and path optimization for mobile edge computing via multi-agent deep reinforcement learning. J. Netw. Comput. Appl. 2024, 229, 103919. [Google Scholar] [CrossRef]
Aloqaily, M.; Bouachir, O.; Al Ridhawi, I. UAV-supported communication: Current and prospective solutions. Veh. Commun. 2025, 54, 100923. [Google Scholar] [CrossRef]
Grindley, B.; Phillips, K.; Parnell, K.J.; Cherrett, T.; Scanlan, J.; Plant, K.L. Avoiding automation surprise: Identifying requirements to support pilot intervention in automated Uncrewed Aerial Vehicle (UAV) flight. Appl. Ergon. 2025, 127, 104516. [Google Scholar] [CrossRef]
Kutpanova, Z.; Kadhim, M.; Zheng, X.; Zhakiyev, N. Multi-UAV path planning for multiple emergency payloads delivery in natural disaster scenarios. J. Electron. Sci. Technol. 2025, 23, 100303. [Google Scholar] [CrossRef]
Xiong, Y.; Zhou, Y.; She, J.; Yu, A. Collaborative coverage path planning for UAV swarm for multi-region post-disaster assessment. Veh. Commun. 2025, 53, 100915. [Google Scholar] [CrossRef]
Yue, S.; Zheng, D.; Wei, M.; Chu, Z.; Lin, D. Behavior-based cooperative control method for fixed-wing UAV swarm through a virtual tube considering safety constraints. Chin. J. Aeronaut. 2025, 38, 103445. [Google Scholar] [CrossRef]
Zhang, Y.; Li, S.; Gu, Y.; He, Q.; Zhou, P.; Zhang, A. UAV fault diagnosis based on collaborative sharing of generic and task-oriented features. Expert Syst. Appl. 2025, 296, 128940. [Google Scholar] [CrossRef]
Darchini-Tabrizi, M.; Pakdaman-Donyavi, A.; Entezari-Maleki, R.; Sousa, L. Performance enhancement of UAV-enabled MEC systems through intelligent task offloading and resource allocation. Comput. Netw. 2025, 264, 111280. [Google Scholar] [CrossRef]
Xiao, J.; Guo, H.; Zhou, J.; Zhao, T.; Yu, Q.; Chen, Y.; Wang, Z. Tiny object detection with context enhancement and feature purification. Expert Syst. Appl. 2023, 211, 118665. [Google Scholar] [CrossRef]
Guo, Y.; Zhou, J.; Dong, Q.; Li, B.; Xiao, J.; Li, Z. Refined high definition map model for roadside rest area. Transp. Res. Part A Policy Pract. 2025, 195, 104463. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the three-level information fusion architecture.

Figure 2. Multi-scale attention policy network architecture.

Figure 3. Comparison of UAV path planning under different control methods.

Figure 4. Comparison of average experimental results.

Figure 5. Performance comparison in complex obstacle scenario (20 UAVs).

Figure 6. Performance of cooperative obstacle avoidance across different formation scales.

Figure 7. Comparison of different methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, G.; Guo, L.; Liao, H.; Wu, F. Fusing Adaptive Game Theory and Deep Reinforcement Learning for Multi-UAV Swarm Navigation. Drones 2025, 9, 652. https://doi.org/10.3390/drones9090652

AMA Style

Yao G, Guo L, Liao H, Wu F. Fusing Adaptive Game Theory and Deep Reinforcement Learning for Multi-UAV Swarm Navigation. Drones. 2025; 9(9):652. https://doi.org/10.3390/drones9090652

Chicago/Turabian Style

Yao, Guangyi, Lejiang Guo, Haibin Liao, and Fan Wu. 2025. "Fusing Adaptive Game Theory and Deep Reinforcement Learning for Multi-UAV Swarm Navigation" Drones 9, no. 9: 652. https://doi.org/10.3390/drones9090652

APA Style

Yao, G., Guo, L., Liao, H., & Wu, F. (2025). Fusing Adaptive Game Theory and Deep Reinforcement Learning for Multi-UAV Swarm Navigation. Drones, 9(9), 652. https://doi.org/10.3390/drones9090652

Article Menu

Fusing Adaptive Game Theory and Deep Reinforcement Learning for Multi-UAV Swarm Navigation

Abstract

1. Introduction

2. Traditional Theories of UAV Formation Cooperative Control

2.1. Rule-Based Bionic Control Method

2.2. Distributed Cooperative Control Method

2.3. DRL Control Method (Deep Reinforcement Learning)

3. A Fusion Framework of Adaptive Game Theory and Deep Reinforcement Learning

3.1. Improvement of Dynamic Game Theory

3.2. Enhanced Design of Deep Reinforcement Learning

4. Design of Dynamic Aggregation Cooperative Obstacle Avoidance Algorithm

4.1. System Architecture and Mathematical Formulation

4.2. Multi-Modal Perception Fusion Model

4.3. Game Theory-Based Threat Assessment Model

4.4. Adaptive Formation Control Algorithm

4.5. Hybrid Decision Engine Design

5. Simulation Verification

5.1. Construction of Simulation Environment

5.2. Design of Comparative Experiments

5.3. Empirical Results

6. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI