Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm

Yang, Chaofan; Zhang, Bo; Zhang, Meng; Wang, Qi; Zhu, Peican

doi:10.3390/drones9100673

Open AccessArticle

Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm

by

Chaofan Yang

¹,

Bo Zhang

^1,*

,

Meng Zhang

²

,

Qi Wang

¹

and

Peican Zhu

¹

School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China

²

School of Human Settlements and Civil Engineering, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(10), 673; https://doi.org/10.3390/drones9100673

Submission received: 8 August 2025 / Revised: 9 September 2025 / Accepted: 18 September 2025 / Published: 25 September 2025

(This article belongs to the Section Artificial Intelligence in Drones (AID))

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

This study presents an enhanced algorithm that integrates the Rainbow module to improve the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm for multi-agent UAV cooperative and competitive scenarios.
The proposed algorithm incorporates Prioritized Experience Replay (PER) and multi-step TD updating to optimize long-term reward perception and enhance learning efficiency. Behavioral cloning is also employed to accelerate convergence during initial training.

What is the implication of the main finding?

Experimental results on a UAV island capture simulation demonstrate that the enhanced algorithm outperforms the original MADDPG, showing a 40% increase in convergence speed and a doubled combat power preservation rate.
The algorithm proves to be a robust and efficient solution for complex, dynamic, multi-agent game environments.

Abstract

To address the limitations of the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm in autonomous control tasks including low convergence efficiency, poor training stability, inadequate adaptability of confrontation strategies, and challenges in handling sparse reward tasks—this paper proposes an enhanced algorithm by integrating the Rainbow module. The proposed algorithm improves long-term reward optimization through prioritized experience replay (PER) and multi-step TD updating mechanisms. Additionally, a dynamic reward allocation strategy is introduced to enhance the collaborative and adaptive decision-making capabilities of agents in complex adversarial scenarios. Furthermore, behavioral cloning is employed to accelerate convergence during the initial training phase. Extensive experiments are conducted on the MaCA simulation platform for 5 vs. 5 to 10 vs. 10 UAV island capture missions. The results demonstrate that the Rainbow-MADDPG outperforms the original MADDPG in several key metrics: (1) The average reward value improves across all confrontation scales, with notable enhancements in 6 vs. 6 and 7 vs. 7 tasks, achieving reward values of 14, representing 6.05-fold and 2.5-fold improvements over the baseline, respectively. (2) The convergence speed increases by 40%. (3) The combat effectiveness preservation rate doubles that of the baseline. Moreover, the algorithm achieves the highest average reward value in quasi-rectangular island scenarios, demonstrating its strong adaptability to large-scale dynamic game environments. This study provides an innovative technical solution to address the challenges of strategy stability and efficiency imbalance in multi-agent autonomous control tasks, with significant application potential in UAV defense, cluster cooperative tasks, and related fields.

Keywords:

multi-agent; reinforcement learning; MADDPG; rainbow; prioritized experience replay; multi-step TD update

1. Introduction

In recent years, the rapid development of UAV technology has promoted its wide application in the military field, especially in the scenarios of island scrambling and target reconnaissance, UAV cluster cooperative operation capability has become a research focus. Through the advantage of group intelligence, UAV clusters can realize efficient collaboration and confrontation in complex tasks [1,2,3,4]. However, under conditions of resource constraints, dynamic adversarial interactions, and complex, evolving environments, the formulation of efficient group confrontation and collaboration strategies remains a critical challenge in multi-agent systems research [5,6,7]. Many scholars at home and abroad have carried out systematic research on this issue.

For the game theoretic approach Isler et al. [8] proposed a solution to the problem of pursuit using a stochastic strategy, which first solves the capture problem of a single pursuer in polygonal environments, and then proposes a dual pursuer strategy, which is further extended to more complex environments such as rooms with doors and line-of-sight communication scenarios. Kaneshige et al. [9] proposed a UAV maneuvering selection for air combat based on the Artificial Immune System (AIS) By simulating the adaptability and memory ability of the biological immune system, the method enables UAVs to realize intelligent maneuver decision-making in aerial confrontation. Chen et al. [10] proposed a game model for multi-agent collaborative coalition combat, and used the particle swarm optimization algorithm to solve the mixed strategy Nash equilibrium of to solve the multi-agent coalition collaborative problem, and verified the validity of the method through simulation. Duan et al. [11] presented the Predator–Prey Particle Swarm Optimization (PP-PSO) method based on game theory to solve the dynamic task allocation problem of multiple UAVs in military missions. By modeling as a two-player game and solving a hybrid Nash equilibrium, the effectiveness of the method in air combat is verified, and it is applicable to confrontation scenarios between attacking and defending parties. Zhou et al. [12] studied the multi-agent air combat system, and improved the robustness and mission success rate of UAV clusters in aerial combat by modeling the aerodynamic characteristics of the aircrafts and the threat area, and combining the ant colony algorithm and the autonomous control algorithm.

Multi-Agent Reinforcement Learning (MARL), as an adaptive intelligence approach, offers possibilities for UAVs to optimize collaborative and competitive strategies in gaming confrontation island capture scenarios [13,14,15,16]. The MADDPG algorithm proposed by Lowe et al. [17] enables multi-agents to collaborate or compete in a shared environment to learn optimal strategies by combining Deep Deterministic Policy Gradient (DDPG) with a centralized training, distributed execution framework. AlphaGo, AlphaGo Zero, AlphaZero, and AlphaStar et al. [1,18,19,20] mainly use deep reinforcement learning techniques, and demonstrate the strong adaptive and strategy optimization capabilities of unsupervised learning and self-play through self-play, Monte Carlo tree search, and multi-agent collaboration. Chen et al. [21] proposed a multi-agent collaborative attack and defense decision-making method based on multi-agent reinforcement learning, and combined with the execution-judgment algorithm to improve the autonomy learning ability and the effectiveness of collaborative combat, the simulation results verified its effectiveness. Li et al. [22] proposed an improved multi-agent collaborative decision-making algorithm, combining the executor-judge framework, GRU and attention mechanism to optimize the decision-making performance and convergence speed. Zhang et al. [23] proposed a deep reinforcement learning method based on the attention mechanism and individual reward shaping to optimize the collaborative strategy to improve the efficiency of the short-range aerial combat mission of multi-UAVs, and simulation results verified the effectiveness and advantages of the method. Gong et al. [24] proposed a multi-agent collaborative air combat decision-making method based on MARL theory, which optimizes the UAV collaboration strategy by combining VDNs and NDec-POMDP framework, and the simulation results verify the effectiveness of the method. Zhou et al. [25] proposed PK-MADDPG algorithm, which significantly improves the multi-agent game confrontation efficiency by combining the a priori knowledge and optimizing the training strategy, and the simulation results verify the effectiveness of the method. Improves the training efficiency and effect of multi-agent game confrontation and performs well in real competitions. Wang et al. [26] proposed COMA method significantly improves the decision-making performance of UAV clusters in confrontation game tasks and multi-agent credit allocation problems, which has strong practicality and robustness.

Some scholars have done the following studies on the assistance of behavioral cloning for reinforcement learning. Aler et al. [27] modeled the low-level behaviors of human players and applied them to Robosoccer intelligence control through behavioral cloning, which realized the simulation of human behaviors of agents in the game. Ho et al. [28]. proposed a novel model-free imitation learning algorithm, which significantly improved the performance in complex behavioral imitation by extracting the strategies directly from the data. Wang et al. [29] proposed a novel iterative learning algorithm that significantly improves performance on mixed-quality datasets by identifying expert trajectories and performing behavioral cloning, outperforming existing offline reinforcement learning and imitation learning methods. Sun et al. [30] proposed a decision-making framework for ASVs that combines behavioral cloning and deep reinforcement learning by using behavioral cloning to train ASVs to cope with underdriven dynamics and optimizing interception strategies, thus improving interception efficiency.

Current research faces three primary challenges: (1) inadequate learning efficiency of existing algorithms in dynamic environments with sparse rewards; (2) absence of an effective long-term reward optimization mechanism integrating prior knowledge; and (3) insufficient balance between multi-agent collaboration and confrontation. This study presents an enhanced algorithm incorporating the Rainbow module to address these limitations. The proposed algorithm employs Prioritized Experience Replay (PER) and Multi-step Temporal Difference (TD) updating techniques to optimize long-term reward perception, thereby significantly enhancing strategy learning efficiency, stability, and robustness. Furthermore, Behavior Cloning is implemented to accelerate convergence during initial training phases. Experimental results demonstrate the algorithm’s superior performance and its substantial application potential in military defense and cooperative swarm warfare domains.

2. Algorithm Design

2.1. Theoretical Framework Analysis

Multi-agent Reinforcement Learning (MARL) is a process where agents learn through interaction in a shared environment, where each agent receives rewards based on the state of the environment and its actions, and uses this information to update its strategy in order to maximize the long term returns [31,32,33,34]. Figure 1 illustrates the combination of the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithmic framework (blue dotted box) with the Rainbow optimization module (yellow dotted box), while behavioral cloning (green dotted box) is introduced to accelerate convergence in the early stages of training using expert data. In this framework, each agent generates policies via Actor Network and evaluates the value of actions via Critic Network, and the introduction of Goal Network further stabilizes the learning process. In addition, Prioritized Experience Replay increases the sampling frequency of important experiences [35], thus accelerating the learning efficiency, while Multi-Step Temporal Difference (TD) Updating enhances the ability of the strategies to learn from long-term dependencies [36].

Consider a multi-agent system with UAV agents. At time step

t

, each agent

i

aims to maximize its cumulative discounted reward. Long-term rewards by learning a strategy

π_{i} (a_{t}^{i} | s_{t})

, where

a_{t}^{i}

is the action selected under the agent’s

i

state

s_{t}

The goal of the agents is to optimize the cumulative payoff (long-term reward), which is usually represented by the following objective function:

J_{i} (π_{i}) = E [\sum_{t = 0}^{T} γ^{t} R_{t}^{i}],

(1)

where

γ \in (0, 1)

is the discount factor and

R_{t}^{i}

, is the reward of agent

i

at time step

t

. The basic problem of reinforcement learning is how to learn optimal policies through interactions that maximize the long-term reward of each agent in the environment. A common goal is to maximize that expected value.

In Multi-agent Reinforcement Learning (MARL), the reward of each agent not only depends on its own behavior but is also significantly influenced by the behaviors of other agents, which makes the interactions among multiple agents inherently complex and dynamic [37,38,39,40]. To effectively address this challenge, this paper adopts the Centralized Training and Distributed Execution (CTDE) framework. This framework optimizes the global policy by sharing information in the training phase, while maintaining the independence of each agent in the execution phase, thus balancing the needs of global collaboration and individual decision-making.

2.1.1. Multi-Agent Markov Game Formalization

The UAV confrontation task can be formalized as a Multi-agent Markov Game (MMG) [41]:

M = 〈 N, S, {A_{i}} i \in N, P, {R_{i}} i \in N, γ 〉,

(2)

where

N = {1, 2, \dots, n}

is the set of agents (UAVs),

S

is the global state space,

A_{i}

is the action space of agent

i

,

P : S \times A_{1} \times \dots \times A_{n} \to Δ (S)

is the state transition probability,

R_{i} : S \times A_{1} \times \dots \times A_{n} \to R

is the reward function of agent

i

, and

γ \in [0, 1)

is the discount factor.

Each agent

i

aims to maximize its expected cumulative reward:

J^{i} (π) = E [\sum_{t = 0}^{T} r_{t}^{i} (s_{t}, a_{t}^{1}, \dots, a_{t}^{n})] .

(3)

The key challenges in this framework are: (1) Non-stationarity: The environment appears non-stationary from each agent’s perspective due to other agents’ policy changes; (2) Credit assignment: Decomposing team rewards into individual contributions; (3) Scalability: Joint action space

| A |^{n}

grows exponentially with agent number; (4) Sparse rewards: In island capture tasks, meaningful rewards are often delayed.

2.1.2. Component Synergy Mechanism Analysis

PER Synergy in Multi-agent Settings: In multi-agent environments, the TD error for agent

i

is:

δ_{t}^{i} = |r_{t}^{i} + γ \max_{a^{'}} Q_{π}^{i} (s_{t + 1}, a^{'}) - Q_{π}^{i} (s_{t}, a_{t}^{i})| .

(4)

The priority-based sampling probability becomes:

P (i) = \frac{{(| δ_{t}^{i} | + ε)}^{α}}{\sum_{j} {(| δ_{t}^{j} | + ε)}^{α}} .

(5)

In multi-agent settings, high TD error samples contain crucial information about other agents’ policy changes. PER automatically focuses learning on these critical transitions [35].

PER calculation in multi-agent: We address prioritization by using global TD errors. The centralized critic computes a unified TD error using team rewards and global state information. This global priority is then applied to all agents’ experiences from the same timestep, ensuring consistent learning signals across the multi-agent system while leveraging the centralized training advantage.

Under PER mechanism, the sample complexity in multi-agent learning reduces from

O (| S | \cdot {| A |}^{n} / ϵ^{2})

to

O (| S | \cdot {| A |}^{n} \cdot \log (\frac{1}{δ})) / (ϵ^{2} \cdot p_{\min})

, where

p_{\min}

is the minimum sampling probability.

Multi-step TD Variance Reduction: The n-step return variance in sparse reward environments is:

Var [G_{t}^{(n)}] \leq \frac{σ R^{2}}{{(1 - γ)}^{2}} \cdot \frac{1 - γ^{2 n}}{1 - γ^{2}} .

(6)

When reward sparsity

σ

<

\frac{1}{n}

, n-step TD achieves variance reduction

O (n)

times lower than single-step and convergence acceleration O(

\sqrt{n}

) times faster [36].

Multi-step TD and CTDE coordination: Multi-step returns are harmonized with CTDE through dual-level computation. Each agent calculates local multi-step returns using individual rewards and observations for policy updates. Simultaneously, the centralized critic computes global multi-step returns incorporating team rewards and complete state information for value estimation. This design preserves distributed execution capabilities while maximizing centralized training benefits.

BC-RL Synergistic Learning: Let expert policy π* have performance gap εexpert from optimal policy. BC pre-training reduces RL convergence time from

O (1 / ε^{2})

to:

TBC0RL = O (\frac{ε_{expert}}{ε} + \log (\frac{1}{ε})) .

(7)

2.1.3. Multi-Agent Challenge Solutions

Credit Assignment Problem: Given team reward

R team = f (R 1, \dots, Rn)

, decompose individual contributions using the Difference Rewards Mechanism solution [42].

Di (s, a) = Rteam (s, ai, a - i) - Rteam (s, ā i, a - i),

(8)

where āi is agent is default action.

Non-stationarity Problem—Solved through CTDE Framework with Convergence Guarantee, joint policy converges to ε-Nash equilibrium under the CTDE framework.

E [| i (π i, π - i)] \geq E [| i (π i, π - i^{⋆})] - ε

(9)

Sparse Reward Learning Problem: Solved through Combined PER + Multi-step TD to achieve improved convergence time in sparse reward environments.

2.1.4. Convergence Guarantees

The proposed Rainbow-MADDPG algorithm converges to ε-optimal joint policy with sample complexity:

T_{total} = O (\frac{| S | \cdot | A |^{n} \cdot H^{3}}{ε^{2} \cdot {(1 - γ)}^{4} \cdot a_{\min}} \cdot C_{synergy}),

(10)

where

H

is episode length, αmin is minimum learning rate, and Csynergy < 1 is synergy coefficient.

Synergy Effect Quantification: Total improvement Δtotal = ΔPER + ΔTD + ΔBC + Δsynergy, where Δsynergy > 0 represents positive synergistic effects.

2.2. Rainbow Algorithm Improvement

To enhance MADDPG’s performance in UAV confrontation tasks, this study inte-grates the core module of the Rainbow algorithm into the MADDPG framework, thereby improving its strategic learning capability and environmental adaptability. The pseudo-code of the enhanced algorithm is presented in Algorithm 1. The Rainbow algorithm, which synthesizes multiple reinforcement learning techniques, proposes several im-provident to enhance learning efficiency and stability. In multi-agent reinforcement learning (MARL), the inherent complexity and dynamic nature of tasks necessitate algorithms that can effectively adapt to environmental changes while demonstrating robust exploration capabilities and rapid convergence. Consequently, incorporating the Rainbow algorithm’s core module into the MADDPG framework significantly enhances its performance in UAV adversarial tasks, particularly in complex and dynamic adversarial scenarios. The integration maintains CTDE principles by separating information flow: global TD errors ensure consistent PER prioritization, dual-level multi-step returns preserve both local autonomy and global coordination, and centralized information sharing occurs only during training while execution remains distributed. The advantages of this integration are primarily manifested in the following aspects:

(1) Prioritized experience replay: Traditional experience replay methods employ uniform random sampling from historical experiences for agent training, assuming equal contribution from all experience samples to policy learning. However, empirical evidence demonstrates that certain samples, particularly those with substantial temporal difference (TD) errors, exert significantly greater influence on policy optimization. Uniform sampling across all experiences may result in training inefficiencies and underutilization of high-value samples, thereby hindering accelerated policy convergence.

To overcome the limitations of traditional experience replay methods, this study incorporates the Prioritized Experience Replay (PER) module from the Rainbow algorithm. As illustrated in Figure 2, the PER mechanism assigns a priority to each sample in the experience replay pool, typically determined by the sample’s temporal difference (TD) error. Samples with larger TD errors indicate greater deviations between the current policy and the optimal policy, rendering them more valuable for policy optimization. By assigning higher sampling probabilities to samples with elevated TD errors, the PER method ensures their more frequent utilization in policy updates, thereby substantially enhancing learning efficiency and policy optimization outcomes.

For each sample in the empirical replay using the tuple

{(s}_{t} {, a}_{t} {, R}_{t} {, s}_{t + 1})

[43] denoted, we compute its TD error:

δ_{t} = R_{t} + γ m a x Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t}),

(11)

where

R_{t}

is the current reward,

γ

is the discount factor, and

Q (s_{t + 1}, a^{'})

is the maximum

Q

value for the next state. Priorities are then assigned to each experience sample:

p_{t} = | δ_{t} | + ϵ,

(12)

where

ϵ

is a small constant to avoid the case where the priority is zero. At this point, the priority of each sample in the experience replay pool

p_{t}

reflects its importance. Based on these priorities, the PER method uses a priority-based sampling strategy to select experience samples for training:

P (s_{t}, a_{t}) = \frac{p_{t}^{α}}{\sum_{t} p_{t}^{α}},

(13)

where

α

is a hyperparameter to control the degree of influence of prioritization. A larger value of

α

implies a greater tendency to sample empirical samples with high priority, while a smaller value of

α

is close to uniform sampling. By prioritizing the sampling of high TD error, PER greatly improves the efficiency of policy updating.

In multi-agent environments, the implementation of Prioritized Experience Replay (PER) not only expedites individual agent learning but also enables agents to focus on strategically crucial experiences during adversarial interactions. For instance, in UAV confrontation tasks, when an agent’s action results in significant consequences (e.g., destruction), the corresponding experience samples exhibit substantially increased temporal difference (TD) errors. By prioritizing the replay of these high-error samples, agents can rapidly identify and rectify critical strategic errors, thereby enhancing task completion rates and overall performance. This mechanism proves particularly valuable in multi-agent adversarial scenarios, as it facilitates faster adaptation and optimization of collaborative and competitive strategies within complex, dynamic environments.

(2) Multi-step TD update: Traditional single-step temporal difference (TD) updating methods calculate an agent’s Q-value update based solely on the immediate reward of the current state and the Q-value of the subsequent state. While computationally efficient and straightforward to implement, this approach fails to adequately capture long-term reward accumulation, particularly in adversarial scenarios where agents often require multiple time steps to obtain meaningful feedback. In environments with sparse or delayed rewards, single-step TD updating may result in inefficient strategy learning or even complete failure to acquire effective behaviors. To address this limitation, this study implements the n-step TD update mechanism from the Rainbow algorithm. As illustrated in Figure 3, the n-step TD update incorporates not only the immediate rewards of the current and next states but also integrates reward information from multiple future time steps for Q-value updates. This method effectively captures long-term reward signals, particularly in tasks with delayed reward feedback, thereby significantly enhancing learning efficiency and policy optimization.

In the n-step TD update, the Q-value is updated with the following equation:

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α (\sum_{k = 0}^{n - 1} γ^{k} R_{t + k + 1} + γ^{n} m a x Q (s_{t + n}, a^{'}) - Q (s_{t}, a_{t})),

(14)

where

n

is the update step,

γ

is the discount factor,

R_{t + k + 1}

is the reward of time step

t + k + 1

, and

γ^{n} m a x Q (s_{t + n}, a^{'})

is the maximum Q value of time step

t + n

. By introducing n-step TD updating, the agent is able to obtain reward information from more time steps, especially in tasks that require a longer time to obtain rewards, and this updating method greatly improves the adaptability of the agent to sparse reward scenarios.

The n-step TD update mechanism demonstrates substantial advantages in multi-agent adversarial games. Consider the UAV island capture task as an illustrative example: agents typically require multiple iterations to achieve their objectives, with each successful action potentially yielding minimal immediate rewards. In such scenarios, n-step TD updating mitigates issues of insufficient exploration or premature convergence caused by excessive reliance on short-term rewards. By aggregating reward signals across multiple time steps, this approach enables agents to optimize their actions based on long-term strategic considerations. The n-step TD update effectively addresses the limitations inherent in traditional single-step TD methods, proving particularly advantageous in environments characterized by delayed reward feedback. This mechanism significantly enhances the convergence rate and strategic stability of multi-agent systems operating in complex task environments.

(3) Behavioral cloning assistance: Behavior Cloning (BC), an imitation learning technique, accelerates agent strategy acquisition by replicating expert behaviors. Within traditional reinforcement learning frameworks, agents predominantly rely on random exploration to accumulate experience and refine strategies. However, this approach often generates numerous inefficient or invalid experiences during initial learning phases, particularly in complex environments, resulting in suboptimal learning efficiency and slow early-stage convergence. To address this limitation, Behavior Cloning serves as an effective auxiliary method, leveraging expert demonstrations to guide initial agent learning. By circumventing the need for random exploration from first principles, BC significantly enhances learning efficiency and expedites strategy convergence.

The core idea of behavioral cloning is to perform policy learning by imitating the behavior of an expert. Specifically, behavioral cloning directly learns a mapping function from states

s_{i}

to actions

a_{i}

by using the expert’s state-action pair dataset

(s_{i}, a_{i})

. This mapping function is typically trained using supervised learning methods such as minimizing the following loss function:

L_{BC} = \frac{1}{N} \sum_{i = 1}^{N} ∥ π_{θ} (s_{i}) - a_{i} ∥^{2},

(15)

where

π_{θ} (s_{i})

is the output action of the agent in state

s_{i}

,

a_{i}

is the actual action of the expert in state

s_{i}

,

a_{i}

is the actual action of the expert in state

s_{i}

,

N

is the number of samples in the dataset, and

θ

is the parameters of the strategy network. The goal of this loss function is to minimize the gap between the agent’s strategy and the expert’s strategy so that the agent can mimic the expert’s behavior in the initial stage.

Algorithm 1 Rainbow-MADDPG algorithm

Input: Initialize actor network

θ^{π}

, critic network

θ^{c}

, replay buffer

D

, the number of UAVs

N

, the number of episodes

M

, and the maximum episode length

T

, the n-step value

n

, priority exponent

β

, and importance sampling correction

α

.
for episode = 1:

M

do
Set up UAVs’ adversarial game environment.
Initialize state

s_{0}

, at step

t = 0

; initialize local observations for each UAV

s_{0}^{i}

;
initialize actions for each UAV

a_{0}^{i} = [r φ, r ϕ, F]

;
for step

t

= 1:

T

do
for UAV

a

= 1:

N

do
Obtain local observation state

s_{t}^{i}

;

a_{t}^{i} = Actor (a_{t - 1}^{i}, s_{t - 1}^{i}, θ_{i})

;
end for
Execute actions

a = [a_{1}^{i}, a_{2}^{i}, \dots, a_{N}^{i}]

;
Observe rewards

R_{t}

, next state

s_{t + 1}

, and completion marks

d t

;
TD_error = Calculate_TD_Error

(a_{t}^{i}, s_{t}^{i}, R_{t}, s_{t + 1}^{i}, θ_{c})

;
Store experience

[s_{t}^{i}, a_{t}^{i}, R_{t}, s_{t + 1}^{i}, d t]

, with priority

\propto {| T D_{-} e r r o r |}^{β}

in replay buffer

D

;
if all

d t

are true then
Reset environment.
end if
samples = Sample_Experience_Batch(

D

,

α

);
for UAV

a

= 1:

N

do
target = Compute_n_Step_TD(

a_{t}^{i}

,

R_{t}

,

s_{t + 1}^{i}

, n,

θ_{c}

);
loss = Compute_Critic_Loss(

s_{t}^{i}

,

a_{t}^{i}

, target,

θ_{c}

);
Update_Critic_Network(

θ_{c}

, loss);
advantage = Compute_Advantage(

s_{t}^{i}

,

a_{t}^{i}

,

θ_{c}

);
actor_gradient = Compute_Actor_Gradient(advantage,

s_{t}^{i}

,

a_{t}^{i}

,

θ_{i}

);
Update_Actor_Network(

θ_{i}

, actor_gradient).
end for
if fixed_step_update then

θ_{π} = θ_{π} + α * δ θ_{π}

;
end if
end for
end for
Output: The strategy distribution of all UAVs

π_{θ}

.

Behavioral cloning eliminates dependency on reward signals and accelerates learning by replicating expert behaviors to establish initial strategies for subsequent reinforcement learning processes. In this study, expert data for behavioral cloning is generated from the converged baseline MADDPG model. We collect high-quality trajectories with task success rates above 75%, creating a dataset that captures effective multi-agent coordination strategies. The behavioral cloning phase initializes Rainbow-MADDPG policy networks through supervised learning on these expert demonstrations, providing learned coordination patterns as a warm start. This initialization approach ensures consistency with our multi-agent framework while facilitating faster convergence in the subsequent reinforcement learning phase. Within the Multi-agent Reinforcement Learning (MARL) framework, behavioral cloning is integrated with reinforcement learning techniques (e.g., MADDPG and Rainbow algorithms) to mitigate ineffective exploration and expedite agent adaptation and strategy optimization.

3. UAV Modeling

The UAV autonomous control task represents a complex class of multi-agent dynamic game problems characterized by high-dimensional state spaces, intricate action decisions, dynamic environments, and sparse rewards. To effectively implement reinforcement learning algorithms, precise modeling of key elements in the island capture confrontation scenario on the MaCA platform is essential, encompassing state representation, action space, environmental dynamics, and reward functions. Figure 4 illustrates the UAV modeling framework.

3.1. State Space Modeling

State space is the core of describing the state information of the UAV in the mission environment and provides inputs for the decision-making of the agent. It mainly includes the UAV’s position

(x, y)

, velocity

(v_{x}, v_{y})

, flight direction and attitude (e.g., heading angle

φ

roll angle

ϕ

). The upper half of Figure 4 shows the flight state of the attack aircraft, while the lower half models the reconnaissance aircraft, with the red and blue circles denoting the detection range, and the detection radius describing the perception capability of the reconnaissance aircraft, respectively. In addition, the state space considers external environment information, such as dynamic obstacles, relative position, speed and heading of enemy UAVs, to help model complex confrontation scenarios. In practice, certain tasks may require UAVs to have environmental memory capabilities, which are realized through state history or augmented perception variables. Combining these elements, the state space

S = {(x, y,), (v_{x}, v_{y},), r, φ, ϕ}

provides comprehensive environment awareness for the agent to ensure efficient decision-making in complex missions. In addition, the state space should also contain information about the dynamics of enemy UAVs to help agents hypothesize enemy strategies and future actions.

3.2. Action Space Modeling

The action space defines the collection of UAV behaviors in the mission environment and is the direct output of the agent’s strategy execution. In UAV confrontation missions, the action space mainly includes the following categories: flight control actions, which realize precise control of position and attitude by adjusting parameters such as speed and heading angle; attack actions, which involve the selection of target UAVs and the implementation of attack behaviors; evasion actions, which are used to respond to enemy threats by changing the flight path, speed, or altitude in order to safeguard their own safety; and collaborative actions, which support information sharing and task division between UAV groups to improve the overall mission efficiency. information sharing and task division to improve the overall mission efficiency. The integrated design of these action types provides flexible strategy selection space for the agent, which can effectively meet the dynamic decision-making requirements in complex confrontation scenarios. Based on the UAV dynamics model proposed in the literature [44], we establish the following dynamic dynamics model equation by introducing momentum theorem equation.

\{\begin{matrix} ϕ = ϕ + r_{ϕ} d t, - 30 ° < ϕ < 30 °, \\ r_{φ} = 9.81 \cdot m / F \cdot d t \cdot \tan ϕ, \\ φ = φ + r_{φ} d t, - 180 ° < φ < 180 °, \\ v_{x} = \sin φ \cdot F / m \cdot d t, \\ v_{y} = \cos φ \cdot F / m \cdot d t, \\ x = x + v_{x} d t, \\ y = y + v_{y} d t, \end{matrix}

(16)

where

ϕ

denotes the roll angle,

r_{ϕ}

denotes the roll angular velocity,

m

denotes the UAV mass,

F

denotes the driving force,

φ

denotes the heading angle,

v_{x}

and

v_{y}

denote the speed of the UAV in the

x

and

y

axes, espectively,

(x, y)

denotes the UAV coordinate position in the

x

and

y

axes, respectively, and

d t

denotes the differential variable of time

t

.

3.3. Strategy Design

This study designs a rule-based algorithm system tailored to the characteristics of MaCA confrontation tasks, incorporating professional player strategies to enhance and refine prior knowledge. The algorithm provides detailed strategies for various combat actions, including attack patterns, interference frequency settings, and evasion tactics. To improve algorithm adaptability during training, the original posture data from the simulated confrontation environment is reconstructed, thereby better supporting agent learning and decision optimization in complex dynamic environments.

(1)

Attack strategies: For enemy units beyond the attack range of our forces, the system compares relative distances between fighters and assigns tracking responsibilities to the nearest available attack unit. Concurrently, a constraint is implemented to limit the number of attack units tracking a single enemy, ensuring both tracking effectiveness and preservation of remaining attack resources. Regarding enemy units within attack range, available combat units are allocated through a coordinated mechanism based on the following operational principles:

Attack all enemy units within our attack range if possible;
To conserve resources, limit the number of our attacking units when attacking the same enemy unit;
In order to increase the efficiency of the attack, the radar range of the reconnaissance unit is expanded to guide the unit on the en route mission to complete the relevant maneuver tasks.

(2)

Interference frequency setting strategy: Given the periodic variation pattern of enemy radar frequencies, the interference frequency strategy primarily employs an online learning approach. This learning process initiates upon entering the simulation and continues throughout the entire engagement. The specific implementation procedure is as follows:

Obtain changes in the radar frequency of an enemy aircraft, using the changes over three consecutive time points as a sample;
Combine the first two frequencies as features in time order, predict and store the probability distribution of the third frequency.

(3)

Avoidance strategies: Our reconnaissance units acquire enemy posture information across two consecutive time steps, enabling the calculation of potential enemy maneuvering directions. This data, combined with our units’ previous state information, facilitates the projection of probable enemy trajectories and tracking patterns. Based on these estimations, our corresponding reconnaissance and attack units execute evasive maneuvers accordingly.

(4)

Detection unit posture reconstruction: In this study, the state information of our two detection units in the island capture confrontation environment is structured as follows:

Our basic attributes: the survival status of this operator, X coordinate, Y coordinate, heading, radar status, and radar frequency;
Friendly basic information: distance to another friendly detection unit, distance to all friendly attack units;
Basic enemy information: Distance to all enemy units detected by the radar.

(5)

Attack unit posture reconstruction: The state information of our attack unit in the game-adversarial island capture environment is organized as follows:

Our basic attributes: operator survival state, X coordinate, Y coordinate, heading, radar state, radar frequency point, jamming radar state, jamming radar frequency point;
Friendly basic information: distance to all friendly detection units, distance to other surviving friendly attack units;
Basic information about the enemy: distance from the enemy unit actively observed by the radar, distance from the enemy unit passively observed by the jamming radar, direction of the enemy unit and radar frequency of the enemy unit.

3.4. Assessment of Indicators

The average reward value serves as a crucial metric for evaluating the overall strategic effectiveness of a multi-agent system. It is computed by dividing the cumulative rewards obtained by all agents in each episode by the number of participating agents, thereby quantifying system performance per episode. The underlying reward design (Table 1) reflects military tactical priorities: detection rewards (+1) encourage information gathering, while destruction penalties (−14) reflect high operational costs. The survival reward (+14) emphasizes unit preservation as the primary objective. This asymmetric structure (14:1:14 ratio) guides agents toward reconnaissance-based strategies rather than aggressive engagement, transforming sparse mission success signals into dense behavioral guidance for stable multi-agent policy learning. The resulting average reward metric comprehensively reflects algorithm optimization in task accomplishment, resource utilization, and strategic coordination. The specific reward values are detailed in Table 1.

The average reward value is calculated by dividing the cumulative rewards of all agents in each episode by the number of agents, yielding the mean reward per episode. This metric reflects the comprehensive performance of the UAV swarm within a single episode, effectively evaluating the algorithm’s capability in coordinating swarm behavior and optimizing mission outcomes. Through analysis of the average reward value trend, we can determine whether the algorithm successfully enhances swarm collaboration efficiency and achieves superior task performance.

Average Reward = \frac{1}{M} \sum_{i = 1}^{M} R_{i},

(17)

where

M

is the number of agent and

R_{i}

is the reward of the

i

-th agent in the round. The average reward value per round is obtained by summing the rewards of all agents and dividing by the number of agents.

By using methods such as

λ -

return, agents are better able to infer optimal strategies from limited reward signals, which in turn improves their long-term performance. The gradual increase in the average reward value as training proceeds indicates that the agent gradually learns how to optimize its behavior in sparse reward scenarios.

G_{t}^{λ} = \sum_{n = 0}^{N} γ^{n} R_{t + n} + γ^{N} V (s_{t + N})

(18)

Among them,

G_{t}^{λ}

is the

n

step return based on

λ

, which helps agents to better optimize their strategies and adapt effectively in sparse reward scenarios.

Battle Force Preservation Rate: The Battle Force Preservation Rate serves as a crucial metric for evaluating agent survivability in UAV confrontation missions, reflecting the algorithm’s capacity to protect UAV resources while accomplishing mission objectives. In multi-agent confrontation scenarios, agents must not only achieve the primary objective of island capture but also optimize resource utilization and minimize combat losses. This rate quantifies the proportion of surviving UAVs relative to the initial deployment, thereby assessing the algorithm’s effectiveness in resource management and protection during engagements. The Battle Force Preservation Rate is calculated using the following formula:

S_{survival} = \frac{N_{s u r v i v e}}{N_{I n i t i a l}},

(19)

where

N_{I n i t i a l}

is the number of UAVs initially deployed (including reconnaissance and attack aircraft), and

N_{s u r v i v e}

is the number of surviving UAVs at the end of the mission. The higher the combat power preservation rate, the stronger the survivability of the agent in the mission, indicating that the algorithm can effectively protect its own UAVs during the confrontation, reduce the loss, and improve the overall effect of the mission execution.

4. Experimental Simulation Design

4.1. Configuration and Operating Instructions

The MaCA environment is compatible with Linux 64-bit, macOS, and Windows 10 x64 operating systems [45]. Python (3.9) environment configuration is implemented through PyCharm (2024.1), utilizing the PyTorch (2.0.1) reinforcement learning framework. To execute the system, run the relevant Python files located in the MaCA root directory, ensuring the “Working Directory” is set to the MaCA root directory.

4.2. Parameter Design

The reinforcement learning framework employs both Actor and Critic networks with two hidden layers, each containing 64 neurons. Table 2 presents the training hyperparameters: maximum episodes (100), time steps per episode (13,000), Actor learning rate (1 × 10⁻⁴), Critic learning rate (1 × 10⁻³), discount factor (0.95), soft-update parameter (0.01), experience replay buffer size (5 × 10⁵), batch size (256), optimizer (Adam), and activation function (ReLU). Table 3 details the environment parameters: map dimensions (800 × 500), random initialization range (50), drone speed (6), health points (100), steering range (0.26), and attack deviation (1). Table 4 specifies the drone parameters: attacker detection range (70), damage value (100), and damage range (150); scout detection range (100). The PER exponent α = 0.6 was set following Schau et al. [35], which achieves the best balance between preferential sampling and empirical diversity in ranking-based variants. The n-step TD length was set to 3 following Hessel et al. [46], which represents the optimal bias-variance trade-off in most reinforcement learning settings.

All experiments were conducted on Intel Core i9-14900 processor with NVIDIA RTX 4090 GPU (24 GB VRAM). The complete training process required 48 h with peak memory usage of 18.5 GB VRAM for the full dataset across 100 episodes with 13,000 time steps per episode. These results suggest that the proposed PER + TD + BC framework, while incurring additional overhead compared to standard TD learning, remains computationally feasible on a single high-performance workstation.

4.3. Heterogeneous Multi-Agent Environment Setting

In the game confrontation scenario, both red and blue UAVs must engage in adversarial interactions while coordinating to accomplish the island capture mission. This scenario presents greater complexity than simple confrontation tasks, as agents must simultaneously ensure successful island capture, maintain their own survival, and effectively engage enemy UAVs. Figure 5 illustrates UAV cluster confrontations for island capture across three distinct island configurations (Quasi-elliptical, Quasi-rectangle, and Quasi-triangle). Each episode begins with red and blue UAVs deploying from identical initial positions, where reconnaissance aircraft are positioned at the center of their respective formations, surrounded by attack aircraft. The objective for both sides involves eliminating enemy drones while protecting their own units and securing the island. This mission evaluates not only the drones’ countermeasure capabilities and tactical adaptability but also their operational effectiveness and survivability in complex environments.

This study establishes a UAV confrontation task environment using the MaCA (Multi-Agent Competitive and Cooperative Environment) simulation platform to evaluate the performance of enhanced algorithms. The experiment designs a multi-agent UAV confrontation scenario simulating red and blue UAVs engaging in island capture missions, enabling performance comparison of different algorithms in complex adversarial environments. The island is configured in a Quasi-rectangular shape, with red and blue UAVs initiating from predefined positions. To assess algorithm performance across varying confrontation scales, missions ranging from 5 vs. 5 to 10 vs. 10 UAV engagements are implemented, with each side deploying two reconnaissance aircraft responsible for enemy position detection and tactical support for attack units. The experiment compares the standard MADDPG algorithm with an enhanced version incorporating the Rainbow module, which integrates multiple reinforcement learning techniques to significantly improve strategy learning efficiency and stability, demonstrating superior adaptability in complex confrontation scenarios. To further validate algorithm adaptability, comparative experiments are conducted across rectangular, elliptical, and triangular island configurations in 7 vs. 7 scenarios, assessing the algorithm’s robustness and strategic optimization capabilities in diverse environments.

4.4. Analysis of Experimental Results

The battle force preservation and the average reward value were employed as evaluation metrics to analyze the performance of UAVs in island capture scenarios. The red side unmanned drone cluster utilized both the standard MADDPG algorithm and its enhanced version incorporating the Rainbow algorithm module, while the blue side employed the standard MADDPG algorithm.

Notably, in all experiments, the standard MADDPG algorithm exhibited low average reward values during the initial training phase, with particularly slow learning progress in the 5 vs. 5 and 6 vs. 6 adversarial tasks, where reward values remained stagnant for extended periods. In contrast, the Rainbow-enhanced MADDPG algorithm demonstrated superior performance early in the training process, showing a steady increase and stabilization of reward values. This improvement can be attributed to the Rainbow module’s preferential experience replay (PER) and multi-step temporal difference (TD) updating mechanisms, which enhance sample utilization efficiency and long-term reward perception, thereby accelerating the learning process and strategy optimization.

Combat effectiveness evaluation across 100 island capture missions, as shown in Figure 6, revealed significant performance disparities through systematic ablation analysis. The baseline MADDPG achieved only 40% combat power preservation rate, while our systematic component integration demonstrates progressive improvements: MAD-DPG + PER reached 50% preservation, MADDPG + PER+multi-step TD achieved 63% preservation, and the complete Rainbow-MADDPG attained 80% preservation rate. This substantial enhancement stems from the synergistic effects of each Rainbow component: PER contributes the most significant initial improvement by prioritizing strategically crucial experiences during adversarial interactions, n-step TD learning adds essential long-term reward perception for effective tactical planning, and behavioral cloning pro-vides optimal policy initialization through expert knowledge integration. The collective integration of these mechanisms enables enhanced path planning and attack avoidance capabilities, which effectively reduce battle losses while ensuring mission success, demonstrating that the superior performance results from systematic component synergy rather than individual algorithmic modifications.

As illustrated in Figure 7 and Table 5, the enhanced algorithm exhibits substantial performance advantages over standard MADDPG, particularly in 6 vs. 6 and 7 vs. 7 scenarios, achieving peak rewards of 14 (6.05-fold improvement), 5 vs. 5 (13.5 reward, 4.4-fold improvement), and 14 (2.5-fold improvement), respectively. However, the improvement factor diminishes in larger-scale scenarios, such as 8 vs. 8 (1.88-fold), 9 vs. 9 (1.54-fold), and 10 vs. 10 (0.71-fold), suggesting increased complexity in more demanding environments. Overall, the proposed method consistently outperforms the baseline, demonstrating superior scalability and robustness in multi-agent tasks.

The performance gain diminishes from 6.05× (6 vs. 6) to 0.71× (10 vs. 10) due to: (1) exponential state-space growth with 20-agent interactions, (2) reduced Rainbow sample efficiency in high-dimensional environments, and (3) network capacity limitations for large-scale coordination. The baseline MADDPG improvement in larger scenarios (2.5 to 7.3) confirms environment complexity as the primary bottleneck.

Figure 8 presents a comparative analysis of average reward values across different task scales (5 vs. 5 to 10 vs. 10). The Rainbow-enhanced algorithm consistently maintained average reward values above 10 across all task sizes, with particularly outstanding performance in 6 vs. 6 and 7 vs. 7 scenarios, where it achieved stable reward values approaching 14 with minimal fluctuations. This consistent performance demonstrates the algorithm’s rapid convergence and effective strategy optimization capabilities. Furthermore, in smaller-scale tasks (e.g., 5 vs. 5), the enhanced algorithm exhibited superior environmental adaptability and strategy stability, achieving higher reward values in shorter timeframes, thereby significantly improving agent collaboration efficiency and adversarial capabilities.

The impact of island morphology on algorithm performance is demonstrated in Figure 9. Rectangular islands provided the most favorable environment, with the enhanced algorithm achieving rapid reward value increases and stabilizing near 14 in later training stages. Conversely, Quasi-elliptical and Quasi-triangle island configurations presented greater challenges, with slower reward growth and final values around 10, indicating that complex terrain features increased task difficulty and reduced algorithm convergence speed. Nevertheless, the enhanced algorithm maintained stable performance across all island types, demonstrating its robustness in diverse operational environments.

The experimental results comprehensively demonstrate the enhanced algorithm’s superiority over the standard MADDPG in three key aspects: (1) task completion efficiency, (2) reward convergence speed, and (3) training stability. The significant improvement in combat power preservation rate underscores the effectiveness of the Rainbow module in enhancing sample utilization, optimizing long-term reward perception, and strengthening strategy robustness. These advancements establish the MADDPG + Rainbow as a highly efficient and stable solution, demonstrating considerable potential for application in complex adversarial scenarios.

5. Conclusions

This study presents an enhanced algorithm that integrates the Rainbow module to overcome the limitations of the MADDPG algorithm in training efficiency and strategy stability for multi-agent UAV adversarial tasks. The proposed algorithm introduces three key improvements: (1) a prioritized experience replay mechanism to optimize sample utilization efficiency, (2) a multi-step temporal difference (TD) updating technique to strengthen long-term reward perception, and (3) a behavioral cloning technique leveraging pre-trained models to expedite strategy convergence during initial training. Experimental results reveal that the enhanced algorithm substantially surpasses the original MADDPG in critical performance metrics, including average reward value, combat power preservation rate, and training convergence speed, while demonstrating superior strategy quality and stability. The algorithm also exhibits remarkable robustness in dynamic adversarial scenarios, effectively balancing mission completion rates and combat power preservation in complex environments. These findings substantiate the algorithm’s potential applicability in multi-agent UAV island capture missions and offer a novel technical approach with theoretical underpinnings for related domains.

However, this study presents certain limitations. While our current evaluation focuses on MADDPG as the baseline algorithm due to its proven effectiveness in continuous action space UAV coordination tasks, future work will expand to comprehensive comparisons with other recent MARL algorithms including QMIX, MAPPO, COMA, and VDN across diverse scenarios to better contextualize our improvements within the broader multi-agent reinforcement learning landscape. The experiments were conducted on the MaCA (Multi-Agent Competitive and Cooperative Environment) simulation platform, which may not fully replicate the complexities of real-world dynamic scenarios or communication constraints. In particular, practical UAV deployments often involve communication latency, sensor noise, and imperfect or incomplete information, which could significantly affect coordination and decision-making. As part of our future work, we plan to incorporate high-fidelity simulators with realistic noise and delay models, and further validate the proposed algorithm through small-scale UAV experiments, in order to strengthen its applicability to real-world scenarios. Moreover, the performance of the enhanced algorithm in large-scale multi-agent tasks (e.g., >20 vs. 20) requires further verification. Future research directions encompass: (1) integrating meta-learning frameworks to enhance the algorithm’s adaptability in heterogeneous environments and multi-task scenarios; (2) developing communication delay models and conducting hardware-in-the-loop (HITL) experiments to assess practical deployment feasibility; and (3) incorporating dynamic disturbances, such as extreme weather and GPS denial, to improve robustness under non-ideal conditions. This study lays a theoretical foundation for the engineering application of UAV swarm countermeasure technologies and provides a comprehensive technical roadmap for future investigations.

Author Contributions

Conceptualization, C.Y.; methodology, C.Y.; software, C.Y.; validation, B.Z., M.Z. and Q.W.; formal analysis, B.Z.; investigation, C.Y.; resources, B.Z.; data curation, C.Y.; writing—original draft preparation, C.Y.; writing—review and editing, C.Y.; visualization, P.Z.; supervision, Q.W.; project administration, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Open Research Subject of State Key Laboratory of Intelligent Game [grant number ZBKF-24-09] and the Military-Civilian Integration Project of the Department of Science and Technology of Inner Mongolia Autonomous Region [project number 2023YFJM0003] and Natural Science Basic Research Program of Shaanxi Province [grant number 2024JC-YBQN-0356] and National Natural Science Foundation of China [grant number 42401446].

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy reasons.

Acknowledgments

The authors would like to thank the Cognitive and Intelligent Technology Key Laboratory, China Electronics Technology Group Corporation, for providing the MaCA environment used in this research and supporting the simulation experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Pang, Z.J.; Liu, R.Z.; Meng, Z.Y.; Zhang, Y.; Yu, Y.; Lu, T. On reinforcement learning for full-length game of starcraft. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4691–4698. [Google Scholar]
Li, S.; Chen, M.; Wang, Y.; Wu, Q.; He, J. Human-computer gaming decision-making method in air combat under an incomplete strategy set. Sci. Sin. Inform. 2022, 52, 2239–2253. [Google Scholar] [CrossRef]
Yan, F.; Zhu, X.; Zhou, Z.; Tang, Y. Real-time task allocation for a heterogeneous multi-uav simultaneous attack. Sci. Sin. Inform. 2019, 49, 555–569. [Google Scholar] [CrossRef]
Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
Bi, Z.; Xiao, F.; Kong, D.; Song, X.; Jia, Z.; Lin, T. A data-driven modeling method for game adversity agent. J. Syst. Simul. 2022, 33, 2838–2845. [Google Scholar]
Barriga, N.; Stanescu, M.; Buro, M. Combining strategic learning with tactical search in real-time strategy games. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Little Cottonwood Canyon, UT, USA, 5–9 October 2017; Volume 13, pp. 9–15. [Google Scholar]
Isler, V.; Kannan, S.; Khanna, S. Randomized pursuit-evasion in a polygonal environment. IEEE Trans. Robot. 2005, 21, 875–884. [Google Scholar] [CrossRef]
Kaneshige, J.; Krishnakumar, K. Artificial immune system approach for air combat maneuvering. In Intelligent Computing: Theory and Applications V; SPIE: Bellingham, WA, USA, 2007; Volume 6560, pp. 68–79. [Google Scholar]
Chen, X.; Wang, Y.F. Study on multi-uav air combat game based on fuzzy strategy. Appl. Mech. Mater. 2014, 494, 1102–1105. [Google Scholar] [CrossRef]
Duan, H.; Li, P.; Yu, Y. A predator-prey particle swarm optimization approach to multiple ucav air combat modeled by dynamic game theory. IEEE/CAA J. Autom. Sin. 2015, 2, 11–18. [Google Scholar] [CrossRef]
Zhou, W.; Zhu, J.; Kuang, M. An unmanned air combat system based on swarm intelligence. Sci. China Inf. Sci. 2020, 50, 363–374. [Google Scholar] [CrossRef]
Zhao, X.; Yang, R.; Zhong, L.; Hou, Z. Multi-UAV path planning and following based on multi-agent reinforcement learning. Drones 2024, 8, 18. [Google Scholar] [CrossRef]
Fernando, X.; Gupta, A. Analysis of unmanned aerial vehicle-assisted cellular vehicle-to-everything communication using markovian game in a federated learning environment. Drones 2024, 8, 238. [Google Scholar] [CrossRef]
Yang, J.; Yang, X.; Yu, T. Multi-unmanned aerial vehicle confrontation in intelligent air combat: A multi-agent deep re-inforcement learning approach. Drones 2024, 8, 382. [Google Scholar] [CrossRef]
Khan, M.R.; Premkumar, G.R.V.; Van Scoy, B. Robust UAV-oriented wireless communications via multi-agent deep re-inforcement learning to optimize user coverage. Drones 2025, 9, 321. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed-cooperative-competitive environments. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6389. [Google Scholar]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815. [Google Scholar]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
Chen, C.; Mo, L.; Zheng, D.; Cheng, Z.; Lin, D. Asymmetric maneuverability Multi-UAV intelligent coordinated attack and defense confrontation. Acta Aeronaut. Astronaut. Sin. 2020, 41, 324152. [Google Scholar]
Li, S.; Jia, Y.; Yang, F.; Qin, Q.; Gao, H.; Zhou, Y. Collaborative decision-making method for multi-uav based on multiagent reinforcement learning. IEEE Access 2022, 10, 91385–91396. [Google Scholar] [CrossRef]
Zhang, T.; Qiu, T.; Liu, Z.; Pu, Z.; Yi, J.; Zhu, J.; Hu, R. Multi-uav cooperative short-range combat via attention-based reinforcement learning using individual reward shaping. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 13737–13744. [Google Scholar]
Gong, Z.; Xu, Y.; Luo, D. Uav cooperative air combat maneuvering confrontation based on multi-agent reinforcement learning. Unmanned Syst. 2023, 11, 273–286. [Google Scholar] [CrossRef]
Zhou, J.; Sun, Y.; Xue, Y.; Xiang, Q.; Wu, Y.; Zhou, X. Research on heterogeneous multi-agent reinforcement learning algorithm integrating prior knowledge. Command Control Simul. 2023, 45, 99–107. [Google Scholar]
Wang, E.; Chen, J.; Hong, C.; Liu, F.; Chen, A.; Jing, H. Introducing a counterfactual baseline for the UAV cluster adversarial game approach. Sci. Sin. Inform. 2024, 54, 1175. [Google Scholar]
Aler, R.; Valls, J.M.; Camacho, D.; Lopez, A. Programming robosoccer agents by modeling human behavior. Expert Syst. Appl. 2009, 36, 1850–1859. [Google Scholar] [CrossRef][Green Version]
Ho, J.; Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 4572–4580. [Google Scholar][Green Version]
Wang, Q.; Cheng, D.; Jia, F.; Li, B.; Bo, L. Improving behavioural cloning with positive unlabeled learning. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; PMLR: Brookline, MA, USA, 2023; pp. 3851–3869. [Google Scholar][Green Version]
Sun, S.; Li, T.; Chen, X.; Dong, H.; Wang, X. Cooperative defense of autonomous surface vessels with quantity disadvantage using behavior cloning and deep reinforcement learning. Appl. Soft Comput. 2024, 164, 111968. [Google Scholar] [CrossRef]
Foerster, J.; Assael, I.A.; De Freitas, N.; Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 2145–2153. [Google Scholar]
Gupta, J.K.; Egorov, M.; Kochenderfer, M. Cooperative multi-agent control using deep reinforcement learning. In Proceedings of theAutonomous Agents and Multiagent Systems: AAMAS 2017 Workshops, Best Papers, São Paulo, Brazil, 8–12 May 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 66–83. [Google Scholar]
Omidshafiei, S.; Pazis, J.; Amato, C.; How, J.P.; Vian, J. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Brookline, MA, USA, 2017; pp. 2681–2690. [Google Scholar]
Wang, E.; Liu, F.; Hong, C.; Guo, J.; Zhao, L.; Xue, J. Masac-based confrontation game method of uav clusters. Sci. Sin. Inform. 2022, 52, 2254–2269. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
De Asis, K.; Hernandez-Garcia, J.; Holland, G.; Sutton, R. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 2887–2894. [Google Scholar]
Buşoniu, L.; Babuška, R.; De Schutter, B. Multi-agent reinforcement learning: An overview. In Innovations in Multi-Agent Systems and Applications-1; Springer: Berlin/Heidelberg, Germany, 2010; pp. 183–221. [Google Scholar]
Littman, M.L. Reinforcement learning improves behaviour from evaluative feedback. Nature 2015, 521, 445–451. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Wang, X.; Zhang, X.; Yu, Q.; Hu, X. Effective service composition using multi-agent reinforcement learning. Knowl. Based Syst. 2016, 92, 151–168. [Google Scholar] [CrossRef]
Palmer, G.; Tuyls, K.; Bloembergen, D.; Savani, R. Lenient multi-agent deep reinforcement learning. arXiv 2017, arXiv:1707.04402. [Google Scholar]
Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 157–163. [Google Scholar]
Wolpert, D.; Tumer, K. Optimal payoff functions for members of multiagent teams. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; Volume 12, pp. 265–279. [Google Scholar]
Liu, W.; Zhang, D.; Wang, X.; Hou, J.; Liu, L. A decision making strategy for generating unit tripping under emergency circumstances based on deep reinforcement learning. Proc. CSEE 2018, 38, 109–119. [Google Scholar]
McGrew, J.S.; How, J.P.; Williams, B.; Roy, N. Air-combat strategy using approximate dynamic programming. J. Guid. Control Dyn. 2010, 33, 1641–1654. [Google Scholar] [CrossRef]
China Electronics Technology Group. Multi-Agent Combat Arena (MACA). 2021. Available online: https://github.com/SJTUwbl/MaCA (accessed on 2 February 2018).
Hessel, M.; Modayil, J.; Van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence: Palo Alto, CA, USA, 2018. [Google Scholar]

Figure 1. Network structure of MADDPG algorithm incorporating Rainbow algorithm module.

Figure 2. Prioritized experience replay flowchart.

Figure 3. Multi-step TD update flowchart.

Figure 4. UAV modeling diagram.

Figure 5. Schematic diagram of UAV clusters in three island shapes for island capture confrontation.

Figure 6. Combat power preservation rate chart.

Figure 7. Comparison of average reward values between the MADDPG+Rainbow and MADDPG for different confrontation sizes (island shape: Quasi-rectangle). Subfigure (a) shows the comparison for 5 vs. 5 and 6 vs. 6 confrontation sizes, subfigure (b) for 7 vs. 7 and 8 vs. 8, and subfigure (c) for 9 vs. 9 and 10 vs. 10.

Figure 8. Average reward values for different UAV cluster confrontation sizes (5 vs. 5 to 10 vs. 10) using the proposed method for the condition where the middle island is shaped as a Quasi-rectangle.

Figure 9. Average reward values for different island shapes (Quasi-elliptical, Quasi-rectangle, Quasi-triangle) under the conditions of using the proposed method.

Table 1. Incentive value settings.

Unit type	Categorization	Meaning	Reward Value
Attack unit	Attack Result Returns	Enemy destroyed	1
	Attack Result Returns	Failure to fight the enemy	−1
	Detection Returns	Enemy units detected	1
	Destroyed Returns	Attack unit destroyed	−14
Sensor unit	Detection Returns	Enemy units detected	1
Sensor unit	Destroyed Returns	Detection unit destroyed	−14
Common (use)	Survival Returns	Survival of your unit	14

Table 2. Hyper parameter settings.

Parameter	Value
Max-episode	100
Time-steps	13,000
Lr-actor	1 × 10⁻⁴
Lr-critic	1 × 10⁻³
γ	0.95
τ	0.01
Buffersize	5 × 10⁵
Batchsize	256
Optimizer	Adam
Activation function	Relu
α(PER)	0.6
n(multi-step TD)	3

Table 3. Environmental parameter settings.

Parameter	Value
Map_x_limit	800
Map_y_limit	500
Random_limit	50
Speed	6
Bloods	100
Turn_range	0.26
Attack_bias	1

Table 4. Drone parameter settings.

Parameter	Value
Fighter_attack_percent	1
Fighter_detect_range	70
Fighter_damage	100
Fighter_damage_range	150
Fighter_turn_range	3.14
Reconnaissance_detect_range	100
Reconnaissance_turn_range	3.14

Table 5. Performance improvement comparison of maddpg and maddpg+rainbow under different confrontation scales.

Scale of Confrontation	Maddpg	Maddpg + Rainbow	Improvement (Times)
5 vs. 5	2.5	13.5	4.4
6 vs. 6	2	14.1	6.05
7 vs. 7	4	14	2.5
8 vs. 8	4.1	11.8	1.88
9 vs. 9	5	12.7	1.54
10 vs. 10	7.3	12.5	0.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, C.; Zhang, B.; Zhang, M.; Wang, Q.; Zhu, P. Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm. Drones 2025, 9, 673. https://doi.org/10.3390/drones9100673

AMA Style

Yang C, Zhang B, Zhang M, Wang Q, Zhu P. Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm. Drones. 2025; 9(10):673. https://doi.org/10.3390/drones9100673

Chicago/Turabian Style

Yang, Chaofan, Bo Zhang, Meng Zhang, Qi Wang, and Peican Zhu. 2025. "Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm" Drones 9, no. 10: 673. https://doi.org/10.3390/drones9100673

APA Style

Yang, C., Zhang, B., Zhang, M., Wang, Q., & Zhu, P. (2025). Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm. Drones, 9(10), 673. https://doi.org/10.3390/drones9100673

Article Menu

Research on Decision-Making Strategies for Multi-Agent UAVs in Island Missions Based on Rainbow Fusion MADDPG Algorithm

Abstract

Highlights

Abstract

1. Introduction

2. Algorithm Design

2.1. Theoretical Framework Analysis

2.1.1. Multi-Agent Markov Game Formalization

2.1.2. Component Synergy Mechanism Analysis

2.1.3. Multi-Agent Challenge Solutions

2.1.4. Convergence Guarantees

2.2. Rainbow Algorithm Improvement

3. UAV Modeling

3.1. State Space Modeling

3.2. Action Space Modeling

3.3. Strategy Design

3.4. Assessment of Indicators

4. Experimental Simulation Design

4.1. Configuration and Operating Instructions

4.2. Parameter Design

4.3. Heterogeneous Multi-Agent Environment Setting

4.4. Analysis of Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI