Next Article in Journal
Graph Neural Networks in Medical Imaging: Methods, Applications and Future Directions
Next Article in Special Issue
Leveraging LLMs for User Rating Prediction from Textual Reviews: A Hospitality Data Annotation Case Study
Previous Article in Journal
Artificial Intelligence and Real Estate Valuation: The Design and Implementation of a Multimodal Model
Previous Article in Special Issue
TisLLM: Temporal Integration-Enhanced Fine-Tuning of Large Language Models for Sequential Recommendation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimization of Multi-Intelligent Body Strategies for UAV Adversarial Tasks Based on MADDPG-SASP

1
School of Electronic and Electrical Engineering, Wuhan Textile University, Wuhan 430200, China
2
School of Information Science and Engineering, Xinjiang College of Science and Technology, Korla 841000, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Information 2025, 16(12), 1050; https://doi.org/10.3390/info16121050
Submission received: 21 October 2025 / Revised: 26 November 2025 / Accepted: 27 November 2025 / Published: 1 December 2025

Abstract

In intelligent multi-agent systems, particularly in drone combat scenarios, the challenges posed by rapidly changing environments and incomplete information significantly hinder effective strategy optimization. Traditional multi-agent reinforcement learning (MARL) approaches often encounter difficulties in adapting to the dynamic nature of adversarial environments, especially when enemy strategies are subject to continuous evolution, complicating agents’ ability to respond effectively. To address these challenges, this paper introduces a novel enhanced MARL framework, MADDPG-SASP, which integrates an improved self-attention mechanism with self-play within the MADDPG algorithm, thereby facilitating superior strategy optimization. The self-attention mechanism empowers agents to adaptively extract critical environmental features, thereby enhancing both the speed and accuracy of perception and decision-making processes. Concurrently, the adaptive self-battling mechanism iteratively refines agent strategies through continuous adversarial interactions, thereby bolstering the stability and flexibility of their responses. Empirical results indicate that after 600 rounds, the win rate of agents employing this framework saw a substantial increase, rising from 26.17% with the original MADDPG to a perfect 100%. Further validation through comparative experiments underscores the method’s efficacy, demonstrating considerable advantages in strategy optimization and agent performance in complex, dynamic environments. Moreover, in the Predator–Prey Scenario combat environment, when the enemy side employs a multi-agent strategy, the win rate for the drone agent side can reach 98.5% and 100%.

1. Introduction

Reinforcement learning (RL) is a fundamental and pivotal subfield of machine learning, demonstrating substantial promise in addressing complex, dynamic decision-making problems in recent years. In single-agent settings, RL has been extensively applied to areas such as robotic control, game-playing agents, and drone path planning. However, real-world applications frequently involve multi-agent interactions, which introduce complexities that far exceed those of single-agent scenarios. Although traditional single-agent methods, including Q-learning [1], Policy Gradient [2], and Actor–Critic [3], serve as the theoretical underpinnings of multi-agent reinforcement learning (MARL) [4], their direct application to multi-agent systems often leads to issues such as non-stationarity and limited observability, which result in unstable and suboptimal performance. These challenges have spurred the evolution of MARL, with its central goal being to foster cooperation and strategic decision-making among agents through advanced policy optimization techniques, ultimately facilitating the efficient emergence of collective intelligence.
In multi-agent systems, agents must operate under partial observability, substantially compounding the complexity of sequential decision-making. The core challenges inherent to these domains can be summarized as follows: (1) Environmental non-stationarity and partial observability: The confluence of incomplete information and a dynamically shifting environment—primarily due to the simultaneous learning and adaptation of other agents—fundamentally undermines the predictability of state transitions and reward signals. (2) Curse of dimensionality: The joint state-action space grows exponentially with the number of agents, rendering traditional exploration and optimization techniques computationally intractable and inefficient. (3) Mixed incentive structures: The complex interplay between cooperative and competitive objectives among agents introduces profound strategic uncertainty, necessitating sophisticated mechanisms for credit assignment and equilibrium selection to facilitate effective decentralized policies.
To address these challenges, significant advancements have been made in deep MARL. A cornerstone of this progress is the centralized training with decentralized execution (CTDE) paradigm. The multi-agent deep deterministic policy gradient (MADDPG) algorithm [5], an extension of the single-agent deep deterministic policy gradient (DDPG) [6], is a seminal CTDE approach that stabilizes learning by leveraging a centralized critic to mitigate non-stationarity. For value function estimation, techniques from single-agent RL such as the twin delayed deep deterministic policy gradient (TD3) [7] are often incorporated to reduce overestimation bias through dual critics and delayed policy updates. Beyond actor–critic methods, architectures equipped with self-attention mechanisms have emerged, enabling agents to dynamically weigh the importance of information from other agents and the environment, thus enhancing coordination under partial observability.
The advancement of drone technology has created new applications for MARL. In dynamic 3D environments, drones must perform complex tasks like target tracking, obstacle avoidance, and combat. Traditional methods such as rule-based systems or model predictive control (MPC) [8] lack adaptability in these high-dimensional scenarios. In contrast, reinforcement learning (RL) enables adaptive decision-making through interactive learning, demonstrating stronger generalization and robustness. For example, MADDPG has been used for drone formation control, while TD3 excels in single-drone trajectory optimization. However, a key challenge in multi-drone adversarial tasks remains: how to efficiently extract relevant features from high-dimensional observations to enhance combat effectiveness, requiring further investigation.
The rapid advancement of the low-altitude economy has created extensive opportunities for the application of unmanned aerial vehicle (UAV)-related technologies. In this context, aerial surveillance represents a critical application scenario, which demands precise positioning and safe navigation of UAVs under dynamic disturbances. Such requirements have accelerated the development of semantic perception and active navigation technologies. The deep reinforcement learning-based semantic perception path planning framework introduced in improves UAV navigation robustness by evaluating the perceptual value of scene semantic information [9], thereby offering reliable technical support for aerial surveillance tasks.
Meanwhile, as a key focus within the low-altitude economy, Integrated Sensing and Communication (ISAC) overcomes the limitations of conventional systems where communication and sensing functions are segregated. In this framework, UAVs are capable of not only delivering communication services to ground users but also performing sensing tasks in target areas [10]. This integrated approach demonstrates significant potential in emergency communications and intelligent surveillance applications. For example, in remote regions, it facilitates simultaneous data transmission, environmental monitoring, and target tracking. The joint maneuvering and beamforming design proposed capitalizes on the Line-of-Sight (LoS) link advantages and agile mobility of UAVs. This approach maintains stable sensing performance while optimizing communication throughput, thereby establishing a foundation for the engineering implementation of ISAC technology. Nevertheless, several challenges persist, including difficulties in dynamically adjusting semantic weights and insufficient system robustness.
To address the challenges of complex multi-agent collaborative decision-making, this paper proposes an innovative MARL framework. The core contribution lies in the organic integration of an enhanced self-attention mechanism with a self-play strategy via a learnable weighting function. In this framework, the adversarial agent adopts the TD3 algorithm as its core strategy, thereby imposing stricter robustness requirements on our approach. Consequently, the proposed architecture builds upon the MADDPG baseline. The incorporated self-attention module accurately captures dynamic inter-agent interactions, while the self-play mechanism enhances policy complexity by having agents compete against their historical versions trained with TD3. A dynamic weighted fusion scheme adaptively balances the contributions of both techniques in challenging settings such as those with TD3-based adversaries. This design leads to significant improvements in collaborative learning efficiency and final policy stability.
In order to deal with multiple factors in the three-dimensional adversarial task, this study designs a comprehensive reward function that incorporates distance, victory, and height difference rewards. This approach ensures that the agent not only considers horizontal distance but also accounts for the impact of height differences, thereby broadening the decision-making scope within a multidimensional space. By optimizing the reward mechanism, the agent demonstrates enhanced learning efficiency and improved decision-making capabilities in the air combat task.
The main contributions of this study are as follows:
(1)
Enhancement of perception ability via self-attention mechanism: In this work, the self-attention mechanism is integrated into the Actor component of the MADDPG framework, improving the agent’s ability to perceive key environmental factors by dynamically weighting the important features in the state. This mechanism enables the agent to adapt its strategy in response to environmental changes, thereby enhancing performance in complex adversarial tasks.
(2)
Optimization of multi-agent antagonistic strategies: This study combines an enhanced MADDPG algorithm with the self-attention mechanism to optimize the agent’s antagonistic strategies through multi-agent self-play training. Over 600 training rounds, the agent’s win rate improved from 26.17% to 100%, demonstrating that the optimization process enables the agent to gain a significant advantage in complex confrontational tasks. Notably, when the enemy’s strategy changes more drastically, the agent’s adaptability and stability are markedly improved.
(3)
Efficient adversarial training framework: This paper proposes an optimized MADDPG approach that incorporates self-attention and self-game mechanisms to construct decision-making models for unmanned aerial combat agents. MADDPG serves as the foundational decision-making framework for the agents, augmented by a self-attention module integrated into the Actor network. By computing self-attention weights across state features, the proposed method enhances the agent’s capacity to extract critical information from the environment. In parallel, an automated strategy pool is established to facilitate adversarial training. This pool engages in self-play by competing against historical policy samples with a preset probability of 30%, and against policies generated by the TD3 algorithm with a preset probability of 70%. Furthermore, linear weighting coefficients α and β are introduced to balance the contributions of the self-attention mechanism in feature extraction and the self-game mechanism in adversarial training, respectively. Through this dual-weighting strategy, the proposed framework collectively improves the adaptability and stability of agent decision-making in dynamic and adversarial scenarios.

2. Related Work

In the domain of unmanned aerial vehicle (UAV) control leveraging deep reinforcement learning (DRL), extensive research efforts have been undertaken, which can be systematically classified into four principal algorithmic paradigms: value-based methods, policy gradient methods, actor–critic frameworks, and model-based approaches. Each paradigm offers distinct advantages while encountering its own set of technical limitations and practical challenges.
Value-based methods optimize an agent’s policy by approximating a cumulative reward function. Representative techniques include Q-learning and its numerous deep learning extensions. For instance, Duan employed deep Q-networks (DQNs) to model multi-UAV cooperative air combat, iteratively refining joint tactics through Q-value updates [11]. Wang applied the DQN framework to path planning in solar-powered UAVs, emphasizing energy allocation strategies to ensure operational continuity under dynamic environmental conditions [12].
Policy gradient methods directly parameterize and optimize the policy, bypassing explicit Q-value estimation. Shen et al. introduced a multi-objective optimization framework based on the golden turtle algorithm for multi-UAV cooperative path planning, where integration with policy gradient techniques enhanced planning efficiency [13].
Actor–critic methods have gained prominence in MARL but often exhibit instability in multi-UAV, IoT, and other distributed systems, particularly when coordination across agents is required in dynamic, partially observable environments. Chen et al. developed an autonomous multi-UAV path planning scheme under incomplete information, achieving improved efficiency through actor–critic optimization [14]. Sun et al. enhanced deep deterministic policy gradient (DDPG) methods for IoT resource allocation, improving system stability [15]. Li et al. introduced FS-DDPG, a safety-constrained DRL method for optimal fan cooling system control [16], while Lu et al. integrated KNN-DDPG for energy-efficient joint computation and trajectory planning [17].
Model-based methods—primarily applied in planning—are extensively used in UAV path optimization to enhance computational efficiency and control precision. Polyakov applied nonlinear feedback control to develop a time-stable UAV control mechanism, improving both response speed and accuracy [18]. Labbadi et al. proposed a fractional-order global sliding mode control method to counteract disturbances and uncertainties [19]. Zeng and Zhang combined trajectory optimization with DRL to balance communication quality and energy efficiency in UAV networks [20].
Despite these advancements, significant challenges remain for DRL-based UAV control in complex, multi-agent environments. Even state-of-the-art frameworks, such as MADDPG, face unresolved issues in strategic interaction modeling, particularly in adversarial and dynamically evolving contexts. Common limitations include suboptimal inter-agent coordination, limited generalization of learned strategies, and low efficiency in transferring decision-making policies.
To address these challenges, this study proposes a novel optimization framework that integrates attention mechanisms with self-play training, aimed at enhancing MADDPG performance in high-complexity environments. The attention mechanism enables agents to prioritize critical decision variables, thereby refining the policy generation process. Simultaneously, self-play facilitates continuous adaptation by exposing agents to diverse adversarial strategies, improving robustness and adaptability in dynamic operational scenarios. This integrated approach is particularly relevant to real-world applications such as UAV collaborative task offloading and path optimization, where high decision accuracy, coordination efficiency, and adaptability are paramount. This Method is called MADDPG-SASP.

3. Our Method

The framework of the proposed approach is illustrated in Figure 1, which incorporates a self-attention mechanism and a self-play module into the intelligent agent, based on an improved MADDPG algorithm. From the perspective of strategy optimization, the enhanced MADDPG algorithm, self-play mechanism, and attention mechanism collaborate synergistically to improve the performance of the agent’s strategy. The improved MADDPG algorithm utilizes a centralized training and decentralized execution mechanism, enabling the agent to continuously adapt within the environment. The introduction of the self-play mechanism allows the agent to engage in interactions with itself, generating adversarial data for further strategy optimization. Additionally, the self-attention mechanism dynamically weights input features, guiding the agent to focus on the most relevant decision-making information. This combination results in more accurate and efficient strategic decisions. Overall, this strategy optimization framework enables the intelligent agent to achieve more effective strategy refinement during gameplay, thereby enhancing its success rate.
This method is called MADDPG-SASP, and the modules in this strategy will be described in detail in Section 3.1, Section 3.2, Section 3.3 and Section 3.4.

3.1. Improved MADDPG Algorithm

Figure 2 depicts the improved MADDPG algorithm presented in this paper. Compared to the original MADDPG algorithm, it employs an adaptive objective network update method to avoid policy oscillations.

3.1.1. Actor Critic Framework

In the Improved MADDPG algorithm, the actor and the critic are responsible for the strategy and evaluation components, respectively. Specifically, the actor maps a given state s to an action a via the policy function π θ ( s ) , while the critic evaluates the quality of taking action a in state s through the Q-value function Q w ( s , a ) . This Q-value function is updated according to the Bellman equation [21] as shown in the following expression:
Q w ( s , a ) = E [ R t + γ Q w ( s , a ) ]
where R t denotes the immediate reward obtained at time step t. γ is the discount factor, which quantifies the importance the agent places on future rewards. A smaller value of γ indicates that the agent prioritizes immediate rewards, while a larger value of γ suggests that the agent is more focused on long-term rewards.
Q w ( s , a ) represents the Q-value function of the target critic, which is used to compute the target Q-value, and the parameters of the target critic network are denoted by w .

3.1.2. Centralized Training and Decentralized Execution

The core of the MADDPG algorithm lies in centralized training and decentralized execution. During the training phase, the policy and Q-value functions of each agent are iteratively optimized by utilizing the states and actions of all other agents. However, during execution, each agent makes decisions solely based on its own state. This design significantly enhances both the learning stability and efficiency within a multi-agent environment.
For each agent i , whose goal is to maximize the desired return based on the current policy π θ ( s i ) , agent i is updated by:
θ i θ i + α θ i   E [ Q w i ( s i , a i ) ]
where θ i is the policy network parameter of agent i , and a is the learning rate, which represents the magnitude of each parameter update.

3.1.3. Adaptive Target Network Update Mechanism

In deep reinforcement learning, the introduction of target networks serves to mitigate the non-stationarity problem in value function estimation—a challenge particularly pronounced in game-theoretic settings. However, the conventional target network architecture employed in the MADDPG algorithm remains prone to policy oscillations in complex drone combat scenarios.
To address this limitation, this paper introduces an enhanced target network mechanism for MADDPG. In the baseline algorithm, the target network is updated via fixed soft-update rules, which lack adaptability to the evolving training dynamics. We therefore propose an adaptive update mechanism guided by policy performance, allowing the target network to dynamically modulate its update frequency in response to the convergence behavior of the current policy.
Let the parameters of the current actor network be θ i , and those of the critic network be w i . The corresponding target network parameters are θ i and w i . The improved soft update process is achieved through adaptive update coefficients:
θ i β a d a p t · θ i + ( 1 β a d a p t ) · θ i
w i β a d a p t · w i + ( 1 β a d a p t ) · w i
where the adaptive update coefficient β a d a p t is defined as:
β a d a p t = β 0 · 1 1 + t t s c a l e · min ( 1 , Δ Q Q t h )
where β 0 is the initial update rate, t is the training step count, t s c a l e is the temporal scale parameter, Δ Q denotes the recent Q-value change magnitude, Q t h is the threshold for Q-value changes.
This improved mechanism offers the following advantages: First, in the initial training phase where policy adjustments are substantial, larger β a d a p t values facilitate rapid knowledge transfer. Second, as training advances, the update magnitude progressively decreases, thereby improving training stability. Finally, by monitoring Q-value variations and dynamically regulating the update frequency, the update rate is reduced during policy convergence to mitigate excessive parameter fluctuations.

3.1.4. Critic Updates and Loss Function

The objective of the critic is to minimize the error between the current Q-value and the target Q-value, thereby improving the approximation of the true Q-value. The target Q-value is computed based on the immediate reward and the Q-value estimate provided by the target critic. A smaller value of the loss function indicates a better strategy. The critic’s loss function is expressed as follows:
L w i = E [ ( Q w i ( s i , a i ) y i ) 2 ]
where y i is the target Q value, and its calculation formula is expressed as:
y i = r i + γ Q w i ( s , a )
where r i denotes the current reward of the current step i, and represents the Q-value of the target critic, and represents the process of predicting the Q-value of the next state s and the next action a .
The method of updating the critic parameter is to minimize the loss function:
w i w i μ w i L w i
where μ denotes the learning rate of the critic, which serves to control the update rate of the critic network.

3.1.5. Actor Updates and Experience Repertoire

The objective of the actor network is to maximize the target Q-value of the critic network, thereby guiding the agents to select actions that yield higher long-term rewards. This approach enhances the quality of the strategy, enabling more efficient exploration of the environment. Consequently, the actor’s loss function is defined as follows:
L θ i = E [ Q w i ( s i , π θ i ( s i ) ) ]
In the above formula, π θ i ( s i ) denotes the action outputted by the actor network based on the current state s i . Q w i ( s i , a i ) denotes the Q-value evaluated by the critic network for the current state s i and action a i .
By maximizing the above loss function, the actor performs a policy update operation, which is expressed as:
θ i θ i + α θ i L θ i
In a multi-agent environment [22], Experience Replay (Replay Buffer) is utilized to store the transitions generated during an agent’s interaction with the environment, enabling the agent to learn from past experiences. Each transition is stored as a tuple ( s , a , r , s , d ) , where s represents the current state, a denotes the action taken by the agent, r is the immediate reward, s is the next state, and d indicates whether the episode has terminated.
In the Improved MADDPG, the agent is trained through alternating exploration and exploitation strategies. During the exploration phase, the agent introduces noise to its actions to explore a variety of actions; during the exploitation phase, the agent selects actions based on its learned policy.

3.2. Self-Play Training

Self-gaming optimizes the strategy of an intelligent agent through multiple rounds of competitive learning. In this paradigm, the agent and its adversary engage in continuous interaction, dynamically adjusting their combat decisions in response to each other’s strategies. When applied to strategy optimization within the enhanced MADDPG algorithm, this approach leverages the agent’s self-adjustment and competitive learning characteristics. Over several rounds of confrontation, it not only enables the agent to progressively refine and optimize its strategic behaviors but also allows for dynamic adaptation based on the evolving strategies of the adversary. This process enhances the agent’s ability to navigate and perform effectively in complex combat environments.

3.2.1. Mathematical Description of the Self-Play Mechanism

Consider a scenario involving two intelligent agents: the primary agent (agent) and the adversary (enemy). At each time step t, both agents select actions a t and interact with the environment’s state s t through these actions. The environment generates rewards based on the actions taken by both the agent and the enemy. The objective of the agent is to maximize its long-term cumulative rewards while simultaneously adapting to the strategy employed by the enemy.
The goal of each intelligent is to maximize the desired reward through its own strategy π a , and the formula for the reward is expressed as:
J ( π a ) = E ( t = 0 T γ t r t )
where J ( π a ) is the expected return of the agent a , r t is the reward received at the time step t, and γ is the discount factor.

3.2.2. Strategy Updates in Self-Plays

In the self-play framework, the intelligent agent engages in interaction with its adversary, where the agent’s strategy is influenced not only by the state of the environment but also by the strategy employed by the adversary. To achieve success in this competitive setting, the agent must continuously adapt its strategy in response to the evolving tactics of the adversary.
In a multi-intelligent body system, the strategy update of an intelligent body can be represented by the following optimization problem:
π a = arg max π a     E π a , π e [ J ( π a ) ]
where π e represents the enemy’s strategy and π a denotes the optimal strategy of the intelligent body, where the intelligent body maximizes its payoff in the confrontation by adjusting its strategy at the right time.

3.2.3. Agent Strategy Evaluation and Optimization

To update its policy, an intelligent agent typically employs a value function (e.g., the Q-function) to assess the quality of a given state-action pair. Let Q ( s t , a t ) represent the Q-value associated with taking action a t in state s t . Using the strategy gradient approach, the objective of the agent is to maximize the Q-value, i.e., to:
L π = E π a , π e [ Q ( s t , π a ( s t ) ) ]
where Q ( s t , π a ( s t ) ) is estimated by the Critic network, which represents the expected payoff that the intelligent body obtains by taking the action π a ( s t ) in the state s t . By minimizing this loss function, the intelligent body is able to gradually optimize its behavioral strategy, thus improving its performance in the self-play.

3.2.4. Strategy Stability in Self-Plays

Throughout the self-play training process, both the intelligent agent and the adversarial agent continuously adjust their respective strategies. This adversarial training process aims to approximate a Nash Equilibrium, which is defined as an equilibrium where no participant can improve their payoff by unilaterally altering their strategy. The mathematical formulation of a Nash Equilibrium [23] is as follows:
π a ( s t ) = arg max π a     E [ r t | π a , π e ]
π e ( s t ) = arg max π e     E [ r t | π e , π a ]
where π a and π e are the strategies of the intelligent and hostile intelligences, respectively, under the nash equilibrium. At this point, the strategies of the two intelligences reach a mutually optimal state, and the payoff cannot be further improved by unilaterally adjusting the strategies.
By incorporating the self-play mechanism, the intelligent agents are enabled to engage in competitive learning with adversarial agents over multiple rounds within a dynamic environment. Each agent not only adapts its strategy based on its own experience but must also adjust to the strategy changes in the adversarial agent, thereby optimizing its long-term performance in the competition. The essence of this mechanism lies in the fact that, through repeated interactions and continuous strategy updates, each agent progressively converges towards a nash equilibrium, thereby achieving self-optimization and self-enhancement of its strategies.

3.3. Attention Mechanism

Incorporating the self-attention mechanism into the MADDPG framework effectively mitigates challenges related to decision efficiency in aerial combat strategies, particularly when agents operate in high-dimensional and dynamic state spaces [24]. Conventional self-attention mechanisms are often limited in such contexts due to their static structural nature and inherent information bottlenecks. To address these limitations, this section presents key enhancements to the standard self-attention mechanism, leading to the development of a more adaptive decision optimizer tailored for air combat scenarios.
Figure 3 depicts the enhanced self-attention mechanism, which incorporates dynamic scaling and information reconstruction into the conventional framework. It performs dual calibration of the input states using statistical features and historical weighting, enabling adaptive fusion of multi-level attention through a gating mechanism. The model leverages residual connections and an MLP reconstruction network to refine feature representation, culminating in an optimized final output.

3.3.1. Dynamic Adaptive Scaling Factor

To address the limitation of conventional approaches that rely on a fixed scaling fac tor d k incapable of adapting to dynamic aerial combat conditions, a dynamically adaptive scaling factor is introduced. This factor is generated in real time via a lightweight feedforward network, as defined by the following expression:
α = L i n e a r 2 ( R E L U ( L i n e a r 1 ( X _ _ ) ) )
where X _ _ corresponds to statistical measures (e.g., mean and variance) of the input state X, and “Linear” refers to a fully connected layer. This design enables real-time adjustment of the scaling factor based on the current state, which enhances the model’s perceptual acuity across diverse air combat environments without compromising numerical stability.

3.3.2. Information Reconstruction Module Based on Residual Connections

To mitigate the loss of essential original state information during weighted aggregation, an information reconstruction module was introduced into the existing self-attention architecture via a residual connection, thereby preserving vital input features.
The output of this module is formulated as follows:
O = A · V + X + R e c o n s t r u c t ( A , V )
R e c o n s t r u c t ( A , V ) = M L P ( C o n c a t ( A · V , V ) )
where R e c o n s t r u c t ( A , V ) denotes a compact multi-layer perceptron. It performs a nonlinear fusion of the attention-weighted and original value vectors, which enables the reconstruction of details and the enhancement of previously neglected critical information, thus ensuring a more comprehensive state representation.

3.3.3. Multi-Level Attention Selection Mechanism for Task Perception

In aerial combat missions, where decision-making priorities vary across different operational phases, we propose a multi-level attention mechanism for mission perception. To this end, a gating network G(X) is incorporated to adaptively activate or combine multiple independent attention modules:
A f i n a l = i = 1 N g i · A t t e n t i o n i ( Q , K , V )
w h e r e     g = s o f t max ( G ( X ) ) ,   G ( X ) = W g · P o o l ( X )
P o o l ( X ) = W p · [ 1 L i = 1 L x i   |   |   max j = 1 L ( x j ) ]
where gi and Pool represent the gating weight and global average pooling operation, respectively; || denotes vector concatenation; Wp is a learnable weight matrix for feature fusion and dimensionality reduction; L is the sequence length; and xi is the i-th feature vector. This architecture enables the agent to prioritize global features in the search phase and local features in the combat phase, consequently realizing dynamic optimization and efficient computational resource allocation.

3.3.4. Closed-Loop Feedback Attention Weight Optimization

To achieve online self-optimization of the strategy, we further introduce a closed-loop feedback mechanism that incorporates historical decision performance into the current attention computation. This mechanism refines the current query-key interactions through a feedback function ffb:
F = s o f t max ( Q K T d k + η · f f b ( W p r e v , R t ) )
f f b ( W p r e v , R t ) = λ · σ ( R t μ σ + ε ) · W p r e v
where Wprev denotes the attention weight from the previous time step; σ(·) represents the sigmoid function; μ and σ indicate the moving average and standard deviation of recent rewards, respectively; λ is the feedback strength coefficient; ϵ is a small constant included for numerical stability; Rt is the immediate reward from the previous decision; and η signifies the feedback coefficient. Based on the success of historical decisions, the function ffb amplifies effective attention patterns while suppressing ineffective ones, thus enabling the agent to dynamically adjust its perceptual focus during adversarial interactions and thereby continuously improve decision quality.
The self-attention mechanism enhances the win rate of the intelligent agent by:
Dynamic adaptive perception for enhanced decision-making: By incorporating a dynamically adaptive scaling factor, the agent gains the capability to adjust its perceptual sensitivity in response to real-time combat conditions. This mechanism facilitates more precise extraction of critical information from high-dimensional state spaces, thereby significantly improving the accuracy and relevance of tactical decisions.
Information retention for robust state representation: Through an information reconstruction module based on residual connections, the agent maintains essential features from the original state space while integrating newly processed attention-weighted information. This architectural design effectively mitigates information loss during feature transformation, preventing tactical blind spots and ensuring comprehensive situational awareness.
Resource-aware strategy evolution: The integration of a multi-level attention selection mechanism with closed-loop feedback optimization enables efficient allocation of computational resources across different mission phases. Simultaneously, the feedback mechanism incorporates historical decision performance to refine attention patterns, resulting in continuous online adaptation and evolution of combat strategies that progressively enhance adversarial performance.
The self-attention mechanism provides a flexible and efficient means for agents in reinforcement learning to weigh different state features, enabling them to dynamically focus on the most critical features in complex environments, thereby improving the quality of their decisions. By leveraging this mechanism, agents can make more accurate decisions in multi-round adversarial tasks, significantly increasing their win rates.

3.4. Strategy Optimization Model Building

By integrating MADDPG, the self-attention mechanism, and the self-play mechanism within the strategy optimization process, the distinct advantages of each component can be fully leveraged, thereby significantly enhancing the overall decision-making capability of intelligent agents. In the initial phases, MADDPG provides a robust foundational strategy that enables effective learning in simpler environments. However, as the complexity of the environment escalates, relying solely on MADDPG often proves insufficient to meet the demands of more intricate tasks. The incorporation of the self-attention mechanism enables the agent to focus on critical state features during the decision-making process, thereby improving its capacity to process complex information, particularly in dynamically evolving environments. Simultaneously, the self-play mechanism bolsters the adaptability and robustness of the strategies by allowing the agent to engage in training against its own historical strategies, thus mitigating the risk of converging to suboptimal solutions. Through the weighted combination of these mechanisms, the relative importance of each can be dynamically adjusted according to the specific phase of training and the demands of the task. This approach facilitates continuous strategy optimization within a changing environment, ultimately enhancing the long-term performance and adaptability of the intelligent agent.
Due to the simplicity of the linear weighting method’s computational process and its high real-time performance, it enables more timely decision-making in dynamic air combat environments. Furthermore, the physical significance of the weighting parameters is relatively clear, making debugging more intuitive. Therefore, to provide a more comprehensive consideration of both the original strategy and its optimization framework, and to enable the dynamic adaptation of various variables within the strategy, a novel strategy model will be developed. In this model, the optimized strategy is denoted as y , the original MADDPG strategy as x 1 , the strategy derived from the self-attention mechanism as x 2 , and the strategy generated through the self-play mechanism as x 3 . The respective coefficients of these strategies are represented as k 1 , k 2 and k 3 , where k 1 , k 2 and k 3 denote the weights assigned to each mechanism. Consequently, the optimized strategy y can be expressed as:
y = k 1 x 1 + k 2 x 2 + k 3 x 3
x 1 = π M A D D P G ( s )
x 2 = s o f t max ( Q K T d )
x 3 = arg max π s e l f p l a y   E ( R s e l f p l a y ( π s e l f p l a y ) )
Let π M A D D P G represent the strategy of the MADDPG agent in state s ; Q, K, and V de note the query, key, and value vectors, respectively, with d representing the dimension of these vectors. Additionally, π s e l f p l a y refers to the strategy employed by the agent within the self-play mechanism, and R s e l f p l a y ( π s e l f p l a y ) indicates the reward received by the agent from its interaction with its own historical strategies.
In the strategy optimization process, the coefficients k 1 , k 2 and k 3 quantify the relative importance of the MADDPG strategy, the self-attention mechanism, and the self-play mechanism, respectively, in the final optimized strategy. The dynamic adjustment of these coefficients aims to calibrate the contribution of each mechanism based on environmental feedback and the agent’s performance. This approach enables the agent to adaptively modulate the influence of each mechanism, ensuring that the strategy remains optimal in the face of changing environmental conditions.
When the reward associated with a given mechanism is high, it indicates that the mechanism has made a substantial contribution to the agent’s strategy, necessitating an increase in the corresponding coefficient. Conversely, if the reward is low, the coefficient for that mechanism should be decreased. The adjustment process is formalized by the following formula:
k 1 ( t + 1 ) = k 1 ( t ) · ( 1 + α 1 R 1 ( t ) | R 1 ( t ) | )
k 2 ( t + 1 ) = k 2 ( t ) · ( 1 + α 2 R 2 ( t ) | R 2 ( t ) | )
k 3 ( t + 1 ) = k 3 ( t ) · ( 1 + α 3 R 3 ( t ) | R 3 ( t ) | )
Let R 1 ( t ) , R 2 ( t ) and R 3 ( t ) denote the rewards associated with the MADDPG, self-attention mechanism, and self-play mechanism at time step t, respectively. The terms | R 1 ( t ) | , | R 2 ( t ) | , and | R 3 ( t ) | represent the normalized values of these rewards, while α 1 , α 2 , and α 3 control the adjustment rates for each mechanism.
When the loss associated with a particular mechanism is substantial, it indicates that the mechanism has not effectively contributed to the enhancement of the strategy, warranting a reduction in its corresponding coefficient. The adjustment formula for loss feedback is as follows:
k 1 ( t + 1 ) = k 1 ( t ) · ( 1 β 1 L 1 ( t ) )
k 2 ( t + 1 ) = k 2 ( t ) · ( 1 β 2 L 2 ( t ) )
k 3 ( t + 1 ) = k 3 ( t ) · ( 1 β 3 L 3 ( t ) )
Let L 1 ( t ) , L 2 ( t ) , and L 3 ( t ) represent the loss values associated with the MADDPG, self-attention mechanism, and self-play mechanism at time step t, respectively. The hyperparameters β 1 , β 2 , and β 3 regulate the loss feedback for each mechanism.
The success rate of a strategy serves as an indicator of an agent’s ability to complete a given task. When a mechanism contributes to an increased success rate, it implies that the mechanism has a greater impact on the strategy, thereby justifying an increase in its corresponding coefficient. The adjustment formula for success rate feedback is as follows:
k 1 ( t + 1 ) = k 1 ( t ) · ( 1 + γ 1 S 1 ( t ) | S 1 ( t ) | )
k 2 ( t + 1 ) = k 2 ( t ) · ( 1 + γ 2 S 2 ( t ) | S 2 ( t ) | )
k 3 ( t + 1 ) = k 3 ( t ) · ( 1 + γ 3 S 3 ( t ) | S 3 ( t ) | )
In the above equation, S 1 ( t ) , S 2 ( t ) , and S 3 ( t ) represent the task success rates associated with the MADDPG, self-attention mechanism, and self-play mechanism at time step t, respectively. The terms | S 1 ( t ) | , | S 2 ( t ) | , and | S 3 ( t ) | denote the normalized values of the success rates, while γ 1 , γ 2 , and γ 3 are hyperparameters that govern the adjustment of these success rates.
To effectively integrate the feedback from rewards, losses, and success rates, these factors can be combined for the purpose of coefficient adjustment as follows:
k 1 ( t + 1 ) = k 1 ( t ) · ( 1 + α 1 R 1 ( t ) | R 1 ( t ) | + γ 1 S 1 ( t ) | S 1 ( t ) | β 1 L 1 ( t ) )
k 2 ( t + 1 ) = k 2 ( t ) · ( 1 + α 2 R 2 ( t ) | R 2 ( t ) | + γ 2 S 2 ( t ) | S 2 ( t ) | β 2 L 2 ( t ) )
k 3 ( t + 1 ) = k 3 ( t ) · ( 1 + α 3 R 3 ( t ) | R 3 ( t ) | + γ 3 S 3 ( t ) | S 3 ( t ) | β 3 L 3 ( t ) )
In this manner, the adjustment of the coefficients incorporates the integrated feedback from rewards, losses, and success rates, enabling a more holistic optimization of all components of the strategy. This approach ensures that the intelligent agent continuously adapts the contribution of each mechanism, thereby enhancing its ability to effectively respond to environmental changes throughout the training process.

3.5. Introduction to Reward Function

The reward function in this paper consists of three parts, they are distance reward, winning reward and height difference reward.

3.5.1. Distance Rewards

The distance between the agent and the adversary plays a critical role in the reward calculation [25]. The closer the agent is to the adversary, the more negative the reward becomes. This design incentivizes the agent to reduce the distance to the adversary, thereby encouraging direct confrontation.
If the current positions of the agent and the enemy are s a g e n t = ( x a g e n t , y a g e n t , z a g e n t ) and s e n e m y = ( x e n e m y , y e n e m y , z e n e m y ) , then the distance d between the agent and the enemy can be calculated by the Euclidean distance formula:
d = ( x a g e n t x e n e m y ) 2 + ( y a g e n t y e n e m y ) 2 + ( z a g e n t z e n e m y ) 2
The design of r d i s tan c e takes into account the proximity of the agent to the enemy and assigns a negative reward for the greater the distance to encourage the agent to move towards the enemy. To increase the sensitivity to distance, we introduce a quadratic decay function to calculate the reward, specifically defined as:
r d i s tan c e = λ · e γ d 2
where λ is the scaling factor of the distance reward part and γ is the parameter fac tor controlling the distance decay rate. This function makes the negative reward larger when the distance is greater, while the reward gradually increases as the distance is shortened.

3.5.2. Winner Rewards

When the distance between the agent and the adversary falls below a predefined threshold m, the agent receives an additional reward, signaling the completion of the confrontation task. This also indicates that the agent has successfully approached the adversary and achieved victory. The specific formulation is as follows:
I w i n = 1     i f   |   |   s a g e n t s e n e m y   |   | < m 0     o t h e r w i s e
Based on the above indicator function, the winning reward is set to:
r w i n = 100 I w i n
When I w i n = 1 , the agent gets a reward of 100; Otherwise, the reward is 0. This reward component directly encourages the agent to approach the enemy and win quickly.

3.5.3. Height Difference Rewards

To further enhance the agent’s strategy optimization in the vertical direction, we also design a reward function based on the altitude difference. The altitude difference reward encourages the agent to not only focus on horizontal attacks during the confrontation but also to consider the spatial changes in height. In three-dimensional space, the calculation of the altitude difference reward r h e i g h t is based on the difference between the agent and the adversary along the z-axis. The specific formula is as follows:
r h e i g h t = α ( z a g e n t z e n e m y ) β
where α and β represent the scaling factor and exponential factor of the height difference reward, respectively, and control the effect of height difference on the reward function. The formula makes the effect of height difference on reward show nonlinear relationship, and can improve the ability of the agent to make decision at height.

3.5.4. Total Rewards

Combining all the above reward mechanisms, the final total reward function can be expressed as:
r t o t a l = r d i s t a n c e + r w i n + r h e i g h t
The above formula can be expanded to:
r t o t a l = λ e γ d 2 + 100 I w i n + α ( z a g e n t z e n e m y ) β
In this function, the first item is a reward calculated based on the distance between the agent and the enemy, the second item is a reward based on the near victory, and the third item is a reward based on the height difference. By taking these factors into consideration, the reward function not only focuses on the relative position of the agent and the enemy, but also guides the agent to optimize the strategy in three-dimensional space through the altitude difference reward.
When the Intelligence’s reward is greater than the enemy’s reward, the Intelligence wins that round, otherwise the enemy wins.

4. Enemy Strategy Introduction

The enemy’s strategy is the TD3 strategy. In deep reinforcement learning, TD3 is a common policy gradient algorithm. TD3 introduces a series of improvements aimed at solving the overestimation problem in traditional algorithms, thereby improving the stability and training efficiency of the algorithm.
The core idea of the algorithm is to reduce the bias in Q-value estimation by using a dual Q-value network, and to make the training process more stable by using a delayed update strategy and target action smoothing. Its target Q-value is expressed as:
y = r t + γ min i = 1 , 2   Q θ i ( s t + 1 , π φ ( s t + 1 ) + ε )
where r t represents immediate reward, γ is the discount factor, Q θ 1 and Q θ 2 are two target Q networks, π φ ( s t + 1 ) is the target action generated in the policy network, and ε is the noise introduced by the target action smoothing.
The update rules for the TD3 policy network depend on the estimation of the Q-values, with the goal of maximizing the output of the Q-value network. Therefore, gradient ascent is used for policy updates:
φ J ( π φ ) = E s t ~ D [ a Q 1 ( s t , a t ) | a t = π φ ( s t ) φ π φ ( s t ) ]
where D is the experience replay pool, π φ ( s t ) is the policy network, and Q 1 ( s t , a t ) is the estimate of the Q-value network.
In the target network, soft updates are used to ensure training stability:
θ = τ θ + ( 1 τ ) θ
φ = τ φ + ( 1 τ ) φ
where τ is the soft update step size, θ and φ are the parameters of the current Q-value network and policy network, respectively, and θ and φ are the parameters of the target network.
In addition, In the selection of enemy multi-agent algorithms, for the selection of the adversarial multi-agent algorithm, MATD3 was chosen as the adversarial multi-agent method and applied in the corresponding experimental environment.
Let y denote the target Q-value and r denote the reward obtained by the agent at the current time step. Then the expression for y is:
y = r + γ min k = 1 , 2 Q k ( s , a + ε )
where Q k denotes the kth target network, s represents the joint state of the agent at the next time step, and a indicates the target actions of all agents at the next time step. ε represents the clipping noise used for policy smoothing.

5. Experiment

5.1. Setting of Experimental Parameters

The experimental platform employed in this study is OpenAI Gym, a widely adopted open-source environment in reinforcement learning research. It provides a standardized interactive framework for drone game tasks by configuring state spaces—including the drone’s real-time position, velocity, and environmental state—and explicitly defining the agent’s action space with parameters such as flight direction and speed. A structured reward mechanism is implemented, where positive rewards are assigned for advantageous states like higher altitude relative to the adversary, and penalties are imposed for unfavorable conditions such as lower energy levels. The platform supports real-time interaction and automatically records state transitions and rewards at each step, supplying reliable data for algorithm training and iteration. By leveraging OpenAI Gym, researchers avoid developing complex simulation environments from scratch and can concentrate on optimizing algorithms and strategies. The unified interface also facilitates objective performance comparisons across different methods, promoting progress in drone game research.
To ensure the complexity of the operational environment, this paper employs the 1976 U.S. Standard Atmosphere model for the tropospheric layer (0–11 km) over the primary UAV operational airspace, combined with a Weibull-distributed random wind field as the UAV combat environment: The former encapsulates analytical formulas through parameterized functions, employing sea-level reference parameters T0 = 288.15 K, P0 = 101,325 Pa, and a tropospheric temperature lapse rate L = 0.0065 K/m. Calculate the temperature T ( h ) = T 0 L h , pressure P ( h ) = P 0 ( T ( h ) / T 0 ) g M / ( R L ) , and air density ρ(h) at height h sequentially, i.e., the ideal gas equation of state, approximating real conditions through temperature perturbations of ±0.5 °C per 100 m and pressure fluctuations of ±20 Pa per 500 m; The latter involves calibrating Weibull distribution parameters (shape k = 2.0, scale λ = 6.0 m/s) according to MH/T 1065–2017 to ensure 95% of wind speeds fall within the drone’s safety threshold of 0–15 m/s. After sampling via scipy.stats.weibull_min, a 3-step moving average smoothing is applied. Wind direction is corrected using a Markov chain (adjacent step deviation ≤ 30°) to ensure continuity. During dynamic coupling, atmospheric parameters update every 50 ms to synchronize with the UAV’s dynamic step size, providing inputs for lift and drag calculations. Wind fields refresh every 5 s and are optimized for urban (50% wind speed attenuation in built-up areas) and offshore (k adjusted to 1.8) scenarios, balancing realism and efficiency to support game strategy validation.
The parameter settings for this experiment are shown in Table 1.

5.2. Original Air Combat Analysis

In the original air combat strategy, TD3 was employed for the enemy’s strategy and MADDPG was employed for the agent’s strategy, followed by the training of the air combat model. The final results indicated that, across 600 rounds of the game, the agent won 157 times, resulting in a win rate of 26.17% for the agent. The relationship between the agent’s and the enemy’s angle, speed, altitude, energy, and sampling points is illustrated in Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8.
Figure 4, Figure 5, Figure 6 and Figure 7 illustrates the variations in angle, altitude, velocity, and energy for both agents across game rounds when the MADDPG strategy is employed. It shows that the adversary exhibits multifaceted advantages throughout the game: in terms of angle, the adversary displays a significantly larger range and higher frequency of fluctuations than the agent, reflecting superior dynamic adaptability, whereas the agent’s angle tends to stabilize, indicating limited flexibility. In the altitude dimension, the adversary consistently maintains a higher altitude than the agent, with a maximum of nearly 10 m compared to the agent’s peak of about 8 m, demonstrating a sustained altitude advantage. In the velocity dimension, the adversary reaches a maximum velocity of 22.5 m/s, while the agent attains only 15 m/s. The adversary’s velocity fluctuations and peak values consistently surpass those of the agent, allowing it to more effectively compress the agent’s reaction space through its velocity advantage. In the energy dimension, although the agent’s total energy slightly exceeds that of the adversary, the adversary exhibits more compact energy fluctuations and higher management efficiency. This enables the adversary to maintain offensive initiative through stable energy output.

5.3. Analysis of the Improved Air Combat Environment

In this background, TD3 was employed for the enemy’s strategy and MADDPG-SASP was employed for the agent’s strategy, followed by the training of the air combat model. The final results indicated that, across 600 rounds of the game, the agent won 600 times. The relationship between the agent’s and the enemy’s angle, speed, altitude, energy, and sampling points is illustrated in Figure 5.
The optimization of the agent’s strategy to the MADDPG-SASP policy, as illustrated in Figure 8, Figure 9, Figure 10 and Figure 11, endows it with comprehensive multidimensional advantages during gameplay. In the angular dimension, the agent exhibits greater stability, in contrast to the adversary’s pronounced and unstable variations, thereby ensuring precise control for tactical execution. Pertaining to altitude, the agent operates stably between 7.5 and 10.5 m, whereas the adversary frequently descends below 0 m, reaching as low as −10.5 m—a clear altitude superiority that reinforces the agent’s strategic dominance. Regarding velocity, the agent maintains a consistent 1.25–1.75 m/s, significantly surpassing the adversary’s fluctuating 0.25–1 m/s, which secures a persistent lead in movement and interaction. In energy, the agent’s stable fluctuations (0.012–0.014 kJ) starkly exceed the adversary’s limited range (0.002–0.006 kJ). Consequently, the agent achieves comprehensive suppression over the adversary across all considered metrics.
Figure 12 shows the spatial trajectories of the agent and the opponent in the 100th, 200th, 300th, 400th, 500th, and 600th rounds of the 600 rounds of the game, respectively, in which the agent was victorious. It is clear that after combining the MADDPG algorithm with the self-play of the agent’s strategy with the optimization of the attention mechanism, the agent always maintains an advantageous position in terms of height, angle, and other relevant metrics in its encounters with the opponent. As a result, the agent exhibits a stronger performance to effectively counter and attack the enemy. As a result, the win rate increases dramatically, from the initial 26.17% to 100%, reflecting a significant improvement in performance.
As shown in Figure 13, the intelligent agent consistently outperforms the adversary across multiple rounds. The majority of the agent’s reward values are concentrated between 10,000 and 15,000, with peaks exceeding 15,000. In contrast, the adversary’s reward remains consistently negative, fluctuating between −15,000 and −8000, indicating poor performance throughout the battle. This suggests that, in most rounds, the intelligent agent effectively optimized its strategy, achieving positive rewards, whereas the adversary’s reward values were significantly lower, reflecting suboptimal performance. Consequently, the agent demonstrated a marked superiority in combat, with the strategy optimization process becoming increasingly evident.

5.4. Ablation Experiment and Comparative Experiment

First, in the ablation experiment section, based on the MADDPG-SASP framework used in this paper, different parts of the framework were removed individually for experimentation, and the results are shown in Figure 14 and Table 2.
Figure 14 illustrates the relationship between the total reward of the agent and the number of training rounds across four algorithms: SA-MADDPG, SP-MADDPG, MADDPG-SASP, and MADDPG.
Specifically, MADDPG-SASP is the strategy method in this paper, SP-MADDPG is the method in this paper after removing the self-attention mechanism module, and SA-MADDPG is the method in this paper after removing only the self-play module. Based on the strategy framework of this paper, ablation experiments were conducted, and the results are shown in Figure 6 and Table 2.
Subsequently, in Table 2, we present the win rates and average rewards of the four algorithms—MADDPG, SP-MADDPG, SA-MADDPG, and MADDPG-SASP—over 600 rounds.
Secondly, in the comparative experiment, Secondly, in the comparative experiment, the PPO, DQN, and DDPG algorithms were compared with the MADDPG-SASP method proposed in this paper. The results are shown in Figure 15 and Table 3.
As illustrated in Figure 15 and Table 3, within the comparative experiments, the intelligent agent employing our MADDPG-SASP framework consistently achieves higher reward values than those obtained by the PPO, DQN, and DDPG algorithms, even after 600 rounds. Moreover, the winning rate of MADDPG-SASP surpasses that of the other three algorithms, further confirming its superior performance. These results collectively underscore the robustness and reliability of the proposed framework.

5.5. Sensitivity Analysis

The robustness of the proposed method and the validity of its optimal configuration were assessed through a sensitivity analysis of key hyperparameters: the discount factor (gamma), critic learning rate (critic_lr), actor learning rate (actor_lr), and soft update parameter (tau). This analysis is crucial because performance is highly sensitive to these parameters, and quantifying their impact provides essential evidence for the algorithm’s stable deployment in drone game tasks.
The results in Figure 16 indicate that the algorithm exhibits varying sensitivity to different hyperparameters. Among these, the actor learning rate (actor_lr) demonstrates the highest sensitivity, with its optimal value residing within a narrow window around 0.001. Deviations from this value lead to a noticeable decline in performance. The critic learning rate (critic_lr) maintains stable performance between 0.0008 and 0.0011, exhibiting greater tolerance. Higher values of the discount factor (gamma) improve the agent’s win rate and cumulative reward, with 0.99 being optimal in this study. In contrast, tau exhibits negligible impact on performance between 0.008 and 0.01, indicating robust behavior in this parameter. Ultimately, within the optimal parameter range, the agent achieves a 100% win rate and high average reward, demonstrating the algorithm’s superiority.

5.6. Multiplayer Combat Environment Analysis

To enhance the persuasiveness of the strategy, we further construct a predator–prey game scenario [26] involving drones. The intelligent agent drone is designated as the predator, characterized by relatively slower flight speed and acceleration; the adversarial drone serves as the prey, possessing faster flight speed and acceleration. Within this game scenario, multiple slower-moving cooperative predator drones must coordinate to pursue the faster prey drone.
In the drone predator–prey game scenario, we further subdivide it into two scenarios. In Scenario 1, there are 2 agent drones, 1 enemy drone, and 3 obstacles. The specific parameters are shown in Table 4; In Scenario 2, there are 4 agent drones, 2 enemy drones, and 3 obstacles. The specific parameters are shown in Table 5.
The 3D flight map for scenario 1 is shown in Figure 17, the reward map is shown in Figure 18, and the final results are presented in Table 6.
As quantified in Table 6 and visualized in Figure 17 and Figure 18, the proposed MADDPG-SASP algorithm demonstrates dominant performance in the predator–prey scenario (Scenario 1) against an adversary employing the MATD3 strategy. The results show that our agent secured victories in 591 out of 600 episodes, achieving a win rate of 98.50% and a substantial average reward of 8391.43. This decisive outcome confirms that the MADDPG-SASP strategy maintains a significant competitive advantage, enabling the consistent and efficient accomplishment of the capture objective with remarkable strategic profitability and operational stability.
Further analysis of the trajectory evolution in Figure 11 reveals that the agent’s policy matured from initial exploratory behaviors into a sophisticated strategy featuring coordinated encirclement and effective obstacle avoidance. The corresponding reward curves underscore this strategic superiority, with the agent sustaining high rewards while the adversary’s rewards remain persistently negative, thereby highlighting the excellent convergence and robustness of our method. Critically, this superior performance is achieved despite the adversary’s advantages in key maneuvering parameters such as maximum speed, acceleration, and turning rate. This affirms the critical role of the agent’s multi-agent coordination and its integrated capability for environmental and adversarial decision-making in securing favorable outcomes under constrained conditions.
The 3D flight map for Scenario 2 is shown in Figure 19, the reward map is shown in Figure 20, and the final results are presented in Table 7.
As summarized in Table 7 and illustrated in Figure 19 and Figure 20, under the competitive scenario involving 4 agents against 2 adversarial predators, the proposed method achieved a decisive victory with a 100% win rate over 600 episodes, accompanied by an average reward of 4743.01. It is noteworthy that this performance was attained despite the adversaries possessing significant advantages in key maneuverability parameters—specifically, a maximum speed of 20.0 m/s compared to 8.0 m/s and an acceleration of 3.5 m/s2 versus 2.0 m/s2. These results robustly demonstrate that the superior collaborative decision-making capability of the agents effectively compensates for hardware disadvantages and validates the strategic superiority of the proposed approach in asymmetric multi-agent settings.
As observed from the trajectory plots in Figure 13 spanning episodes 100 to 600, the agents’ strategy evolves from an initial phase of dispersed exploration to a highly coordinated encirclement pattern, demonstrating continuous improvement in both obstacle avoidance and target acquisition capabilities. The corresponding reward curve further reveals that the agents maintain consistently high and stable rewards, whereas the adversaries’ rewards remain persistently negative throughout the training process. These results collectively highlight that the MADDPG-SASP framework not only achieves excellent training convergence and robustness in complex 3D multi-agent environments with obstacles but also effectively establishes comprehensive tactical dominance over adversaries through efficient environmental perception and collaborative strategic planning.

6. Conclusions

In this paper, we propose an enhanced multi-agent reinforcement learning framework called MADDPG-SASP, which integrates an improved MADDPG algorithm with self-attention mechanisms to optimize strategies in the context of complex UAV confrontation tasks. By leveraging the self-play training mechanism, the agent can dynamically adjust its strategies within a three-dimensional adversarial environment, enabling self-enhancement of its strategic capabilities. The results demonstrate that, over the course of 600 training rounds, the adoption of the optimized strategy framework leads to an increase in the agent’s win rate from 26.17% to 100%, significantly improving its decision-making ability and learning efficiency in adversarial tasks. Moreover, in the adversarial environment of the Predator–Prey Scenario, when the agent employs the MADDPG-SASP strategy and the adversary adopts a multi-agent strategy, the agent achieves a win rate of 98.5% in scenarios with 2 agents versus 1 adversary. In scenarios with 4 agents versus 2 adversaries, the agent achieves a win rate of 100%.
This study not only offers a reliable solution for UAV countermeasure strategies but also contributes to the exploration of reinforcement learning applications in high-dimensional, dynamic environments. The incorporation of the self-attention mechanism enhances the agent’s ability to perceive and adapt to environmental features, providing a robust foundation for future advancements in multi-agent cooperation and strategy optimization.

Author Contributions

Methodology, Z.X. and F.L.; Software, Z.X. and F.L.; Visualization, Q.W.; Writing—original draft, Z.X. and F.L.; Writing—review & editing, Z.X. and F.L.; Formal analysis, Z.X. and F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Central Guidance for Local Science and Technology Development Fund ZYYD2025QY19.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Xu, J.; Liang, G. Enhanced Q learning and deep reinforcement learning for unmanned combat intelligence planning in adversarial environments. Sci. Rep. 2025, 15, 28364. [Google Scholar] [CrossRef]
  2. Tedeschi, G.; Papini, M.; Metelli, A.M.; Restelli, M. Search or split: Policy gradient with adaptive policy space. Mach. Learn. 2025, 114, 186. [Google Scholar] [CrossRef]
  3. Zhao, Z.; Wang, Y. Soft actor-critic algorithm and improved GNN model in secure access control of disaggregated optical networks. Sci. Rep. 2025, 15, 29358. [Google Scholar] [CrossRef] [PubMed]
  4. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
  5. Amerimehr, M.H.; Efazati, S.; Amani, N. A New DDPG-Based Algorithm for Sum Throughput Maximization in UAV-Assisted IRS-Empowered Wireless Sensor Network. Wirel. Pers. Commun. 2025, 141, 543–570. [Google Scholar] [CrossRef]
  6. Waseem, M.; Chang, Q. From Nash Q-learning to nash-MADDPG: Advancements in multiagent control for multiproduct flexible manufacturing systems. J. Manuf. Syst. 2024, 74, 129–140. [Google Scholar] [CrossRef]
  7. Zhou, T.; Liu, Z.; Jin, W.; Han, Z. Intelligent maneuver decision-making for UAVs using the TD3-LSTM reinforcement learning algorithm under uncertain information. Front. Robot. AI 2025, 12, 1645927. [Google Scholar] [CrossRef]
  8. Kan, Y.; Yang, M.; Qian, R.; Jiang, W.; He, Y.; Zhang, L. Research on DP-MPC control strategy based on active equalization system of bidirectional flyback transformer. Ionics 2025, 31, 11265–11280. [Google Scholar] [CrossRef]
  9. Bartolomei, L.; Teixeira, L.; Chli, M. Semantic-Aware Active Perception for UAVs Using Deep Reinforcement Learning. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 3101–3108. [Google Scholar]
  10. Lyu, Z.; Zhu, G.; Xu, J. Joint Maneuver and Beamforming Design for UAV-Enabled Integrated Sensing and Communication. IEEE Trans. Wirel. Commun. 2023, 22, 2424–2440. [Google Scholar] [CrossRef]
  11. Duan, H.B.; Huo, M.Z.; Fan, Y.M. Flight verification of multiple UAVs collaborative air combat imitating the intelligent behavior in hawks. Control Theory Appl. 2018, 35, 1812–1819. [Google Scholar]
  12. Wang, X.; Yang, Y.; Wu, D.; Zhang, Z.; Ma, X. Mission-Oriented 3D Path Planning for High-Altitude Long-Endurance Solar-Powered UAVs With Optimal Energy Management. IEEE Access 2020, 8, 227629–227641. [Google Scholar] [CrossRef]
  13. Shen, Q.; Zhang, D.; He, Q.; Ban, Y.; Zuo, F. A novel multi-objective dung beetle optimizer for Multi-UAV cooperative path planning. Heliyon 2024, 10, e37286. [Google Scholar] [CrossRef] [PubMed]
  14. Chen, Y.; Dong, Q.; Shang, X.; Wu, Z.; Wang, J. Multi-UAV Autonomous Path Planning in Reconnaissance Missions Considering Incomplete Information: A Reinforcement Learning Method. Drones 2023, 7, 10. [Google Scholar] [CrossRef]
  15. Sun, Y.; Xia, H.; Su, C.; Zhang, R.; Wang, J.; Jia, K. A multi-agent enhanced DDPG method for federated learning resource allocation in IoT. Comput. Commun. 2025, 233, 108066. [Google Scholar] [CrossRef]
  16. Li, C.; Fu, Q.; Chen, J.; Lu, Y.; Wang, Y.; Wu, H. FS-DDPG: Optimal Control of a Fan Coil Unit System Based on Safe Reinforcement Learning. Buildings 2025, 15, 226. [Google Scholar] [CrossRef]
  17. Lu, Y.; Xu, C.; Wang, Y. Joint Computation Offloading and Trajectory Optimization for Edge Computing UAV: A KNN-DDPG Algorithm. Drones 2024, 8, 564. [Google Scholar] [CrossRef]
  18. Polyakov, A. Nonlinear Feedback Design for Fixed-Time Stabilization of Linear Control Systems. IEEE Trans. Autom. Control 2012, 57, 2106–2110. [Google Scholar] [CrossRef]
  19. Labbadi, M.; Boubaker, S.; Djemai, M.; Mekni, S.; Bekrar, A. Fixed-Time Fractional-Order Global Sliding Mode Control for Nonholonomic Mobile Robot Systems under External Disturbances. Fractal Fract. 2022, 6, 177. [Google Scholar] [CrossRef]
  20. Zeng, Y.; Zhang, R. Energy-Efficient UAV Communication With Trajectory Optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
  21. Feng, Z.; Wu, Z.; Zou, J.; Cheng, L.; Zhao, X.; Zhang, X.; Lu, J.; Wang, C.; Wang, Y.; Wang, H.; et al. Memristive Bellman solver for decision-making. Nat. Commun. 2025, 16, 4925. [Google Scholar] [CrossRef]
  22. Gadiraju, D.S.; Karmakar, P.; Shah, V.K.; Aggarwal, V. GLIDE: Multi-Agent Deep Reinforcement Learning for Coordinated UAV Control in Dynamic Military Environments. Information 2024, 15, 477. [Google Scholar] [CrossRef]
  23. Liu, Q.; Yan, H.; Chen, K.; Wang, M.; Li, Z. Distributed Nash equilibrium solution for multi-agent game in adversarial environment: A reinforcement learning method. Automatica 2025, 178, 112342. [Google Scholar] [CrossRef]
  24. Wu, D.; Gao, Q. Intelligent detection method of small targets in UAV based on attention mechanism and edge enhancement filtering. Alex. Eng. J. 2025, 115, 201–209. [Google Scholar] [CrossRef]
  25. Chansuparp, M.; Jitkajornwanich, K. A Novel Augmentative Backward Reward Function with Deep Reinforcement Learning for Autonomous UAV Navigation. Appl. Artif. Intell. 2022, 36, 2084473. [Google Scholar] [CrossRef]
  26. Liu, K.; Zhao, Y.; Wang, G.; Peng, B. Self-Attention-Based Multi-Agent Continuous Control Method in Cooperative Environments. Inf. Sci. 2022, 585, 454–470. [Google Scholar] [CrossRef]
Figure 1. Strategy optimization framework of MADDPG-SASP.
Figure 1. Strategy optimization framework of MADDPG-SASP.
Information 16 01050 g001
Figure 2. The framework of Improved MADDPG.
Figure 2. The framework of Improved MADDPG.
Information 16 01050 g002
Figure 3. The frame of improved self-attention.
Figure 3. The frame of improved self-attention.
Information 16 01050 g003
Figure 4. The changes in the angle of both parties with the game rounds in game situation under the MADDPG.
Figure 4. The changes in the angle of both parties with the game rounds in game situation under the MADDPG.
Information 16 01050 g004
Figure 5. The changes in the height of both parties with the game rounds in game situation under the MADDPG.
Figure 5. The changes in the height of both parties with the game rounds in game situation under the MADDPG.
Information 16 01050 g005
Figure 6. The changes in the speed of both parties with the game rounds in game situation under the MADDPG.
Figure 6. The changes in the speed of both parties with the game rounds in game situation under the MADDPG.
Information 16 01050 g006
Figure 7. The changes in the energy of both parties with the game rounds in game situation under the MADDPG.
Figure 7. The changes in the energy of both parties with the game rounds in game situation under the MADDPG.
Information 16 01050 g007
Figure 8. The changes in the angle of both parties with the game rounds in game situation under the MADDPG-SASP.
Figure 8. The changes in the angle of both parties with the game rounds in game situation under the MADDPG-SASP.
Information 16 01050 g008
Figure 9. The changes in the height of both parties with the game rounds in game situation under the MADDPG-SASP.
Figure 9. The changes in the height of both parties with the game rounds in game situation under the MADDPG-SASP.
Information 16 01050 g009
Figure 10. The changes in the speed of both parties with the game rounds in game situation under the MADDPG-SASP.
Figure 10. The changes in the speed of both parties with the game rounds in game situation under the MADDPG-SASP.
Information 16 01050 g010
Figure 11. The changes in the energy of both parties with the game rounds in game situation under the MADDPG-SASP.
Figure 11. The changes in the energy of both parties with the game rounds in game situation under the MADDPG-SASP.
Information 16 01050 g011
Figure 12. Trajectory diagram of agent and enemy. (a) Trajectory of both sides after 100 rounds; (b) Trajectory of both sides after 200 rounds; (c) Trajectory of both sides after 300 rounds; (d) Trajectory of both sides after 400 rounds; (e) Trajectory of both sides after 500 rounds; (f) Trajectory of both sides after 600 rounds.
Figure 12. Trajectory diagram of agent and enemy. (a) Trajectory of both sides after 100 rounds; (b) Trajectory of both sides after 200 rounds; (c) Trajectory of both sides after 300 rounds; (d) Trajectory of both sides after 400 rounds; (e) Trajectory of both sides after 500 rounds; (f) Trajectory of both sides after 600 rounds.
Information 16 01050 g012aInformation 16 01050 g012b
Figure 13. Agent and enemy’s rewards (MADDPG-SASP).
Figure 13. Agent and enemy’s rewards (MADDPG-SASP).
Information 16 01050 g013
Figure 14. Results of ablation experiment.
Figure 14. Results of ablation experiment.
Information 16 01050 g014
Figure 15. Results of comparative experiment.
Figure 15. Results of comparative experiment.
Information 16 01050 g015
Figure 16. Sensitivity Analysis Results. (a) Sensitivity analysis of parameter actor_lr; (b) Sensitivity analysis of parameter critic_lr; (c) Sensitivity analysis of parameter gamma; (d) Sensitivity analysis of parameter tau.
Figure 16. Sensitivity Analysis Results. (a) Sensitivity analysis of parameter actor_lr; (b) Sensitivity analysis of parameter critic_lr; (c) Sensitivity analysis of parameter gamma; (d) Sensitivity analysis of parameter tau.
Information 16 01050 g016
Figure 17. Trajectory diagram of agent and enemy (2 agents vs. 1 enemy in Predator–Prey Scenario). (a) Trajectory of both sides after 100 rounds; (b) Trajectory of both sides after 200 rounds; (c) Trajectory of both sides after 300 rounds; (d) Trajectory of both sides after 400 rounds; (e) Trajectory of both sides after 500 rounds; (f) Trajectory of both sides after 600 rounds. (In the trajectories of the agent and the enemy, ‘.’ denotes the starting point, while ‘×’ denotes the endpoint).
Figure 17. Trajectory diagram of agent and enemy (2 agents vs. 1 enemy in Predator–Prey Scenario). (a) Trajectory of both sides after 100 rounds; (b) Trajectory of both sides after 200 rounds; (c) Trajectory of both sides after 300 rounds; (d) Trajectory of both sides after 400 rounds; (e) Trajectory of both sides after 500 rounds; (f) Trajectory of both sides after 600 rounds. (In the trajectories of the agent and the enemy, ‘.’ denotes the starting point, while ‘×’ denotes the endpoint).
Information 16 01050 g017aInformation 16 01050 g017b
Figure 18. Agent and enemy’s rewards (2 agents vs. 1 enemy in Predator–Prey Scenario).
Figure 18. Agent and enemy’s rewards (2 agents vs. 1 enemy in Predator–Prey Scenario).
Information 16 01050 g018
Figure 19. Trajectory diagram of agent and enemy (4 agents vs. 2 enemies in Predator–Prey Scenario). (a) Trajectory of both sides after 100 rounds; (b) Trajectory of both sides after 200 rounds; (c) Trajectory of both sides after 300 rounds; (d) Trajectory of both sides after 400 rounds; (e) Trajectory of both sides after 500 rounds; (f) Trajectory of both sides after 600 rounds. (In the trajectories of the agent and the enemy, ‘.’ denotes the starting point, while ‘×’ denotes the endpoint).
Figure 19. Trajectory diagram of agent and enemy (4 agents vs. 2 enemies in Predator–Prey Scenario). (a) Trajectory of both sides after 100 rounds; (b) Trajectory of both sides after 200 rounds; (c) Trajectory of both sides after 300 rounds; (d) Trajectory of both sides after 400 rounds; (e) Trajectory of both sides after 500 rounds; (f) Trajectory of both sides after 600 rounds. (In the trajectories of the agent and the enemy, ‘.’ denotes the starting point, while ‘×’ denotes the endpoint).
Information 16 01050 g019aInformation 16 01050 g019b
Figure 20. Agent and enemy’s rewards (4 agents vs. 2 enemies in Predator–Prey Scenario).
Figure 20. Agent and enemy’s rewards (4 agents vs. 2 enemies in Predator–Prey Scenario).
Information 16 01050 g020
Table 1. Configuration of experimental parameters.
Table 1. Configuration of experimental parameters.
ParametersValues
space_size[−1000, 1000]
gamma0.99
tau0.01
actor_lr1 × 10−3
critic_lr1 × 10−3
batch_size10
replay_buffer_max_size100,000
rounds600
state_dim3
action_dim3
max_action1
Table 2. Ablation experiments for different methods (600 rounds).
Table 2. Ablation experiments for different methods (600 rounds).
MethodsMADDPGSP-MADDPGSA-MADDPGMADDPG-SASP
Average reward1298.228−582.7725319.16115,176.07
Agent’s win124229598600
Enemy’s win47637120
Win rate26.17%38.17%99.67%100.00%
Table 3. Comparative experiments for different methods (600 rounds).
Table 3. Comparative experiments for different methods (600 rounds).
MethodsPPODQNDDPGMADDPG-SASP
Average reward−4665.701−9396.234−3779.35815,176.07
Agent’s win380294526600
Enemy’s win220306740
Win rate63.33%49.00%87.67%100.00%
Table 4. Predator–prey game scenario 1 parameters for drones.
Table 4. Predator–prey game scenario 1 parameters for drones.
ParameterPredator (Agents)Prey (Enemies)Unit
UAV Count21units
obstacles33units
max Speed8.012.0m/s
acceleration2.03.5m/s2
turn rate120150°/s
collision radius1.51.2m
Table 5. Predator–prey game scenario 1 parameters for drones.
Table 5. Predator–prey game scenario 1 parameters for drones.
ParameterPredator (Agents)Prey (Enemies)Unit
UAV Count42units
obstacles33units
max Speed8.020.0m/s
acceleration2.03.5m/s2
turn rate120150°/s
collision radius1.51.2m
Table 6. Intelligent agent combat results of scenario 1.
Table 6. Intelligent agent combat results of scenario 1.
MetricTotal RoundsAgent WinsAgent Win RateAverage Reward
Value60059198.50%8391.43
Table 7. Intelligent agent combat results of scenario 2.
Table 7. Intelligent agent combat results of scenario 2.
Metric Total Rounds Agent Wins Agent Win Rate Average Reward
Value600600100%4743.01
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiao, Z.; Liu, F.; Wang, Q. Optimization of Multi-Intelligent Body Strategies for UAV Adversarial Tasks Based on MADDPG-SASP. Information 2025, 16, 1050. https://doi.org/10.3390/info16121050

AMA Style

Xiao Z, Liu F, Wang Q. Optimization of Multi-Intelligent Body Strategies for UAV Adversarial Tasks Based on MADDPG-SASP. Information. 2025; 16(12):1050. https://doi.org/10.3390/info16121050

Chicago/Turabian Style

Xiao, Zhenfei, Fuyong Liu, and Qian Wang. 2025. "Optimization of Multi-Intelligent Body Strategies for UAV Adversarial Tasks Based on MADDPG-SASP" Information 16, no. 12: 1050. https://doi.org/10.3390/info16121050

APA Style

Xiao, Z., Liu, F., & Wang, Q. (2025). Optimization of Multi-Intelligent Body Strategies for UAV Adversarial Tasks Based on MADDPG-SASP. Information, 16(12), 1050. https://doi.org/10.3390/info16121050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop