Graph Neural Network-Enhanced Multi-Agent Reinforcement Learning for Intelligent UAV Confrontation

Hu, Kunhao; Pan, Hao; Han, Chunlei; Sun, Jianjun; An, Dou; Li, Shuanglin

doi:10.3390/aerospace12080687

Open AccessArticle

Graph Neural Network-Enhanced Multi-Agent Reinforcement Learning for Intelligent UAV Confrontation

by

Kunhao Hu

¹,

Hao Pan

^2,3,*,

Chunlei Han

^2,3,

Jianjun Sun

^2,3,

Dou An

¹

and

Shuanglin Li

^2,3

¹

School of Automation Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

²

National Key Laboratory of Multi-Domain Data Collaborative Processing and Control, Xi’an 710068, China

³

Xi’an Research Institute of Navigation Technology, Xi’an 710068, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(8), 687; https://doi.org/10.3390/aerospace12080687

Submission received: 8 June 2025 / Revised: 17 July 2025 / Accepted: 21 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue New Perspective on Flight Guidance, Control and Dynamics)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicles (UAVs) are widely used in surveillance and combat for their efficiency and autonomy, whilst complex, dynamic environments challenge the modeling of inter-agent relations and information transmission. This research proposes a novel UAV tactical choice-making algorithm utilizing graph neural networks to tackle these challenges. The proposed algorithm employs a graph neural network to process the observed state information, the convolved output of which is then fed into a reconstructed critic network incorporating a Laplacian convolution kernel. This research first enhances the accuracy of obtaining unstable state information in hostile environments. The proposed algorithm uses this information to train a more precise critic network. In turn, this improved critic network guides the actor network to make decisions that better meet the needs of the battlefield. Coupled with a policy transfer mechanism, this architecture significantly enhances the decision-making efficiency and environmental adaptability within the multi-agent system. Results from the experiments show that the average effectiveness of the proposed algorithm across the six planned scenarios is 97.4%, surpassing the baseline by 23.4%. In addition, the integration of transfer learning makes the network convergence speed three times faster than that of the baseline algorithm. This algorithm effectively improves the information transmission efficiency between the environment and the UAV and provides strong support for UAV formation combat.

Keywords:

unmanned aerial vehicle; decision making; multi-agent reinforcement learning; graph convolutional network; transfer learning

1. Introduction

Unmanned aerial vehicles (UAVs) are taking on an increasingly necessary role in station exploration and military processes [1]. As typical cyber-physical systems, unmanned combat aircraft integrate capabilities such as surveillance, target recognition, and precision strikes [2]. By leveraging advanced control technologies and intelligent decision-making algorithms, these platforms can process complex battlefield information in real time [3]. This enables more effective tactical decision making and significantly enhances one’s ability to gain and maintain air superiority [4].

In recent years, autonomous decision making for UAVs has received increasing attention [5]. A defining characteristic of decision making in such scenarios is the prevalence of incomplete and uncertain information [6]. The combat process lacks fixed rules, with only constraints arising from weapon configurations and maneuverability. Both parties strategically conceal their intentions and remain unaware of their opponent’s tactics, a situation from which a distinctly dynamic and adversarial environment ensues. To tackle such challenges, present strategies can typically be divided into three primary categories, which are (1) game theory-based methods [7], (2) optimization theory-based methods [8], and (3) artificial intelligence-based methods [9].

The game theory-based methods view aerial combat as a process in which multiple decision makers engage in both cooperation and competition to maximize their individual task completion [10]. Park et al. [11] and Zhang et al. [12] recommended a hierarchical decision-making system for one-on-one aerial combat scenarios, as mentioned in differential game explanation, where the upper layer is responsible for tactical intent decisions such as attack and defense, while the lower layer is responsible for aircraft maneuver decisions. However, game theory-based methods have limitations despite their strong theoretical foundations. They often struggle to accurately model real-world scenarios and handle the inherent uncertainties of dynamic environments like air combat. Consequently, these methods face significant challenges in managing high-dimensional decision spaces and deriving truly optimal strategies. This ultimately limits their practical effectiveness in highly complex environments. Optimization theory-based methods formulate aerial combat decision making as a multi-objective optimization problem and solve it using mathematical optimization techniques [13]. Virtanen et al. [14] employed reachable set theory and the adaptive tuning of target state weights to address the issue of incomplete information and to make tactical maneuver decisions grounded in fuzzy logic and target intent prediction. However, they often fail to meet real-time requirements, limiting their applicability in online combat scenarios.

The complexity of modern battlefields increases the demands on UAV decision-making algorithms, particularly in extracting key information and making timely decisions [15]. Game theory-based methods struggle with accurately modeling real scenarios and handling large decision spaces, while optimization theory-based methods fail to meet real-time requirements. In contrast, artificial intelligence-based methods present substantial advantages for UAV combat decision making. They can effectively process complex battlefield information that often exceeds the capabilities of game theory-based methods. Furthermore, they can generate strategies with the real-time responsiveness that typically eludes optimization theory-based methods [16,17,18].

Current mainstream methods based on artificial intelligence include value decomposition methods, communication-based methods, and hierarchical reinforcement learning. The value decomposition method promotes cooperation among agents by decomposing the centralized global Q function into the local Q functions of each agent. Do et al. [19] proposed a novel approach called action-branching QMIX based on multi-agent deep reinforcement learning, which employs a new long short-term memory module to effectively control long sequences and enables it to adapt to changing environmental variables in real time. Pan et al. [20] propose the Qedgix framework, which combines GNN and the QMIX algorithm to achieve distributed optimization of the age of information for users in unknown scenarios. The framework utilizes GNN to extract information from UAVs, users within the observable range, and other UAVs within the communicable range, thus enabling effective UAV trajectory planning. To ensure the effectiveness of decentralized execution, these approaches usually require imposing strong structural constraints on the relationship between the global value function and the local value function. Such constraints may prevent the algorithm from learning certain optimal strategies. The communication-based method allows agents to directly coordinate their behaviors by learning explicit communication protocols. Agents exchange and receive information during the decision-making process to achieve more explicit coordination. Agents not only need to learn how to act, but also need to learn when and what to send and to whom, thereby forming an effective communication protocol. Han et al. [21] formulated the communication-constrained multi-UAV air combat problem as a Markov game and proposed a novel sparse-inferred intention sharing multi-agent reinforcement learning algorithm to improve the winning rate of multi-UAV air combat. It enables efficient sparsity in communication without causing performance penalties, but it also brings communication overhead. Hierarchical reinforcement learning solves the “dimension disaster” and sparse reward problems faced by traditional reinforcement learning by decomposing complex long-term tasks into multiple simpler and shorter subtasks. Li et al. [22] proposed a hierarchical reinforcement learning framework oriented towards unmanned combat aircraft to address the challenge of temporal abstraction in autonomous air combat for UAV within visual range. They incorporate the maximum-entropy objective into the MEOL framework to optimize the autonomous low-level tactical discovery and high-level option selection which simplifies long-term planning and coordination through task decomposition. However, its effectiveness largely depends on the quality of the task decomposition. This process is not only time-consuming and labor-intensive, but its final outcome is also entirely dependent on the domain knowledge and intuition of experts.

To address the above challenges, this paper proposes a new multi-agent reinforcement learning framework for tactical maneuver optimization, achieved by redefining the state space, action space, and reward function. Building on this, a multi-agent deep deterministic policy gradient algorithm (G-MADDPG) enhanced by graph convolutional networks is introduced, in which the critic network leverages graph convolutional networks (GCNs) to effectively extract structural features from complex and dynamic battlefield environments. To further improve learning efficiency, a transfer learning strategy is integrated into the framework, allowing the reuse of policy knowledge from a trained source domain in a new target domain, thereby accelerating convergence and improving adaptability. The simulation results of six UAV combat scenarios show that the proposed approach markedly enhances win rates and cumulative rewards, while also achieving faster convergence than the baseline algorithms.

The main contributions of this paper are summarized as follows:

1. We developed a maneuvering model for multi-UAV combat and formalized the decision-making process using a Markov decision process (MDP) framework. This framework defines the state, action, and a composite reward function based on the angle, distance, and outcome advantages. By integrating tactical and strategic factors, this reward function better captures the complexities of air combat and accelerates the algorithm convergence by providing more informative, goal-aligned feedback.

2. A novel multi-agent reinforcement learning algorithm enhanced with graph convolutional networks (GCNs) is proposed to enhance the representation and assessment of complex environmental states in UAV coordination and confrontation tasks.

3. A transfer learning mechanism is introduced to accelerate convergence and facilitate policy adaptation in different UAV combat scenarios, and improving overall mission performance by reducing the training time and computational resources. Through simulations conducted across six UAV combat scenarios, the results demonstrate that the introduction of graph neural networks effectively enhances the win rates and reward returns of multi-agent algorithms, while the incorporation of transfer learning techniques significantly accelerates the convergence speed.

The remainder of this paper is organized as follows: In Section 2, the UAV maneuvering model is analyzed, along with the definitions of the agent’s state and action. In Section 3, the GCN-based intelligent decision-making algorithm for UAVs is proposed. Additionally, transfer learning is applied to enhance the training efficiency of the model by leveraging a source domain policy. In Section 4, simulations are conducted to demonstrate the effectiveness of the proposed algorithm and the convergence acceleration achieved through transfer learning.

2. Mathematical Statements

This section details the mathematical foundations underpinning our research. We begin by establishing the physical dynamics of the aircraft through a comprehensive UAV maneuver model. Building upon this physical framework, we then formalize the multi-agent adversarial problem as a Markov decision process, which provides the structure for our reinforcement learning approach. This formalization includes a precise definition of the state space, action space, and the composite reward function that guides the agents’ learning process.

2.1. UAV Maneuver Model

The maneuver model is the basis of intelligent UAV decision making [23]. It is used to define the aircraft’s dynamic responses and interactions within the simulation environment, which is crucial for the effective training of G-MADDPG algorithm [24]. The UAV maneuver model is shown in Figure 1.

The overload of the UAV reflects the maneuverability of an aircraft during flight and is defined as

n = \frac{N}{W}

(1)

where N denotes the combined aerodynamic and thrust forces, W is the gravity of the aircraft.

Assuming that the UAV flies without sideslip and the engine thrust is aligned with the flight velocity vector, the projection of the center of mass motion in the trajectory coordinate system is described by

\{\begin{matrix} m \frac{d v}{d t} = T - D - W sin γ \\ m V cos \frac{d φ}{d t} = L sin μ \\ - m V \frac{d γ}{d t} = - L cos μ + W cos γ \end{matrix}

(2)

where L is the lift force, T is the engine thrust, and D is the air resistance along the velocity direction. The

γ

,

φ

, and

μ

represent the UAV flight path inclination, heading and roll angles, respectively. Then, the projected components of the overload on the axial coordinate system can be obtained by

\{\begin{matrix} n_{x} = \frac{T - D}{W} \\ n_{y} = \frac{L sin μ}{W} \\ n_{z} = \frac{L cos μ}{W} \end{matrix}

(3)

where

n_{x}

denotes the tangential overload and

n_{y}

and

n_{z}

denote vectors that are orthogonal to the velocity vector. The vector sum of

n_{y}

and

n_{z}

is the normal overload

n_{n}

, which can be expressed as

n_{n} = \sqrt{n_{y}^{2} + n_{z}^{2}}

(4)

By combining Equations (4) and (3), the relationship between overload and UAV motion can be written as

\{\begin{matrix} \dot{v} = g (n_{x} - sin γ) \\ \dot{γ} = \frac{g}{v} (n_{n} cos μ - cos γ) \\ \dot{φ} = \frac{g n_{n} sin μ}{v cos γ} \end{matrix}

(5)

Subsequently, based on the kinematic equations, the UAV velocity along the x, y, and z directions are formulated as

\{\begin{matrix} \dot{x} = v cos γ sin φ \\ \dot{y} = v cos γ cos φ \\ \dot{z} = v sin γ \end{matrix}

(6)

2.2. Markov Decision Process for the UAV Game Confrontation Policy

Building on the above model, the UAV adversarial strategy is formulated as an MDP, defined by a state space, action space, and reward function that together capture the environment, agent behavior, and optimization objectives [25]. The components of the process are described in detail below.

2.2.1. State Space

To support autonomous decision making in multi-agent air combat scenarios, a structured state space is designed to capture the essential information required by each UAV. The state space includes the self-information, friendly relative information, and enemy relative information. The details are shown in Figure 2.

In this study, aircraft self-information contains the aircraft serial number and the aircraft’s state information [26]. One-hot encoding is used to process the aircraft serial number information to extend the feature to the Euclidean vector space, and a certain value of the feature corresponds to a certain coordinate point in the Euclidean vector space, which is denoted as

O n e - {H o t}^{i} = [a_{0}, a_{1}, a_{2}, \dots, a_{j}, \dots, a_{n}] = \{\begin{matrix} 1 & j = i \\ 0 & j \neq i \end{matrix}

(7)

An aircraft’s state information

S_{t}

at time t includes the current position, attitude angles, and velocity, which are collectively denoted as

S_{t} = [x_{t}, y_{t}, φ_{t}, v_{t}]

(8)

The relative information of the friendly force is the information about the friendly forces that can be observed by the aircraft [27], including the serial number information, the relative position information, and the relative velocity information. In this study, when a friendly aircraft is outside the observation range of the current aircraft, the relative information of the friendly force is 0, and when a friendly aircraft is shot down within the observation range of the surviving aircraft, both the relative position and velocity of the friendly aircraft are assigned a value of

- 1

. When a friendly aircraft is active within the observation range of the surviving aircraft

d_{d e t e c t}

, the information about the friendly force includes the serial number

O n e - H o t^{i}

, the relative information of the friendly force in the x direction

\bar{x_{a}}

, the relative information of the friendly force in the y direction

\bar{y_{a}}

, the relative distance of the friendly force

\bar{y_{a}}

, the relative velocity of the friendly force

\bar{v_{a}}

, and the heading angle of the friendly force

\bar{φ_{a}}

, which are expressed as

f_{t} = \{\begin{matrix} [O n e - H o t_{a}, 0, \dots, 0] & d > d_{d e t e c t} \\ [O n e - H o t_{a}, - 1, \dots, - 1] & d < d_{d e t e c t a n d t h e t a r g e t a i r c r a f t w a s s h o t d o w n .} \\ [O n e - H o t_{a}, \bar{x_{a}}, \bar{y_{a}}, φ_{a}, \bar{d_{a}}, \bar{v_{a}}] & e l s e \end{matrix}

(9)

Similarly, the enemy-force relative position information is designed in the same way as the friendly-force relative position information, which is expressed as

e_{t} = \{\begin{matrix} [O n e - H o t_{e}, 0, \dots, 0] & d > d_{d e t e c t} \\ [O n e - H o t_{e}, - 1, \dots, - 1] & d < d_{d e t e c t a n d t h e t a r g e t a i r c r a f t w a s s h o t d o w n .} \\ [O n e - H o t_{e}, \bar{x_{e}}, \bar{y_{e}}, φ_{e}, \bar{d_{e}}, \bar{v_{e}}] & e l s e \end{matrix}

(10)

2.2.2. Action Space

To allow the algorithm to control the UAV in the simulation environment, this study simplifies the aircraft maneuver model and designs a two-dimensional continuous action space composed of the aircraft velocity and the aircraft heading angular acceleration, which is expressed as

A = \{\begin{matrix} a_{v} \in [v_{m i n}, v_{m a x}] \\ a_{Δ φ} \in [{- Δ φ}_{m a x}, {Δ φ}_{m a x}] \end{matrix}

(11)

where A represents the complete action space available to the UAV, and

a_{v}

represents the speed.

a_{Δ φ}

refers to the action space of the heading angle acceleration.

v_{m i n}

and

v_{m a x}

indicate the minimum and maximum speeds, respectively.

{Δ φ}_{m a x}

denotes the maximum heading angle acceleration.

2.2.3. Reward Function

Here, based on the basic rules of UAV game confrontation, it is assumed that drones have three situational advantages.

(1) Angle advantage: To gain an advantage in air combat, the UAV should minimize its exposure to the enemy’s attack angle while maximizing the likelihood of the enemy entering its own attack angle range. Based on the rules of UAV engagement, the angle advantage is defined as

r_{a} = \frac{ϕ_{t s} - ϕ_{s t}}{180^{\circ}}

(12)

ϕ_{s t} = ϕ - φ_{s}

(13)

ϕ_{t s} = ϕ - φ_{t}

(14)

where

r_{a}

denotes the angular advantage.

ϕ_{s t}

and

ϕ_{t s}

represent the relative angles between the two UAV, measured with respect to the heading directions of the UAV and the target, respectively.

ϕ

denotes the angle between the two UAVs.

φ_{s}

and

φ_{t}

denote the heading angles of the UAV and the target, respectively.

(2) Distance advantage: To destroy the enemy, our UAV is required to allow the enemy to enter its attack range while maintaining a safe distance to ensure operational security, which is expressed as

r_{d} = m i n (2, \frac{d_{a t t a c k}}{d})

(15)

where

r_{d}

is the distance advantage value,

d_{a t t a c k}

denotes the attack range of our UAV, and d denotes the distance between the two UAVs. The introduction of 2 ensures that the distance between our UAV and the enemy UAV is not less than twice the attack distance, protecting the safety of our UAV.

(3) Outcome advantage: To guide the aircraft to achieve the tactical goal, when the aircraft completes a task, a certain positive reward should be provided. On the contrary, when the aircraft fails the task, a certain negative reward should be provided as a punishment. This process is expressed as

r_{f} = \{\begin{matrix} r^{*} & The aircraft accomplished the tactical goal . \\ - r^{*} & The aircraft did not achieve the tactical goal . \end{matrix}

(16)

where

r^{*}

and

- r^{*}

, respectively, denote the reward values for achieving and not achieving the tactical goal, and they are fixed values that have been preset in advance.

The angle advantage, distance advantage, and outcome advantage are not mutually exclusive. Instead, they are complementary and collectively guide the UAV to achieve victory in unmanned aerial combat. Consequently, we propose an overall reward function that integrates these three distinct reward components, which is engineered to effectively facilitate the training process of the G-MADDPG algorithm. It is expressed as

r = r_{a} + r_{d} + r_{f}

(17)

3. Methodology

In this part, we put forward a UAV decision-making algorithm based on graph convolutional networks (G-MADDPG), along with a GCN-enhanced critic network model designed for multi-agent environments. The critic network uses GCN to evaluate actions through local observations and alleviates the problem of incomplete global information by enhancing the interaction among local nodes. In addition, experience replay and periodic updates of the target network are combined to ensure training stability and improve the overall policy effectiveness.

3.1. Intelligent Decision-Making Algorithm Based on GCN

The G-MADDPG algorithm is built upon the multi-agent actor–critic framework. In this setting, each UAV is considered an individual agent, with the separate deployment of actor and critic networks. The comprehensive view of the algorithm’s structure is presented in Figure 3.

Specifically, the actor network comprises an input layer, an autoencoding layer, a fully connected layer, and an output layer. During the decision-making process, the actor network receives the agent’s local observation as input and generates the corresponding action. The critic network comprises an input layer, a graph convolutional layer, an information selection layer, an autoencoding layer, a fully connected layer, and an output layer. During training, the critic network receives a graph structure that includes the agent’s local observations and action information as input, and outputs an evaluation of the agent’s current action.

3.2. Multi-Agent State Extraction Based on GCN

In real-world air combat, the constraints of detection systems lead to the incomplete knowledge of enemy units, making it difficult to acquire precise the global state information necessary for effectively training the critic network [28]. Considering the above problems, this paper constructs a critic network leveraging graph neural networks to accurately approximate the Q-function in the context of unmanned aerial combat [29]. This study embeds the inter-agent graph structure into the critic network’s input representation. We assume a fully connected communication graph where each agent can send and receive information from all the other agents without delay or distortion [30,31].

Under this assumption, the input form of the critic relationships is changed from a one-dimensional vector into a two-dimensional graph structure vector serving the multi-agent contacts, where the row value of the graph structure vector represents the agent label and the row vector represents the agent’s local observation information. After the state information fusion of the graph neural network, the network still generates a two-dimensional graph structure vector. Therefore, it is necessary to add an information selection layer after the graph neural network to select the required agent feature vector, as shown in Figure 4.

The actor network maps each agent’s local observation to a continuous action. It consists of an input layer, two fully connected hidden layers with 64 neurons each using ReLU activation functions, and an output layer with two neurons. This output layer generates continuous control signals for velocity and heading angular acceleration, which are scaled to the range of

[- 1, 1]

using the Tanh activation function. In contrast, the GCN-based critic network evaluates the state-action value by processing a graph-structured input that represents the joint states and actions of all agents. This input is passed through a graph convolutional layer that applies a Laplacian kernel, as defined in Equation (23), together with a ReLU activation function, resulting in a 64-dimensional feature vector for each node. A dedicated selection mechanism then extracts the feature vector corresponding to the agent being evaluated. This vector is fed into a fully connected hidden layer with 64 neurons and ReLU activation, and subsequently into a final output layer consisting of a single neuron with linear activation to estimate the Q-value. By aggregating information from neighboring nodes, the graph convolution enables the critic to capture both local interactions and global coordination patterns effectively.

Building on the GCN, this study adopts the Laplace equation as the convolution kernel to integrate node information. In a fully connected graph structure, every node has connections to all other nodes. From the graph structure, one can derive the adjacency matrix A and degree matrix D as

A = [\begin{matrix} 0 & 1 & 1 & \dots & 1 \\ 1 & 0 & 1 & \dots & 1 \\ 1 & 1 & 0 & \dots & 1 \\ \dots & \dots & \dots & \dots & \dots \\ 1 & 1 & 1 & \dots & 0 \end{matrix}]

(18)

D = [\begin{matrix} n - 1 & 0 & 0 & \dots & 0 \\ 0 & n - 1 & 0 & \dots & 0 \\ 0 & 0 & n - 1 & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & \dots & n - 1 \end{matrix}]

(19)

Then, to prevent nodes from losing their own information, a self-adjacency matrix

\tilde{A}

and a self-degree matrix

\tilde{D}

are defined as

\tilde{A} = A + I_{N} = [\begin{matrix} 1 & 1 & 1 & \dots & 1 \\ 1 & 1 & 1 & \dots & 1 \\ 1 & 1 & 1 & \dots & 1 \\ \dots & \dots & \dots & \dots & \dots \\ 1 & 1 & 1 & \dots & 1 \end{matrix}]

(20)

\tilde{D} = \sum_{j} {\tilde{A}}_{i j} = [\begin{matrix} n & 0 & 0 & \dots & 0 \\ 0 & n & 0 & \dots & 0 \\ 0 & 0 & n & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & \dots & n \end{matrix}]

(21)

Finally, according to the self-adjacency matrix and the self-degree matrix, a graph convolution kernel based on the Laplace Equation L can be obtained as

L = {\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}}

(22)

In this way, the graph convolution formula in this paper is defined as

H^{l + 1} = f (L H^{l} W^{l})

(23)

where

H^{l + 1}

represents the node feature matrix at layer

l + 1

,

H^{l}

represents the node feature matrix at layer l,

W^{l}

represents the weight matrix at layer l, and

f (\cdot)

represents the activation function.

3.3. Training Process of Intelligent Decision-Making Algorithm Based on GCN

The critic network relies on a substantial volume of labeled data during training. In order to efficiently leverage interactive experience for training the reinforcement learning algorithm, the G-MADDPG algorithm adopts the experience playback mechanism. After each interaction with the environment, the quadruple containing the current local observation information, action information, environmental reward, and local observation information at the next moment is stored in the experience pool to facilitate subsequent model training. Furthermore, during the training process, to eliminate the correlation between each experience, the algorithm uses simple random sampling to select training data from the experience pool to train the neural network.

To improve the critic network’s training stability, the G-MADDPG algorithm also maintains a target critic network to fit the state-value function. According to the Bellman equation, the critic network’s loss function is formulated as

L (θ_{i}) = E [{(Q_{i}^{u} (o_{1}, o_{2}, o_{3}, \dots, o_{n}, a_{1}, a_{2}, a_{3}, \dots, a_{n}) - (r + Q_{i}^{u^{'}} ({o_{1}}^{'}, {o_{2}}^{'}, {o_{3}}^{'}, \dots, {o_{n}}^{'}, a_{1}, a_{2}, a_{3}, \dots, a_{n})))}^{2}]

(24)

where

Q_{i}^{u}

is the action-value function for the i-th agent at current time step, which evaluates the expected return given the current observations

o_{1}, o_{2}, o_{3}, \dots, o_{n}

and actions

a_{1}, a_{2}, a_{3}, \dots, a_{n}

of all agents. r denotes the immediate reward received after executing the action, and

Q_{i}^{u^{'}}

is the estimated action-value function at the next time step, based on the observations at the next time step

{o_{1}}^{'}, {o_{2}}^{'}, {o_{3}}^{'}, \dots, {o_{n}}^{'}

and the current actions

a_{1}, a_{2}, a_{3}, \dots, a_{n}

. For notational convenience, we denote the joint set of all agents’ local observations as

o = (o_{1}, o_{2}, o_{3}, \dots, o_{n})

and the joint action as

a = (a_{1}, a_{2}, a_{3}, \dots, a_{n})

. The observations and actions at the next timestep are denoted as

o^{'}

and

a^{'}

, respectively.

To improve the convergence rate and training stability of the actor network algorithm, the G-MADDPG algorithm uses the critic network’s output to guide the actor network’s training. Thus, the loss function of the actor network is

\nabla_{θ_{a}} J = E_{s} [\nabla_{θ_{i}} μ_{i} (a_{i} | o_{i}) \nabla_{a_{i}} Q_{i}^{u} (o_{1}, o_{2}, o_{3}, \dots, o_{n}, a_{1}, a_{2}, a_{3}, \dots, a_{n}) | a_{i} = μ_{i} (o_{i})]

(25)

where

μ_{i} (a_{i} | o_{i})

is the policy function of the i-th agent, representing the probability distribution over actions

a_{i}

given the observation

o_{i}

. The term

\nabla_{θ_{i}} μ_{i} (a_{i} | o_{i})

signifies the parameter-wise gradient of the policy function, and

\nabla_{a_{i}} Q_{i}^{u}

signifies the derivative of the Q-value function concerning the action input.

Algorithm 1 provides a detailed description of the implementation process of the G-MADDPG algorithm, and the specific training process is as follows:

Step 1: Each agent is equipped at initialization with an actor network, a critic network, and a target critic network. The actor network is used to fit strategies and make decisions about agent actions. The critic network approximates the action-value function, assesses the agent’s actions, and provides feedback for training the actor network. The target critic network helps train the critic network by fitting the target value according to the Equation (24).

Step 2: The agents decide on the current actions and interact with the environment according to the actions selected by their own actor networks and the preset exploration-utilization strategy. To avoid the algorithm converging to a local optimum, a noise exploration strategy mechanism for the agents is designed. The noise exploration strategy mechanism is described as

a_{t}^{i} = A c t o r (o_{t}^{i}) + ε N (0, 1)

(26)

The formula indicates that, at the beginning of the training, due to the influence of Gaussian white noise, the agents mainly rely on exploration to make decisions. With an increase in training time, the influence of white Gaussian noise gradually decreases under the action of the noise discount factor

ε

, and the agents mainly rely on the MARL algorithm proposed by us for decision making. After the agents complete the process of selecting actions, the agents’ local observation information, action information, and reward return and the agents’ local observation information at the next moment are stored in the experience pool.

Step 3: The critic network and the target critic network take data from the experience pool for training.

Step 4: The actor network is trained using information from the updated critic network.

Step 5: Steps 1 to 4 are repeated to continuously update the actor network and the critic network until the algorithm converges or terminates.

Algorithm 1 G-MADDPG Algorithm

Input: Environment model

E n v

, Number of training steps l, Number of episodes N, Discount factor

γ

, Experience pool size M, Sample size m, Target Network Update

τ

, Exploration-exploitation strategy

ϵ

Output: Optimal policy

π

1:: Initialize the actor network, critic network, and target critic network.
2:: for episode $i = 0$ to N do do
3:: Initialize the environment model.
4:: for train step $j = 0$ to l do
5:: for each UAV k do do
6:: Obtain local observation $o_{k}^{j}$ from environment.
7:: Select action $a_{k}^{j}$ using actor network and strategy $ϵ$ .
8:: end for
9:: Generate action set $A_{j}$ .
10:: Environment transitions to $S_{j + 1}$ using action $A_{j}$ and transition matrix P.
11:: for each UAV k do do
12:: Receive reward $r_{k}^{j}$ and new observation $o_{k}^{j + 1}$ .
13:: end for
14:: Generate observation set $O_{j}$ , reward set $R_{j}$ , and new observation set $O_{j + 1}$ .
15:: if experience pool is full then
16:: Update experience pool.
17:: else
18:: Store experience $(O_{j}, A_{j}, R_{j}, O_{j + 1})$ in pool.
19:: end if
20:: Sample m experiences from pool.
21:: Update critic network using Equation (24).
22:: Update actor network using Equation (25).
23:: Soft update target network using Equation (24) with $τ$ as the soft update parameter.
24:: end for
25:: end for

3.4. MARL Based on Transfer Learning

Under the constraints of computing resources, the above training process has the disadvantages of long training times and slow algorithm convergence. In the UAV game confrontation environment, the algorithm must be retrained if the scenario changes, which further amplifies the shortcomings of the training process. As a result, enhancing the convergence rate of multi-agent algorithms during training is a key challenge in UAV game theory applications.

The prior policy network obtained by the algorithm in a single-game confrontation scenario is constructed as the source domain set

M_{s} = \{m_{s} | m_{s} \in M_{s}\}

in transfer learning, and the policy network trained in the current scenario is constructed as the target domain in transfer learning, denoted as

M_{t}

. If the target and source domains are identical, then knowledge can be completely transferred between them, which is expressed as

M_{s} = M_{t}

. If the target and source domains are different, then knowledge can only be partially transferred between them. In that case, transfer learning aims to acquire the optimal policy

π^{*}

in the target domain using the internal information set

D_{s}

in the source domain

M_{s}

and the internal information set

D_{t}

in the target domain

M_{t}

. The optimization Equation for

π^{*}

is

π^{*} = arg max E_{s ∽ s_{0}^{t}, a ∽ π} [Q_{M}^{π} (s, a)]

(27)

In the formula, the policy

π = ϕ (D_{s} ∽ M_{s}, D_{t} ∽ M_{t}) : S^{t} \to A^{t}

is the mapping of states to actions in the target domain

M_{t}

. The policy is learned from

D_{s}

and

D_{t}

.

In different combat scenarios, when the aircraft is in the same situation, the same action decision should have the same value. That is, the state–action value function of each action in the same state is the same. Based on this, the critic networks of different policy networks should have similar distributions. Therefore, the objective of transfer learning is to transfer the knowledge of the critic network in the source domain to the critic network in the target domain, thereby increasing the critic network’s training speed in the target domain, as shown in Figure 5.

Therefore, we integrate the output of the source domain’s critic network into the loss function, using it as the learning target for the target domain’s critic network. The loss function is expressed as

L (θ_{t}) = {L o s s}_{1} + β {L o s s}_{2}

(28)

{L o s s}_{1} = E [{(Q_{i}^{u} (o, a) - (r + Q_{i}^{u^{'}} (o^{'}, a)))}^{2}]

(29)

{L o s s}_{2} = E [{(Q_{t}^{u} (o, a) - Q_{s} (o^{'}, a))}^{2}]

(30)

where

Q_{t}^{u}

represents the target domain’s critic network, r is the action reward representing environmental feedback,

Q_{i}^{u^{'}}

represents the target domain’s target critic network,

Q_{s}

represents the source domain’s critic network, and

β

is the transfer rate. Equation (28) shows that, under the transfer learning framework, the loss function of the target domain’s critic network consists of two parts of loss.

{L o s s}_{1}

is a Bellman equation, as shown in Equation (29), and its role is to use the environment’s action reward to fit the critic network in the target domain.

{L o s s}_{2}

represents knowledge transfer, as shown in Equation (30), and its role is to use the knowledge of the source domain’s critic network to fit the target domain’s critic network. Using

{L o s s}_{1}

, the target domain’ critic network can obtain information from the environmental reward to learn the mapping relationship between states and actions in a specific battlefield environment. Using

{L o s s}_{2}

, the target domain’ critic network can learn the source domain’s knowledge to update its own distribution characteristics.

The efficacy of our transfer learning approach is predicated on the principle of task structure similarity between the source and target domains. We assess this similarity based on the invariant components of the Markov decision process across scenarios. While the number of agents changes, the fundamental mechanics including state and action space definitions, flight physics, and reward function principles remain consistent. This structural consistency ensures that the value estimations learned by the critic network in a simpler source domain provide a highly relevant and beneficial inductive bias for a more complex target domain.

Our method controls the degree of knowledge transfer through a two-part strategy. First, we manually select source–target pairs with high structural similarity, specifically transferring from an

N - v s - N

to an

N - v s - (N + 1)

configuration. This initial selection acts as a coarse-grained filter to maximize the potential for positive transfer. Second, for these high-similarity pairs, the hyperparameter

β

in Equation (28) is empirically set to finely balance the influence of the source domain’s distilled knowledge against new experiences from the target domain. A higher value for

β

is used when the source and target tasks are nearly identical, allowing for more aggressive knowledge transfer, while a more conservative value is chosen for pairs with greater divergence. This mechanism provides an informed starting point that prevents inefficient random exploration, while the Bellman loss component allows the policy to fine-tune and adapt to the specific dynamics of the new scenario.

4. Simulation Verification

4.1. Simulation Settings

This paper assumes that there is a formation of our red UAVs and a formation of the enemy blue UAVs. At the same time, each UAV makes decisions and takes action based on the decision-making strategy, and then interacts with the environment. Based on the comparison of UAVs forces on both sides, combat scenarios can be classified into balanced combat scenarios and asymmetric combat scenarios. In balanced combat scenarios, the number of UAVs from both sides is the same. However, in asymmetric combat scenarios, the number of blue UAVs is greater than that of the red UAVs, and thus, the red UAVs are in a disadvantaged position on the battlefield. From the perspective of the number of red UAVs, the battlefield scenarios can be divided into small, medium, and large scenarios. In the experiments, the red UAVs utilize the proposed algorithm for its maneuver and decision-making processes, whereas the blue UAVs operate using a fixed default strategy for level flight.

4.1.1. Parameter Settings

Initially, both the red and blue UAVs are configured with identical combat capabilities, including the same maximum speed, minimum speed, and attack range. In addition, they are initialized with the same health values and health decay rates, meaning that when a UAV enters the enemy’s attack range, it loses health at the same rate per unit time. Detailed parameter settings are provided in Table 1.

Table 2 provides a detailed overview of the comprehensive hyperparameters for the G-MADDPG model and its training process. The selection of model hyperparameters was guided by a two-part methodology. Foundational parameters were adopted from seminal works in multi-agent reinforcement learning, ensuring that our study aligns with established best practices. The remaining task-specific parameters were then empirically determined through a series of preliminary experiments designed to optimize for both stable convergence and peak performance within our custom scenarios. The training protocol is set for a maximum of 5000 episodes, with each episode containing up to 500 simulation steps. For network optimization, we utilize the Adam optimizer with a consistent learning rate of 0.01 for both actor and critic networks. To balance exploration and exploitation, we implement a Gaussian noise strategy for action selection; this noise is initialized with a standard deviation of 1.0 and is annealed with a decay rate of 0.9995 per episode.

4.1.2. Scenarios Settings

In order to evaluate the performance of the proposed G-MADDPG algorithm, we designed six UAV combat scenarios, including 2v2, 2v3, 4v4, 4v6, 6v6, and 6v9, following the experimental setup as illustrated in Figure 6. The six combat scenarios are systematically designed to evaluate the algorithm’s scalability and robustness across varying scales and force configurations, from balanced to asymmetric engagements. As a baseline adversary policy, the blue UAVs operate under a rule-based reactive pursuit strategy. Each blue UAV continuously identifies the nearest red UAV and flies directly towards its current position without predicting future motion. To avoid redundant engagements, if multiple blue UAVs target the same red UAV, only the closest maintains pursuit while others reassign their targets. The red UAVs utilize the proposed algorithm for its maneuver and decision-making processes. Complementing this, for both the red and blue UAVs, we employed an idealized sensor model, assuming perfect, omnidirectional detection for all agents within a 10,000-m radius, devoid of noise, error, or occlusion. Finally, consistent with the centralized training paradigm of G-MADDPG, a perfect communication network was assumed, facilitating the instantaneous and reliable information sharing among all agents for the centralized critic, without considering potential real-world impairments such as latency, bandwidth limitations, or packet loss.

Note that, when the UAV leaves the map range, the confrontation system determines that the aircraft has been shot down. In addition, a general death rule is that when, the UAV enters the enemy’s attack range, the health of the UAV will decrease over time, which is expressed as

h p_{t + 1} = h p_{t} - Δ h p \cdot Δ t

(31)

where

h p_{t + 1}

is the health of the UAV at time

t + 1

,

h p_{t}

is the health of the UAV at time t,

Δ h p

is the health loss of UAV per unit time, and

Δ t

represents the duration of one simulation time step.

4.2. Experimental Results

In order to evaluate the performance of the proposed G-MADDPG algorithm, we compared the winning rates of the G-MADDPG algorithm with the multi-agent deep deterministic policy gradient (MADDPG) algorithm [32,33,34,35] in different scenarios in the last 100 iterations.

To more precisely compare the performance of the two algorithms, this study adopts two evaluation metrics for air combat outcomes: the average return per UAV per 100 episodes and the win rate per 100 episodes. The average return per 100 episodes refers to the average amount of reward a single UAV obtains per episode over a span of 100 episodes. The win rate per 100 episodes indicates the proportion of victories achieved by the UAV formation over 100 engagements. The specific calculation formulas are provided in Equations (32) and (33).

R e w a r d = r e w a r d / 100 / n

(32)

W i n r a t e = w i n c o u n t s / 100

(33)

where

R e w a r d

denotes the average return per 100 episodes for a single UAV,

r e w a r d

represents the total accumulated return of the UAV formation over 100 episodes, n is the number of UAVs in the formation,

W i n r a t e

indicates the win rate of the UAV formation over 100 episodes, and

w i n c o u n t s

is the number of victories achieved by the formation within those 100 episodes.

The win rate per 100 episodes provides a macroscopic evaluation of the combat performance of the UAV formation, thereby reflecting the algorithm’s overall control effectiveness at the formation level. In contrast, the average return per UAV per 100 episodes offers a microscopic assessment of individual UAV performance, serving as an indicator of the algorithm’s control capability over each UAV within the formation.

To ensure a robust and statistically sound comparison, all experiments were repeated for 10 independent runs using different random seeds. For each performance metric (win rate and average reward), we report the mean and standard deviation (STD) across these 10 runs. Figure 7 shows a comparison of the average win rate and standard deviation of the win rate between MADDPG and G-MADDPG in these ten experiments, while Figure 8 shows a comparison of the average return and standard deviation of the return between MADDPG and G-MADDPG in these ten experiments. Figure 7 shows a comparison of the average win rate and STD of the win rate between MADDPG and G-MADDPG in these ten experiments, while Figure 8 shows a comparison of the average reward and STD of the reward between MADDPG and G-MADDPG in these ten experiments.

To determine whether the performance differences between MADDPG and G-MADDPG are statistically significant, we conducted an independent two-sample t-test on the experimental results of the two algorithms. Table 3 presents the experimental results with standard deviations for the two algorithms after 10 runs in six scenarios, along with their corresponding p-values. A p-value less than 0.05 is considered statistically significant.

In the 2v2 scenario, both algorithms demonstrated a similarly strong performance, with negligible differences observed. A p-value of 0.215 suggests that, under this relatively simple setting, there is insufficient statistical evidence to support a significant performance difference between the two methods. However, in the 2v3 scenario, where the first level of asymmetry was introduced, the performance of MADDPG exhibited a slight but measurable decline. The p-value decreased to 0.036, falling below the commonly accepted significance threshold of 0.05. This indicates that the performance advantage of G-MADDPG no longer attributable to random variation, but is instead statistically significant for the first time. As the environment became increasingly complex in the 4v4, 4v6, 6v6, and 6v9 scenarios, the performance gap between the two algorithms widened substantially, eventually exhibiting a pronounced divergence. In all of these complex configurations, especially in the 6v9 scenario, the p-values were consistently well below 0.001, providing strong statistical evidence that G-MADDPG outperforms MADDPG in handling high-dimensional and highly interactive environments.

The statistical results presented in Table 3 provide a robust quantitative comparison of the final converged performance of the G-MADDPG and MADDPG algorithms across 10 independent runs. However, to gain a deeper understanding of how each algorithm achieves these outcomes, it is instructive to analyze their learning dynamics throughout the training process. While the average win rate and average reward can summarize overall trends, they often smooth out and obscure the fine-grained, trial-to-trial learning behaviors, such as initial instability, exploration efficiency, and the precise point of convergence. Therefore, to qualitatively illustrate these crucial aspects, Figure 9 presents the learning curves from a single, representative experimental run for each of the six scenarios. These plots are intended to visually compare the typical learning trajectories, highlighting the key differences in convergence speed and stability between G-MADDPG and MADDPG as they learn and adapt within their respective environments.

The results in Figure 7 and Figure 9 show that, in the small-map scenarios, whether in the balanced 2v2 scenario or the asymmetric 2v3 scenario, both the G-MADDPG and the MADDPG can cause the aircraft formation to achieve a high win rate and can also cause a single aircraft to obtain high rewards. In the medium-map scenarios, when the aircraft formation is in the balanced 4v4 scenario, G-MADDPG can effectively control the aircraft formation to achieve a 99.1% win rate, while MADDPG achieves a 89.7% win rate. In the asymmetric 4v6 scenario, the G-MADDPG can control the aircraft to achieve a 97.8% win rate, while the MADDPG can only achieve a 78.5% win rate. In the large-map scenarios, when the aircraft formation is in the balanced 6v6 scenario, the G-MADDPG can achieve a 96.1% win rate, while the MADDPG can only achieve a 70.4% win rate. In the asymmetric 6v9 scenario, the G-MADDPG can achieve a 92.3% win rate, while the MADDPG algorithm can only achieve a 10.5% win rate. In contrast to the MADDPG algorithm, the proposed G-MADDPG method demonstrates superior control over UAV formations in large-scale scenarios, with an average accuracy of 97.4% in six scenarios, representing a 23.4% enhancement over the MADDPG algorithm.

Furthermore, in the 2v3 scenario, the proposed algorithm uses the prior policy network obtained in the 2v2 scenario to analyze the effectiveness of transfer learning. The results of the simulation are presented in Figure 10. In the absence of transfer learning, the convergence process is noticeably slower and exhibits greater instability, particularly during the initial iterations. This can be attributed to the model being trained from scratch, where the random initialization of parameters often leads to suboptimal starting points and increased oscillations in the early stages of training. As a result, more iterations are required to achieve convergence, and the trajectory towards stability is less smooth. In contrast, when transfer learning is employed, the algorithm benefits from pre-trained parameters derived from a related source domain. The incorporation of prior knowledge enables a better initialization, leading to faster convergence and improved learning stability. The convergence curve in this case demonstrates a significantly faster approach to the optimal solution, with reduced fluctuation and improved robustness throughout training.

As shown in Figure 9, in the 2v3 scenario, after the algorithm deploys transfer learning, the reward converges at approximately the 500th iteration, while if transfer learning is not deployed, the algorithm converges at approximately the 2000th iteration. And as shown in Table 4, in the medium scenarios, the integration of transfer learning makes the network convergence speed three times faster than the baseline algorithm. Therefore, the introduction of transfer learning can triple the convergence speed of the algorithm and effectively address the issue of slow convergence commonly encountered in reinforcement learning.

The experimental results offer a robust validation of the algorithm’s key operational characteristics. The high win rate in the most complex asymmetric scenarios, particularly the 92.3% win rate in the 6v9 confrontation, serves as a direct testament to the algorithm’s superior coordination effectiveness. This outcome demonstrates that the G-MADDPG successfully enables the synergistic decision making required to overcome a significant numerical disadvantage. Concurrently, the transfer learning experiments reveal a marked improvement in computational efficiency. The results show that introducing a pre-trained policy accelerates the convergence speed by nearly three-fold, proving the method’s ability to significantly reduce training overhead and address a key challenge in reinforcement learning.

5. Conclusions

In this paper, we introduce a GCN-based multi-agent reinforcement learning decision-making algorithm and a transfer learning-enhanced training method for intelligent UAV control, namely G-MADDPG. A Markov decision process is first constructed to model multi-UAV confrontations. The proposed method utilizes GCNs to aggregate local observations from multiple UAVs, enabling the effective extraction of battlefield information. To improve the convergence speed under varying scenarios, a transfer learning approach is introduced, where a pre-trained source policy is used to accelerate training in a target domain. The results in six UAV combat scenarios show that the proposed G-MADDPG is statistically 23.4% better than the baseline MADDPG. In the most challenging 6v9 asymmetric scenario, G-MADDPG maintained a robust 92.3% win rate, whereas the performance of MADDPG declined sharply, with a final win rate of 10.5%. In addition, the use of transfer learning improves convergence speed by nearly 20%, effectively reducing training time. Despite these promising results, we acknowledge the limitation that our model operates under the assumption of an ideal, lossless communication network among agents, which does not account for real-world challenges like signal delay, disruption, or agent attrition. Future research endeavors will focus on conducting a detailed quantitative analysis of emergent tactical behaviors and integrating a self-play mechanism, enabling the simultaneous adversarial training of both red and blue agents using the proposed algorithm, to further enhance its performance and applicability.

Author Contributions

Conceptualization, K.H., H.P., and D.A.; methodology, K.H., H.P., and D.A.; software, K.H. and C.H.; validation, K.H. and H.P.; formal analysis, K.H. and C.H.; investigation, H.P. and J.S.; resources, H.P., C.H., and J.S.; data curation, C.H. and S.L.; writing—original draft preparation, K.H. and H.P.; writing—review and editing, K.H. and H.P.; visualization, J.S. and S.L.; supervision, H.P. and S.L.; project administration, H.P. and D.A.; funding acquisition, H.P., C.H., and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Laboratory of Multi-domain Data Collaborative Processing and Control Foundation of China (Grant No. MDPC-20240303).

Data Availability Statement

Data available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Elmeseiry, N.; Alshaer, N.; Ismail, T. A detailed survey and future directions of unmanned aerial vehicles (uavs) with potential applications. Aerospace 2021, 8, 363. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, J.; Li, Z.; Zhang, H.; Chen, J.; Yang, W.; Yu, T.; Liu, W.; Li, Y. Manufacturing technology of lightweight fiber-reinforced composite structures in aerospace: Current situation and toward intellectualization. Aerospace 2023, 10, 206. [Google Scholar] [CrossRef]
Sar, A.; Choudhury, T.; Singh, R.K.; Kumar, A.; Mahdi, H.F.; Vishnoi, A. Disruptive Technologies in Cyber-Physical Systems in War. In Cyber-Physical Systems for Innovating and Transforming Society 5.0; Scrivener Publishing LLC: Beverly, MA, USA, 2025; pp. 233–250. [Google Scholar]
Navickiene, O.; Bekesiene, S.; Vasiliauskas, A.V. Optimizing Unmanned Combat Air Systems Autonomy, Survivability, and Combat Effectiveness: A Fuzzy DEMATEL Assessment of Critical Technological and Strategic Drivers. In Proceedings of the 2025 International Conference on Military Technologies (ICMT), Brno, Czech Republic, 27–30 May 2025; pp. 1–8. [Google Scholar]
Insaurralde, C.C.; Blasch, E. Ontological Airspace-Situation Awareness for Decision System Support. Aerospace 2024, 11, 942. [Google Scholar] [CrossRef]
Sandino Mora, J.D. Autonomous Decision-Making for UAVs Operating Under Environmental and Object Detection Uncertainty. Ph.D. Thesis, Queensland University of Technology, Brisbane, Australia, 2022. [Google Scholar]
Ni, J.; Tang, G.; Mo, Z.; Cao, W.; Yang, S.X. An improved potential game theory based method for multi-UAV cooperative search. IEEE Access 2020, 8, 47787–47796. [Google Scholar] [CrossRef]
Wang, L.; Lu, D.; Zhang, Y.; Wang, X. A complex network theory-based modeling framework for unmanned aerial vehicle swarms. Sensors 2018, 18, 3434. [Google Scholar] [CrossRef] [PubMed]
Sarkar, N.I.; Gul, S. Artificial intelligence-based autonomous UAV networks: A survey. Drones 2023, 7, 322. [Google Scholar] [CrossRef]
Mkiramweni, M.E.; Yang, C.; Li, J.; Zhang, W. A survey of game theory in unmanned aerial vehicles communications. IEEE Commun. Surv. Tutor. 2019, 21, 3386–3416. [Google Scholar] [CrossRef]
Park, H.; Lee, B.Y.; Tahk, M.J.; Yoo, D.W. Differential game based air combat maneuver generation using scoring function matrix. Int. J. Aeronaut. Space Sci. 2016, 17, 204–213. [Google Scholar] [CrossRef]
Zheng, H.; Deng, Y.; Hu, Y. Fuzzy evidential influence diagram and its evaluation algorithm. Knowl.-Based Syst. 2017, 131, 28–45. [Google Scholar] [CrossRef]
Gao, Y.; Zhang, L.; Wang, C.; Zheng, X.; Wang, Q. An evolutionary game-theoretic approach to unmanned aerial vehicle network target assignment in three-dimensional scenarios. Mathematics 2023, 11, 4196. [Google Scholar] [CrossRef]
Virtanen, K.; Raivio, T.; Hamalainen, R.P. Modeling pilot’s sequential maneuvering decisions by a multistage influence diagram. J. Guid. Control. Dyn. 2004, 27, 665–677. [Google Scholar] [CrossRef]
Jiang, S.; Xie, W.; Zhang, L.; Zhang, G.; Liang, X.; Zheng, Y. Current status and development trends of research on autonomous decision-making methods for unmanned swarms. In Proceedings of the International Conference on Optics, Electronics, and Communication Engineering (OECE 2024), Wuhan, China, 26–28 July 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13395, pp. 132–138. [Google Scholar]
Canese, L.; Cardarilli, G.C.; Di Nunzio, L.; Fazzolari, R.; Giardino, D.; Re, M.; Spanò, S. Multi-agent reinforcement learning: A review of challenges and applications. Appl. Sci. 2021, 11, 4948. [Google Scholar] [CrossRef]
Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Springer: Cham, Switzerland, 2021; pp. 321–384. [Google Scholar]
Hernandez-Leal, P.; Kartal, B.; Taylor, M.E. Is multiagent deep reinforcement learning the answer or the question? A brief survey. Learning 2018, 21, 22. [Google Scholar]
Do, Q.T.; Hua, T.D.; Tran, A.T.; Won, D.; Woraphonbenjakul, G.; Noh, W.; Cho, S. Multi-UAV aided energy-aware transmissions in mmWave communication network: Action-branching QMIX network. J. Netw. Comput. Appl. 2024, 230, 103948. [Google Scholar] [CrossRef]
Pan, Y.; Wang, X.; Xu, Z.; Cheng, N.; Xu, W.; Zhang, J.J. GNN-Empowered Effective Partial Observation MARL Method for AoI Management in Multi-UAV Network. IEEE Internet Things J. 2024, 11, 34541–34553. [Google Scholar] [CrossRef]
Han, J.; Yan, Y.; Zhang, B. Towards Efficient Multi-UAV Air Combat: An Intention Inference and Sparse Transmission Based Multi-Agent Reinforcement Learning Algorithm. IEEE Trans. Artif. Intell. 2025. [Google Scholar] [CrossRef]
Li, Y.; Dong, W.; Zhang, P.; Zhai, H.; Li, G. Hierarchical Reinforcement Learning with Automatic Curriculum Generation for Unmanned Combat Aerial Vehicle Tactical Decision-Making in Autonomous Air Combat. Drones 2025, 9, 384. [Google Scholar] [CrossRef]
Yoon, N.; Lee, D.; Kim, K.; Yoo, T.; Joo, H.; Kim, H. STEAM: Spatial Trajectory Enhanced Attention Mechanism for Abnormal UAV Trajectory Detection. Appl. Sci. 2024, 14, 248. [Google Scholar] [CrossRef]
Wang, X.; Wang, Y.; Su, X.; Wang, L.; Lu, C.; Peng, H.; Liu, J. Deep reinforcement learning-based air combat maneuver decision-making: Literature review, implementation tutorial and future direction. Artif. Intell. Rev. 2024, 57, 1. [Google Scholar] [CrossRef]
Skarka, W.; Ashfaq, R. Hybrid Machine Learning and Reinforcement Learning Framework for Adaptive UAV Obstacle Avoidance. Aerospace 2024, 11, 870. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, Y.; Huang, W.; Pan, W.; Liao, Y.; Zhou, S. An auto-upgradable end-to-end pre-authenticated secure communication protocol for UAV-aided perception intelligent system. IEEE Internet Things J. 2024, 11, 30187–30203. [Google Scholar] [CrossRef]
Guo, Y.; Xue, W.; Ma, W.; Ju, Z.; Li, Y. Research on Methods for Presenting Battlefield Situation Hotspots. In Proceedings of the 2024 11th International Conference on Dependable Systems and Their Applications (DSA), Taicang, China, 2–3 November 2024; pp. 313–319. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
Wang, Q.; Hao, Y.; Cao, J. ADRL: An attention-based deep reinforcement learning framework for knowledge graph reasoning. Knowl.-Based Syst. 2020, 197, 105910. [Google Scholar] [CrossRef]
Akinwale, O.; Mojisola, D.; Adediran, P. Consensus issues in multi-agent-based distributed control with communication link impairments. Niger. J. Technol. Dev. 2024, 21, 85–93. [Google Scholar] [CrossRef]
Liu, Y.; Wang, W.; Hu, Y.; Hao, J.; Chen, X.; Gao, Y. Multi-agent game abstraction via graph attention neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7211–7218. [Google Scholar]
Li, B.; Wang, J.; Song, C.; Yang, Z.; Wan, K.; Zhang, Q. Multi-UAV roundup strategy method based on deep reinforcement learning CEL-MADDPG algorithm. Expert Syst. Appl. 2024, 245, 123018. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Z.; Ma, Y.; Sun, R.; Xu, Z. Research on autonomous formation of multi-UAV based on MADDPG algorithm. In Proceedings of the 2022 IEEE 17th International Conference on Control & Automation (ICCA), Naples, Italy, 27–30 June 2022; pp. 249–254. [Google Scholar]
Li, B.; Liang, S.; Gan, Z.; Chen, D.; Gao, P. Research on multi-UAV task decision-making based on improved MADDPG algorithm and transfer learning. Int. J.-Bio-Inspired Comput. 2021, 18, 82–91. [Google Scholar] [CrossRef]
Zhang, Y.; Mou, Z.; Gao, F.; Jiang, J.; Ding, R.; Han, Z. UAV-enabled secure communications by multi-agent deep reinforcement learning. IEEE Trans. Veh. Technol. 2020, 69, 11599–11611. [Google Scholar] [CrossRef]

Figure 1. UAV maneuvering model.

Figure 2. Details of the designed state space.

Figure 3. The proposed G-MADDPG for UAV intelligent decision making.

Figure 4. The critic network based on the GCN.

Figure 5. Structure diagram of MARL based on transfer learning.

Figure 6. UAV game confrontation scenarios. (a) 2v2 scenario. (b) 2v3 scenario. (c) 4v4 scenario. (d) 4v6 scenario. (e) 6v6 scenario. (f) 6v9 scenario.

Figure 7. Average win rate comparison between MADDPG and G-MADDPG.

Figure 8. Average reward comparison between MADDPG and G-MADDPG.

Figure 9. (a) The 2v2 scenario. (b) The 2v3 scenario. (c) The 4v4 scenario. (d) The 4v6 scenario. (e) The 6v6 scenario. (f) The 6v9 scenario.

Figure 10. Comparison of the effects of transfer learning and non-transfer learning (2v3 scenario).

Table 1. UAV parameters.

UAV Parameters	Value
Maximum speed	250 m/s
Minimum speed	10 m/s
Maximum detection range	10,000 m
Maximum attack range	3000 m
Maximum yaw angular acceleration	20°
Maximum health	1
Loss rate of health	0.2

Table 2. Hyperparameter specifications for training.

Parameter Category	Parameter	Value	Description
General RL Parameters	Optimizer	Adam	Optimizer for both actor and critic networks.
	Actor learning rate	0.01	Learning rate for the actor network optimizer.
	Critic learning rate	0.01	Learning rate for the critic network optimizer.
	Discount factor $γ$	0.99	Reward discount factor.
	Experience pool size	1,000,000	Maximum number of experiences stored.
	Experience pool sample size	1024	Number of experiences sampled for each update.
	Target network update $τ$	0.01	Soft update parameter for the target critic network.
Training schedule	Max training episodes	5000	Fixed termination condition for training.
	Max steps per episode	500	Maximum simulation steps within one episode.
	Warm-up episodes	100	Episodes using random actions to populate the buffer before training starts.
Exploration strategy	Noise type	Gaussian $N (0, σ^{2})$	Additive noise for action exploration.
	Initial noise std. $(σ)$	1.0	Initial standard deviation of the Gaussian noise.
	Noise decay rate	0.9995	Multiplicative decay factor applied to $σ$ after each episode.
	Minimum noise std. $(σ_{m i n})$	0.05	Floor value for the noise standard deviation.
Transfer learning	Transfer rate $(β)$	0.5	Weighting factor for the source critic loss term in Equation (28).

Table 3. Performance comparison and statistical analysis.

Scenario	Method	Win Rate (Mean ± STD)	Average Reward (Mean ± STD)	p-Value (vs. MADDPG)
2v2	G-MADDPG	$99.8 % \pm 0.5 %$	$725 \pm 38.1$	$p = 0.215$
2v2	MADDPG	$98.9 % \pm 1.1 %$	$702 \pm 45.2$	$p = 0.215$
2v3	G-MADDPG	$99.5 % \pm 0.7 %$	$788 \pm 42.9$	$p = 0.036$
2v3	MADDPG	$96.2 % \pm 2.8 %$	$715 \pm 60.4$	$p = 0.036$
4v4	G-MADDPG	$99.1 % \pm 0.9 %$	$482 \pm 45.3$	$p < 0.001$
4v4	MADDPG	$89.7 % \pm 5.4 %$	$361 \pm 88.6$	$p < 0.001$
4v6	G-MADDPG	$97.8 % \pm 1.8 %$	$461 \pm 59.2$	$p < 0.001$
4v6	MADDPG	$78.5 % \pm 9.2 %$	$125 \pm 101.7$	$p < 0.001$
6v6	G-MADDPG	$96.1 % \pm 2.6 %$	$528 \pm 65.1$	$p < 0.001$
6v6	MADDPG	$70.4 % \pm 11.5 %$	$103 \pm 112.9$	$p < 0.001$
6v9	G-MADDPG	$92.3 % \pm 4.1 %$	$551 \pm 80.3$	$p < 0.001$
6v9	MADDPG	$10.5 % \pm 5.8 %$	$- 715 \pm 155.4$	$p < 0.001$

Table 4. The influence of transfer learning on the convergence speed.

Scenario	Iterations to Converge (TL)	Iterations to Converge (No TL)
2v2	649	2034
2v3	673	2031
4v4	698	2489

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, K.; Pan, H.; Han, C.; Sun, J.; An, D.; Li, S. Graph Neural Network-Enhanced Multi-Agent Reinforcement Learning for Intelligent UAV Confrontation. Aerospace 2025, 12, 687. https://doi.org/10.3390/aerospace12080687

AMA Style

Hu K, Pan H, Han C, Sun J, An D, Li S. Graph Neural Network-Enhanced Multi-Agent Reinforcement Learning for Intelligent UAV Confrontation. Aerospace. 2025; 12(8):687. https://doi.org/10.3390/aerospace12080687

Chicago/Turabian Style

Hu, Kunhao, Hao Pan, Chunlei Han, Jianjun Sun, Dou An, and Shuanglin Li. 2025. "Graph Neural Network-Enhanced Multi-Agent Reinforcement Learning for Intelligent UAV Confrontation" Aerospace 12, no. 8: 687. https://doi.org/10.3390/aerospace12080687

APA Style

Hu, K., Pan, H., Han, C., Sun, J., An, D., & Li, S. (2025). Graph Neural Network-Enhanced Multi-Agent Reinforcement Learning for Intelligent UAV Confrontation. Aerospace, 12(8), 687. https://doi.org/10.3390/aerospace12080687

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Graph Neural Network-Enhanced Multi-Agent Reinforcement Learning for Intelligent UAV Confrontation

Abstract

1. Introduction

2. Mathematical Statements

2.1. UAV Maneuver Model

2.2. Markov Decision Process for the UAV Game Confrontation Policy

2.2.1. State Space

2.2.2. Action Space

2.2.3. Reward Function

3. Methodology

3.1. Intelligent Decision-Making Algorithm Based on GCN

3.2. Multi-Agent State Extraction Based on GCN

3.3. Training Process of Intelligent Decision-Making Algorithm Based on GCN

3.4. MARL Based on Transfer Learning

4. Simulation Verification

4.1. Simulation Settings

4.1.1. Parameter Settings

4.1.2. Scenarios Settings

4.2. Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI