1. Introduction
Unmanned aerial vehicles (UAVs) have attracted significant attention as they effectively reduce mission costs and pilot casualties [
1]. Due to recent advancements in electronics, computers, and communication technology, UAVs have become more autonomous and are now commonly used for missions that include reconnaissance, surveillance, and communications relay [
2]. Nevertheless, current UAV autonomy is inadequate to meet the demands of highly complex and dynamic tasks. Multiple UAVs (multi-UAV) can execute tasks with more richness, efficiency, and task redundancy compared to a single UAV. Consequently, comprehensive research on methods for coordinated decision-making among multi-UAVs is necessary to enhance their autonomous cooperative abilities. Cooperative air combat, with its demanding and continuous complexity involving multi-UAVs, places significant emphasis on the decision-making capabilities of these UAVs. Hence, cooperative air combat constitutes an ideal situation where cooperative decision-making methods of multi-UAVs can be tested. Currently, research on automatic air combat decision-making can be categorized into two types, single-UAV air combat and multi-UAV cooperative air combat (MCAC).
Scholars have proposed multiple methods to address the issue of autonomous air combat in single UAVs by utilizing various theories, including game theory, expert systems, optimization algorithms, and machine learning. In air combat, game theory is used to address the optimal decision-making problem. In particular, Park designed automatic maneuver generation methods for air-to-air combat within the visual range of UAVs based on differential game theory [
3]. The expert system has been explored by several researchers earlier. For example, Wang proposed an evolutionary expert system tree approach that integrates genetic algorithms and expert systems for solving decision problems in two-dimensional air combat [
4]. Optimization algorithms formalize air combat decision problems as multi-objective optimization problems. Huang, for example, utilized Bayesian inference theory to develop an air situational model that adjusts the weights of maneuver decision-making factors based on situational evaluation results to create a more logical objective function [
5]. Machine learning has shown its potential in single-UAV air combat. For instance, Fu employed a stacked sparse auto-encoder network and a Long short-term memory network as a maneuver decision-maker for supporting aircraft maneuver decisions [
6]. Reinforcement learning methods, such as Q-Learning [
7], Policy Gradient [
8], and Deep Q-Network [
9], were adopted to establish the maneuver decision model. They constructed value functions to evaluate the value of different actions in each state and made choices based on this or constructed strategy functions to build the mapping between different states and actions. Isci implemented air combat decisions based on the deep deterministic policy gradient (DDPG) algorithm and the proximal policy optimization algorithm in the air combat environment established in that literature by incorporating energy state rewards in the reward function [
10]. Li applied the improved DDPG algorithm to autonomous maneuver decision-making of the UAVs [
11]. Sun proposed a multi-agent hierarchical policy gradient algorithm for the air combat decision-making [
12]. The complex hybrid action problem was solved using adversarial self-play learning, and the strategy performance was enhanced. Lockheed Martin proposed an approach that combines hierarchical structure with maximum entropy reinforcement learning to integrate expert knowledge through reward shaping and support the modularity of the policies [
13].
MCAC is more challenging and involves higher model complexity than single-UAV air combat, as it requires consideration of not only the tactics of every UAV but also their tactical cooperation. To address the challenges of MCAC, researchers have suggested multiple methods. For example, Gao introduced a rough set theory into the tactical decision-making of cooperative air combat and proposed an attribute reduction algorithm to extract key data and tactical decision rules in the air combat [
14]. Fu divided the MCAC decision process into four parts: situational assessment; attack arrangement; target assignment; and maneuver decision [
15]. Furthermore, Zhang presented an evaluation method based on an improved group generalized intuitionistic fuzzy soft set. This method constructs a threat assessment system and indicators for air targets, introduces the generalized parameter matrix provided by multiple experts into the generalized intuitionistic fuzzy soft set, and uses subjective and objective weights to achieve more logical target threat assessment results [
16]. Meng introduced an approach to identify the tactical intent of multi-UAV coordinated air combat by utilizing dynamic Bayesian networks, a threat assessment model, and a radar model to outline crucial features concerning maneuver occupancy. These features were utilized to teach support vector machines for the attack intent prediction [
17]. Additionally, Ruan investigated the task assignment problem under time continuity constraints in cooperative air combat and developed a task assignment and integer planning model for cooperative air combat through time and continuity constraints [
18]. Peng established an MCAC optimization model based on the destruction probability threshold and time-window constraints and then solved the target assignment problem by employing a hybrid multi-target discrete particle swarm optimization (DPSO) algorithm [
19]. Lastly, Li created a multi-UAV cooperative occupation model by constructing an advantage function to evaluate the air combat situation. The model was resolved using an upgraded DPSO, resulting in a multi-UAV cooperative occupation scheme [
20].
In recent years, multi-agent reinforcement learning (MARL) has made significant accomplishments. For instance, AlphaStar [
21] and OpenAI Five [
22] have achieved remarkable success in gaming. Multi-UAV systems, as a typical multi-agent system, have enormous potential applications when coupled with MARL. Li developed a MARL-based algorithm with a centralized critic, an actor using the gate recurrent unit model, and a dual experience playback mechanism for solving multi-UAV collaborative decision-making problems and evaluated it through simulations in a multi-UAV air combat environment [
23]. In addition, Liu designed a missile attack zone model and combined it with the multi-agent proximal policy optimization (MAPPO) method to achieve beyond-visual-range air combat among multiple unmanned aerial vehicles [
24]. The references [
23,
24] have investigated the MCAC decision-making problem concerning discrete action spaces. However, the MCAC decision-making problem in continuous action spaces is substantially more intricate and better suited to meet the decision-making requirements of unmanned aerial vehicles.
In the above research, game theory and expert systems are primarily employed in discrete environments, which lack learning capabilities and are unsuitable for changing scenarios. Optimization techniques may struggle to adapt to dynamic decision-making requirements as they require solving an optimization problem in real time. In contrast, machine learning methods, particularly reinforcement learning, offer considerable promise for multi-UAV cooperative decision-making due to their robust learning ability. However, overestimation bias and error accumulation can be problematic with many reinforcement learning algorithms with actor-critic framework and may lead to suboptimal strategies during the updating procedure [
25]. Meanwhile, in the function approximation setting, estimation bias is inevitable due to the inaccuracy of the estimator. Furthermore, using temporal-difference (TD) learning exacerbates the problem of inaccurate estimation by updating the value function estimation based on subsequent states, leading to a build-up of errors [
26]. This means that using inaccurate estimates in each update will lead to an accumulation of errors. Overestimation bias and error accumulation also exist in MARL algorithms that employ the actor-critic framework and TD learning, such as multi-actor-attention-critic (MAAC) [
27], MAPPO [
28], multi-agent soft actor-critic (MASAC) [
29], and multi-agent deep deterministic policy gradient (MADDPG) [
30].
Therefore, this paper proposes the multi-agent double-soft actor-critic (MADSAC) algorithm to solve the multi-UAV cooperative decision-making problem in three-dimensional continuous action space, as well as the overestimation bias and error accumulation problems in MARL. The contributions of this paper are threefold:
- (1)
This paper proposes the MADSAC algorithm based on a decentralized partially observable Markov decision process (Dec-POMDP) using centralized training with decentralized execution (CTDE) framework. To solve the overestimation problem in MARL and to enhance the scalability of the algorithm, it uses the double-centralized critics based on an attention mechanism;
- (2)
To reduce the impact of error accumulation on algorithm performance, this paper employs a delayed policy updates mechanism. This mechanism updates the policy network after the critic networks become stable to reduce policy scattering caused by the overestimation of the value network;
- (3)
To address the challenge of sparse reward in the reward space during decision-making, this paper designs a segmented reward function, which accelerates algorithm convergence and obtains excellent strategies. Additionally, a new observation space is designed based on the air combat geometry model.
The rest of this paper is organized as follows. A multi-UAV cooperative simulation environment is established in
Section 2.
Section 3 introduces MARL knowledge, proposes the MADSAC algorithm, and a red-blue confrontation scenario. In
Section 4, MADSAC and the comparison algorithm are trained, and the training results are compared and analyzed. Three sets of confrontation experiments between MADSAC and the comparison algorithm are also designed, and then the experimental results and the strategies obtained by MADSAC are analyzed. Finally, the conclusion of this paper and the outlook of the next work are given in
Section 5.
2. Multi-UAV Cooperative Simulation Environment
To facilitate algorithm learning, this section establishes a multi-UAV cooperative simulation environment that supports enhanced learning interactions. In reality, communication among UAVs may face interference, causing delays in data transmission and reception, particularly during the broadcast mode, which can intensify communication stress. Therefore, communication between UAVs requires the use of information compression transmission technology [
31] to reduce communication pressure. To simplify the communication model, this paper assumes smooth communication with minimal to no delay. The framework includes a UAV model, a sensor observation model, an aerial situation, an attack model, and a fixed strategy with a 0.1-s solution interval for all models.
2.1. UAV Model
The motion of the UAV is performed under the earth-fixed reference
. Where
denotes the north-east-down (NED) coordinate and its origin at
, with
as the horizontal coordinate pointing north,
is the vertical coordinate pointing east, and
is the vertical coordinate pointing to the ground. Similarly, the UAV’s body coordinate is represented by
, and its origin is located at the center of mass of the UAV.
,
, and
denote the
x-axis,
y-axis, and
z-axis, respectively. The position relationship between the two coordinate systems is shown in
Figure 1. The
and
in the figure indicate the pitch and yaw angles of the UAV, respectively.
The motion of the UAV is represented as the motion of a point of mass in the coordinate system
with the center of mass located at the center of gravity, as shown in Equation (1):
where
,
, and
denote the UAV’s velocity components on its corresponding axes, respectively.
denotes the UAV’s speed.
Assuming that the engine thrust vector corresponds with the UAV’s
x-axis and points to the positive direction of the
x-axis, the UAV velocity direction is equal to the nose point, implying that the sideslip angle is zero. The effect of wind speed on the UAV is not considered. Equation (2) represents the UAV’s center-of-mass dynamics equation [
23].
where
and
denote the tangential and normal accelerations of the UAV, which are generated by the tangential and normal components of the total force acting on the UAV, respectively.
stands for gravitational acceleration,
for the acceleration of the UAV,
for pitch angle rate, and
for yaw angle rate. The expectation of pitch angle, yaw angle, and speed of the UAV are denoted by
,
, and
, respectively.
is settled into the control quantity
by the controller.
2.2. Sensor Observation Model of UAV
Given the errors inherent in sensor data during UAV flight, we develop an observation model to simulate sensor data acquisition. To simulate the data acquisition process, Gaussian noise is incorporated within the observed position, attitude, and velocity. Moreover, we assume that the data acquisition frequency of each sensor matches the operating frequency of the UAV system.
The position information of the UAV is obtained as shown in Equation (3):
where
is the observed position of the UAV, which is obtained from the real position
of the UAV adding Gaussian noise. The
represents Gaussian noise, where
is the scaling factor, and
is a random number following the Gaussian distribution
, taking values in the range
; that is
.
Similar to acquiring position information, the attitude information of UAV is shown in Equation (4):
Moreover, adding Gaussian noise to the UAV’s real attitude information can obtain the observed attitude of the UAV. The represents Gaussian noise, where is the scaling factor, and is a random number following the Gaussian distribution, taking values in the range ; that is .
Finally, Gaussian noise is added to the real speed
to obtain the observation speed
of the UAV, as shown in Equation (5):
where
is the scaling factor, and
is the value range of Gaussian noise.
Simulating sensor noise can effectively enhance the robustness of the policy network in reinforcement learning. The noise, particularly with location, increases the variety of sampled data during the training, reducing the overfitting prospects.
2.3. Aerial Situation
Developing a UAV situational awareness process is critical in air combat because it is necessary to collect genuine air situational awareness in real time, which is the cornerstone for the timely decision-making of the UAVs. The situational information in the MCAC process includes information about both friendly and enemy UAVs. Information on friendly UAVs can be obtained using datalinks and radar, which covers position, speed, altitude, weaponry, and fuel levels. In addition, information can be learned about the courses of action taken by friendly UAVs in light of their present situation. Only radar and other sensing technologies can provide data on enemies, including their position and speed. As a result, the situational awareness module directly accesses information about other UAVs from the simulation environment as the UAV awareness information.
In air combat theory [
32], the relative position of two UAVs flying at the same altitude is often described using the antenna train angle (ATA), aspect angle (AA), and horizontal crossing angle (HCA). When evaluating the complete position relationship in a three-dimensional space, the height line of sight angle (HA) is incorporated to depict the vertical position relationship. Together, these four angles comprehensively convey the positional relationship of two UAVs within three-dimensional space, as depicted in
Figure 2, and can be calculated using Equation (6).
The radar scanning range is generally a sector area forward of the UAV nose. During air combat, the UAV can launch missiles after the UAV’s fire control radar has locked the target. Thus, it is extremely dangerous to be within the range of enemy radar. In
Figure 3, the red UAV is in a dominant position relative to the blue UAV A, but it is easy to lose that target because A is at the edge of its radar detection, and the speed of A is perpendicular to its axis. It is at an absolute advantage relative to UAV B and a balanced relative to UAV C because it is locked on UAV C but is also locked on by UAV C. Therefore, the only way to ensure locking enemy UAVs while not being locked by them is to approach them from the rear or side, i.e., when the absolute values of both ATA and HA are less than 30 degrees, and the absolute value of AA is less than 90 degrees [
13].
2.4. Attack Model
The weapon firing and hitting of the UAVs need to satisfy certain constraints. To simulate the weapon attack process of UAVs in air combat, a weapon launch model and a weapon hit model are set up in this section. In the real world, the UAV firing at the target needs to satisfy certain distance and angle constraints. According to these factors, the designed firing model is shown in Equation (7).
where
and
are attack angle constraints, which denote the maximum attack ATA and maximum attack HA, respectively. The attack distance constraints are
and
, which denote the minimum attack distance and the maximum attack distance, respectively.
The hit model of the weapon is a probabilistic model, and whether this attack hits or not is determined by three factors,
d, ATA, and HA, as shown in Equation (8):
where
is the Gaussian noise obeying the distribution
. The
and
are the ATA and HA noise scaling factors, respectively. The
is the effective hitting distance.
2.5. Fixed Strategy
This section presents a straightforward and efficient fixed strategy that focuses on attacking the nearest UAV during formation operations, thereby creating a local advantage, particularly when the opposing UAV is dispersed. In the initial phase, the blue UAV gathers positional data for all red UAVs in the space and selects the nearest red UAV as a pursuit target in each simulation step, switching to a new red UAV once the nearest one changes. The same pursuit strategy is also implemented during the attack phase. In real time, the attack model outlined in
Section 2.4 is used to determine whether to fire a missile and whether to engage in destruction after firing.
Figure 4 illustrates the pursuit strategy for the blue UAV, in which condition 1 is: “Is the red UAV alive?”, Condition 2 is: “Is the target UAV alive?”, Condition 3 is: “Is a missile fired?”, and Condition 4 is: “Is the target UAV destroyed?”.
4. Agent Training and Results Analysis
All algorithms, including MADSAC, MAAC, MAPPO, MADDPG, MASAC, and SAC, underwent training in the red-blue confrontation simulation scenario established in the previous section. The characteristics of all algorithms are shown in
Table 3. Specifically, MAAC was modified to fit the continuous action space and incorporated into the shared parameter mechanism, still named MAAC henceforth. These algorithms functioned as the red side, employing a fixed strategy during the training process and confronting the blue side. Subsequently, the performance of MADSAC was compared to other algorithms, and the obtained strategies were discussed.
4.1. Agent Training
MADSAC employs two hidden layers of an MLP with 256 units per layer for all actors. For the critics, two attentional heads featuring two hidden layers of 256 units are implemented based on the attention mechanism. The Adam optimizer is used with a learning rate of 0.0001, a replay buffer size of 10
6, and a minibatch size of 1024 transitions for each update. The target networks are updated with a smoothing coefficient
. The discount factor
and temperature parameter
are set to be 0.99 and 1/10, respectively. The experiments were carried out on a computer that was equipped with an NVIDIA RTX 2080s GPU and an AMD Ryzen 9 5950X CPU. The algorithm was tested once every two training cycles using a strategy simulated once in twenty environments infused with different seeds. The average score obtained from twenty rounds of simulations is recorded as the strategy’s average score. To achieve the win criterion, the red side must destroy all the blue UAVs, and the difference between the number of red wins in twenty simulations and 20 is counted as the strategy’s win rate. As shown in
Figure 8, the learning curves of all algorithms represent the average return of five training processes with different seeds; the shadow represents the return fluctuation under a 95% confidence interval.
During the entire training process, each algorithm sampled over eight million steps. MADSAC showed a rapid increase in average return in the first million steps, and after only 50,000 steps, the average return approached 0, with a winning rate close to 0.5. This indicated that the red UAV achieved combat effectiveness equivalent to the blue UAV. At around two million steps, MADSAC converged, maintaining an average return of about 41 and a win rate close to 100%. The average return of MAAC fluctuated around −35 for the first half-million steps but increased rapidly between 0.5 and 1.5 million steps. At around 1.2 million steps, the average return approached 0, with a winning rate close to 0.5. At 3.5 million steps, MAAC converged, demonstrating an average return of roughly 20 and a winning rate of roughly 95%. Although the baselines MAPPO and MADDPG showed slight improvements during training, their average return barely surpassed 0 for the entire process, and their winning rate was never superior to 50%. In contrast, SAC and MASAC barely learned effective strategies, maintaining a fluctuating learning curve around −35 with a win rate close to 0. However, MADSAC showed significant improvement in convergence performance and high sample efficiency.
In the red-blue confrontation scenario, MAAC outperformed DDPG, MAPPO, MASAC, and SAC regarding convergence speed, average return, and win rate. Combining the maximum entropy theory with an attention-based mechanism critic improved the convergence speed and sampling efficiency of MARL significantly. However, MADSAC’s convergence is quicker than MAAC’s, and its average return is roughly 20 points higher, with smaller fluctuations after convergence, indicating a lesser variance. As displayed in
Figure 8 and
Figure 9, MADSAC had an average loss of 0.6 UAVs per battle, whereas MAAC’s UAV losses averaged 2.5 and fluctuated more. Although MADSAC’s winning rate was only about 5% better than MAAC’s, it achieved these victories with less loss compared to MAAC, whose cost to win was roughly five times higher than MADSAC’s. Clearly, the double-soft critic and delayed policy update mechanisms effectively improved the algorithm’s performance.
4.2. Strategy Testing
Three experiments were conducted to demonstrate the performance of the strategies produced by MADSAC. In each experiment, the red side was represented by MASAC, while the blue side was represented by MADDPG, MAPPO, and MAAC, respectively. Since MASAC and SAC did not develop effective strategies, they were not included in the comparison. The results of the experiments are presented in
Figure 10, and The experimental video is at
https://bhpan.buaa.edu.cn:443/link/0B83BE79F08C104F9D935B0F251F9517 (accessed on 26 April 2023).
In
Section 4.1, the algorithms were each trained five times. Strategies that produced a win rate of above 70% during training were saved in the pool, along with the most recent ones. As neither MADDPG nor MAPPO achieved a win rate above 70%, the latest strategies from the five training sessions for these algorithms were used as test strategies for the blue side. From the MADSAC strategies with a 100% win rate, thirty were randomly selected as red-side strategies. Each red strategy was then pitted separately against all blue side strategies, 100 times each, to calculate the win rate. The win rate of all red strategies was averaged to determine the red side’s win rate. Finally, MADSAC achieved a 100% win rate against MADDPG and a 94% win rate against MAPPO.
During testing, it was determined that the MADDPG strategy did not learn to fight in the designated area, resulting in the majority of the UAVs being judged dead when they left the engagement area, as shown in
Figure 11. This phenomenon is mainly attributed to the algorithm’s insufficient exploration ability, resulting in local convergence. As the MADDPG strategy is trained against a fixed strategy that flies directly to the nearest target and attacks, it is unable to find an effective attack strategy, and no positive return can be obtained. In the reward function, the presence of
a certain condition makes the negative reward for flying directly out of the engagement area smaller than the negative reward for being defeated by the blue UAV. As UAVs are always encouraged to obtain more rewards, they are more inclined to fly directly out of the engagement area. On the other hand, MAPPO showed signs of overfitting, with all five strategies hovering in place, as indicated in
Figure 12. Although this strategy can withstand the fixed strategy to some extent, it performs poorly against MADSAC. However, it is worth noting that some of MADSAC’s strategies are still unable to achieve victory against this strategy.
For MADSAC and MAAC, thirty different strategies are randomly selected from the pool of strategies with a win rate of 100%, respectively. Each strategy of MADSAC is played 100 times against all MAAC strategies to obtain a win rate of 87%.
Figure 13 shows one of the replays of MADSAC against MAAC, where MAAC demonstrates an effective offensive strategy and is able to cause the loss of the red UAV in most cases and even completely destroy them. However, overall, MADSAC still holds a significant advantage over MAAC.
Furthermore, the results of the win rates of all algorithms in adversarial testing are presented in
Table 4. The first column of the table represents the blue strategy, with the first row representing the red strategy and the data indicating the red side’s win rate.
It is important to note that the success criterion for the scenario is the complete elimination of the blue drones, and a draw is considered a failure on the red side. Hence, in
Table 4, when the same algorithm is pitted against itself, the win rate may be less than 50%. Additionally, within the same group of tests, the sum of the win rates of the red and blue sides may be less than one because of this criterion. The victory rate data of MAAC versus MAPPO and MAPPO self-confrontation are particularly interesting. The former highlights that the strategies obtained by MAAC perform poorly in response to MAPPO strategies. The latter is because the hovering strategy obtained by MAPPO does not actively attack, making it difficult to achieve victory in confrontation under existing winning standards.
Analysis of the experimental results in
Section 4.1 and this section leads to the following conclusions: (1) MADSAC boasts faster convergence, higher average return, higher win rates, and smaller algorithm variance than other compared algorithms; (2) the improvements made in MADSAC for overestimation and error accumulation problems are effective; (3) maximizing entropy theory, double-centralized critics based on attention mechanisms, and delayed policy update techniques can effectively improve MARL’s performance in complex problems.
4.3. Strategy Analysis
Tests of the strategies obtained from MADSAC training show that the algorithm has learned several effective strategies, including the pincer movement, outflanking, and High-Yo-Yo strategies.
Pincer movement strategy: the red UAVs split into two formations, using pincer movement to attack the blue UAVs. As shown in
Figure 14, the red UAVs quickly split into two formations to approach the blue from both sides so that the blue UAVs lose their frontal advantage and are threatened on both sides simultaneously. In this case, the red UAVs can almost eliminate all the blue UAVs without loss;
Outflanking strategy: some of the red UAVs attack the blue UAVs head-on while others flank them. As shown in
Figure 15, a formation of the red UAVs engages the blue UAVs from the front to draw their fire, while the other formation speeds up from the flank to the rear of the blue UAVs to complete the outflank so that the blue UAVs are attacked front and rear and eliminated;
High-Yo-Yo strategy: the red UAVs slow their approach rate by suddenly climbing, then making a quick turn and gaining a favorable position. As shown in
Figure 16, the red UAVs induce the blue UAVs to approach, and then two of the red UAVs climb fast, completing a maneuver turn and circling the inside of the blue UAVs. In that way, the red UAVs occupy a favorable position and can eliminate all the blue UAVs.
Figure 14.
Pincer movement strategy: from (a–c) shows the red UAVs split into two formations, using pincer movement to attack the blue UAVs.
Figure 14.
Pincer movement strategy: from (a–c) shows the red UAVs split into two formations, using pincer movement to attack the blue UAVs.
Figure 15.
Flanking attack strategy: from (a–c) shows some of the red UAVs attacking the blue UAVs head-on while others flank them.
Figure 15.
Flanking attack strategy: from (a–c) shows some of the red UAVs attacking the blue UAVs head-on while others flank them.
Figure 16.
High-Yo-Yo strategy: from (a–c) shows the red UAVs slowing their approach rate by suddenly climbing, then making a quick turn and gaining a favorable position.
Figure 16.
High-Yo-Yo strategy: from (a–c) shows the red UAVs slowing their approach rate by suddenly climbing, then making a quick turn and gaining a favorable position.
Furthermore, the strategy model with a high success rate trained by MADSAC is allocated to both red and blue UAVs. During the air combat process, the two sides have a wonderful dogfight, as shown in
Figure 17. According to the current situation, the two sides adjust their turning efficiency by the change in height and speed so that they can constantly make maneuvers to try to occupy a favorable position. Therefore, the two sides will appear in a spiral combat trajectory in the air combat process. From this point, it can be seen that the MADSAC algorithm has learned to flexibly maneuver and occupy a favorable position to attack based on the rapidly changing battlefield situation.
5. Conclusions
This paper focuses on the cooperative air combat decision-making of multi-UAVs. At the starting point, a simulation environment for multi-UAV cooperative decision-making is created to support the design and training of the algorithm. Then, the MADSAC algorithm is proposed featuring multiple actors sharing network parameters and double-centralized critics based on the attention mechanism. The algorithm uses maximum entropy theory, target networks, and delayed policy update techniques to effectively avoid overestimation bias and error accumulation in MARL while improving algorithm stability and performance. Finally, the algorithm is evaluated by training MADSAC and comparison algorithms in an MCAC simulation scenario of red-blue confrontation. The experimental results illustrate that both MADSAC and MAAC exhibit satisfactory performance and outperform MASAC, SAC, MAPPO, and MADDPG. Specifically, MADSAC converges faster than the other five algorithms. Compared to MAAC, MADSAC shows better stability and convergence while achieving greater victories with fewer UAVs lost.
However, there are several limitations to this paper. Firstly, it only addresses the cooperative air combat decision-making process of isomorphic UAVs. In the future, it is necessary to explore the MAAC decision-making methods in heterogeneous UAVs. Secondly, the MADSAC algorithm requires UAVs to engage in real-time communication, which may result in significant communication pressure. Future research will focus on developing information compression mechanisms that reduce communication restrictions among UAVs and cooperative decision-making methods in situations where communication is limited.