A Multi-UCAV Cooperative Decision-Making Method Based on an MAPPO Algorithm for Beyond-Visual-Range Air Combat

: To solve the problems of autonomous decision making and the cooperative operation of multiple unmanned combat aerial vehicles (UCAVs) in beyond-visual-range air combat, this paper proposes an air combat decision-making method that is based on a multi-agent proximal policy optimization (MAPPO) algorithm. Firstly, the model of the unmanned combat aircraft is established on the simulation platform, and the corresponding maneuver library is designed. In order to simulate the real beyond-visual-range air combat, the missile attack area model is established, and the probability of damage occurring is given according to both the enemy and us. Secondly, to overcome the sparse return problem of traditional reinforcement learning, according to the angle, speed, altitude, distance of the unmanned combat aircraft, and the damage of the missile attack area, this paper designs a comprehensive reward function. Finally, the idea of centralized training and distributed implementation is adopted to improve the decision-making ability of the unmanned combat aircraft and improve the training efﬁciency of the algorithm. The simulation results show that this algorithm can carry out a multi-aircraft air combat confrontation drill, form new tactical decisions in the drill process, and provide new ideas for multi-UCAV air combat.


Introduction
In actual air combat, pilots often have to make a series of tactical decisions according to the established combat tasks, air combat tactics, aircraft performance, and the current situation between themselves and the enemy.In this process, a series of decisions that are shown to be made by air combat pilots will become certain air combat tactics, and the more abundant that number of air combat tactics that they have, then the greater the possibility of wiping out the enemy is.With the development of air combat weapons, modern air combat has formed a pattern of beyond-visual-range air combat, which is supplemented by short-range air combat [1,2].At present, air combat is mainly based on multiple unmanned combat aerial vehicles (UCAVs) performing cooperative operations, wherein, UCAVs cooperate to complete combat tasks, including performing cooperative maneuvers, coordinated strikes, and providing fire cover [3].Compared with single UCAV decision making, multi-UCAV air combat decision making involves more air combat entities, a more complex air combat environment, and more space for air combat decision making [4,5].
At present, the research on air combat decision making mostly focuses on differential games [6][7][8], the expert system method [9,10], the Bayesian network method [11,12], the fuzzy reasoning method [13][14][15], and approximate dynamic programming [16].Although the application of these theories can solve many air combat decision-related problems, with the continuous demand for the improvement of air combat decision making and the deepening of the research, the relevant methods of game theory have also exposed many defects: (1) The first of these is the complexity of modeling the real air combat problem.The huge amount of information in a real air combat environment and the rapidly changing state of it brings problems to those that are conducting accurate modeling.(2) The second of these is the dimension explosion problem that is caused by the growth of the number of game individuals and the decision-making space.The decisionmaking problems that occur due to there being a large number of game participants bring huge decision-making space dimensions which will directly affect the efficiency and accuracy of the solution.(3) Finally, it is difficult to solve the problem with an optimal strategy.In the face of complex and dynamic air combat decision-making problems, it is impossible to obtain the analytical solution of the Nash equilibrium.
In contrast, deep reinforcement learning has a unique information perception ability and a strong learning ability [17][18][19], which can be employed to effectively solve highdimensional, sequential decision-making problems.Therefore, the application of deep reinforcement learning generates new ideas that can be applied to complex air combat decision-making problems.
Intelligent air combat, which combines deep reinforcement learning and air combat decision making, has been explored and studied by many scholars [20][21][22][23][24][25][26].The reinforcement learning algorithm controls the UCAV in order for it to perform actions and constantly interact with the environment to obtain corresponding returns which will be fed back to the UCAV to guide it to take the next action.At the same time, it constantly adjusts its own strategies according to the feedback of the environment, and finally generates a set of the most worrying strategies.For UCAV autonomous maneuver decision making, Yong-Feng Li proposed a UCAV autonomous maneuver decision-making model for short-range air combat that is based on an MS-DDQN algorithm [20].To improve the dogfight ability of the unmanned combat aerial vehicles and avoid the deficiencies of traditional methods, such as poor flexibility and a weak decision-making ability, a maneuver method that uses deep learning was proposed by [21].To improve the efficiency of the reinforcement learning algorithm for the exploration of strategy space, [22] Zhang proposed a heuristic Q-Network method that integrates expert experience and uses expert experience as a heuristic signal to guide the search process.Q. Yang proposed the use of the optimization algorithm to generate the air combat maneuver action value, and the addition of the optimization action as the initial sample of the DDPG replay buffer to filter out a large number of invalid action values and guarantee the correctness of the action value [23].
However, from the research results that have been published, current air combat decisions that have been made are based on deep reinforcement learning procedures and these are mainly aimed at their application for two-dimensional air combat environments, the air combat decisions of single aircraft confrontations, or the decision simulations of some typical air combat tactics.There are many problems in the cooperative over-thehorizon air combat of multiple unmanned aerial vehicles, such as there being many entity types, a large decision-making space, and a high-complexity and slow nature of the training and convergence speed.With the rise of multi-agent reinforcement learning, there are new solutions for multi-agent collaborative decision making [27,28].In order to solve these problems, the following methods are adopted in this paper: (1) To simulate the real air combat environment, this paper establishes a six-degreeof-freedom model of the unmanned combat aircraft, constructs the action library of the unmanned combat aircraft, and establishes the missile attack area and the corresponding probability of damage occurring according to both the enemy and us.(2) This paper designs a set of comprehensive reward functions according to the angle, height, speed, and distance of the unmanned combat aircraft, and the probability of damage occurring to the missile attack area to overcome the problem of sparse rewards in reinforcement learning and improve the training efficiency.(3) The multi-agent proximal policy optimization approach is designed to optimize the air combat decision-making algorithm, which adopts the architecture of the centralized training-distributed execution mechanism to improve the decision-making ability of the UCAVs.(4) A two-to-one battlefield environment is used to simulate and verify the algorithm.
The experimental results show that deep reinforcement learning can guide UCAVs to fight, find a way to quickly expand their superiority when they are superior, and find a suitable way to attack to win in an air combat scenario when it is at an absolute disadvantage.

Task Scene and Decision Structure Design
A UCAV that is engaged in autonomous air combat can obtain strategy instructions from an intelligent decision-making algorithm according to the current air combat situation that it is in, which can greatly improve the autonomous ability of the UCAV and the efficiency of the mission's execution.In the future air combat system, the air combat mission is set as shown in Figure 1.The blue is an enemy UCAV that is intruding into our airspace and scouting our base.When the initial position of the target UCAV is known, we send two UCAVs to intercept and pursue it.The battlefield environment is composed of three-dimensional space, and the two sides, which are composed of the enemy and us, will engage in a contest that is beyond the visual range in the vast space battlefield.In this paper, a three-dimensional battlefield map is constructed to perform as the task environment, and both of the red and blue UCAVs are the agents, and the action selection of the agents will be completed by the use of a discrete action library.The scene that is in this paper is a situation of an enemy UCAV intruding into our airspace; considering that the enemy UCAV can maneuver and evade and attack the ones on the red side, the task of the red ones is to intercept and drive the blue one away, and when the blue one is shot down, then the red ones complete the interception and pursuit task.If one of the red UCAVs is shot down, it is considered that the one on the blue side has successfully escaped the pursuit of the red ones and completed the reconnaissance mission of the red ones' base.
The experimental results show that deep reinforcement learnin to fight, find a way to quickly expand their superiority when th find a suitable way to attack to win in an air combat scenario whe disadvantage.

Task Scene and Decision Structure Design
A UCAV that is engaged in autonomous air combat can obtain s from an intelligent decision-making algorithm according to the curre tion that it is in, which can greatly improve the autonomous ability o efficiency of the mission's execution.In the future air combat system sion is set as shown in Figure 1.The blue is an enemy UCAV that i airspace and scouting our base.When the initial position of the targ we send two UCAVs to intercept and pursue it.The battlefield enviro of three-dimensional space, and the two sides, which are composed o will engage in a contest that is beyond the visual range in the vast spa paper, a three-dimensional battlefield map is constructed to perform ment, and both of the red and blue UCAVs are the agents, and the ac agents will be completed by the use of a discrete action library.The paper is a situation of an enemy UCAV intruding into our airspace; enemy UCAV can maneuver and evade and attack the ones on the red red ones is to intercept and drive the blue one away, and when the bl then the red ones complete the interception and pursuit task.If one shot down, it is considered that the one on the blue side has successfu suit of the red ones and completed the reconnaissance mission of the Based on the above task scenarios and considering that the pro zation algorithm does not rely on the value function, the strategy fu culate and has the advantages of having good stability and converg signs a multi-agent air combat decision-making framework that is ba proximal policy optimization algorithm (MAPPO).The framework of three parts: a deep reinforcement learning module, an environmen processing module.The specific structure is shown in Figure 2. The ing module is mainly divided into two parts, the central controller Based on the above task scenarios and considering that the proximal policy optimization algorithm does not rely on the value function, the strategy function is easy to calculate and has the advantages of having good stability and convergence.This paper designs a multi-agent air combat decision-making framework that is based on a multi-agent proximal policy optimization algorithm (MAPPO).The framework is mainly composed of three parts: a deep reinforcement learning module, an environment module, and a data processing module.The specific structure is shown in Figure 2. The reinforcement learning module is mainly divided into two parts, the central controller and the distributed execution.The central controller is primarily responsible for evaluating the actions that are performed by the UCAVs.The distributed execution mainly outputs the actions that the red UCAVs and the blue-side UCAV need to perform through the new actor network.The new actor network is updated according to the comparison of the old and new actor networks, and the advantage function is outputted by the critic network.The environment module is mainly composed of the UCAV model, the target allocation, the missile attack area, and the reward module.The target allocation module is responsible for the allocation of air combat targets, while the missile attack zone module will judge the attack result of the UCAV, and the reward module conducts a situational assessment according to the battlefield situation.The function of the data processing module is to process the information that is sent by the environment module, normalize and package the information, and send it to the experience base.Two red UCAVs and one blue UCAV choose their initial actions according to the initial parameters of their policy network, and ones that are on the red and blue sides will perform the corresponding action in the environment to get a new state and reward.At this time, the status, actions, rewards, and other data of both of the sides will be processed into a joint status, with joint actions and joint rewards, by the data processing module after it conducts the packaging, normalization, and formatting procedures.After the reinforcement learning module receives the information data, the critic networks of the red UCAVs and the blue UCAV sample the data, and send the sampled joint state, joint action, and joint reward to the central controller to guide the policy network to update the policy.Then, the red UCAVs and the blue UCAV take their private observations as the input of their updated policy networks, and the new actor networks output their corresponding action decisions.After receiving the action decision of the reinforcement learning module, the UCAVs execute corresponding actions to interact with the air combat environment.The new UCAV status information and the corresponding reward information that is obtained by executing the new actions are processed by the data processing module and sent to the experience base.When the neural network is trained, these data are sampled from the experience base and passed into the neural network training for their training to occur.
Aerospace 2022, 9, x FOR PEER REVIEW 4 of 20 execution.The central controller is primarily responsible for evaluating the actions that are performed by the UCAVs.The distributed execution mainly outputs the actions that the red UCAVs and the blue-side UCAV need to perform through the new actor network.
The new actor network is updated according to the comparison of the old and new actor networks, and the advantage function is outputted by the critic network.The environment module is mainly composed of the UCAV model, the target allocation, the missile attack area, and the reward module.The target allocation module is responsible for the allocation of air combat targets, while the missile attack zone module will judge the attack result of the UCAV, and the reward module conducts a situational assessment according to the battlefield situation.The function of the data processing module is to process the information that is sent by the environment module, normalize and package the information, and send it to the experience base.Two red UCAVs and one blue UCAV choose their initial actions according to the initial parameters of their policy network, and ones that are on the red and blue sides will perform the corresponding action in the environment to get a new state and reward.At this time, the status, actions, rewards, and other data of both of the sides will be processed into a joint status, with joint actions and joint rewards, by the data processing module after it conducts the packaging, normalization, and formatting procedures.After the reinforcement learning module receives the information data, the critic networks of the red UCAVs and the blue UCAV sample the data, and send the sampled joint state, joint action, and joint reward to the central controller to guide the policy network to update the policy.Then, the red UCAVs and the blue UCAV take their private observations as the input of their updated policy networks, and the new actor networks output their corresponding action decisions.After receiving the action decision of the reinforcement learning module, the UCAVs execute corresponding actions to interact with the air combat environment.The new UCAV status information and the corresponding reward information that is obtained by executing the new actions are processed by the data processing module and sent to the experience base.When the neural network is trained, these data are sampled from the experience base and passed into the neural network training for their training to occur.As can be seen from the decision structure that has been designed (above), as an agent of autonomous combat, a UCAV must have a decision-making ability.As a new strategy gradient reinforcement learning algorithm, the proximal policy optimization algorithm is especially suitable for autonomous decision making in air combat.According to the UCAV air combat environment, the construct of the UCAV and the missile attack zone model, the design of a comprehensive reward function for optimal decision making, and the construct of a multi-agent air combat decision-making algorithm complete the decisionmaking design of air combat.The following article will be discussed and designed from these aspects.

Motion Control Model of the UCAV
The three-dimensional kinematic equations of the UCAV in-ground inertial coordinate system are defined as follows: Each term in the equations represents the first-order differential equations of the UCAV's flight speed, yaw angle, pitch angle, and three-dimensional coordinate values, respectively, which together determine the calculation method of the UCAV's state update in the process of air combat.N x is the tangential overload of the UCAV, which indicates the ratio of the thrust to the gravity of the UCAV in the forward speed direction.By changing the size of this, the UCAV can be controlled to accelerate, decelerate, or fly at a uniform speed.N z is the normal overload of the UCAV, which means that the UCAV is subjected to the overload that is perpendicular to that of the fuselage and upward direction.The dive, pull-up, and smooth flight of the UCAV can be controlled by changing of the size and the vertical height of the UCAV.ϕ is the roll angle of the body, it represents the angle of the body around its speed direction, and the UCAV can be controlled to turn laterally by changing the size of this.
In this paper, the action coding is carried out by selecting the tangential overload, normal overload, and roll angle of the UCAV; this triad [N x , N z , ϕ] is used to represent the actions that are taken at each point in the simulation.The action library that can be taken by the UCAV is shown in Table 1.
Table 1.Instruction analysis of basic air combat maneuver.

Modeling of the Missile Attack Zone
As shown in Figure 3, the attack area of the UCAV is similar to a cone.To better analyze the attack area, the attack area is cross-sectioned, and the cross-section of the attack area is a sector.Combined with the performance of the missile and the maneuverability of the enemy UCAV, the probability of damage occurring to the target is analyzed.The sector attack is divided into five parts, and the attack probabilities of these five parts are also different.

Modeling of the Missile Attack Zone
As shown in Figure 3, the attack area of the UCAV is similar to a cone.To better analyze the attack area, the attack area is cross-sectioned, and the cross-section of the attack area is a sector.Combined with the performance of the missile and the maneuverability of the enemy UCAV, the probability of damage occurring to the target is analyzed.The sector attack is divided into five parts, and the attack probabilities of these five parts are also different.
where ATA is the deviation angle of the drone, (3) Among them, ( _ ) position aircraf aim is the area of the attack zone in which the tar- get is located.The details of a P and d P are as follows: The specific division of the attack zone is as follows: where ATA is the deviation angle of the drone, ϕ Mmax is the maximum off-axis launch angle of the missile, D Mmin is the minimum attack distance of the missile, D Mmax is the maximum range of missile attack, d is the distance between the drone and the target, D Mkmin is the minimum inescapable distance of a missile, D Mkmax is the maximum inescapable distance of a missile, and ϕ MKmax is the conical angle.The probability of the target being destroyed is also different in different attack areas; of course, the maneuvering action of the target UCAV will also affect the probability of the target being destroyed.

Design of a Comprehensive Reward Function
To solve the problem of sparse reward, this paper adopts the method of combining a continuous reward with a sparse reward, which includes the angle reward, height reward, speed reward, and distance reward, which guide the UCAV to carry out air combat and increase the effective reward.As a result, the learning speed is accelerated, and the sparse reward is used as an event reward, as a sign of the end of air combat, and as an evaluation of the UCAV's completion of its mission.The combination of a continuous reward and a sparse reward not only maintains the exploratory nature of the UCAV, but also increases the effective reward and improves the learning efficiency of the algorithm.
The angle reward, which combines the departure angle of the target and the deviation angle of the UCAV in the course of combat, is defined as: In this form, ϕ R max , ϕ Mmax , and ϕ Mkmax are the maximum search deflection angle of the radar, the maximum off-axis launch angle of the missile, and the maximum deviation angle of the target that cannot escape, respectively.There being a positive value for the angle reward means that the UCAV occupies the dominant angle, and therefore, the target is at a disadvantage; there being a negative value for the angle reward means that the target occupies the dominant angle, and therefore, the UCAV is at a disadvantage.
The height reward function is defined as follows: In this form, r_h represents a normalized height reward that is determined by the height difference.
When the UCAV has a certain speed advantage relative to the target, it can quickly avoid the threat of the target in the course of combat, and it can also enter its superior attack zone first because it has a faster flight speed that is uses to threaten the target.There is a relationship between speed reward and speed difference, which is defined as: In the process of air combat, the distance between the UCAV and the target directly affects the battle result.The UCAV needs to make the target fall into its attack zone, but not into the target's attack zone.When the attack angle of the UCAV to the target is appropriate, the UCAV only needs to shorten the distance between the two UCAVs.In a fixed-point environment, the return of this distance is designed as follows: In this form, D Rmax , D Mmax , D Mkmax , and D Mkmin are the maximum search range of the radar, the maximum attack range of the missile, the maximum distance of the missile's non-escape zone, and the minimum distance of missile's non-escape zone, respectively.
The return functions are shown in the Figure 4:   The comprehensive reward function is the key factor in whether the air combat strategy can converge with the optimal state, and the reward function must give the evaluation of the actions of the UCAV.The reward function R consists of a sparse reward function and a continuous reward function of the intermediate state.That is: The continuous return function R c consists of the angle reward, height reward, speed reward, and distance reward through a certain weight.
The sparse return function is related to iconic events such as drones being hit, the enemy being hit, and crashes that might occur.
Lock the enemy aircraft to launch missiles 5 Hit the enemy plane −2 Locked by enemy missiles −5 Be shot down by an enemy plane (12)

Algorithm Design of Air Combat Decisions That Are Based on MAPPO
According to the air combat mission and decision framework, and considering the characteristics of multi-UCAV cooperative combat, this paper designs a multi-agent proximal policy optimization algorithm for multi-UCAV air combat decision making that is based on a proximal policy optimization approach.The centralized training-distributed execution architecture is used to improve the flexibility of multi-agent, design action evaluation mechanism and corresponding network structure to improve the efficiency of the algorithm, and finally form a complete set of air combat decision-making algorithms.

Algorithm of Proximal Policy Optimization
Because air combat decision making considers the efficiency of real-time decision making and discrete instructions, there is almost no prior data or an algorithm model that can be used, however, considering the stability and convergence of the algorithm, this paper uses a near-end strategy to optimize the reinforcement learning algorithm to achieve satisfactory decision-making results.
The proximal policy optimization approach uses the method of comparing the preand post-strategy of the objective function, instead of modifying the learning rate like the general deep reinforcement learning algorithm does, which makes the strategy change more stable.The algorithm adopts an actor-critic (AC) framework, which includes two networks, namely, the actor network and the critic network.The update to the actor network defines the agent's objective function: where the superscript CPI represents the conservative strategy iteration.Êt is used to find the mean at [0, t].π θ represents the strategy that is adopted by the drones in the moment.π θold represents the strategy that is adopted by the drone in the previous moment.a t represents the actions that are taken by the UCAV in the moment of t. s t indicates the status of the UCAV in the moment of t.Ât represents the dominance function of the action a t in the state s t .To prevent the policy update from being too large when optimizing the agent objective function in the PPO algorithm, the agent objective function is clipped: In this form, ε is a super parameter.This method combines the gradient descent method with the trust region correction method and uses the adaptive KL penalty factor to judge whether the strategy update of the UCAV should be stopped or not to simplify the algorithm.
The critic network uses the TD-error method to train the neural network, and the update method is used to maximize the loss function.The specific formula is as follows: Among them, γ is the discount factor of the reward, r is the return of the state of the UCAV that is acting on the environment in the moment t , s t is the state in the moment t, V φ (s t ) is the evaluation of the state of the current moment, which is given by the critic network.

Centralized Training and Distribution Execution Strategy
In air combat, if the PPO algorithm is used to train multiple UCAVs, then the immediate reward for each UCAV is the same, so no matter what the UCAV does, the reward is the same, which leads to the phenomenon of laziness in some UCAVs.UCAVs that have made great contributions and drones that have made no contribution get the same immediate reward.The distribution of this immediate reward is unfair.It leads to the inefficiency of decision-making abilities and the algorithm training of UCAV individuals.To solve this problem, this paper forms a multi-agent, near-end policy optimization algorithm that is based on the PPO algorithm and the centralized training distributed execution architecture.
"Centralized training" refers to the training of the UCAVs by using the joint stateaction value function V φ (s r1 , s r2 , s b , a r1 , a r2 , a b ) of the two red UCAVs and the one blue UCAV in the process of neural network training.The input of the value network in the central controller is the joint state (s r1, s r2 , s b ) and joint action (a r1 , a r2 , a b ) of the UCAVs on the red and blue sides, thus, the joint state-action value function V φ (s r1 , s r2 , s b , a r1 , a r2 , a b ) is calculated, which not only takes into account the influence of the state and action of the UCAVs on the overall air combat environment, but it also takes into account the changes that are brought about by the choice of actions that are taken by the other UCAVs in different states, which solves the problem of laziness in some UCAVs in the decision making processes of multi-UCAVs cooperative air combat.This makes it possible for the ones on the red side to cooperate with the two drones to make them pursue the one on the blue side, and the one on the blue side can also confront the ones on the red side.
"Distributed execution" means that the UCAV only inputs its private observations s = (v, γ, φ, x, y, z) into the policy network to select its actions a = (a r1 , a r2 , a b ).When the strategy is being updated, the UCAV will get a new action selection strategy Π s, a θ Π , according to the joint dominance function R = (r r1 , r r2 , r b ) of the value network output.This method effectively solves the communication problem during the action selection of each UCAV, in which v, γ, and φ represent the speed, the track pitch angle, and the track roll angle of the UCAV, respectively.(x, y, z) represents the spatial three-axis position of the UCAV.a r1 , a r2 , and a b represent the action that is selected by the first red UCAV 1, the selected action of the second red UCAV 2, and the selected action of the blue UCAV, respectively.r r1 , r r2 , and r b indicate the reward that is obtained by the first red drone after acting, the reward that is obtained by the second red drone after acting, and the reward that is obtained by the blue drone after acting, respectively.
Compared with the traditional single-agent reinforcement learning framework (using local action-value function V φ (s i , a i ) training to input the local state s i and action a i of a single UCAV), the joint action-value function inputs the global state information s and the action information a = (a r1 , a r2 , a b ) of all of the entities, which is the real evaluation of the joint state strategy.The advantage of using this architecture to design an air combat decision algorithm is that all of the UCAVs share a set of network structures, and the cooperative relationship between the UCAVs can be taken into account when they are making decisions.Because the whole structure of the air combat and the generation of the return function of the UCAVs are related to the joint action, this solves the problem of the decision algorithm being difficult to converge when there are more decision subjects in the air combat.

Network Structure Design of MAPPO
Under the multi-agent reinforcement learning framework of centralized training and distributed execution, the centralized critic and distributed actor of the UCAV have different observation horizons, so the neural network structure that has shared parameters cannot be used.The calculation of the loss of the critic and the actor in the MAPPO approach should be carried out separately, and two independent networks should be used to realize the strategy and state estimations.The MAPPO approach uses a technique of centralized dominance estimation, which enables it to observe the observations and movements of all of the drones, taking full account of the state of all of the drones in the air combat environment to obtain a more accurate estimations.The TD-ERROR for the centralized dominance estimation is as follows: The error L VE t calculation method for the critic is as follows: where S is the batch size, and i indicates the number of the drone.The MAPPO approach can update the network and the target network to ensure that the parameters of the critic network do not diverge during the update.The input of the artificial neural network of the critic network is the state of the UCAV, including the three-axis position [x i , y i , z i ] and the speed v i of the state space of the UCAV, the three-axis position [x a , y a , z a ] and the speed v a of the detected target state space, and the three-axis position [x j , y j , z j ] and the speed v j of the state space of other the UCAVs which are obtained by information sharing, wherein, the total number of input dimensions are 12.The output of the network is the value of the current state.The number of nodes in the network is (1,64,64,128,128,256), and the optimizer that is used in the network's training is AdamOptimizer.
When updating the policy network, the MAPPO approach uses the tailored proxy objective function.Unlike PPO, MAPPO uses centralized dominance functions Âµ i X j , a j 1 , . . ., a j N to guide the policy updates.The policy update objective function of MAPPO is: In the process of training, the UCAV updates the strategy network using the decision sequence that is explored in the environment and updates the target strategy network π θ i,old regularly.
The input of the artificial neural network of the actor network is the state of the UCAV, including the three-axis position [x i , y i , z i ] and the speed v i of the state space of the UCAV, the three-axis position [x a , y a , z a ] and the speed v a of the detected target state space, and the three-axis position [x j , y j , z j ] and the speed v j of the state space of other the UCAVs which are obtained by information sharing, wherein, a total of 12 input dimensions are obtained.The output of the network is the selection probability of seven actions in the maneuver action database.The number of nodes in the network is (7,64,64,128,128,256), and the optimizer that is used in the network's training is AdamOptimizer.

Algorithm Implementation Flow
The MAPPO algorithm is used to train the decision-making tasks of the multi-UCAVs during air combat.The procedure flow is shown in Algorithm 1: Algorithm 1. Specific implementation steps of the MAPPO algorithm.
1. put UCAV model, aerial combat environment, and missile model 2. initialize hyper-parameters: Training episode E, Incentive discount rate, the Learning rate of actor net work, the Learning rate of critic network, batch_size 3.For each episode 1,2,3, . . ...E, do 4.For each UCAV, it is normalized according to current private observations s = (v, γ, φ, x, y, z) 5. Input the normalized s_nor to the policy network to selection action a i = π θ i (o i ), execute the action set (a r1 , a r2 , a b ) in the environment to get the state s − 6. Input the s i − of UCAV i and the s j_ of adversary UCAV j to the target selection module to obtain the primary target and secondary target 7. Input the s i − of UCAV i and the s j_ of adversary UCAV j to the missile module, calculate whether the enemy UCAV is in the attack area of its UCAV and the missile hit rate, and obtain the discrete return r g according to the judgment conditions such as missile hit rate, crash and stall.8. Obtain continuous rewards according to the main and secondary targets of the UCAV and the angle reward r a , distance reward r d , speed reward r v and high reward r h of the current state.Add discrete returns and continuous returns to obtain comprehensive returns r = r g + r c .9. When the data in the experience pool reaches the set min_ BATCH_ The normalized joint state, joint action, and joint reward of all agents are input into the value network.

For UCAV
11. calculate and update the advantage estimation of each UCAV Â1 , Â2 , . . ., Ân .12. Update the value network, and the objective function is: 13.The advantage estimation function Âi of each UCAV is input into the strategy network.Input the private state observation s i of each UCAV in the experience pool to the strategy network.14.Calculate the action selection strategy of the policy network.15.Select the old action strategy and update the new action strategy.16.The objective function of the policy network is:

Intelligent Air Combat Decision Simulation Experiment
The simulation environment of the algorithm was: the official version of Intel Xeon gold 6242R CPU was the processor, with the main frequency being 3.1 GHz.The running memory was 126 GB; the graphics card was the RTX3090 24GB.
As shown in Figure 5, when the distance difference in the XY of the UCAVs between the red ones and the blue one was still 50 km, the red UCAVs and blue UCAV were equal in height.At this time, the initial position of the red UCAV 1 was [−50,000 m, 0 m, 3000 m] and the speed was 100 m/s, and the initial position of the red UCAV 2 was [−50,000 m, 20,000 m, 3000 m] and the speed was 100 m/s.The initial position of the blue UCAV was [0 m, 10,000 m, 3000 m] and the speed was 100 m/s.

Training Process
The main super parameter settings of the algorithm are show

D
In the simulation, three identical UCAVs were used for train ing state of the algorithm and prevent the phenomenon of grad gradient explosion, this paper observed the loss of the actor neu of the critical neural network, as shown in Figure 6.

Training Process
The main super parameter settings of the algorithm are shown in Table 2.In the simulation, three identical UCAVs were used for training.To detect the training state of the algorithm and prevent the phenomenon of gradient disappearance and gradient explosion, this paper observed the loss of the actor neural network and the loss of the critical neural network, as shown in Figure 6.
As shown in Figure 6, it can be seen from the curves that the loss of the actor network (aloss) and the loss of the critic network (closs) fluctuated in the training process.With the progress of the training, the aloss and closs tended to be stable, and the artificial neural network gradually tended to be stable in the training process.
In the simulation, three identical UCAVs were used for training.To de ing state of the algorithm and prevent the phenomenon of gradient disapp gradient explosion, this paper observed the loss of the actor neural network of the critical neural network, as shown in Figure 6.

Simulation Verification
The combat situation after 1000 tests is shown in Table 3.This paper takes two cases for analysis: (1) Red 1 hits the blue UCAV.
The red and blue simulation track is shown in Figure 7.The red curve represents red UCAV 1, the red dotted line represents red UCAV 2, and the blue curve represents the blue UCAV.Red 1 and red 2 will be behind the blue UCAV.According to the situation analysis, red 1 and red 2 will be superior to the blue.According to the target distribution, red 1 is the leader and it will be responsible for the attack, and red 2 is the wingman and it will be responsible for reconnaissance and vigilance but it also will be able to launch attacks if it is necessary.According to the action sequence that is described above, red 1 first will choose to accelerate, shorten the distance between itself and the blue UCAV, and bring it into its missile attack zone, while the blue UCAV will carry out a series of actions such as acceleration, a continuous left turn, and climbing.Red 1 will also choose to accelerate and climb when it finds that the blue UCAV is maneuvering in an attempt to lock the blue one firmly to make it so that the blue UCAV is unable to escape the missile attack area of the red UCAVs.Finally, the missile of red UCAV 1 will successfully hit the blue UCAV.It can also be seen from the situation change chart in Figure 7b that although the blue one will make a series of maneuvers to gradually improve its situation when it is at a disadvantage, red 1 better will respond to the maneuvers that will be made by the blue, making it impossible for the blue UCAV to escape.(2) The blue UCAV hits the red 1.
The red and blue simulation track is shown in Figure 8.The initial situa (2) The blue UCAV hits the red 1.
The red and blue simulation track is shown in Figure 8.The initial situation of the red and blue sides will be the same as the first case.Red UCAV 1 and red UCAV 2 will cooperate to attack the blue UCAV.The blue will find itself in a disadvantageous situation and begin to maneuver to escape.Compared with the first situation, the blue one will choose to climb many times, but reduce the number of left turns that it will make, and the blue UCAV's situation will gradually become better.As red 1 blindly pursues it, the blue one will find the opportunity to attack red 1 and hit it in the process of it rising to avoid being attacked by the red UCAV.As shown in Figure 8b, the situation of the red UCAVs will be higher than that of the blue UCAV at first.With the maneuvering of the blue, the situation will turn from bad to excellent, and it will even complete the counter-kill against red 1.These simulation experiments show that deep reinforcement learning can guide the UCAVs to fight, find a way to quickly expand their advantages when they have advantages, and find an appropriate attack mode when they are at an absolute disadvantage to win the air battle.The combination of deep reinforcement learning and air combat decision making has good development prospects and guiding significance.

Comparison of Algorithms
The simulations of UCAV aerial combat were performed for training using the MAPPO and PPO algorithms.The algorithm environment settings were the same as in Section 5.2.Among them, the UCAVs adopted the MAPPO algorithm for training and to generate corresponding air combat maneuvers.The target adopted the PPO algorithm for training, and the action space was the same as that of the UCAVs.Figure 7a shows the combat trajectory of the UCAV. Figure 7b-d show the reward, speed, and altitude of the UCAVs and the target.
The blue UCAV adopted the MAPPO algorithm in Section 5.2, while the blue UCAV adopted the PPO algorithm in this section.Compared with Figures 7 and 9, the blue UCAV that used the MAPPO algorithm flew faster, climbed higher, and had a higher advantage of angle.The blue UCAV's winning rate was higher, as shown in Ta- These simulation experiments show that deep reinforcement learning can guide the UCAVs to fight, find a way to quickly expand their advantages when they have advantages, and find an appropriate attack mode when they are at an absolute disadvantage to win the air battle.The combination of deep reinforcement learning and air combat decision making has good development prospects and guiding significance.

Comparison of Algorithms
The simulations of UCAV aerial combat were performed for training using the MAPPO and PPO algorithms.The algorithm environment settings were the same as in Section 5.2.Among them, the UCAVs adopted the MAPPO algorithm for training and to generate corresponding air combat maneuvers.The target adopted the PPO algorithm for training, and the action space was the same as that of the UCAVs.Figure 7a shows the combat trajectory of the UCAV. Figure 7b-d show the reward, speed, and altitude of the UCAVs and the target.
The blue UCAV adopted the MAPPO algorithm in Section 5.2, while the blue UCAV adopted the PPO algorithm in this section.Compared with Figures 7 and 9, the blue UCAV that used the MAPPO algorithm flew faster, climbed higher, and had a higher advantage of angle.The blue UCAV's winning rate was higher, as shown in Table 4. Compared with the blue UCAV that used the MAPPO algorithm, the winning rate of the blue UCAV that used the PPO algorithm dropped from 29.8% to 14.3%.

Conclusions
With the background of multi-UCAV intelligent air combat, and considering the problems of many entity types, a large decision space, and a high complexity in the multi-UCAVs' air combat procedures, this paper designs an air combat decision algorithm that is based on a multi-agent proximal policy optimization approach, designs the UCAV and missile attack area model according to the combat scenario, constructs the deep reinforcement learning network and reward function, and forms a complete set of multi-UCAV intelligent air combat decision systems.
The simulation results show that the decision system that was designed and is reported in this paper can carry out multi-UCAV air combat confrontation drills, and new tactics can be formed in the process of the drill.The intelligent air combat decision algorithm that is based on the MAPPO approach that is proposed in this paper has the characteristics of a good input/output interface and modular rapid transplantation, a fast autonomous decision-making solution cycle, an independent of expert experience, and a decision-making mission profile that can meet the requirements of a multi-UCAV cooperative operation and a stable algorithm.

Conclusions
With the background of multi-UCAV intelligent air combat, and considering the problems of many entity types, a large decision space, and a high complexity in the multi-UCAVs' air combat procedures, this paper designs an air combat decision algorithm that is based on a multi-agent proximal policy optimization approach, designs the UCAV and missile attack area model according to the combat scenario, constructs the deep reinforcement learning network and reward function, and forms a complete set of multi-UCAV intelligent air combat decision systems.
The simulation results show that the decision system that was designed and is reported in this paper can carry out multi-UCAV air combat confrontation drills, and new tactics can be formed in the process of the drill.The intelligent air combat decision algorithm

Figure 2 .
Figure 2. Decision-making framework of two UCAVs and a single UCAV that are engaging in air combat.

Figure 2 .
Figure 2. Decision-making framework of two UCAVs and a single UCAV that are engaging in air combat.

Figure 3 .
Figure 3. Schematic diagram of the attack area.The specific division of the attack zone is as follows: max min min max max min max max max min max max max max max min

Figure 3 .
Figure 3. Schematic diagram of the attack area.

Figure 4 .
Figure 4.Return functions.The comprehensive reward function is the key factor in whether the air combat strategy can converge with the optimal state, and the reward function must give the evaluation of the actions of the UCAV.The reward function R consists of a sparse reward function and a continuous reward function of the intermediate state.That is:

Figure 9 .
Figure 9.Comparison of the MAPPO and PPO algorithms.

Figure 9 .
Figure 9.Comparison of the MAPPO and PPO algorithms.
d is the distance between the drone and the target, The probability of the target being destroyed is also different in different attack areas; of course, the maneuvering action of the target UCAV will also affect the probability of the target being destroyed.

Table 3 .
Statistics of training results.

Table 4 .
Statistics of results.