Air Combat Maneuver Decision Method Based on A3C Deep Reinforcement Learning

: To solve the maneuvering decision problem in air combat of unmanned combat aircraft vehicles (UCAVs), in this paper, an autonomous maneuver decision method is proposed for a UCAV based on deep reinforcement learning. Firstly, the UCAV ﬂight maneuver model and maneuver library of both opposing sides are established. Then, considering the different state transition effects of various actions when the pitch angles of the UCAVs are different, the 10 state variables including the pitch angle, are taken as the state space. Combined with the air combat situation threat assessment index model, a two-layer reward mechanism combining internal reward and sparse reward is designed as the evaluation basis of reinforcement learning. Then, the neural network model of the full connection layer is built according to an Asynchronous Advantage Actor–Critic (A3C) algorithm. In the way of multi-threading, our UCAV keeps interactively learning with the environment to train the model and gradually learns the optimal air combat maneuver countermeasure strategy, and guides our UCAV to conduct action selection. The algorithm reduces the correlation between samples through multi-threading asynchronous learning. Finally, the effectiveness and feasibility of the method are veriﬁed in three different air combat scenarios.


Introduction
With the development of information technology and the great progress of artificial intelligence, autonomous systems of agents are becoming more and more common, and autonomous maneuvering decision-making of unmanned combat aircraft vehicles (UCAVs) has also become an important issue of current research. UCAVs are capable of autonomous flight to varying degrees: they can be controlled autonomously by a computer or remotely by a human operator [1]. Facing the increasingly complex air combat environment and battlefield situation, UCAVs will face increasingly complex tasks and challenging environments in various practical applications [2].
Domestic and overseas scholars have achieved some progress from extensive research on the decision making problem of air combat maneuvering decision. They have proposed many decision-making theories and methods, which can be approximately classified into three categories according to different solving ideas: methods based on expert knowledge [3][4][5], methods based on game theory [6][7][8][9] and methods based on heuristic learning [10][11][12][13]. In [4], combining genetic methods and expert systems, a fast response autonomous maneuver decision model is established based on the maneuver library. In [6], the idea of the differential game is used for the air combat game confrontation problem to obtain the optimal air combat model. In [9], an autonomous decision-making method is proposed, which combines matrix game and genetic algorithms. The improved matrix game algorithm is used to determine the approximate range of the optimal strategy of the UCAVs, and the genetic algorithm is used to search for the optimal strategy in the range. In [12], a neural network is trained using those samples obtained from simulations, and the future situation is predicted according to current information to select the optimal action. The algorithm can achieve victory in air combat through fewer actions. In [13], the genetic fuzzy tree method is applied to control the flight of a UCAV in an air combat mission. Among the above methods, the expert system method relies on prior knowledge and lacks flexible and effective adjustment methods for complex and changeable battlefield environments. The matrix game method is not suitable for maneuvering decision-making in large action spaces. The deep neural net requires many training samples. The reinforcement learning method does not rely on prior knowledge, but faces the problem of the dimensional explosion of complex state spaces. To solve the above problems, deep reinforcement learning has emerged and gradually become a hot topic in recent years [14][15][16]. Reinforcement learning does not require training samples and optimizes action strategies by interacting with the environment through trial and error. Deep neural networks effectively solve the problem of dimension explosion of complex state space in reinforcement learning. Since the proposal of deep reinforcement learning, many outstanding algorithms have emerged, which have been successively used in the field of maneuver decision-making of UCAVs, such as DQN [17,18], DDQN [19,20] and DDPG [21,22], etc.
This paper proposes an autonomous maneuver decision model based on an asynchronous advantage actor-critic (A3C) algorithm. The study [23] proposes an asynchronous variant method of the reinforcement learning algorithm, which can effectively break the correlation of inputs caused by online reinforcement learning algorithms, and has stable performance in neural network training without consuming a large number of storage resources. In this paper, the method is applied to the autonomous decision-making of UCAVs in air combat. The main contributions of this paper are as follows: (a) the A3C algorithm is innovatively applied to the air combat maneuver decision problem. Compared with the deep reinforcement learning methods based on the experience pool mechanism, the multi-thread asynchronous mechanism breaks the input correlation to avoid the resource waste caused by the experience replay and speed up the training speed. (b) In response to the problem of the sparse reward in reinforcement learning algorithms, a two-layer reward mechanism combining internal reward and sparse reward is established according to the air combat situation threat assessment index model to promote faster convergence of this algorithm. (c) Simulation results show the performance of the model against the three enemy maneuver strategies of straight flight, hovering flight and DQN algorithm, and verify the feasibility and effectiveness of the model in autonomous maneuver decision-making.

UCAV Maneuver Model
To simplify the complexity of the air combat environment, some experimental assumptions are made for the design of the UCAV flight maneuver model: the influences of air resistance and flow velocity are not considered, the mass of UCAVs is constant, and the sideslip angle and angle of attack during a flight are not considered. The acceleration of gravity does not change with the change of flight altitude, and the effect of earth curvature is overlooked.
Based on the above assumptions, the ground inertial coordinate system is established as shown in Figure 1, where the X-axis points to the true north direction of the ground coordinate system, the Y-axis points to the true east direction, and the Z-axis points vertically upward. v r is the velocity vector of the red UCAV, ψ, θ and ϕ are the yaw angle, pitch angle and roll angle of the UCAV respectively, v b is the velocity vector of the blue UCAV, d is the distance between the two UCAVs, ATA is the deviation angle, representing the angle between the speed of the red UCAV and the distance vector between the two UCAVs, AA is the detachment angle, representing the angle between the speed of the blue UCAV and the distance between the two UCAVs.

Kinematic Model
On the basis of the above-ground inertial coordinate system, the kinematic formula of UCAV in 3D space is established as follows [24]:

Kinetic Model
Because the study of the paper focuses on maneuver decision-making and trajectory generation, and does not involve complex aerodynamic parameters and attitude changes in the six-DOF model, the dynamic model is established by using normal and tangential overload as follows: where N x is the tangential overload of UCAVs, in the direction of flight speed, and can provide power for the UCAV to move forward, N z is the normal overload of UCAVs, in the direction perpendicular to the speed of flight, and can provide lift for the UCAV. Lastly, g is the gravity acceleration.

Air Combat Situation Assessment and Judgment of Victory and Defeat
In the game confrontation of air combat, the warring parties need to evaluate the current air combat situation to analyze the advantageous positions for attack and the inferior areas, which are easy to be attacked by the enemy. This paper introduces the Antenna Train Angle (ATA) and Aspect Angle (AA) to describe the basic situation in air combat. According to the geometric relationship of the two aircraft situation, the formula is as follows: In an air battle, the side colliding with the rear belongs to the dominant side, and the side that is rear-ended belongs to the disadvantaged side, and two planes facing each other or against each other belong to the state of mutual equilibrium. Assuming that the red UCAV is our fighter plane, the air combat situation relationship is stipulated as follows. R is 1 means our fighter is victorious, and R is −1 means our fighter is defeated: 1 ATA < π 6 and AA < π 6 and d < 1000 −1 ATA > 5π 6 and AA > 5π 6 and d < 1000

Introduction to Deep Reinforcement Learning
As shown in Figure 2, through continuous interaction with the environment using the trial-and-error method, reinforcement learning can obtain the optimal strategy for a specific task to maximize the expected cumulative payoff of the task. Traditional reinforcement learning methods are used to solve some simple prediction and decision-making tasks, and they are helpless for the problems of high-dimensional state and action space. Deep learning has been a hot topic in recent years and is extensively used in classification and regression problems in various fields. Deep neural networks have good feature representation ability and can be used to solve the problem in which the highdimensional state and value functions are difficult to represent in reinforcement learning.

Overall Framework
The A3C asynchronous framework [25] of the UCAV maneuvering decision model is shown in Figure 3. It is mainly composed of a global network and several workgroups, each of which occupies a thread corresponding to an independent UCAV and has its network model to interact with an independent environment. After each UCAV collects a certain amount of data by interacting with the corresponding environment, the gradient about the loss function of the neural net is calculated in the thread and is put into the global network for updating, and then its neural network parameters are replaced with the global network's parameters. The final trained global network can output the optimal policy of UCAVs in the current air combat environment.

A3C Related Principles
The A3C deep reinforcement learning method adopts actor-critic (AC) architecture combined with an asynchronous learning mechanism. Then multiple agents can interact with independent environments to learn strategies. Therefore, A3C belongs to an actor-critic framework-based optimization algorithm. Compared with offline learning algorithms, which use the empirical replay mechanism to disrupt the sample correlation, the A3C algorithm collects the sample data asynchronously through the multi-thread mechanism to avoid the waste of resources caused by the empirical replay mechanism.
The AC algorithm combines the value function method and policy gradient method, with actor and critic networks fitting the individual policy function and state value function, respectively. The actor selects the action performed by the agent according to the current state information, and the critic evaluates the quality of the action selected by the actor by calculating the value function, and guides the actor to select an action in the next stage, as shown in Figure 4.
In the real air combat environment transformation process, the state transformation model of the environment will be too complex to be modeled. Therefore, it is necessary to simplify the state transformation model of reinforcement learning reasonably when the reinforcement learning model is used for UCAV autonomous maneuver decision-making problem. It is assumed that the probability of transitioning to the next state is only related to the current state, and has nothing to do with the previous state; that is, the Markov property of environmental state transformation is assumed to follow the Markov Decision Process (MDP). MDP is normally represented in the following six-tuple form < S, A, P, R, γ, G > where S is the state space of the environment, A is the action space of the agent, P is the state transition probability in the MDP, R is the instantaneous return obtained by the agent after executing action A, γ is reward decay factor, γ ∈ [0, 1], and G is the subsequent cumulative discount reward from a certain point in the MDP. The strategy π of an individual under the Markov hypothesis is expressed as: According to the Bellman equation, the state value function V π is expressed as: The deep neural net is used to approximate the above two functions, and the policy function is approximated as follows: The state value function is approximated as follows: The n-step sampling is used to accelerate convergence, and the formula of n-step return is as follows: Using n-step return instead of Q-value function, the dominance function is obtained as follows: The cross-entropy policy function is used as the loss function of the policy network, and the parameter update formula of the policy network is: where H(π(s t , θ)) is the entropy of the strategy, which is used to ensure that the strategy can be fully explored and encourage agents to explore further to avoid falling into local optima, c is the influence factor of entropy, and α is the learning ratio of the actor network.
Taking the advantage function as the critical evaluation point, the loss function of the value network is obtained as: Therefore, the parameter update formula of the value network is: where β is the learning rate of the critic network.

State Space
To adequately reflect the air combat situation information of the two sides in an air battle, state variables are set to describe the air combat situation, and they are used as the inputs of two networks for learning. Therefore, the more numerous the state variable values are, the more detailed the situation information description will be, but the calculation amount of network learning will be increased. Considering that the state transition effects of various actions are very different when the pitch angles of the UCAV are different, the pitch angles of the two UCAVs are added to the state space as state variables. In this paper, the following 10 state variables are selected, and each variable is normalized as the network input: where, ∆H is the height difference between the two UCAVs, ∆v is the speed difference between the two UCAVs, θ r is the pitch angle of our UCAV, θ b is the pitch angle of the enemy UCAV, z r is the height of our UCAV, v r is the speed of our UCAV, β is the angle between the speeds of the two UCAVs, which can be expressed by the formula as follows: β = arccos(cos ψ r cos θ r cos ψ b cos θ b + sin ψ r cos θ r sin ψ b cos θ b + sin θ r sin θ b )

Action Space
The UCAV maneuver library of air combat can be approximately classified into two levels. The first level is the typical tactical maneuver library, and the other level is the basic maneuver library. The tactical maneuver library includes tumbling tactics, cobra tactics, high-speed dive tactics, etc. The traditional basic maneuver library includes steady straight, acceleration straight, deceleration straight, left turn, right turn, up pull, and down dive. Figure 5 shows these seven basic maneuvers. The set of basic maneuvers was proposed by the National Aeronautics and Space Administration (NASA). It includes the most commonly used basic maneuvers of UCAVs [26], and can be combined to form new composite actions to form advanced tactical maneuvers. The A3C algorithm is applicable to discrete and continuous action spaces. The focus of this paper is to study the autonomous maneuver decision of UCAVs. It is not necessary to use too advanced tactical actions, and the seven basic maneuver libraries in the basic maneuver library can meet the experimental requirements.
Assuming that tangential overload, method overload and roll angle are constant when a maneuver is selected, seven different combinations of [N x , N y , ϕ] are set according to the UCAV dynamics equation, and seven basic maneuvers are coded, that is, the triplet [N x , N y , ϕ] is used as the control variables of seven discrete maneuvers in the experimental simulation. Table 1 lists the triplet code of seven basic actions.

Reward Function
The setting of the reward function is a crucial link in reinforcement learning. The agent can complete the learning process by constantly trying different actions. Meanwhile, the system evaluates each step of the agent and gives a reward or punishment. In this paper, the air combat reward mechanism of UCAV is designed by combining external reward and internal reward. The external reward is the sparse reward. There is a sparse reward at the end of each episode; that is, it is awarded when our UCAV wins and is penalized when our UCAV is defeated or the two sides draw, so the reward function for external rewards R e can be defined as: ATA< π 6 and AA< π 6 and d<1000 −20 (ATA> 5π 6 and AA> 5π 6 and d<1000) or step==STEP (18) where step is the current step of the episode's time series and STEP is the maximum length of steps specified in the episode. Relying only on the sparse reward, it is difficult for agents to receive positive feedback and correct strategy evaluation in the process of random exploration. To solve this problem, an internal reward is set to stimulate exploration, and an air situation assessment model is established by using the situation information of UCAV in the air combat environment, that is, an internal reward function R i is defined, which is composed of air combat situation information such as direction, speed, relative altitude and distance.
Negative rewards are given when UCAV is in an unfavorable state. To avoid flying too high or too low, and also to prevent overspeed or stalling during flight, the reward function is set as follows: In the air combat process, the angle relationship between the two engaged UCAVs is a crucial factor in judging the situational advantage of one over the other. In particular, the deviation angle ATA and detachment angle AA can largely affect the result of air combat. The value ranges of ATA and AA are defined within [0, π]. The smaller ATA and AA are, the better our air combat situation is. Thus, the angle rewards can be defined as: The distance reward function is set to maintain a certain safe distance between UCAVs: where: d min is the nearest safe distance between the two UCAVs, and d max is the furthest expected distance between the two UCAVs. Combined with the air battle situation assessment model, the height threat index and speed threat index of both sides are given as follows: where, ∆H m is the optimal air combat altitude difference, v r is our UCAV's speed, and v b is the enemy UCAV's speed. When the air combat threat index is high, our UCAV is in an unfavorable situation and should be punished accordingly. Therefore, the altitude reward function and speed reward function are established as follows: In order to make the UCAV learn to win with fewer steps, an extra step reward function R s is set. According to the above air combat situation reward functions, the internal reward function in a 3D environment is defined as follows: where, ω a , ω d , ω h , ω v is the weight coefficient of each air combat situation information.

Experimental Parameter Setting
In the simulation, it is assumed that our UCAV is the red UCAV. The A3C maneuver decision algorithm is used for maneuver guidance of our UCAV, and the enemy UCAV is the blue UCAV. At the beginning of each episode, the two UCAVs are initialized to the given initial state, and the two UCAVs decide at the same time, every 0.5 s, which is called an execution step. If the winner is not decided within 200 steps, the episode will be judged as a tie. When evaluating the internal reward function after each maneuver selection, the influence degree of angle, distance, altitude, and speed on the situation is considered comprehensively, so the weight parameters of four situation information in the internal reward function are set as 0.45, 0.25, 0.15 and 0.15, respectively.
In this paper, three simulation conditions are designed to verify the effect of the A3C maneuver decision model. The first two groups set the enemy UCAV as fixed strategies, and do steady flight and spiral climb, respectively. The third group sets the trained DQN algorithm as the maneuver strategy of the enemy UCAV. To speed up the calculation, the starting position coordinate of the red UCAV is set as (0, 0, 3000), the initial speed is 250 m/s, and the initial pitch angle is 0 degrees in simulation. The parameters of two UCAVs' initial states are shown in Table 2. Through the processing of the Actor and Critic network, the real-time state of UCAVs is mapped to the probability of different actions being selected and the value of the current state. To fully exploit the hidden features of a given input state, it is necessary to design an appropriate deep neural network structure. Since this experiment does not involve high-dimensional data such as screen images, a forward fully-connected network is used as the network structure of the Actor and Critic. After continuous attempts and comparisons of neural networks with different layers and neurons, the parameters of the network are finally set, as shown in Table 3.

Algorithm Flow
In each group of simulation experiments, five UCAVs are set as our reinforcement learning working group to learn independently from the fixed maneuvering enemy UCAV environment. The reward decay factor is set to 0.9. Each thread updates the global network every 20 steps, and then the net's parameters in this thread are updated to the global neural net's parameters. Because the A3C algorithm uses an asynchronous multi-threading mechanism, the algorithm flow focuses on listing the flow of any thread. The algorithm flow is as shown in Algorithm 1. In this algorithm, a multi-core CPU is used to create multithreads. Each thread has a virtual UCAV that interacts with the environment independently. The action is selected according to the output strategy, and the return is calculated, and local gradient updates are accumulated by formulas 13 and 15. These accumulated gradients do not update the local network parameters of the current thread, but are sent to the global network for parameter updates. Finally, the UCAV can perform the optimal action through the trained global network.

Algorithm 1 : Autonomous Maneuver Decision of UCAV based on A3C
Input: The state feature dimension S, the action dimension A, the global network parameters θ and ω, the neural network parameters θ , ω of the current thread, the maximum number EPISODE of iterations, the maximum length STEP of one episode's time series within a thread. The reward decay factor γ and the number N of n-step rewards. The entropy coefficient c and the learning rates α, β. Output: The global network parameters θ and ω that have been updated. while t <= STEP and s t is not the termination state and T mod N != 0 do 8: Select action a t based on the output strategy π θ (s t , a t ) of the Actor network 9: Perform action a t

10:
Receive reward r t and next state s t+1

Analysis of Experimental Results
In this paper, the hardware environment is Intel(R) Core(TM) I5-8400 CPU @ 2.80 Ghz, 16G random access memory, and the software environment is Python language and TensorFlow deep learning open source framework. Since A3C adopts an asynchronous learning mechanism, the exploration and learning effect of each UCAV is far from the same when interacting with the environment. Therefore, every 100 episodes are selected as a training stage, and the win rate, average episode reward and average episode step number of our UCAV are counted. The algorithm iteration is finished until the episode reward and episode step number converge.
The simulation results of the first group of experiments are given. Figure 6 shows the reward convergence of our UCAV. At the beginning of the training, the episode reward is very low, and there is a tendency for random fluctuations, which indicates that our UCAV is almost in a random exploration state at the early stage of the training. Around the 300th training stage, the reward begins to converge and eventually converges to close to the value of 0. Figure 7 shows the statistical trend curve of episode steps, and Figure 8 shows the change curve of our UCAV win rate. The enemy UCAV is steady and straight and our UCAV is in the exploration stage in the early stage, so both sides have been tied, and the step number of the episode is kept at the maximum number of steps. It is not until the 50th training stage that our UCAV begins to learn the tactics to defeat the enemy UCAV. During this period, it still keeps exploring, but the win rate gradually increases and the average number of episodes gradually decreases. At about the 350th training stage, the number of episodes tends to be stable, converging to about 47 steps, and the win rate of our UCAV also converged to about 0.95.   To show our UCAV learning situation in the training process more visually, the trajectory of two UCAVs in certain episodes in the middle and late training period is selected in the first experiment as shown in Figures 9 and 10. After a certain stage of training, the network parameters of the Actor-Critic have been preliminarily formed, which can purposefully guide our UCAV to make favorable actions, such as raising the height to gain a height advantage, going around behind the enemy UCAV to gain an angle advantage, and narrowing the distance to win air battle. In the later stage of training, the network parameters of the Actor-Critic continue to be updated and optimized, and our UCAV has learned the strategy to win at a faster speed. Figure 11 shows the changes in ATA, AA and distance d in the late training period. When the distance is less than 1000 m and the deviation angle and detachment angle are less than π 6 , our UCAV wins.   In the second experiment, the enemy UCAV takes the maneuver of circling upward. Our reward and win rates converge more slowly than in the first experiment. As shown in Figures 12 and 13, the win rate of our UCAV starts to increase in the 120th training stage, and the reward and win rate converge in the 520th training stage. Figures 14 and 15 show the flight trajectories of two UCAVs in the middle and late training episodes of the second experiment. It can be seen that our UCAV has learned the winning strategy in the middle training period. Our UCAV can enter the hovering range of the enemy UCAV during the air battle, follow behind the enemy UCAV and keep closing the distance, and win the air battle. In the late period of training, our UCAV shows higher maneuvering ability and has learned to quickly intercept the enemy UCAV with a smaller turning radius. Figure 16 shows the statistical chart of ATA, AA and d in the late training period.     In the third experiment, the enemy UCAV uses a DQN algorithm as the maneuver strategy to select actions. The DQN algorithm has been trained for 100,000 episodes against the greedy algorithm and enabled the UCAV to confront others independently. The win rate of our UCAV in the training process is shown in Figure 17. It can be seen that, with the progress of the training stage, the win rate gradually increases and finally converges to about 0.7. The movement track of two UCAVs in one episode in the late training period is selected as shown in Figure 18. In this episode, our UCAV primarily narrows the distance with the enemy UCAV; meanwhile, the enemy UCAV tries to fly upward to increase the altitude advantage. Our UCAV quickly implements Flick Half Roll tactical action to win from outside the track of the enemy UCAV to behind the enemy UCAV. Figure 19 shows the statistical chart of the change rule of ATA, AA and d of this episode.   In the first two cases, the enemy UCAV adopts a fixed strategy, and our A3C algorithm can quickly catch up with the enemy UCAV and quickly win after a period of training. It can be seen that the decision scheme proposed in this paper can effectively guide a UCAV to make maneuver decisions. In the third case, our UCAV equipped with the A3C algorithm and the enemy UCAV equipped with the DQN algorithm are engaged in a fight. After training, our UCAV can suppress the enemy UCAV and converge to a higher winning rate, which indicates that the application of the A3C algorithm in the UCAV autonomous maneuver decision-making problem can be more effective.

Conclusions
In this paper, an asynchronous framework of an Actor-Critic network based on the advantage function is built to study the autonomous maneuver decision of UCAVs. To calculate the state transition and estimate the situation in an air combat confrontation, a physical model of UCAV is established, and a reward function mechanism combining sparse reward and internal reward is designed. The continuous state space and discrete action space based on control variables are set up. Then the neural network and multithreading model are built to realize the asynchronous advantage AC framework where the UCAV of each thread learns from the environment independently and the parameters of the global AC network are updated regularly. Finally, the model is trained under three scenarios. When the enemy UCAV performs fixed maneuvers, our UCAV equipped with the A3C algorithm shows excellent autonomous maneuvering decision-making ability, and when the enemy UCAV adopts the DQN algorithm, our UCAV can still converge the reward and achieve a high winning rate. Therefore, the effectiveness and feasibility of the UCAV autonomous maneuver decision-making method based on A3C deep reinforcement learning are verified by observing the rewards, win rates and confrontation trajectories.
Of course, there are also some improvements in this paper. For example, the fixed initial position can be replaced with a random initial position, which can make the algorithm more fully trained. In addition, the seven candidate actions can be further subdivided, which can make the UCAV maneuver more flexible, so as to make a more beautiful adversarial maneuver trajectory. In future work, we will further study the above improvement points.