Autonomous Maneuver Decision Making of Dual-UAV Cooperative Air Combat Based on Deep Reinforcement Learning

: Autonomous maneuver decision making is the core of intelligent warfare, which has become the main research direction to enable unmanned aerial vehicles (UAVs) to independently generate control commands and complete air combat tasks according to environmental situation information. In this paper, an autonomous maneuver decision making method is proposed for air combat by two cooperative UAVs, which is showcased by using the typical olive formation strategy as a practical example. First, a UAV situation assessment model based on the relative situation is proposed, which uses the real-time target and UAV location information to assess the current situation or threat. Second, the continuous air combat state space is discretized into a 13 dimensional space for dimension reduction and quantitative description, and 15 typical action commands instead of a continuous control space are designed to reduce the difﬁculty of UAV training. Third, a reward function is designed based on the situation assessment which includes the real-time gain due to maneuver and the ﬁnal combat winning/losing gain. Fourth, an improved training data sampling strategy is proposed, which samples the data in the experience pool based on priority to accelerate the training convergence. Fifth, a hybrid autonomous maneuver decision strategy for dual-UAV olive formation air combat is proposed which realizes the UAV capability of obstacle avoidance, formation and confrontation. Finally, the air combat task of dual-UAV olive formation is simulated and the results show that the proposed method can help the UAVs defeat the enemy effectively and outperforms the deep Q network (DQN) method without priority sampling in terms of the convergence speed.


Introduction
With the progress of combat mode and the expansion of combat scale, modern air combat gradually extends from the within visual range (WVR) air combat to the beyond visual range (BVR) air combat [1]. Unmanned aerial vehicles (UAVs) are more and more widely used in military tasks such as investigation, monitoring and target attack [2] because of their low cost, strong mobility and high concealment. Due to the limitations of a single UAV's mission and combat capability, autonomous multi-UAV cooperative air combats have become a research hotspot in recent years [3].
Autonomous air combat maneuver decision making refers to the process of automatically generating the maneuver control decisions of UAVs based on mathematical optimization and artificial intelligence [4], which requires that UAVs have independent modeling, difficult sample marking and cumbersome solutions of other methods, and can produce a series of decision sequences considering long-term effects through self-interactive training without manual intervention. It is a feasible modeling method for autonomous decision making of air combat maneuvers in artificial intelligence [20,21]. The autonomous maneuver decision making problem of air combat based on deep reinforcement learning is studied in [22][23][24][25][26]. In [22,23], researchers verify the performance of the algorithm by building a high simulation combat platform, and has good experimental results. However, the reward function in [23] is sparse, and the reward is 0 in most states of each round, which is not conducive to network training. The robust multi-agent reinforcement learning (MARL) algorithm framework is used in [24] to solve the problem that the reinforcement learning algorithm cannot converge due to the unstable environment in the training process. However, the simulation environment in [24] is a two-dimensional plane and the simulation test initialization is fixed, which makes it hard to be applied in the dynamic confrontation scenarios. Many aspects of UAV situation assessment is considered in [25], but UAV uses absolute coordinates as the state input, which is highly dependent on spatial characteristics. In [26], researchers use Monte Carlo reinforcement learning to carry out research. The biggest problem is that the agent needs to complete a complete air combat process to evaluate the reward. Moreover, the above references consider the one-to-one air combat scenario, which has limited reference value for the research of multi-aircraft cooperative autonomous control. There are few studies on multi-agent confrontation using reinforcement learning algorithms. In [27], a novel autonomous aerial combat maneuver strategy generation algorithm based on state-adversarial deep deterministic policy gradient algorithm (SA-DDPG) is proposed, which considers the error of the airborne sensor and uses a reward shaping method based on maximum entropy inverse reinforcement learning algorithm. However, the reliance on expert knowledge in the design of reward functions in this paper is not conducive to extension to more complex air combat environments. In [28] researchers propose an air combat decision-making model based on reinforcement learning framework, and use long short-term memory (LSTM) to generate a new displacement prediction. However, the simulation experiments in [28] rely on an off-the-shelf game environment, which is not conducive to the extension of the study and it studies the air combat problem of searching for observation station in a non-threatening environment, which differs significantly from the air combat mission of this paper. Based on the MARL method, the simulation in [29] of multiple UAVs arriving at their destinations from any departure points in a large-scale complex environment is realized. However, the modeling environment is planar, and a sparse reward function is used, and only distance penalty is considered. The method for maneuver decision making of multi-UAV formation air combat in [30] is robust. However, there are no simulation results, and there are only three maneuver behaviors. The deep deterministic policy gradient (DDPG) algorithm is used in [31] to realize the maneuver decision of the dynamic change of UAV quantity in the process of swarm air combat. The algorithm has robustness and expansibility, but the waypoint model is used in this paper, which cannot describe the maneuver characteristics of UAV. Other researches on air combat based on reinforcement learning, the intelligent decision making technology for multi-UAV prevention and control proposed in [21,32]. The control method of UAV autonomous avoiding missile threat based on deep reinforcement learning introduced in [33,34]. Deep reinforcement learning is used in [35] to build an intelligent command framework and so on. These studies focus on the feasibility of reinforcement learning methods in solving some air combat problems, which has little correlation with our autonomous maneuver decision making problem, but provides some ideas for our research. In addition, uniform sampling is used in [21][22][23][24][25][26]29,30,32], which means that the probability of all experiences in the experience pool being extracted and utilized is the same, thus ignoring the different importance of each experience, resulting in long training time and extremely unstable.
Generally speaking, at present, the research on air combat maneuver decision making based on reinforcement learning mainly focuses on single UAV confrontation tasks, and the research on multi-UAV confrontation and multi-UAV cooperation are in the initial exploration stage. These studies have one or more of the following problems: Dimension explosion, rewards are sparse and delayed, simple simulation environment, lack of maneuver model, incomplete situation assessment and random uniform sampling leads to slow training.
In this paper, an autonomous maneuver decision making method based on deep reinforcement learning is proposed for dual-UAV cooperative air combat. The main contributions are as follows. First, aiming at the problems of dimension explosion, sparse and delayed rewards and incomplete situation assessment, we discretize the continuous air combat state space into 13 dimensions for dimension reduction and quantitative description of air combat states. Then the situation assessment model is established based on the relative location between the UAV and the target. Second, a reward function is designed according to the situation assessment results which includes the real-time gain due to maneuver and the final combat winning/losing gain. Such a design helps to solve the problem of sparse and delayed reward in the games of long-running time for ending. Third, aiming at the problem of slow convergence caused by random sampling in conventional DQN learning, an improved priority sampling strategy is proposed to accelerate the convergence of the DQN network training. Fourth, we apply and modify the designed autonomous maneuver decision making method for the typical task of dual-UAV olive formation air combat, which enables the UAVs to own the capability of collision avoidance, formation and confrontation. Finally, the proposed method is validated by simulation using practical fixed-wing UAV models and compared with the DQN learning method without priority sampling. The simulation results show that our method can make the two UAVs defeat the enemy effectively and improve the performance in terms of the convergence speed.
The following parts are arranged as follows: Section 2 is the problem formulation. Section 3 is the air combat confrontation algorithm based on deep reinforcement learning. Section 4 is the description of typical air combat scenarios and the design of dual-UAV cooperative autonomous maneuver strategy. Section 5 conducts simulation analysis. Section 6 summarizes the full paper.

Problem Formulation
This paper studies the autonomous maneuver control decision of multi-UAV BVR cooperative tracking and close combat. Maneuver control model and situation assessment are the premises of UAV maneuver decision. Therefore, in the following sections, we will elaborate on UAV autonomous maneuver decision making from three aspects: Maneuver control model, situation assessment and maneuver control decision.

UAV Dynamic Model
As shown in Figure 1, in the ground coordinate system, the ox, oy and oz are the east, north and vertical directions respectively. The motion model of UAV in the coordinate system is given by (1) where x, y and z represent the position of the UAV in the coordinate system, v represents the current speed direction of the UAV,ẋ,ẏ andż represent the change rate of v in the three coordinate axis directions, v is the projection of v on the xoy plane, γ is the angle between v and v, and γ represents the pitch angle, ψ is the angle between v and oy axis, and ψ represents the yaw angle. In the same coordinate system, the dynamic model of UAV can be expressed as where g is the gravitational acceleration. n x ∈ R and n z ∈ R represent tangential overload and normal overload, and µ ∈ [−π, π] represents the roll angle around v.
[n x , n z , µ] ∈ R 3 are the feasible basic control parameters in the UAV maneuver control model and they jointly control the direction and magnitude of UAV speed.
[n x , n z , µ] are often used as the command for air combat maneuver decision making.

Situation Assessment Model
As shown in Figure 2, denote by P U t and P T t the position of UAV and target at time t. Denote by ϕ U t the angle between the vectors P T t − P U t and v U t , named as the lag angle, and similarly, ϕ T t for the angle between the vectors P T t − P U t and v T t , named as lead angle, which are defined as where D U t T t = P T t − P U t is the distance between UAV and target.
Based on the attack model [36] and evaluation function [31,37], the effective attack range of a UAV in the air combat is a cone with an axis in the direction of v U t and angle of ϕ m , which is truncated by a ball of radius D max as shown in Figure 2, where D max represents the attack range of weapons. Similarly, we can define the cone-shape attack range for the target. The UAV should try to follow the target as much as possible. That is, the smaller the ϕ U t is, the greater the probability of UAV successfully attacking the target is. On the contrary, the larger the ϕ T t is, the greater the probability of the target successfully attacking UAV is. Therefore, we define to reflect the changes of the angle situation between the target and the UAV in the process of air combat confrontation, and there is 0 ≤ η A t ≤ 2π. The smaller the η A t is, the more means that the UAV is in pursuit posture confrontation the target. In addition to angle, distance is also an important factor in air combat. Denote by D min the minimum distance that a UAV can reach to the target for safety. When D U t T t < D min , it means that the target is in the blind zone of UAV radar detection, and the UAV has the risk of collision [38]. Thus, the distance situation of UAV in air combat can be defined as where D U t T t is distance which can be as long as 50km in the air combat. The larger the η D t is, the closer the distance between the UAV and the enemy is. Outside the D min , the smaller the D U t T t and lag angle are, the greater the probability of success of UAV attack on the target is. Combining angle situation and distance situation, we define to evaluate the real-time comprehensive situation during UAV air combat, where w 1 and w 2 are scale factors, which represent the influence of different situation factors on UAV situation assessment. Moreover, since the value range of η A t and η D t is very different, we have to balance the effects of η A t and η D t by using w 1 and w 2 . Therefore, the air combat maneuver decision making problem of UAV can be regarded as an optimization problem, where Λ denotes a set of UAV maneuver control commands. η t (n x t , n z t , µ t ) means η t is the function of n x t , n z t , and µ t , where [n x t , n z t , µ t ] has the same meaning as (2). Thus (8) means maximize the sum of UAV situation from the beginning to the end of air combat. It is difficult to get the optimal solution because the objective function is complex high-order nonlinear. Next, we will use deep reinforcement learning to solve this problem.

Reinforcement Learning
Reinforcement learning is a method for the agent to optimize maneuver strategy. The air combat maneuver decision making problem discussed in this paper belongs to the model-free reinforcement learning problem. Markov decision process (MDP) is usually used as the theoretical framework of model-free reinforcement learning, and the final objective of the reinforcement learning is to solve the MDP by deducing an optimum policy [39], which is described by quaternion array [S, A, R, γ], where S represents state space, A represents action space, R represents reward function and γ represents discount factor. Reinforcement learning uses the state action value function Q to evaluate the value of action taken in the current state [18], which is defined as where s ∈ S, a ∈ A, r ∈ R. In addition, in order to facilitate the calculation, the following simplified processing is usually implemented [40], Reinforcement learning finds the optimal strategy π * (s) by finding the optimal action value function Q * (a|s), i.e., As long as the maximum action value function is found, the corresponding strategy π * is the solution of the reinforcement learning problem. In order to solve the dimension disaster problem of (11), deep reinforcement learning algorithm [41] is proposed, and transforms table value learning into parameters fitting of the neural network, i.e., where θ is the parameters of neural network and Q(s, a; θ) is called online Q network. Therefore, the solution of reinforcement learning problem can be expressed as

State Space
The state of UAV can quantitatively reflect the current air combat information. We consider designing the state space of UAV from the following three aspects: The first is the maneuver characteristics of UAV. The second is the relative situation between the UAV and the target. The third is target dynamic embedding prediction of the situation in real-time combat.
In this paper, we use the following 13 variables to form the state space UT . γ UT represents the angle between P T t − P U t and the oxy plane, and ψ UT represents the angle between projection vector of P T t − P U t on the oxy plane and ox axis as shown in Figure 2. In order to unify the range of each state variable and improve the efficiency of network learning, each state variable is normalized to a range, as shown in Table 1.
and v min represent the maximum and minimum speed of UAV respectively. z max and z min represent the maximum and minimum safe altitude of UAV flight respectively. Dthres is the distance threshold, and represents the starting distance of close combat. Therefore, the state space can be defined as s = [s 1 , s 2 , ...., s 13 ].

Action Space
As mentioned above, [n x t , n z t , µ t ] ∈ Λ constitute the action space of UAV. Different deep reinforcement learning algorithms such as DDPG and DQN have different design methods for the action space. Due to the huge state space, using DDPG and other algorithms to train continuous maneuvering strategies will cause difficulty in neural network convergence, and the maneuver process of UAV can be regarded as a combination of some basic actions [42], thus this paper adds eight action commands on the basis of the basic air combat maneuver [43] divided by NASA, and finally discretizes the UAV action space into 15 actions, as shown in Figure 3 and Table 2. This approach reduces the difficulty of UAV training, and compared with the basic seven actions, this method can make the UAV carry out constant speed, acceleration and deceleration control in each direction, which is closer to the real flight mode of the UAV. The UAV selects an action a ∈ A according to the state s and outputs it to the environment. After format conversion, the UAV is guided to fly according to the command,

Reward Function
Reward function [44][45][46][47] is the feedback signal obtained by the agent in the process of interaction with the environment, which is used to evaluate the effect of the agent executing a certain action strategy. Therefore, reasonable design of reward function can effectively improve the convergence speed of the system [45]. The reward r t in this paper consists of two parts, which is defined as follow.
In (16), R t is the evaluation of the final result of air combat, which defines as (17).
C is a constant and greater than 0. When the UAV meets the following conditions, it is considered that "UAV wins" is established and a success reward C is given. First, an optimal shooting distance threshold D attack based on expert experience is defined, and there is D attack ≤ D max . Then if the distance between UAV and the target meets D min < D U t T t < D attack and the angle meets ϕ U t ≤ ϕ m and ϕ T t ≥ π − ϕ m at the same time, it is considered that the UAV has the best shooting conditions against the target, and the "UAV wins" condition is established. Similarly, if the distance between UAV and the target satisfies D min < D U t T t < D attack and the angle satisfies ϕ T t ≤ ϕ m and ϕ U t ≥ π − ϕ m at the same time, the condition "Target wins" is established and we give the failure penalty −C. (16) is used for the real-time evaluation of maneuver decision making, which is defined as (18) indicates the change of situation of UAV during air combat. If R η t > 0, the situation at time t is better than that at time t − 1. The maneuver strategy a t−1 adopted by UAV from s t−1 to s t is reasonable, and we give a positive reward. On the contrary, if R η t < 0, we give a negative penalty. w 1 and w 2 indirectly affect the maneuver decision making of UAV by influencing r t . Considering the different importance of η A t and η D t under different s t , an evaluation method of maneuver decision making based on w 1 piecewise adjustment is proposed. In this paper, w 1 is set as a piecewise function, where W 1 and W 2 are constants, and 0 < W 1 < W 2 . When D U t T t > D max holds, UAV should give priority to adjust D U t T t to quickly approach the target. When D U t T t ≤ D max holds, UAV should consider adjusting ϕ U t and D U t T t at the same time, and the smaller the D U t T t is, the larger the w 1 is. w 2 is a constant and there is no need to set a piecewise function for w 2 because the relative importance of w 2 will change with w 1 .

Remark 1.
If we change one of w 1 and w 2 , the influence of angle advantage and distance advantage on UAV maneuver decision will change relatively. Therefore, this paper discusses the design of w 1 . The setting method of w 1 can consider the following three types: Fixed value, piecewise function as (17), or continuous function as w 1 ∝ D U t T t . The fixed value represents that the influence of η A t and η D t on UAV maneuver decision is fixed. The piecewise function represents the influence of η A t and η D t on UAV maneuver decision, which is changed in a limited number of different cases. The continuous function represents the influence of η A t and η D t on UAV maneuver decision, which changes in real-time according to the current state of UAV. In this paper, we set w 1 as a piecewise function with the following considerations. r t is a comprehensive evaluation of UAV at time t on η A t , η D t and R t and these three contents are independent of each other. If w 1 ∝ D U t T t , D U t T t and ϕ U t are coupled with each other. Then with the change of D U t T t , the change of w 1 represents the synchronous change of the importance of η A t and η D t to UAV maneuver decision making, which cannot reflect the different importance of the two advantage functions in different stages of air combat. In addition to the above theoretical analysis, this paper also gives the comparison results in the experimental stage. That is, w 1 is set as a fixed value, piecewise function and continuous function proportional to distance respectively, and analyzes the changes of the loss function in three cases, and further explains the rationality of w 1 as a piecewise function.

Priority Sampling and Network Training
In this paper, the DQN algorithm is used to realize the self-learning of UAV maneuver control strategy [41,48], and an improved priority sampling strategy is proposed to accelerate the training and learning process. Experience replay mainly includes two key steps of "experience storage" and "sampling replay". Here, it is mainly to improve "sampling replay". The basic idea is to assign a priority to each sample in the experience pool. When selecting experience, we prefer to choose the experience with high priority. First, the data in the experience pool is marked according to importance, that is, the greater the value of data to network training is, the more important it is, and the higher the corresponding priority is. Then sample the labeled data, that is, the higher the priority is, the greater the probability that the sample is extracted is. Finally, the extracted samples are used for the weighted training of the network. Next, we will introduce priority sampling in detail from three aspects: Sample labeling, sampling and network training. p i is used to indicate the importance of the ith sample (s i , a i , r i , s i+1 ). A reasonable approach is to use the following TD error δ i to assign p i [49], where y i is called the target Q value, and δ i represents TD error. Since |δ i | ≥ 0, and the larger the |δ i | is, the more important the sample i is [49], and the higher the probability of being sampled should be. In order to avoid accidental factors that cause |δ i | of some samples to be too large and the sampling probability of some samples with lower priority to be close to 0 resulting in the decrease of sampling diversity, |δ i | is limited in [0 − 1] by using tanh function, i.e., where σ is a positive number, so that p i = σ > 0 at δ i = 0. Then the sampling probability of sample i is expressed as where α is the priority factor, which is used to adjust the priority of the sample. The larger the α is, the larger the P(i) will be. When α = 0, the above equation will degenerate into uniform sampling, and k is the number of samples.

Remark 2.
The definition of P(i) is not unique. Two variants are proposed in the [49]. The second variant is p i = 1 rank(i) . rank(i) represents the rank of sample i sorted according to |δ i |. Considering the simplicity of code implementation, we use p i = |δ i | + σ.
Prioritized replay introduces bias because it changes distribution of sampled data in an uncontrolled fashion, and therefore changes the solution that the Q(s, a; θ) will converge to. Generally, the important sampling (IS) weights λ i can be used to correct this error, where N represents the experience pool capacity and β represents the compensation degree. If β = 1, the non-uniform probability is fully compensated. Then update the network parameters by using λ i δ i instead of δ i . For stability reasons [50], we will standardize λ i as follow, After obtaining the sample data needed for network training through priority sampling, we input it into Q(s, a, θ ) and Q(s, a, θ) to update θ. DQN adjusts θ through gradient descent method during training, and the loss function after adding importance sampling weight and its gradient is Finally, in order to collect enough samples for network training, DQN algorithm uses ε − greedy strategy [41,48] to select actions, i,e., where num is a random number of 0-1. The above UAV BVR autonomous maneuver decision algorithm is summarized in the form of pseudo code as shown in Algorithm 1.

Algorithm 1 DQN with proportional prioritization
1: Initialize online network Q with random parameters θ Initialize target network Q with random parameters θ Initialize replay buffer M, Initialize hyper-parameters D max , D min , V max , γ, ϕ m , w 2 , W 1 , W 2 , β, z min , z max , σ, D BVR , V min , D WVR , α, a, b, C, k, K, D attack . 2: for episode = 1 to N do 3: Initialize the initial state of air combat 4: Receive initial observation state s 1 5: for t = 1 to T do 6: With probability ε select a random action a t

7:
Otherwise select a t = max a Q(s t , a; θ) 8: UAV executes action a t , and target executes action according to its policy 9: Receive reward r t and observe new state s t+1 10: Store transition (s t , a t , r t , s t+1 ) in M 11: Sample a mini batch of N transition (s t+1 , a t , r t , s t+1 ) from M with priority Compute importance-sampling weight λ i = (N · P(i)) −β /max j λ j

16:
Perform a gradient descent step on λ i (y i − Q(s i , a i ; θ)) 2 with respect to the network parameters θ 17: Every K steps reset θ = θ 18: end for 19: end for The current state of the UAV is s t . The online Q network selects and executes the action a t based on ε − greedy to transfer the UAV to the next state s t+1 , and obtain the reward r t . Save (s t , a t , r t , s t+1 ) to the experience pool, and repeat the above steps until the number of samples in the experience pool meets the requirements. Select samples from the experience pool according to the priority P(i) to train the neural network, and calculate the importance sampling weight λ i of the selected samples. Use these samples to train the network parameters. That is, first, input s t into the online Q network, and input s t+1 into the target Q network. Second, calculate the weighted mean square error according to (25), and use (27) to update the online Q network's parameters. At the same time, TD error δ i = y i − Q(s i , a i ; θ ) is obtained. According to δ i , the priority of the selected samples in the experience pool is updated, and the target network parameters are updated after a certain number of times or rounds of training. In the training process, the ε should be increased slowly, so that the UAV can choose the optimal action according to the value function with greater probability. When the error is close to 0 or there is no obvious change, the training is stopped, and the trained neural network is saved to obtain the air combat maneuver strategy of dual-aircraft formation π(a|s) U = arg max a∈A Q(s, a, θ). (29)

Task Description
This paper takes a typical two UAVs olive formation combat scene as an example. As shown in Figure 4, two UAVs perform tracking, defense and attacking tasks. UAV air combat can be divided into three stages: Take-off stage, BVR tracking stage and WVR attack stage. During the take-off stage, the two UAVs took the enemy aircraft as the target and continuously accelerated to the target direction. In the BVR tracking stage, the nose of one UAV faces the target and the nose of the other UAV backs the target. Two UAVs fly in olive formation to maintain the continuity of attack and defense. In this stage, the trajectory formed by the UAV from flying towards the target to flying back to the target and then flying towards the target is a circle or ellipse, so it is called olive formation. The process of WVR combat is also called dog fight. The two UAVs find the best angle and distance to attack the enemy and avoid entering the attack range of the target at the same time. We assume that the UAVs can accurately obtain any information they want, and then control the speed, yaw and roll through autonomous maneuver decision making, so as to track, defend and attack the target.
There are three problems to be considered for dual-aircraft autonomous maneuver control decisions: First, how to conduct inter aircraft collision avoidance? Second, how to make the two UAVs form an olive formation to maintain the continuity of attack and defense? Third, how to make the two UAVs maneuver independently to realize BVR tracking and short range attack? Next, we solve the above three problems by designing a hybrid autonomous maneuver strategy of obstacle avoidance, formation and confrontation.

Collision Avoidance and Formation Strategy
As described in Section 4.1, A = {U 1 , U 2 } is used to represent our UAV set. We use U to represent any UAV in A, and useŨ to represent U's friendly aircraft. Denote by D U tŨt the distance between U andŨ at time t. First, if D U tŨt < D min , no matter whether the UAV meets the firing conditions or not, it must avoid collision between UAVs. When D U tŨt < D min , the fast and effective obstacle avoidance method between UAVs is to change the flight altitude between itself and friendly aircraft. Therefore, the collision avoidance strategy is as follows, U z andŨ z represent the heights of U andŨ respectively. The UAV with a higher altitude adopts the accelerated ascent strategy, and the UAV with a lower altitude adopts the accelerated descent strategy. Second, in order to ensure that the UAVs can realize olive formation flight, and considering the flight characteristics and ease of control of the UAV, we use continuous uniform left turn to realize circling flight, However, strategy (31) is not enough. We also need to let the two UAVs know when to start flying in olive formation, who starts first and how to switch. It is assumed that U 1 is the leader and U 2 is the wingman. In the BVR tracking stage, the leader first uses the strategy (31). It can be seen from (2) that the decision making time required by the UAV for a circle is T olive = 2πv cos γ gn z sin µ .
Finally, the maneuver control in takeoff stage, BVR tracking stage and WVR attack stage is realized by π(a|s) U . Therefore, the maneuver strategy of UAV at time t can be expressed by the following equation.
D BVR represents the distance threshold of BVR air combat, and D WVR represents the distance threshold of WVR air combat. If D U t T t ≥ D BVR , our UAV belongs to take-off stage.
If D WVR ≤ D U t T t ≤ D BVR , our UAV belongs to BVR tracking stage. If D U t T t < D WVR , our UAV belongs to the WVR air combat stage. The autonomous maneuver decision algorithm of the dual-UAV olive formation is sketched in Algorithm 2.
Algorithm 2 Maneuver strategy of two UAVs olive formation in air combat 1: Load trained neural network Q(s, a, θ). 2: Initialize the state of the leader and wingman (s U1,0 , s U2,0 , s T,0 ), 3: Initialize target maneuver strategy π(a|s) T . 4: for step = 1 to maxstep do 5: for U in A do 6: Calculate D UT 7: Calculate D UŨ 8: Execute a t = Π(a|s) U 9: [∆v, ∆γ, ∆ψ] is obtained according to (2) 10: [∆x, ∆y, ∆z] is obtained according to (1) 11: Get the next state s U,t+1 12: s U,t = s U,t+1 13: end for 14: The target moves to the next state s T,t+1 according to the strategy π(a|s) T 15: 16: UAVs win 17: break 18: end if 19: if 20: target win 21: break 22: end if 23: s T,t = s T,t+1 24: end for First, the trained neural network Q(s, a, θ) and the maneuver strategy of the target π(a|s) T are loaded. Initialize the state s U 1 ,0 , s U 2 ,0 , s T,0 of UAVs and target, where s U 1 ,0 and s U 2 ,0 are the initial state of our two UAVs respectively, and s T,0 is the initial state of target. For each UAV, the distance D UT and the distance D UŨ are calculated, and then the maneuver strategy a t is obtained according to (33). If the distance between UAV and the enemy is less than D min , the collision avoidance strategy (30) is implemented. If the UAV is in the takeoff stage, the UAV selects the maneuver strategy according to (29). If both UAVs enter the BVR tracking state, the leader first executes the strategy (31), and the number of execution steps is obtained according to (32), while the wingman continues to select a t according to (29). When the leader completes a circle according to (32) and (31), the flight strategies of the leader and the wingman are exchanged. In the BVR tracking stage, our two UAVs constantly change their maneuver strategies to maintain the continuity of attack and defense. In the WVR attack stage, both UAVs use (29) to complete the short-range combat. If either of the enemy and our UAVs satisfies (17), the air combat ends.

Simulation
This paper uses Python language to establish the air combat environment model of dual-aircraft olive formation tracking and attacking, and establishes the DQN network model based on the PyTorch module.

Simulation Setup
The air combat environment parameters are set as shown in Table 3. The parameters in the DQN model are set as follows. According to the definition of state space and maneuver library, it is obvious that DQN has 13 input states and 15 output Q values. The online Q network and the target Q network are constructed using a fully connected network. The network has three hidden layers, 512, 1024 and 512 units respectively. The output layer has no activation function, and the other layers are tanh layers. The learning rate is 0.001 and the discount coefficient is 0.9. The size of the experience pool is 5000, the number of samples taken in batch during training is 64, and the target network is updated every 300 steps. In the process of air combat simulation, the decision cycle T is set to 1s, and one set contains 100 decision steps. If any of the following conditions are met: The height of UAV is greater than z max or less than z min , or any UAV meets (17). The round of training is completed.
In order to verify the effectiveness of the DQN algorithm based on priority sampling and dynamic adjustment of scale factor proposed in this paper, we compare the loss values in the training process of the DQN network under the following four conditions while ensuring the same initial conditions of the simulation: w 1 is set as a fixed value, piecewise function and continuous function proportional to distance respectively, and introduce the priority sampling. The loss value represents the difference between the online network and the target network. The larger the loss value is, the larger the network optimization space at this stage is. The greater the fluctuation of loss value is, the worse the convergence performance of the neural network is. Figure 5 shows the results of two UAVs formation air combat. Red and green are our two UAVs, and blue is the target. The initial positions of the three UAVs are fixed, namely (0,0,3000), (20,000,20,000,4000), (500,500,3000), and the heading angle and pitch angle are initialized randomly. The target moved in a straight line at a constant speed. Our two UAVs take different maneuver control decisions in different air combat stages, and the two UAVs cooperate to complete the tracking, defense and attack tasks of the target. During the takeoff stage, our two UAVs tracked the target from a distance of about 30,000 m. In the BVR stage, the leader (red) first executes the strategy H(a|s) U , while the wingman (green) continues to track the target. When the leader hovers for one circle, its strategy is changed to track the target, and the wingman switches the flight strategy to H(a|s) U . The two UAVs cooperate to maintain the continuity of attack and defense. When the distance between UAV and target is less than D BVR , two UAVs enter the close attack phase. Figure 5 shows that our two UAVs attack the target from the rear of the target.    Figure 6 shows the change of the distance between the two sides in the process of air combat, where red is the change curve of the distance between the leader and the target, and green is the change curve of the distance between the wingman and the target, and black is the change of the distance between our two UAVs. The ordinate is the distance in meters. It can be seen that the distance between the UAV and the target will increase when flying in circles, and the distance between the UAV and the target will continue to shorten in the stages of BVR tracking and WVR air combat. Figure 7 shows the change of the lag angle of the UAVs of our two UAVs, where red is the change of the lag angle between the leader UAV and the target, and green is the change of the lag angle between the wingman UAV and the target. The ordinate is the lag angle in radians. It can be seen that the lag angle of our UAV changes from 0 to π when flying in circles, and the lag angle remains at a low level during the pursuit and close combat.   Figure 8 shows the change of the reward function of our two UAVs in the whole process of air combat, in which red is the leader and green is the wingman. The abscissa is r t . It can be seen that our UAV is flying in circles r t fluctuated obviously, and the situation of pursuit and close combat remained positive. When the target enters the attack range of UAV, the angle advantage function is added, and our situation rises obviously. Figure 9 shows the change of cumulative r t , which also shows an upward trend as a whole. Figure 10 shows the variation diagram of loss values during training when w1 are w1 = −20 and piecewise function (19), in which red is w1 = −20. Figure 11 shows the variation diagram of loss values during training when w1 are w 1 ∝ D U t T t and piecewise function (19), in which red is w 1 ∝ D U t T t .   Figure 12 shows the error comparison diagram of introducing priority sampling and not introducing priority sampling on the premise that w 1 is a piecewise function (19). The blue curve represents the DQN algorithm considering priority sampling, and the red curve represents the DQN algorithm without priority sampling. The abscissas of the above three figures represent training steps, and the ordinates represent loss values, and the comparison shows the changing trend of loss in the whole training process. In addition, because the amount of data to be presented is too large, in order to avoid affecting the clarity of the display results, and on the premise of not affecting the loss change trend, Figures 10-12 choose to record the loss value every 300 training times. It is not difficult to see from the above figures that the fluctuation amplitude and range of the red curve are much larger than that of the blue curve. The network converges better when w 1 is a piecewise function. The priority sampling can effectively improve the learning efficiency of agents and accelerate the convergence speed of the neural network. In summary, by modeling and simulation, this paper solves the following problems that exist in the state of the art: (i) Dimension explosion solved by discretizing the air combat state space to be finite and using neural network to learn the decision making model; (ii) Sparse and delayed reward solved by designing a real-time reward function based on situation assessment; (iii) Slow convergence solved by using improved priority sampling strategy; and (iv) Inadaptation to the real air combat maneuver control solved by incorporating the real UAV dynamic model and the comprehensive situation assessment model which is verified in the classical two agents olive formation scenario.

Conclusions
Based on reinforcement learning theory, an improved maneuver decision algorithm for UAV autonomous air combat is proposed in this paper. First, the UAV dynamic model and situation assessment model are established, and the UAV state space and action space are improved to solve the dimension explosion problem and make the UAV maneuver more flexible. Second, aiming at the problems of delayed reward and poor guidance ability, a reward function design method based on adaptive adjustment of the relative situation and the scale factor is proposed. Third, an improved priority sampling strategy is proposed to speed up the learning rate. Fourth, based on the dual-UAV olive formation task, a hybrid maneuver strategy of collision avoidance, formation and confrontation is proposed to realize dual-UAV cooperative autonomous air combat decision making. The simulation results show that the improved method can effectively improve the efficiency of the UAV learning confrontation maneuver strategy, and the UAV air combat maneuver decision model based on deep reinforcement learning can realize strategy with self-learning. The improved deep reinforcement learning method has a faster training speed and a more stable effect.