UAV Autonomous Aerial Combat Maneuver Strategy Generation with Observation Error Based on State-Adversarial Deep Deterministic Policy Gradient and Inverse Reinforcement Learning

: With the development of unmanned aerial vehicle (UAV) and artiﬁcial intelligence (AI) technology, Intelligent UAV will be widely used in future autonomous aerial combat. Previous researches on autonomous aerial combat within visual range (WVR) have limitations due to simplifying assumptions, limited robustness, and ignoring sensor errors. In this paper, in order to consider the error of the aircraft sensors, we model the aerial combat WVR as a state-adversarial Markov decision process (SA-MDP), which introduce the small adversarial perturbations on state observations and these perturbations do not alter the environment directly, but can mislead the agent into making suboptimal decisions. Meanwhile, we propose a novel autonomous aerial combat maneuver strategy generation algorithm with high-performance and high-robustness based on state-adversarial deep deterministic policy gradient algorithm (SA-DDPG), which add a robustness regularizers related to an upper bound on performance loss at the actor-network. At the same time, a reward shaping method based on maximum entropy (MaxEnt) inverse reinforcement learning algorithm (IRL) is proposed to improve the aerial combat strategy generation algorithm’s efﬁciency. Finally, the efﬁciency of the aerial combat strategy generation algorithm and the performance and robustness of the resulting aerial combat strategy is veriﬁed by simulation experiments. Our main contributions are three-fold. First, to introduce the observation errors of UAV, we are modeling air combat as SA-MDP. Second, to make the strategy network of air combat maneuver more robust in the presence of observation errors, we introduce regularizers into the policy gradient. Third, to solve the problem that air combat’s reward function is too sparse, we use MaxEnt IRL to design a shaping reward to accelerate the convergence of SA-DDPG.


Introduction
Compared with human-crewed aircraft, military UAVs have attracted much attention for their low cost, long flight time, and fearless sacrifice [1]. With the development of sensor technology, computer technology, and artificial intelligence technology, the operational performance of military UAVs has been significantly improved, and the range of tasks that can be performed has been continuously expanded. Although military UAVs can perform reconnaissance and ground attack missions, most control decisions are made by ground station controllers. Because it is difficult to adapt to the fast and changeable air battle scene, the traditional ground station command operation is difficult to command the UAV for aerial combat [2]. Moreover, since the weapons and sensors approximation errors and explains for the robust learning convergence of the NN-based IMF-AVI. In Reference [22], a UAV control policy based on DDPG to address the combination problem of 3-D mobility of multiple UAVs and energy replenishment scheduling, which ensures energy efficient and fair coverage of each user in a large region and maintains the persistent service.
Further practical improvements are required for the WVR autonomous aerial combat. We propose a novel the autonomous aerial combat maneuver strategy generation algorithm with high-performance and high-robustness based on the SA-DDPG algorithm. In order to consider the error of the aircraft sensors, we model the aerial combat WVR as a state-adversarial Markov decision process (SA-MDP), which introduce the small adversarial perturbations on state observations and these perturbations do not alter the environment directly, but can mislead the agent into making suboptimal decisions. SA-DDPG introduce a robustness regularizers related to an upper bound on performance loss at the actor-network to improve the robustness of the aerial combat strategy. At the same time, a reward shaping method based on MaxEnt IRL is proposed to improve the efficiency of the aerial combat strategy generation algorithm. Finally, the aerial combat strategy generation algorithm's efficiency and the performance and robustness of the resulting aerial combat strategy are verified by simulation experiments. Our contribution in this paper is a novel autonomous aerial combat maneuver strategy generation algorithm with high-performance and high-robustness based on SA-DDG. Unlike existing methods, the observation errors of UAV is introduced into the air combat model, and regularizer is introduced into the policy gradient to make the strategy network of air combat maneuver more robust. Finally, to solve the problem that air combat's reward function is too sparse, we use MaxEnt IRL to design a shaping reward to accelerate the convergence of SA-DDPG.
The remainder of this paper is organized as follows. Section 2 explains and defines the aerial combat model based on SA-MDP. Next, the specific theory and techniques for autonomous aerial combat maneuver strategy generation based on SA-DDPG are described in Section 3. A reward shaping method based on MaxEnt IRL is proposed in Section 4. Section 5 details the virtual combat environment and analyzes the performance of the proposed algorithm. This paper is concluded in Section 6. In this study, the UAV piloted by the proposed algorithm, and the target is referred to as the attacker and the target, respectively, for simplicity.

State-Adversarial Markov Decision Process (SA-MDP)
Markov decision process (MDP) is widely used in robotics, economics, manufacturing, and automatic control [23]. MDP is a mathematical framework for probabilistic modeling of the interactions between agents and the environment [24]. Agent is assumed to be learner or decision-maker, interacting with the environment. It receives a reward and a representation of the environment's state at each time step and imposes an action on the environment that may change its future state. This interaction between the agent and the environment is shown in Figure 1a.
A typical MDP is represented using a 6-tuple (S, A, T , γ, D, R), where S = {s 0 , s 1 , ..., s n } is a set of possible states that represent a dynamic environment, A = {a 0 , a 1 , ..., a n } is a set of available actions that the agent can select at a certain state, T is the state transition function which is determined by the current state and the next action of the agent. MDP assumes that performing an operation in a given state depends only on the current state-action pair, not on the previous states and actions, that is, P (s t+1 |s t , a t , s t−1 , a t−1 , ..., s 0 , a 0 ) = P (s t+1 |s t , a t ) (1) Equation (1) is called Markov property [25]. So, its mapping relationship is shown in Equation (2).
T is the state transition probability matrix that provides the probability of the system transition between every pair of the states, γ ∈ [0, 1) is the discount rate that guarantees the convergence of total returns. D is the initial state distribution, and R is the reward function, which specifies the reward for taking action in a state whose mapping relationship is shown in Equation (3).
The core objective of an MDP is to find a policy π for the agent, where the policy π : S → A specifies the action to take at the current state, where π(s) = a denotes that action a is always executed in-state s following the policy π. Moreover, the goal is to find the optimal policy of π * that maximizes the cumulative discounted reward over an infinite horizon: The value function of policy π at state s is the expected discounted return of starting in-state s and executing the policy. The value function can be computed as: The Q-value function of policy π at state s and using action a is the expected discounted return of starting in-state s and using action a and executing the policy. The Q-value function can be computed as: The Bellman equation [26] for V π (s) and Q π (s, a) can be computed as: The calculation process of the Bellman equation for Q π (s, a) is similar to V π (s): In SA-MDP [27], an adversary v(s) : S → S is introduced shown in Figure 1b. The role of the adversary is to perturb the observation of the agent, such that the action is taken as π(v(s)), however, the environment still transits from state s rather than v(s) to the next state. Since v(s) may be different from s, the agent's action from π(v(s)) may be sub-optimal, and thus the adversary can reduce the reward earned by the agent. In practical control problems, the uncertainty of measurement or state estimation is often reflected as the noise in the worst case. In order to model the uncertainty of measurement of the UAV, we assume that the error between the measured value v(s) and the real value s satisfies the small disturbance constraint, so we restricted v(s) to choose the value inside the I ∞ ball containing s. Formally, v(s) ∈ B(s) is assumed, where B(s) is a small set of states, s ∈ B(s) and B(s) is a set of task-specific "neighbouring" states of s.

State Transitions
As for the state information s of the aircraft, this paper uses the ground coordinate system to describe the coordinates of the aircraft and represents the path of the aircraft in the process of aerial combat. Use the coordinate system in Figure 2b to describe the forces on the aircraft. The unbalance of torques in the model of state transitions is ignored, which plays a role in simplifying the model since the research focuses on the air maneuver decision algorithm. Therefore, the three-degree-of-freedom (3-DoF) aircraft particle motion model is selected to analyze its force characteristics. After coordinate transformation, a simplified aircraft dynamics equation can be obtained, as shown in Equation (13).
where v is the speed of the aircraft, g represents the acceleration of gravity, n x and n z indicate the tangential overload and the normal overload, γ, Ψ and φ respectively indicate the aircraft's pitch angle, yaw angle, and roll angle. The following formula can be obtained by analyzing the force of the aircraft. n x is tangential overload; n z is normal overload. n x is used to change the speed of the aircraft; n z and φ determine the rate of change of pitch angle and yaw angle, which can change the flight direction and altitude.  According to the simplified dynamics equation of the aircraft, the change of UAV coordinates with time can be obtained, and the moving differential equations of the aircraft can be obtained, as shown in Equation (14).
As can be seen from Equation (14), when the initial velocity v, the initial pitch angle γ, the initial yaw angle Ψ of the aircraft is given, the coordinates of the aircraft in the ground coordinate system can be obtained through integral calculation.
Single order inertial links are introduced into the three control commands to make the control process of the aircraft model more consistent with the real aircraft control process. The acceleration commands (n xcmd , n zcmd , φ cmd ) due to actuator dynamics are given by where τ x , τ z and τ φ are time constants of n x , n z and φ, respectively.

State Definition
The state of aerial combat can be fully described using the current motion parameters of the attacker and the target; that is, the inertial coordinates parameters, the velocity parameters and the attitude angle parameters of the two aircrafts. The position, velocity and attitude angle parameters of the agent and the opponent can be represented as follow: where x a , y a , z a are the inertial coordinate parameters of the attacker in the ground coordinate system; x o , y o , z o are the parameters of the inertial coordinate of the opponent in the ground coordinate system; v a is the velocity parameter of the attacker and v o is the velocity parameter of the target; γ a is the pitch angle of the attacker and γ o is the pitch angle of the target; φ a the roll angle of the attacker and φ o is the roll angle of the target; Ψ a the yaw angle of the attacker and Ψ o is the yaw angle of the target. In this paper, the depth neural network (DNN) is used to approximate the critic network and actor-network. In order to ensure the good convergence performance of the deep neural network, the above parameters are normalized to the aerial combat state-space vector. State-space vectors are displayed in Table 1.
Then, the errors of the position, height, speed, and attitude sensors of the attacker and the errors of the position, height, speed and attitude sensors of the target are modeled as a disturbance of the information around the real state, which can be random or has antagonistic characteristics. In the paper, a random disturbance is used as an error in the state: where B (s) is the bounded set of the state vector, it is an I ∞ ball around the state vector s. The perturbation range on each state feature is determined individually, depending on the standard deviation of that state feature.

Action Definition
In this paper, a continuous action space is used to control the aircraft. There are three continuous control commands referred to in Section 2.2.1 to control the aircraft. Therefore, the action space can be indicated by: (n xcmd , n zcmd , φ cmd )

Reward Function and Terminal Condition
In real WVR aerial combat scenes, both aircraft maneuver at the same time; therefore, one aircraft will be located at the tail of the other aircraft. This ensures stable locking on the other party, and therefore, the opponent aircraft finds it challenging to get rid of the lock, and then attack the different plane. The combat geometry and parameters shown in Figure 2a are employed.
In Figure 2a, λ a is the antenna train angle (ATA) of the attacker, which is the angle between the line-of-sight (LoS) vector and the attacker's velocity vector. a is the aspect angle (AA) of the attacker, which is the angle between the LoS vector and the target's velocity vector. a can be obtained in terms of the velocity of vector v a and the LoS vector ρ, where ρ denotes the LoS vector between the attacker and the target. The ATA and AA are allowed to take any value between ±180 • . ATA and AA can be obtained from Equations (19) and (20).
As can be seen from Figure 2a, a smaller value of |λ a | means that the attacker is more accurate at aiming at the target, so |λ a | is inversely proportional to the shooting chance or offensive advantage. Similarly, a smaller | a | indicates a lower probability of being hit by the target; therefore, | a | is inversely proportional to the survival advantage. A termination condition of WVR aerial combat and rewards can be designed through the above-mentioned quantitative analysis of the aerial combat objectives and related parameters. The reward function can be designed using Equation (21) as follows: When a aircraft (assumed to be the attacker) satisfies the condition |λ a (s)| < 30 • ∧ | a (s)| < 60 • and satisfies this condition for a period of time, it is believed that the attacker can maintain the condition and complete the attack and destruction of the target. In this paper, since there is no killing simulation of UAVs, the problem is simplified as follows: when the attacker continuously tracks the target for five decision periods (e.g., the attacker is rewarded with one, five times over), it is considered that the attacker is stable tracking the target and the attacker wins. Similarly, when the target continuously tracks the attacker for five decision periods (e.g., the attacker is rewarded with −1, five times over), it is considered that the target is stable tracking the attacker and the attacker is defeated.

Deep Deterministic Policy Gradient (DDPG)
DDPG is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function and uses the Q-function to learn the policy. DDPG algorithm is successfully used in continuously high dimensional. DDPG is an actor-critic and model-free algorithm based on the deterministic policy gradient [28]. In other words, in DDPG, the actor directly maps states to actions instead of outputting the probability distribution across a discrete action space. In the DDPG algorithm, a policy parameter θ exists in the value function Q(s, a), and this function must be derived from the policy. The gradient of the objective J(θ) = E s∼µ [G] can be expressed as follows: In Equation (22) there are no expectations related to actions. Therefore, compared with a stochastic policy, policy strategies needless learning data and exhibit high efficiency, especially in the case of action spaces of large dimensions.
The policy and value functions of DDPG are approximated with deep neural networks. Like the DQN algorithm [29], experience replay [30] and dual network structures are introduced in the DDPG to make the training process more stable, and the algorithm converges more. DDPG is, therefore, an off-policy algorithm, and the sample trajectories from the experience replay buffers stored throughout the training.

State-Adversarial Deep Deterministic Policy Gradient (SA-DDPG)
In SA-MDP, the optimal adversary v * π (s) that minimizes the total expected reward for a given π, and define the optimal adversarial value and action-value functions: The Bellman equation for optimal adversary v * π (s) can be written as: According to Theorem 5 in [27], given a policy π for a standard MDP, the inequality relation between the value function of standard MDP and SA-MDP can be obtained: where D TV (π(·|s), π(·|ŝ)) is the total variation distance between π(·|s) and π(·|ŝ), and is a constant that does not depend on π. Equation (26) shows that as long as D TV (π(·|s), π(·|ŝ)) is not too large for anyŝ close to s the performance gap between V π (s) and V π v * (s) can be bounded. Equation (27) shows that as long as we can penalize π(·|s) − π(·|ŝ) 2 the total variation distance between the two smoothed distributions can be bounded. In DDPG, we parameterize the policy as a policy network π θ and the critic as a critic network Q θ . According to Equation (26), for each state we need to find maxŝ ∈B(s) D TV (π(·|s), π(·|ŝ)), and we use maxŝ ∈B(s) π(·|s) − π(·|ŝ) 2 as a surrogate. Note that the smoothing procedure can be done completely in test time, and during training time our goal is to keep maxŝ ∈B(s) π(·|s) − π(·|ŝ) 2 small.
For the set B(s) = {ŝ : s|s l ≤ŝ ≤ s u }, we can use the perturbation analysis tools Interval Bound Propagation (IBP) mentioned in Section 3.3 to give the upper and lower bounds of π θ (ŝ): where the bounds l π (s; θ) and u π (s; θ) are differentiable functions of θ. So we can obtain the upper bound of the norm of the difference as follows: The SA-DDPG training algorithm is shown in Algorithm 1. The main difference comparing to regular DDPG is the additional loss term L SA (θ), which provides an upper bound on maxŝ ∈B(s) π θ (s) − π θ (ŝ) 2 . If this term is small, we can bound the performance loss under adversary.

Algorithm 1 State-Adversarial Deep Deterministic Policy Gradient (SA-DDPG).
1: Randomly initialize critic network Q(s, a; θ Q ) and actor π(s; θ π ) with weights θ Q and θ π 2: Initialize target network Q and π with weights θ Q ← θ Q , θ π ← θ π 3: Initialize replay buffer R. 4: for episode = 1 to M do 5: Initialize a random process N for action exploration 6: Receive initial observation state s 1 7: for t = 1 to T do 8: Select action a t = π(s; θ π ) + N t according to the current policy and exploration noise 9: Execute action at and observe reward r t and observe new state s t+1 10: Update critic by minimizing the loss:

14:
Obtain upper and lower bounds on π(s i ; θ π ) using IBP: 15: Upper bound on I 2 distance: 16: Update the actor policy using the sampled policy gradient: 17: Update the target networks: 18: end for 19: end for

Interval Bound Propagation of Neural Network
The goal of IBP [31] is to find the lower and upper bounds in Equation (30). For clarity of presentation, we assume that the neural network is defined by a sequence of transformations h k for each of its K layers. That is, for an input z 0 , we have The output z K ∈ R N has N logits. The simplest approach is to bound the activation z k of each layer by an axis-aligned bounding box (i.e., z k ( ) ≤ z k ≤ z k ( )) using interval arithmetic. For I ∞ adversarial perturbations of size we have for each coordinate z k,i of z k : where z 0 ( ) = x 0 − and z 0 ( ) = x 0 + . For the fully connected layers that can be represented in the form h k (z k−1 ) = Wz k−1 + b, solving the optimization problems Equation (36) can be done efficiently with two matrix multiplies: where | · | is the element-wise absolute value operator. Propagating bounds through any element-wise monotonic activation function (e.g., ReLU, tanh, sigmoid) is trivial. Concretely, if h k is an element-wise increasing function, we have:

Maueuvering Strategy Generation Algorithm OutLine
The algorithm framework is composed of the aerial combat environment model, SA-DDPG, policy generation of the target, and reward shaping modules. The overall framework of the maneuvering strategy generation algorithm is shown in Figure 3.
In Figure 3, the inputs of the module of the aerial combat environment model are the actions of both the UCAVs, and the output is the next state and the reward value. The module of SA-DDPG contains an actor and a critic.
The actor's input is the state vector described in Section 2.2.1, and the output is the action vector described in Section 2.2.3. In order to improve the stability of the learning process, two neural networks are used in the actor, one as the target network and the other as the evaluation network. The structures of the two networks are the same, but the update methods are different. The evaluation network uses the deterministic policy gradient for updating, and the target network copies the online network's parameters through soft updates.
The critic's input is the state vector and the action vector, and the output is the Q(s, a) value. Like the actor, there are two neural networks in the critic-one for the target network and the other for the evaluation network. The update method is also similar to that used in the case of the actor.
The input of the reward-shaping module includes the state vector, the action vector, and the next state. The shaping reward is a designed reward that is received from the aerial combat environment. The specific method of designing the shaping reward using MaxEnt IRL is explained in Section 5.  Figure 3. Aerial combat maueuvering strategy generation algorithm using SA-DDPG.

Reward Shaping
Due to the learning efficiency of deep reinforcement learning, reinforcement learning has been restricting its practical application. Therefore, shortening the training time has become an important issue.
Reward shaping is a common technique to improve learning performance in reinforcement learning tasks [32]. The concept of reward shaping is to provide complementary incentives to agents to encourage them to move to a higher-paid status in the environment. These rewards and punishments, if applied arbitrarily, may divert agents from their intended objectives. In this case, agents converge to the optimal strategy in the case of reward formation, but they are suboptimal in the primary task.
Reward shaping accelerates the convergence rate of reinforcement learning by introducing additional rewards in an attempt to obtain more accurate rewards than those in the original environment. In general, the shaping reward function is expressed as follows: where R(s) denotes the binary reward function of the aerial combat environment in which the primary task owns. R RS (s) denotes the shaping reward function.

Nonparameterized Features of Aerial Combat
The design of the shaping reward function is difficult because the behavior of pilots in aerial combat is difficult to describe. The shaping reward function also varies from pilot to pilot and may even change over time. A common way to design a shaping reward function is to represent it manually chosen non parameterized features. The nonparameterized features Φ(s) are functions of the states of aerial combat. We consider shaping reward functions as a linear combination of features in the following form: where w is the weight vector, and Φ(s) is the feature vector with each component representing a nonparameterized feature of aerial combat. In order to improve the efficiency of the SA-DDPG algorithm, the nonparametric features of air combat selected in this paper are all continuous scalar functions sensitive to air combat state. In this work, we define the nonparameterized features in Table 2. One can design the weight vector w to encourage or penalize certain features using the given reward function, and then use reinforcement learning to learn the corresponding optimal policy by maximizing the total reward.

Shaping Reward Modeling Using Maximum Entropy Inverse Reinforcement Learning
However, the weight vector of Equation (40) of the features set by hand is hard, and the weight vector of a bad setting may affect the optimality of the aerial combat maneuvering strategy. Therefore, this paper uses another method; it learned from the experience of aerial combat expert weight vector w based on inverse reinforcement learning (IRL) algorithm [33]. IRL can be defined as a reverse procedure of RL problems, which assumes that the demonstrations from experts are generated from the optimal policy. In the IRL problem, one is given a number of time histories of the agent's behaviors consisting of past states and actions. These past states and actions are often referred to as demonstrations. Ziebart [34] first applies the maximum entropy principle to solve the inverse reinforcement learning problem, for cases where the reward function depends only on the current state and represented via a linear combination of feature functions, namely in Equation (40). Formally, IRL considers the case in MDP, where the reward function is unknown. Correspondingly, there is a demonstration set D = {ξ 1 , ξ 2 , ..., ξ N }, which is composed of expert demonstration trajectories. Each demonstration trajectory includes a set of state-action pairs, which are ξ i = {s 0 , a 0 , s 1 , a 1 , ..., s T , a T }. Thus, we define an MDP/R process with no reward function, defined as tuple S, A, T, γ, D. Inverse reinforcement learning aims to learn the potential reward function R.
Jaynes first proposed the principle of maximum entropy, and since then, it has been used in many areas of computer science and statistical learning [35]. In the basic maximum entropy formula, we give a set of samples of the target distribution and a set of constraints on the distribution, and then we estimate the distribution using the maximum entropy distribution that satisfies these constraints. The probability of a demonstration ξ over all paths of duration T is calculated: where the partition function Z(w) is a normalization constant, and it provides the solutions corresponding to a deterministic MDP, where the future state can be uniquely determined with the given action at the present state.
The goal of IRL is to find the optimal weight of w * , such that the likelihood of the observed demonstrations is maximal under the distribution in Equation (41). The entropy of the path distribution under the constraint of maximizing the features of the observed data means the possibility of maximizing the observed data under the maximum entropy distribution we deduced above. w * = arg max L(w) = arg max log P (ξ|w) This function is a convex function of deterministic MDPs, and the optimal solution can be obtained using the gradient optimization method. The gradient is the difference between expected empirical feature counts and the learner's expected feature counts, which can be expressed by the expected state visitation frequencies D s i , whereΦ ξ := 1 N ∑ ξ∈D Φ ξ is the expected empirical feature count and Φ ξ = ∑ s∈ξ Φ(s). From Equation (43), when the expected state frequencies D s i is given, the gradient can easily be computed Equation (43) for optimization. The algorithm to approximate the state frequencies is proposed in [34]. The algorithm approximates the state frequencies for the infinite time horizon using a large fixed time horizon. It recursively "backs up" from each possible terminal state and computes the probability mass associated with each branch along the way by computing the partition function for Equation (41) at each action and state. These branching values yield local action probabilities, from which state frequencies in each timestep can be computed and summed for the total state frequency counts. The algorithm of approximating expected empirical feature count is shown in Algorithm 2.
Algorithm 2 Approximate expected empirical feature count.
1: Z s i ,0 → 1 2: for j = 1 to N do 3: D s i ,t+1 = ∑ a i,j ∑ k D s k ,t P(a i,j |s i )P(s k |a i,j , s i ) 10: end for 11: D s i = ∑ t D s i ,t Since the state space and action space in the aerial combat model are continuous, it is impossible to use Algorithm 2 to estimate the expected empirical feature count. Therefore, in this paper, the state space and action space in the aerial combat model are discretized, and each parameter is divided into multiple discrete values.

Aerial Combat Simulation Platform Construction
The aerial combat simulation platform was built using the HARFANG 3D framework [36], which is a software framework used for the development of modern multimedia applications and for modeling the shapes of aircraft. The HARFANG framework can manage and display complex 3D scenes, play sounds and music, and virtual reality (VR) devices. It is a new multimedia application development framework, which is highly suitable for the development of games. Nowadays, ground controllers use VR equipment to control the UAVs; hence, we adopted the HARFANG 3D framework for our proposed aerial combat simulation platform.
The platform was able to simulate the aerial combat in certain airspaces, use the dynamic equation refer in Section 3.2 to simulate the flight characteristics of aircraft, and set the performance parameters. The platform was also able to reserve the aerial combat strategy interface of the two aircraft, thereby obtaining the aircraft control signals from the external environment. The platform could also support human control, where the interface received input from a keyboard or hands-on throttle and stick system (HOTS). The graph user interface of the aerial combat simulation platform is shown in Figure 4.  The main interface of the aerial combat simulation platform is depicted in Figure 4. The interface presents a real-time combat situation of the aircraft, and the five positions on the screen display the health value, altitude and speed, radar image, attitude angle, and all the speed components.

Initial Setting for 1-vs-1 WVR Aerial Combat Engagement
As in Reference [37], Figure 5 shows the four different initial situation of the WVR aerial combat engagement based on the initial relative positions and orientations of the UCAVs. The initial position of the attacker concerning target in the horizontal direction is randomly chosen to be between 350 m and 1050 m in the offensive, defensive, and neutral cases. The two aircraft are at the same height initially.

Aircraft Performance Parameters Setting
For experimental comparison, two different aircraft performance parameters were considered in this paper. Variation of the performance capabilities focuses on the following parameters: maximum thrust and maximum angular rates, as presented in Table 3. The dominant aircraft performance parameters are called the "advantage" parameters, and the non-dominant aircraft performance parameters are called the "disadvantage" parameters. The masses of both the aircraft are the same and equal to 10,000 kg each and the speed range of both the aircraft is [50, 250] meters per second. The most direct method to evaluate an aerial combat strategy is to conduct an aerial combat confrontation with other aerial combat strategies and then determine the winner. We can also determine if the flight track generated by an aerial combat strategy is reasonable by observing the aircraft's flight tracks from both sides of the combat.
To quantitatively and accurately analyze the intelligence degree of an aerial combat strategy, this study proposes four metrics: intercept time T I , defensive time T D , offensive time T O , and winning probability P W . Intercept time is measured from the aerial combat simulation's beginning until an aircraft has a position of advantage. The position of advantage is the termination condition of the aerial combat, expressed in Equation (21). Defensive time is when a UCAV is at |AA| > 90 • during aerial combat. Offensive time is when a UCAV is at |ATA| < 90 • during aerial combat. The winning probability is the ratio of aerial combat simulations and the total number of aerial combat simulations.

Opponent Maueuvering Strategy
In this paper, the maneuver strategy of the target is designed using the Minimax Search Tree method (MST). The MST method simplifies and models the combat engagement between two players as a chess-type tree, a model-based method. See the schematic plot in Figure 6. The MST idea models the aircraft to following certain predefined candidate maneuvers. The candidate maneuvers include seven actions: max load factor left turn, max load factor right turn, max long acceleration, steady flight, max long deceleration, max load factor pull up, and max load factor pull over.  At each decision point, both aircraft are assumed to take actions that are prespecified as the seven elemental maneuvers. The end states of both aircraft are obtained by numerically integrating the motion equations. Given the seven potential options from each player, the terminating condition consists of 49 possibilities, represented by a seven-by-seven square matrix. Based on the terminate orientation, range, velocity, and terrain, a scoring function is defined for each index in the matrix. One player thus is to take actions that maximize the terminate score, while the other to minimize it.
The scoring function is composed of contributions consisting of an orientation score, a relative range score, and a velocity score. The contributions and the method of combining them were developed by incorporating various candidate functions into the program, performing numerous computer simulations, and studying the aircraft performance. The scoring function is defined in Equation (44).
where R D is the expected attacking range of weapons, and k is a coefficient adjusting the influence of Score(S) in the total score.

Experiment Settings
Before using the SA-DDPG algorithm to obtain aerial combat maneuver strategy, the MaxEnt IRL algorithm is required to obtain shaping reward in order to provide the SA-DDPG algorithm. To complete the experiment, we used the aerial combat simulation platform to collect 500 different aerial combat counter trajectories manually. The initial states of these aerial combat trajectories are generated randomly, and the target uses the MST method for confrontation, and its look ahead level is set to 4.
Each component of the aerial combat features defined in Table 2 is divided into 20 parts on average, and each component of the action space is divided into five parts on average. Gradient descent method is used to solve w * in Equation (42), and the learning rate is 0.01. The initial weight is set to −0.25.

Experiment Result and Analysis
The curve of weight w of the linear combination of features is shown in Figure 7. As can be seen from the figure, the curve showed a trend of convergence, and the weights all converge to some negative value after 6000 steps of learning. The optimal weight w * are shown in Table 4.
In Table 4, the weight components are all negative, which indicates that the principle of aerial combat strategy used by aerial combat experts is to reduce |ρ|, |AA|, |ATA| and |v a − v o | as much as possible to complete the aerial combat, which is intuitively reasonable. Table 4. The optimal weight w * of the linear combination of aerial combat features.
In order to quantitatively determine the effects of the four features of the aerial combat on the total shaping reward, the color map of the shaping reward using |AA| and |ATA| features and the color map of the shaping reward using |ρ| and |v a − v o | features are respectively drawn in Figure 8a. In Figure 8b, the influence of |AA| and |ATA| on shaping reward is far greater than that of |ρ| and |v a − v o | on shaping reward. Therefore, it is more important to have the advantage of a relative angle than the advantage of relative distance and speed in aerial combat, especially in short-range aerial combat.

Experiment Settings
In this section, a comparative analysis of the training methods is performed. The five training algorithms are the DDPG algorithm without reward shaping (DDPG), the DDPG algorithm with reward shaping (RS-DDPG), the SA-DDPG algorithm with reward shaping, and the learning rate κ in Equation (33) is set to 0.1 (RS-SA-DDPG-0.1), the SA-DDPG algorithm with reward shaping and the learning rate κ is set to 1 (RS-SA-DDPG-1), and the SA-DDPG algorithm with reward shaping and the learning rate κ is set to 10 (RS-SA-DDPG-10) to training the aerial combat maneuver strategy.
When training the aerial combat strategies, the initial states of the two aircraft are randomly generated according to the four situations described in Section 5.1.2. At the same time, to prevent the two aircraft from falling into the "boring" Nash equilibrium of "biting the tail" during training, the performance parameters of the two aircraft to be different from one another: the attacker uses the "advantage" performance parameters, and the target uses the "disadvantage" performance parameters. The target uses the MST method for confrontation, and its look ahead level is set to 4.
The length of the actor network's input vector is 14, and that of the output vector is 3. The length of the critic network's input vector is 17, and that of the output vector is 1. The online actor-network, target actor-network, online critic network, and target critic network are constructed using three layers fully connected neural network, and the nodes number of the hidden layer is 64. The output layer of the actor-network and the critic network have no activation function; furthermore, the input and hidden layers are all ReLU layers. The learning rate of the network is 0.01, and the discount factor γ is 0.9. The soft update factor of the target network is 0.01.
Additionally, the weight of the initialized neural network can be adjusted to be using the Xavier initializer. The batch size of the updating network is 1024, and the size of the experience replay memory is set to 10 6 . The perturbation range for all normalized state components is 0.1.

Experiment Result and Analysis
The curve of the average reward of the aircraft shown in Figure 9. This curve is the average reward of all steps in every 500 episodes. Besides DDPG, which has only the original reward signal, the rewards of the other four algorithms all contain the shaping reward signal. As can be seen from Figure 9, Except DDPG, the other four algorithms tend to converge. This indicated that after the introduction of the shaping reward, the algorithm's efficiency in learning aerial combat maneuver strategy was significantly improved. By comparing the RS-DDPG algorithm with the 4 RS-SA-DDPG algorithms, it can be determined that the training convergence speed of the RS-SA-DDPG algorithms is slightly better than that of the RS-DDPG algorithm, and the final convergence result of RS-SA-DDPG algorithms is better than the RS-DDPG algorithm. By comparing the convergence curves of three RS-SA-DDPG algorithms with different κ parameters, the convergence speed will be slightly slower with the increase of κ, and it can converge to a better result, to obtain a better aerial combat maneuver strategy.

Experiment Settings
In this section, the intelligence degree of aerial combat maneuver strategy obtained using the RS-SA-DDPG-1 algorithm is evaluated against the target using the uniform linear motion maneuver strategy and four look-ahead MST maneuver strategy. The performance parameters of aircraft on both sides are kept at "Advantage" to ensure that the performance of the two aircraft is the same. The perturbation range is 0.1. Meanwhile, the initial situation of the attacker is set to "Defensive" when the attacker conducts aerial combat with the target with the uniform linear motion maneuver strategy; the initial situation of the attacker is set to "Opposite" when the attacker conducts aerial combat with the target with the MST maneuver strategy.

Experiment Result and Analysis
The results of the simulation scenario against the uniform linear motion maneuver strategy are presented in Figure 10. The trajectories are depicted at intervals of 1 second. The downward-pointing triangle points at the trajectory represent the initial position. The cross points at the trajectory represent the end position of the aerial combat and, the circular points are marked every 10 s. To gain the situational advantage in aerial combat, the attacker performs a somersault maneuver to allow the target to overshoot it for about 30 s quickly. Then, the attacker speeds up to catch up with the opponent and, finally, slows down to prevent overshoot and steadily tracks the target. As a result, the target was successfully destroyed in 50 s by gunshot. It can also be seen from Figure 10 that the overall features of the attacker relative to the target increases from the initial positive value to zero, and there are some oscillations in the middle, but in the end, it can be stably maintained at zero, i.e., in an advantageous situation.  The results of the simulation scenario against the four look-ahead MST maneuver strategy are presented in Figure 11. The attacker first accelerates to approach the target and overshoot quickly for about 20 s, and the target takes an advantageous situation. Then at 30 s, the attacker makes the left turn to bring the aircraft into the opposite situation. After 50 s, the attacker takes an advantageous position. Although the target adopted a series of maneuvers of acceleration or turning to try to get rid of it, the agent aircraft is still in an advantageous situation and gradually stabilized tracking. From the above two simulation results, it can be concluded that the aerial combat maneuver strategy using the RS-SA-DDPG-1 algorithm is effective. Besides, 1000 times Monte-Carlo simulations with the randomly initial situations were performed to ensure the universal validity of the performance in air combats between the strategies of the RS-SA-DDPG-1 algorithm and the four look-ahead MST method. The results are presented in Figure 12. As can be seen from Figure 12, regardless of the initial aerial combat situations, RS-SA-DDPG-1 strategy wins the air combats with a high probability, and the total probability of winning reaches 63.01%, and the average intercept time is 60.09 s.

Experiment Settings
In this section, in order to test the robustness of the aerial combat strategy, two experiments are designed. First, we modified the perturbation range to evaluate the impact on the effectiveness of aerial combat strategies obtained by RS-SA-DDPG-1 and RS-DDPG algorithms. The perturbation range is set to {0, 0.1, 0.2, 0.3, 0.4}. For each perturbation range, we will conduct 1000 aerial combat simulations, and the initial aerial combat states are randomly generated to record the result of each aerial combat. Then, we modified the look ahead level of the aerial combat strategy of the target using the MST method to evaluate the impact on the effectiveness of aerial combat strategies obtained by RS-SA-DDPG-1 and RS-DDPG algorithms. The look-ahead level is set from 1 to 6. For each look ahead level, we will conduct 1000 aerial combat simulations, and the initial aerial combat states are randomly generated to record the result of each aerial combat.

Experiment Result and Analysis
The result of the experiment is shown in Figure 13. Firstly, the robustness of the airial combat strategy on the perturbation range of observation error is analyzed. In Figure 13a, when the perturbation range is 0, the winning rate of aerial combat strategies obtained by RS-SA-DDPG-1 and RS-DDPG algorithms are relatively high, about 0.8. This shows that the introduction of the regularization of robustness will not affect the performance of the airial combat strategy network. With the increase of the perturbation range , the winning rate of aerial combat strategies obtained by RS-SA-DDPG-1 algorithm remained relatively stable, and the winning rate of aerial combat strategies obtained by RS-DDPG algorithm gradually declined to about 40%. This shows that the robustness of the policy network with regularization is better than that without regularization. Then, the robustness of the airial combat strategy on the target intelligence level is analyzed. In Figure 13b, with the increase of the look ahead level of target, the winning rate of aerial combat strategies obtained by RS-SA-DDPG-1 and RS-DDPG algorithm gradually decrease simultaneously, but the rate of decline of RS-SA-DDPG-1 is less than the rate of decline of RS-DDPG. Through these two experiments, it can be found that the robustness of the aerial combat strategies obtained by the RS-SA-DDPG-1 algorithm is better than that of the aerial combat strategies obtained by the RS-DDPG algorithm no matter changing the perturbation range or the intelligence degree of the target.

Conclusions
In order to generate intelligent and robust aerial combat maneuver strategy, an algorithm for generating the aircraft's autonomous maneuver strategy based on the SA-DDPG algorithm and inverse reinforcement learning algorithm is proposed in this paper. In order to model the measurement errors of the sensors of aircraft in aerial combat, the process of aerial combat is modeled as SA-MDP. As can be seen from the experimental results above, the shaping reward obtained through the inverse reinforcement learning algorithm can effectively improve the speed of the aerial combat strategy learning. Meanwhile, through the comparison of aerial combat simulation results of the aerial combat strategies obtained by DDPG, SA-DDPG algorithm, it can be found that the aerial combat strategies obtained by the SA-DDPG algorithm have strong robustness. In this paper, only the decision of aircraft maneuver is studied, but the control strategy of weapon and sensor is not discussed, which needs study in the future.

Conflicts of Interest:
The authors declare no conflicts of interest.