1. Introduction
Compared with human-crewed aircraft, military UAVs have attracted much attention for their low cost, long flight time, and fearless sacrifice [
1]. With the development of sensor technology, computer technology, and artificial intelligence technology, the operational performance of military UAVs has been significantly improved, and the range of tasks that can be performed has been continuously expanded. Although military UAVs can perform reconnaissance and ground attack missions, most control decisions are made by ground station controllers. Because it is difficult to adapt to the fast and changeable air battle scene, the traditional ground station command operation is difficult to command the UAV for aerial combat [
2]. Moreover, since the weapons and sensors mounted by UAVs are less than the human-crewed aircraft, compared with the aerial combat of the human-crewed aircraft, UAVs should be fought in the close range. Therefore, the autonomous aerial combat of UAVs in close range is an important research topic.
This study focuses on gun-based aerial combat WVR, referred to as dogfighting. Although missiles have developed into major equipment for beyond-visual-range operations, particularly for second-generation fighters and later [
3], their effectiveness and death toll have been lower than expected. Therefore, as the next-generation fighter, unmanned combat with guns is considered the key to WVR combat.
Since the early 1960s, a great deal of research has been done on autonomous aerial combat, and some remarkable research results have been published. The problem of aerial combat is modeled as pursuit-evasion games [
4,
5,
6], and various theoretical and optimal control schemes provide solutions for autonomous aerial combat. Using differential game theory [
7], the aerial combat model is modeled as a deterministic, complete information pursuer-evader game model. In Reference [
8], approximate dynamic programming (ADP), a real-time autonomous one-to-one aerial combat method, was studied, and the results were tested in real-time indoor autonomous vehicle test environments (RAVEN). ADP differs from classical dynamic programming in that it constructs a continuous function to approximate future returns. ADP does not need to perform future reward calculations for each discrete state. Therefore, its real-time performance is reliable. A more complex two-on-one combat was treated as a differential game in three-dimensional space [
9]. These theoretical approaches make possible the mathematical interpretation of the problem of aerial combat. However, they may include oversimplified assumptions, such as fixed roles or two-dimensional motion, to reduce complexity and computational time.
Some approaches are to model autonomous aerial combat as rule-based heuristic systems that imitate the behavior of human pilots. In Reference [
10,
11,
12], the maneuvering library design, control application, and maneuvering identification based on the basic fighter maneuvers (BFM) expert system are presented. Based on a combination of the BFM library, target prediction, and impact point calculations, an autonomous aerial combat framework for two-on-two engagements was proposed in [
9]. At the same time, the influence graph model is used to model the pilot’s decision-making process choosing the right maneuver at every moment in the aerial combat [
13,
14]. The rule-based system produces real and reasonable simulation results. However, it is difficult to deal with all combat situations, modify previously designed rules, and add new rules.
Artificial neural network and reinforcement learning methods have recently exhibited an improved performance by generating effective new tactics for various simulation environments. Based on the artificial neural network, aerial combat maneuver decision-making is learned from a large number of aerial combat samples with strong robustness [
15,
16,
17,
18]. However, aerial combat samples need to include multiple groups of time series and aerial combat results. Due to the difficulty in obtaining samples for aerial combat, it is necessary to mark samples at each sampling time manually. Therefore, aerial combat samples’ problem limits the application of the method of generating an aerial combat maneuver strategy based on a neural network. In Reference [
19], an algorithm based on DDPG and a new training method was proposed to reduce training time and simultaneously obtain sub-optimal but effective training results. A discrete action space was used, and the aerial combat maneuvering decisions of an opponent aircraft were considered. However, the current reinforcement learning methods all consider that the acquired aerial combat status is accurate, which is inconsistent with the real aerial combat. In addition, reinforcement learning has also been applied in the fields of UAV flight control and multi UAVs cooperative flight control [
20,
21,
22]. In Reference [
20], deep Q-network, policy gradient and DDPG are used to design the 2-DOF flight attitude simulator control system and the feasibility of model-free reinforcement learning algorithm is proved through the experimental results. In Reference [
21], the output reference model tracking control for a nonlinear real-world two-inputs–two-outputs aerodynamic system is solved by iterative model-free approximate value iteration (IMF-AVI), and theoretical analysis shows convergence of the IMF-AVI while accounting for approximation errors and explains for the robust learning convergence of the NN-based IMF-AVI. In Reference [
22], a UAV control policy based on DDPG to address the combination problem of 3-D mobility of multiple UAVs and energy replenishment scheduling, which ensures energy efficient and fair coverage of each user in a large region and maintains the persistent service.
Further practical improvements are required for the WVR autonomous aerial combat. We propose a novel the autonomous aerial combat maneuver strategy generation algorithm with high-performance and high-robustness based on the SA-DDPG algorithm. In order to consider the error of the aircraft sensors, we model the aerial combat WVR as a state-adversarial Markov decision process (SA-MDP), which introduce the small adversarial perturbations on state observations and these perturbations do not alter the environment directly, but can mislead the agent into making suboptimal decisions. SA-DDPG introduce a robustness regularizers related to an upper bound on performance loss at the actor-network to improve the robustness of the aerial combat strategy. At the same time, a reward shaping method based on MaxEnt IRL is proposed to improve the efficiency of the aerial combat strategy generation algorithm. Finally, the aerial combat strategy generation algorithm’s efficiency and the performance and robustness of the resulting aerial combat strategy are verified by simulation experiments. Our contribution in this paper is a novel autonomous aerial combat maneuver strategy generation algorithm with high-performance and high-robustness based on SA-DDG. Unlike existing methods, the observation errors of UAV is introduced into the air combat model, and regularizer is introduced into the policy gradient to make the strategy network of air combat maneuver more robust. Finally, to solve the problem that air combat’s reward function is too sparse, we use MaxEnt IRL to design a shaping reward to accelerate the convergence of SA-DDPG.
The remainder of this paper is organized as follows.
Section 2 explains and defines the aerial combat model based on SA-MDP. Next, the specific theory and techniques for autonomous aerial combat maneuver strategy generation based on SA-DDPG are described in
Section 3. A reward shaping method based on MaxEnt IRL is proposed in
Section 4.
Section 5 details the virtual combat environment and analyzes the performance of the proposed algorithm. This paper is concluded in
Section 6. In this study, the UAV piloted by the proposed algorithm, and the target is referred to as the attacker and the target, respectively, for simplicity.