Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach

Jiang, Liang; Nan, Ying; Zhang, Yu; Li, Zhihan

doi:10.3390/aerospace9080424

Open AccessArticle

Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach

by

Liang Jiang

^1,*

,

Ying Nan

¹,

Yu Zhang

² and

Zhihan Li

¹

College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

²

School of Computer Science and Engineering, Southeast University, Nanjing 210096, China

^*

Author to whom correspondence should be addressed.

Aerospace 2022, 9(8), 424; https://doi.org/10.3390/aerospace9080424

Submission received: 8 April 2022 / Revised: 26 July 2022 / Accepted: 1 August 2022 / Published: 4 August 2022

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Anti-interception guidance can enhance a hypersonic glide vehicle (HGV) compard to multiple interceptors. In general, anti-interception guidance for aircraft can be divided into procedural guidance, fly-around guidance and active evading guidance. However, these guidance methods cannot be applied to an HGV’s unknown real-time process due to limited intelligence information or on-board computing abilities. In this paper, an anti-interception guidance approach based on deep reinforcement learning (DRL) is proposed. First, the penetration process is conceptualized as a generalized three-body adversarial optimal (GTAO) problem. The problem is then modelled as a Markov decision process (MDP), and a DRL scheme consisting of an actor-critic architecture is designed to solve this. Reusing the same sample batch during training results in fewer serious estimation errors in the critic network (CN), which provides better gradients to the immature actor network (AN). We propose a new mechanismcalled repetitive batch training (RBT). In addition, the training data and test results confirm that the RBT can improve the traditional DDPG-based-methodes.

Keywords:

hypersonic glide vehicle; anti-interception; deep reinforcement learning; guidance

1. Introduction

A hypersonic glide vehicle (HGV) has the advantages of a ballistic missile and a lifting body vehicle. It is efficient as it does not require any additional complex anti-interception mechanisms (e.g., carrying a defender to destroy an interceptor [1] or relying on the range advantage of a high-powered onboard radar for early evasion before the interceptor locks on [2]), and can achieve penetration through anti-interception guidance. Over the past decade, studies have focused on utilizing HGV manoeuvreability to protect against interceptors.

In general, anti-interception guidance for aircraft has three categories: (1) procedural guidance [3,4,5,6], (2) fly-around guidance [7,8], and (3) active evading guidance [9,10].

Early research on anti-interception guidance focused on procedural guidance. In procedure guidance, the desired trajectory (such as sine manoeuvres [5], square wave manoeuvres [3], or snake manoeuvres [6]) is planned prior to launch based on facts such as the target position, interceptor capability and manoeuvre strategy, and the vehicle receives guidance based on the fixed trajectory after launch. Imado et al. [11] studied the lateral procedural guidance in a horizontal plane. Zhang et al. [12] proposed a midcourse penetration strategy using an axial impulse manoeuvre and provided a detailed trajectory design method. This penetration strategy does not require lateral pulse motors. To eliminate the re-entry error caused by midcourse manoeuvre, Wu et al. [13] used the remaining pulse motors to regress a preset ballistic. Procedural guidance only needs to plan a ballistic and inject it into the onboard computer before launch, which is easy to implement and does not occupy onboard computing resources. With the advancement of interception guidance, the procedural manoeuvres studied by Imado et al. [11,12] may be recognized by interceptors, and effectiveness cannot be guaranteed when fighting against advanced interceptors.

In conjunction with the advances in planning and information fusion, fly-around the detection zone has emerged as a penetration strategy. The primary objective is to plan a trajectory that can evade an enemy’s detection zone. A complex nonlinear programming problem with multiple constraints and stages is used to achieve this objective. In terms of the detection zone, Zhang et al. [14] established infinitely high cylindrical and semi-ellipsoidal detection zone models under the assumption of earth-flattening and optimized the trajectory that satisfies the waypoint and detection zone constraints. Zhao et al. [7] proposed adjusting the interval density with curvature and error criteria based on the multi-interval pseudo-spectrum method. An adaptive pseudo-spectrum method was constructed to solve the multi-interval pseudo-spectrum, and the number of points in the interval was allocated. A rapid trajectory optimization algorithm was also proposed for the whole course under the condition of multiple constraints and multiple detection zones. However, fly-around guidance has insufficient adaptability on the battlefield. Potential re-entry points have a wide distribution, and the circumvention methods studied by Zhang et al. [7,14], cannot guarantee that a planned trajectory will meet energy constraints. In addition, there may not be a trajectory that can evade all detection zones in an enemy’s key air defence area. Moreover, it may be impossible to know an enemy’s detection zones due to limited intelligence.

As onboard air-detection capabilities have advanced, active evading has gradually gained popularity in anti-interception guidance, and some research results have been recorded through differential game (DG) theory and numerical optimization algorithms in recent years. In DG, Hamilton functions are built based on an adversarial model and are then solved by numerical algorithms to find the optimal control using real-time aircraft flight states. Xian et al. [15] conducted research based on DG and achieved a strategy set of evasion manoeuvres. According to the accurate model of a penetration spacecraft and an interceptor, Bardhan et al. [4] proposed guidance using a state-dependent Riccati equation (SDRE). This approach obtained superior combat effectiveness compared with traditional DG. However, DG requires many calculations and has poor real-time performance. Model errors can be introduced during linearization [16], and onboard computers have difficulty achieving high-frequency corrections. The ADP algorithm was employed by Sun et al. [17,18] address with the horizontal flight pursuit problem. In an attempt to reduce the computational complexity, neural networks were used to fit the Hamilton function online. However, it is unclear whether this idea can be adapted for an HGV featuring a gaint flight envelope. Numerical optimization algorithms, for example, pseudo-spectral methods [19,20] have been used to discretize complex nonlinear HGV differential equations and convert various types of constraints into algebraic constraints. Various optimization methods, such as convex optimization or sequential quadratic programming, are used to solve the optimal trajectory. The main drawback of applying numerical optimization algorithms to HGV anti-interception guidance is that they occupy a considerable amount of onboard computer resources over a long period of time, and the computational time required increases exponentially with the number of aircrafts. Due to these limitations, active evading guidance is unsuitable for engineering.

Reinforcement learning (RL) is a model-free algorithm that is used to solve decision-making problems and has gained attention in the control field because it is entirely based on data, does not require model knowledge, and can perform end-to-end self-learning. Due to the limitations of traditional RL, early research cannot handle high-dimensional and continuous battlefield state information. In recent years, deep neural networks (DNNs) have demonstrated the ability to approximate an arbitrary function and have unparalleled advantages in the feature extraction of high-dimensional data. Deep reinforcement learning (DRL), is a technique resulting from the intersection of DNN and RL, and its abilities can exceed those of human empirical cognition [21]. A wave of research into DRL has been sparked by successful applications, such as Alpha-Zero [22] and Deep Q Networks (DQN) [23,24]. After training, a DNN can quickly output control in milliseconds and has a good generalization ability in unknown environments [25]. Therefore, DRL has promising applications in aircraft guidance. There has been some discussion regarding using DRL to train DNNs in intercepting a penetration aircraft [26]. Brain et al. [27] used reinforcement meta-learning to optimize an adaptive guidance system that is suitable for the approach phase of an HGV. However, no research has been conducted on the application of DRL to HGV anti-interception guidance. However, some studies in different areas have examined similar questions. Wen et al. [28] proposed a collision avoidance method based on a deep deterministic policy gradient (DDPG) approach, and a proper heading angle was obtained by using the proposed algorithm to guarantee conflict-free conditions for all aircraft. For micro-drones flying in orchards, Lin et al. [29] implemented DDPG to create a collision-free path. Guo et al. studied a similar problem, the difference being that DDPG was applied to an unmanned ship. Lin et al. [30] studied how to use DDPG to train a fully connected DNN to avoid collision with four other vehicles by controlling the acceleration of the merging vehicle. For unmanned surface vehicles (USVs), Xu et al. [31] used DDPG to determine the switching time of path-planning and dynamic collision avoidance. These studies led us to believe that DDPG-based methods have promising applications for solving the anti-interception guidance problem of HGVs. However, due to the differences in the objects of study, the anti-interception guidance problem for HGVs requires consideration of some of the following issues: (1) The performance (especially the available overload) of an HGV is time-varying due to the velocity and altitude, while the performance of the control object is fixed in the abovementioned studies. We need to build a model in which the DDPG training process is not affected by time-varying performance. (2) The end state is the only concern in the anti-intercept guidance problem, and only one instant reward is obtained in a training episode. Therefore, sparsity and the delayed reward effect are more significant in this study than in the studies mentioned above. In this paper, we attempt to improve the existing DDPG-based methods to help DNNs gain intelligence faster.

The main contributions of this paper are as follows: (1) Anti-interception HGV guidance is described as an optimization problem, and a generalized three-body adversarial optimization (GTAO) model is developed. This model does not need to account for the severe constraints in the available overload (AO) and is suitable for DRL. To our knowledge, this is the first time that DRL-specific application was implemented for the anti-interception guidance of an HGV. (2) A DRL scheme is developed to solve the GTAO problem, and the RBT-DDPG algorithm is proposed. Compared with the traditional DDPG-based algorithm, the RBT-DDPG algorithm can improve the learning effects of the critic network (CN), alleviate the exploration–utilization paradox, and achieve a better performance. In addition, since the forward computation of a fully connected neural network is very simple, an intelligent network trained by DRL can quickly compute a command for anti-interception guidance that matches the high dynamic characteristics of the HGV. (3) A strategy review of teh HGV anti-interception guidance derived from the DRL approach is provided. We note that this is the first time that these strategies are summarized for HGV guidance through semantics, and may inspire the research community.

The remainder of this paper is organized as follows: Section 2 describes the problem of anti-interception guidance for HGVs as an optimization problem and translates the problem into solving a Markov decision process (MDP). In Section 3, the RBT-DDPG algorithm is given in detail. Moreover, we propose a specific design of the state space, action space, and reward functions that are necessary to solve the MDP using DRL. Section 4 examines the training and test data and the specific anti-interception strategy. Section 5 presents a conclusion to the paper and an outlook on intelligent guidance.

2. Problem Description

Figure 1 shows the path of an HGV conducting an anti-interception manoeuvre. The coordinate and velocity of the aircraft are known, as well as the coordinate of the desired regression point (DRP). The HGV and interceptor rely on aerodynamics to perform ballistic manoeuvres. The aerodynamic forces are mainly derived from the angle of attack (AoA). As an HGV can glide for thousands of kilometres, this paper focuses on the guidance needed after the interceptors locked on the HGV, and the flight distance of this process is set to 200 km.

2.1. The Object of Anti-Interception Guidance

In the Earth-centered earth-fixed frame, the motion of the aircraft in the vertical plane is described as follows [32]:

\{\begin{matrix} \frac{d v}{d t} = \frac{1}{m} (P cos α - C_{X} (v, α) q S) - g sin θ \\ \frac{d θ}{d t} = \frac{1}{m v} (P sin α + C_{Y} (v, α) q S) - (\frac{g}{v} - \frac{v}{R_{0} + y}) cos θ \\ \frac{d x}{d t} = \frac{R_{0} v cos θ}{R_{0} + H} \\ \frac{d y}{d t} = v sin θ \end{matrix}

(1)

where x is the flight distance, y is the altitude, v is the flight velocity,

θ

is the ballistic inclination, g is the gravity acceleration,

R_{0}

is the mean radius of the earth when the flatness of the earth is ignored,

C_{X} (v, α)

and

C_{Y} (v, α)

are the lift and drag aerodynamic coefficients of the aircraft, respectively,

α

is the AoA, q is the dynamic pressure, S is the aerodynamic reference area, m is the mass of the vehicle, and P is the engine thrust.

Letting subscript

H

and

I

indicate the variables subordinate to the HGV and interceptor, respectively. Setting

x_{H} (t) = {[\begin{matrix} x & y & v & θ \end{matrix}]}^{T}

is the state of an axisymmetric HGV, then Equation (1) can be rewritten as follows:

{\dot{x}}_{H} (t) = f_{H} (x_{H}) + g_{H} (x_{H}) u (t) = [\begin{matrix} \frac{R_{0} v cos θ}{R_{0} + H} \\ v sin θ \\ - \frac{C_{X} q S}{m} - g sin θ \\ - (\frac{g}{v} - \frac{v}{R_{0} + y}) cos θ \end{matrix}] + [\begin{matrix} 0 \\ 0 \\ 0 \\ δ_{H} (x_{H}) \end{matrix}] u (t)

(2)

where

δ_{H} (x_{H}) = max_{α} (\frac{1}{m v} C_{Y} (v, α) q S)

is the maximum rate of inclination generated by aerodynamics in state

x_{H}

. The

u (t) \in [\begin{matrix} - 1, & 1 \end{matrix}]

is the command of the guidance.

Remark 1.

δ_{H} (x_{H})

is related to y, v of the vehicle and the available AoA, so its value varies with

x_{H} (t)

. The purpose of this procedure is to place

u (t)

into a constant range, ensuring that the manoeuvring ability required for guidance always follows the real-time AO under the giant flight envelope of the HGV.

Remark 2.

The reason for using

u (t)

as the control variable, instead of directly using the AoA α, is that it is possible to rely on an existing model to calculate

δ_{H} (x_{H})

and input it into the neural network (as shown in Section 3), thus omitting the neural network training process in the maximum available overload.

As the long-range interceptor is in the target-lock state, its booster rocket is switched off and unpowered with state

x_{I} (t) = {[\begin{matrix} x & y & v & θ \end{matrix}]}^{T}

. The motion is as follows:

{\dot{x}}_{I} (t) = f_{I} (x_{I}, x_{H}) = [\begin{matrix} \frac{R_{0} v cos θ}{R_{0} + H} \\ v sin θ \\ - \frac{C_{X} q S}{m} - g sin θ \\ G_{I} (x_{I}, x_{H}, t) \end{matrix}]

(3)

where

G_{I} (x_{I}, x_{H}, t)

is the actual rate of inclination under the influence of the guidance and control system.

Remark 3.

As the focus of this paper is on the centre-of-mass motion of vehicles, it is assumed that the vehicles always follow the guidance under the effects of the attitude control system, so errors in attitude control are ignored in Equations (2) and (3), as well as minor effects such as Gauche forces and implicated accelerations.

HGV and N interceptors can form a nonlinear system:

{\dot{x}}_{S} (t) = f_{S} (x_{S}) + g_{S} (x_{S}) u (t)

(4)

where system state is

x_{S} = {[\begin{matrix} {(x_{H})}^{T} & {(x_{I 1})}^{T} \begin{matrix} \dots & {(x_{I N})}^{T} \end{matrix} \end{matrix}]}^{T}

, the nonlinear kinematics of the system are

f_{S} (x_{S}) = {[\begin{matrix} (f_{H}) & (f_{I 1}) \begin{matrix} \dots & (f_{I N}) \end{matrix} \end{matrix}]}^{T}

and the non-linear effect of control is

g_{S} (x_{S}) = {[\begin{matrix} {(g_{H})}^{T} & O_{^{1 \times 4 N}} \end{matrix}]}^{T}

.

The guidance system aims to control the system described in Equation (4) using

u (t)

to achieve penetration. Therefore, the design of an anti-interception guidance can be viewed as an optimal control problem. First, the HGV must successfully evade the interceptor during penetration, and then reach the DRP

P_{E} = {[\begin{matrix} x_{E} & y_{E} \end{matrix}]}^{T}

to conduct the follow-up mission. Let

t_{f}

be the time that the HGV arrives at

x_{E}

, and let

u (t)

direct system Equation (4) to state

x_{S} (t_{f})

. The anti-interception guidance is designed to solve the following GTAO problem [33].

As mentioned in Equation (4), an HGV and its opponents form a system represented by

x_{S}

. The initial value of

x_{S}

is:

x_{S} (t_{0}) = {[\begin{matrix} x_{H, 0} & y_{H, 0} & v_{H, 0} & θ_{H, 0} & \dots x_{I N, 0} & y_{I N, 0} & v_{I N, 0} & θ_{I N, 0} \end{matrix}]}^{T}

(5)

The process constraint of penetration:

min ({(M_{x, i} x_{S} (t))}^{2} + {(M_{y, i} x_{S} (t))}^{2}) > R^{2}, i \in [1, N]

(6)

where

M_{x, i} = [\begin{matrix} 1 & O_{^{1 \times 3}} & O_{^{1 \times 4 (i - 1)}} & \begin{matrix} - 1 & \begin{matrix} O_{^{1 \times 3}} & O_{^{1 \times 4 (N - i)}} \end{matrix} \end{matrix} \end{matrix}]

, and R is the kill radius of the interceptor,

M_{y, i} = [0 1 O_{1 \times 2} O_{1 \times 4 (i - 1)} 0 - 1 O_{1 \times 2} O_{1 \times 4 (N - i)}]

.

The process constraint of heat flux is:

{(q_{S})}_{3 D} \leq q_{U}

(7)

where

{(q_{S})}_{3 D}

is the three-dimensional stagnation point of the heat flux and

q_{U}

is the upper limit of the heat flux. For an arbitrarily shaped, three-dimensional stagnation point with a radius of curvature

R_{1}

and

R_{2}

, the heat flux is expressed as:

{(q_{S})}_{3 D} = \sqrt{\frac{1 + k}{2}} {(q_{S})}_{AXI}

(8)

where

k = \frac{R_{1}}{R_{2}}

and

{(q_{S})}_{AXI}

is the axisymmetric heat flux, which is related to the flight altitude and velocity [34].

The minimum velocity constraint:

[\begin{matrix} O_{^{1 \times 2}} & 1 & \begin{matrix} 0 & O_{^{1 \times 4}} \end{matrix} \end{matrix}] x_{S} \geq V_{min}

(9)

The control constraint:

u (t) \in [\begin{matrix} - 1 & 1 \end{matrix}]

(10)

The objective function is a Mayer type function:

J (x_{S} (t_{0}), u (t)) = Q (x_{S} (t_{f}))

(11)

where

Q (x_{S} (t_{f})) = {(x_{S} (t_{f}) - {\tilde{P}}_{E})}^{T} R (x_{S} (t_{f}) - {\tilde{P}}_{E})

,

{\tilde{P}}_{E} = {[\begin{matrix} {P_{E}}^{T} & \begin{matrix} V_{min} & O_{^{1 \times (4 N + 1)}} \end{matrix} \end{matrix}]}^{T}

and

R = [\begin{matrix} 0 \\ - w_{1} \\ w_{2} \\ O_{^{(4 N + 1) \times (4 N + 1)}} \end{matrix}]

in which

w_{1}

,

w_{2} \in R^{+}

are weights.

The optimal performance:

J^{*} (x_{S} (t_{0})) = max_{u (t), t \in [t_{0}, t_{f}]} J (x_{S} (t_{0}), u (t))

(12)

From Equation (12), the optimal control

u^{*} (t)

is determined by

x_{S} (t_{0})

. After obtaining all the model information of system Equation (4), the optimal state trajectory of the system can be found according to

x_{S} (t_{0})

using static optimization methods (e.g., quadratic programming). Nevertheless, resolving the problem using optimization methods is challenging, especially since there is limited information about the interceptor (aerodynamic parameters, available overload, guidance, etc.).

2.2. Markov Decision Process

The MDP can model a sequential decision problem and is well-suited to the HGV anti-interception process. The MDP can be defined by a five-tuple

〈S, A, T, R, γ〉

[35].

S

is a multidimensional continuous state space.

A

is an available action space. T is a state transition function:

S \times A \to S

. That is, after an action

a \in A

is taken in the state

s \in S

, the state changes from s to

s^{'} \in S

. R is an instant reward function: it represents an instant reward obtained from the state transition.

γ \in [0, 1]

is a constant discount factor used to balance the importance of the instant reward and forward reward.

The cumulative reward obtained by the controller under the command sequence

τ = {a_{0}, \dots, a_{n}}

is:

G (s_{0}, τ) = \sum_{t = 0}^{\infty} γ^{t} r_{t}

(13)

The expected value function of the cumulative reward based on state

s_{t}

and the expected value function of

(s_{t}, a_{t})

are introduced as shown below:

V^{π} (s_{t}) = E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} | s_{t}, π]

(14)

Q^{π} (s_{t}, a_{t}) = E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} | s_{t}, a_{t}, π]

(15)

where

V^{π} (s_{t})

indicates the expected cumulative reward that controller

π

can obtain in the current state

s_{t}

.

Q^{π} (s_{t}, a_{t})

indicates the expected cumulative reward under controller

π

after executing

a_{t}

in state

s_{t}

.

According to the Bellman Optimal theorem [35], updating

π (s_{t})

through the iteration rule shown in the following equation can be used to approximate the maximum

V^{π} (s_{t})

value:

π (s_{t}) = arg max_{a_{t}} Q^{π} (s_{t}, a_{t})

(16)

3. Proposed Method

Section 2 converts the anti-interception guidance of an HGV to an MDP. The key to obtaining optimal guidance is the accurate estimation of

Q^{π} (s_{t}, a_{t})

. Much progress has been made towards artificial intelligence using supervised learning systems trained to replicate human expert decisions. However, expert data are often expensive, unreliable, or unavailable [36], although DRL can still realize accurate

Q^{π} (s_{t}, a_{t})

estimation. The guidance system needs to process as much data as possible and then choose the next action. The input space of the guidance system has high dimensionality and continuous characteristics, and its action space has continuous characteristics. The DDPG-based methods (the main methods are DDPG and TD3) have been shown to effectively handle high-dimensional continuous information and output continuous actions. They could be used to solve the MDP proposed in this paper. This section aims to achieve faster policy convergence and better performance by optimizing DDPG-based method training. Since the TD3 algorithm has only one more CN pair compared to DDPG, this paper takes RBT-DDPG as an example to introduce how the RBT mechanism improves the training of the critic part. RBT-TD3 is easily obtained by simply repeating the improvements made by RBT-DDPG in each pair of TD3 critic networks.

3.1. RBT-DDPG-Based Methods

CN

Q (\cdot)

in the DDPG-based methods is only used, during reinforcement learning, to train AN

A (\cdot)

. At execution, the real action is determined directly by

A (\cdot)

.

For CN, the DDPG-based optimization objective is to minimize the loss function

L_{Q} (ϕ_{Q})

. Its gradient is

\nabla_{ϕ_{Q}} L_{Q} (ϕ_{Q}) = (\frac{1}{N_{b}} \sum_{j = 1}^{N_{b}} (\nabla_{ϕ_{Q}} Q (s_{t, j}, a_{t, j} | ϕ_{Q}))) (\frac{2}{N_{b}} \sum_{j = 1}^{N_{b}} (Q (s_{t, j}, a_{t, j} | ϕ_{Q})) - {\hat{y}}_{t})

(17)

After a single gradient descent operation is performed, the gradient descent method is only guaranteed to update the parameters in the right direction. The amplitude of the loss function is not guaranteed after a single gradient descent procedure.

For AN, the DDPG-based optimization objective is to minimize the loss function

L_{A} (ϕ_{A})

. Its gradient is

\nabla_{ϕ_{A}} L_{A} (ϕ_{A}) = - \frac{1}{N_{b}} \sum_{j = 1}^{N_{b}} (\nabla_{{\tilde{a}}_{t, j}} Q (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q}) \nabla_{ϕ_{A}} A (s_{t, j} | ϕ_{A}))

(18)

The direction of parameter iteration of AN is affected by CN. The single gradient descent method used by the DDPG-based method does not guarantee accurate CN estimation. In some instances where the estimation is grossly inaccurate, incorrect parameter updates will be provided to AN, further deteriorating the sample data in the memory pool and affecting the training efficiency.

Rather than allowing CN to steer AN in the wrong direction, this paper proposes a new, improved, DDPG-based mechanism: repetitive batch training (RBT). The core idea of RBT, which mainly aims to improve the updating strategy of

Q^{π} (s_{t}, a_{t})

, is that when a sample batch is used to update the CN parameters, the CN is repetitively trained using the loss function as its reference threshold (as in Equation (19)), thereby avoiding a serious misestimate of CN. The reference threshold

L_{TH}

for repeats should be set appropriately.

\{\begin{matrix} R e p e a t, L_{Q} (ϕ_{Q}) \geq L_{TH} \\ P a s s, L_{Q} (ϕ_{Q}) < L_{TH} \end{matrix}

(19)

Q (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q})

is split into two parts:

Q (s_{t, j}, a_{t, j} | ϕ_{Q}) = Q^{*} (s_{t, j}, a_{t, j} | ϕ_{Q^{*}}) + D (s_{t, j}, a_{t, j} | ϕ_{Q})

(20)

where,

Q^{*} (s_{t, j}, a_{t, j} | ϕ_{Q^{*}})

is the real mapping of Q and

D (s_{t, j}, a_{t, j} | ϕ_{Q})

is the estimation error. Therefore, Equation (18) is rewritten as:

\nabla_{ϕ_{A}} L_{A} (ϕ_{A}) = - \frac{1}{N_{b}} \sum_{j = 1}^{N_{b}} ((\nabla_{{\tilde{a}}_{t, j}} Q^{*} (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q^{*}}) + \nabla_{{\tilde{a}}_{t, j}} D (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q})) \nabla_{ϕ_{A}} A (s_{t, j} | ϕ_{A}))

(21)

According to the Taylor expansion:

\nabla_{{\tilde{a}}_{t, j}} D (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q}) = \frac{1}{Δ \tilde{a}} (D (s_{t, j}, {\tilde{a}}_{t, j} + Δ \tilde{a} | ϕ_{Q}) - D (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q}) - R_{2})

(22)

where

R_{2}

is the higher-order residual term. Since

∥D (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q})∥ \in (0, L_{TH})

,:

∥D (s_{t, j}, {\tilde{a}}_{t, j} + Δ \tilde{a} | ϕ_{Q}) - D (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q})∥ < 2 L_{TH}

(23)

∥\nabla_{{\tilde{a}}_{t, j}} D (s_{t, j}, {\tilde{a}}_{t, j} | ϕ_{Q})∥ < \frac{2 L_{TH}}{∥Δ \tilde{a}∥} - 2 ∥\frac{R_{2}}{Δ \tilde{a}}∥

(24)

Naturally, setting

L_{TH}

as a small value will reduce the misdirection from the unconverged CN to the AN.

If

L_{TH}

is too large, the RBT-DDPG will be weakened. CN will overfit the samples in a single batch if

L_{TH}

is too small. The RBT-DDPG shown in Figure 2 and Algorithm 1 is an example of how RBT can be combined with DDPG-based methods.

Algorithm 1 RBT-DDPG

1:: Initialize parameters $ϕ_{A}$ , $ϕ_{A^{-}}$ , $ϕ_{Q}$ , $ϕ_{Q^{-}}$ .
2:: for each iteration do
3:: for each environment step do
4:: $a = (1 - e^{- θ t}) μ + e^{- θ t} \hat{a} + σ \sqrt{\frac{1 - e^{- 2 θ t}}{2 θ}} ε$ , $ε \sim (0, 1)$ , $\hat{a} = A (s)$ .
5:: end for
6:: for each gradient step do
7:: Randomly sample $N_{b}$ samples.
8:: $ϕ_{Q} \leftarrow ϕ_{Q} + α_{C} \nabla_{ϕ_{Q}} L_{Q} (ϕ_{Q})$ .
9:: if $L_{C}$ > $L_{TH}$ then
10:: Back to step 7.
11:: end if
12:: $ϕ_{A} \leftarrow ϕ_{A} + α_{A} \nabla_{ϕ_{A}} L_{A} (ϕ_{A})$ .
13:: $ϕ_{Q^{-}} \leftarrow (1 - s_{r}) ϕ_{Q^{-}} + s_{r} ϕ_{Q}$ .
14:: $ϕ_{A^{-}} \leftarrow (1 - s_{r}) ϕ_{A^{-}} + s_{r} ϕ_{A}$ .
15:: end for
16:: end for

3.2. Scheme of DRL

To approximate

Q^{π} (s_{t}, a_{t})

and optimal AN through DRL, the state space, action space and instant reward function are designed as follows.

3.2.1. State Space

As mentioned in Section 2.1, in the vertical plane, the HGV and interceptors follow Equations (2) or (3), and both can be expressed in terms of two-dimensional coordinates: the velocity, and the ballistic inclination. Therefore, the network can predict the future flight state based on the current state. The AN state space is

S_{HI} = S_{H} \times S_{I 1} \times \dots \times S_{I N}

, where

S_{H} \in R^{4}

and

S_{I i} \in R^{4}

are the state space of the HGV and the interceptor i, respectively. AN needs to know the AO of the current state

S_{Ω} \in R

to evaluate the available manoeuvrability. It also needs to know the position of the DRP

S_{D} \in R

^{2}

. As a result, the state space of AN is designed as a

7 + 4 N

-dimensional space

S = S_{HI} \times S_{Ω} \times S_{D}

, and the form of each element is

(x_{H}, y_{H}, v_{H}, θ_{H}, x_{I 1}, y_{I 1}, v_{I 1}, θ_{I 1}, \dots, x_{I N}, y_{I N}, v_{I N}, θ_{I N}, ω_{max}, x_{D}, y_{D})

. For CN, an additional action bit is needed, so its input space is

8 + 4 N

-dimensional.

Remark 4.

The definition of the state space used in this paper means that there is a vast input space for the neural network when an HGV is confronted with many interceptors, resulting in many duplicate elements and complicating the training process. An attention mechanism can alleviate this issue by extracting features from the original input space [37]. Typically, two interceptors are used to intercept one target in air defence operations [38]. Therefore, there is only one HGV and two interceptors in the virtual scenario, limiting the input space of the neural networks.

3.2.2. Action Space

As described in Equation (2), since

u (t)

is limited to the interval

[- 1, 1]

, the output or action of the neural network can easily be defined as

u (t)

.

u (t)

is multiplied by the AO derived from the model information and then applied to the HGV’s flight control system to ensure that the guidance signal meets the dynamical constraints of the current flight state. The neural network only needs to pick a real number within

[- 1, 1]

, which indicates the choice of

\dot{θ}

as a guidance command subject to AO constraints. From a training perspective, when aiming to bypass the learning process of the AO, it is more straightforward to use the model information directly, rather than adopting the AoA as the action space.

3.2.3. Instant Reward Function

As an essential part of the trial-and-error approach, the instant reward function, guides the networks to learn the optimal strategy. Diverse instant reward functions indicate different behaviour tendencies. The instant reward function affects the quality of the strategy learned by DRL. As in Equation (12), an instant reward function should be designed, with two aims: (1) to determine the distance between the HGV and DRP reward function

r_{E} (\cdot)

at the terminal time

t_{f}

, and (2) to apply the velocity reward function

r_{D} (\cdot)

at

t_{f}

. The instant reward function is designed as follows:

r (x_{S} (t)) = \{\begin{matrix} r_{E} (x_{S} (t)) + r_{D} (x_{S} (t)), t = t_{f} \\ 0, t < t_{f} \end{matrix}

(25)

It is necessary to convert

w_{1}

in Equation (11) to a function with a positive domain in order to encourage the HGV to evade interceptors and reach

x_{E}

during DRL:

f (x) = \frac{1}{w_{1} + \sqrt{x}}

(26)

Remark 5.

If the flight state of the HGV does not meet the various constraints mentioned in Section 2, then the episode ends early, and the instant reward for the whole episode is 0, which introduces the common sparse reward problem in reinforcement learning. However, it is evident from Section 4 that RBT-DDPG-based methods can generate intelligence from sparse rewards.

Remark 6.

An intensive instant reward function similar to that designed by Jiang et al. [39] is not used, since there is no experience to draw on for the HGV anti-interception problem. Furthermore, rather than a Bolza problem, this instant reward function is fully equivalent to the optimization goal, that is, the mayer type problem defined in Equation (12). Moreover, the neural network is not influenced by the human tendencies of strategy exploration, resulting in an in-depth exploration of all strategic options that could approach the global optimum.

Remark 7.

A curriculum-based approach similar to that discussed by Li et al. [40] was attempted, where HGVs can quickly learn to reduce interceptors’ energy using snake manoeuvres. As the policy is solidified at the approach phase, it is difficult to achieve global optimality using this framework, and the obtained performance is significantly lower than that obtained by Equation (25).

4. Training and Testing

Section 3 introduces RBT-DDPG-based methods to solve the GTAO problem discussed in Section 2. This section verifies the effectiveness of DRL in finding the optimal anti-interception guidance system.

4.1. Settings

4.1.1. Aircraft Settings

To simulate a random initial interceptor energy state in a virtual scenario, the interceptors’ initial altitude and initial velocity follow a uniform distribution and are

U (25 km, 45 km)

and

U (1050 m / s, 1650 m / s)

respectively. Table 1 illustrates the remaining parameters.

The aerodynamics of an aircraft are usually approximated by the curve-fitted model (CFM). The CFM of the HGV used in this paper is referenced in Wang et al. [41]:

C_{L} = - 0.21 + 0.075 \cdot M + (0.23 + 0.05 \cdot M) \cdot α

and

C_{D} = 0.41 + 0.011 \cdot M + (- 0.0081 + 0.0021 \cdot M) \cdot α + 0.0042 \cdot α^{2}

. M is the Mach number of the HGV and

α

is the AoA (rad). Morever, the interceptors are

C_{L} = (0.18 + 0.02 \cdot M) \cdot α

and

C_{D} = 0.18 + 0.01 \cdot M + 0.001 \cdot M \cdot α + 0.004 \cdot α^{2}

[42]. The interceptors employ a proportional guide. To compensate for the lack of representation of the vectoring capability in the virtual scenario, we increased the interceptors’ kill radius to a staggering 300 m.

4.1.2. Hyperparameter Settings

Hyperparameters of training are shown in Table 2.

In AN, it is evident that the bulk of computation occurs in the hidden layer. A neuron in the hidden layer reads in

n_{i}

(

n_{i}

is the width of the previous layer) numbers according to a dropOut layer( the drop rate is 0.2), multiplies them by the weight (totalling

2 n_{i} + 0.8 n_{i}

FLOPs), adds up all the values (totalling

n_{i}

FLOPs), then passes the result through the activation function (LReLU) after adding a bias term (totaling 2 FLOPs), which means that a single neuron consumes

3.8 n_{i} + 3

FLOPs in a single calculation. The actor, as shown in Figure 3, consumes approximately 87K FLOPs in a single execution. Assuming an on-board computer with

10^{- 3}

FLOP floating-point capabilities (most mainstream industrial FPGAs in 2021 have more than a 1 FLOP floating-point capability), a single execution of anti-interception guidance formed by AN takes less than 0.1 ms.

4.2. Training Results and Analysis

The CPU of the training platform is an AMD Ryzen5 3600@4.2 GHz. The RAM is 8 GB × 2 DDR4@3733 MHz. As the networks are straightforward and the computation is concentrated on calculating aircraft models, GPUs are not selected for training. The aircraft models in the virtual scenario are written in C++ and packaged as a dynamic link library (DLL). The networks and training algorithm are implemented by Python. The networks and the virtual scenario interact after calling the DLL. The training process is shown in Figure 4, Figure 5 and Figure 6.

For episodes 0–3500, DDPG and RBT-DDPG are in the process of constructing their memory pool. The neural networks approximate a random output in this process, and no training occurs. As a result, cumulative rewards are near 0 for most episodes, with a slim possibility of reaching six points through random exploration. The neural networks begin to be iteratively updated as soon as the memory pool is complete. The RBT-DDPG exceeds the maximum score of the random exploration process at approximately 3900 episodes, which is close to seven points. The RBT-DDPG then gradually boosts intelligence at a rate of approximately 0.00145 points/episode, which is approximately three times the 0.00046 points/episode of the DDPG. RBT-TD3 intelligence achieves steady growth from about episode 500. TD3, on the other hand, apparently fell into a local pole before episode 3500, with reward values that were always around 0.8 points. RBT-TD3 learns faster compared to RBT-DDPG because the learning algorithm is more complex, but the strategies they learned are similar and eventually converge to almost the same reward.

4.3. Test Results and Analysis

A Monte Carlo test was performed to verify that the strategies learned by the neural network are universally adaptable. In addition, some cases were used to analyze the anti-interception strategies. As the final performance obtained by RBT-DDPG and RBT-TD3 is similar, due to space limitations, this section uses the AN learned by RBT-DDPG as the test object.

4.3.1. Monte Carlo Test

To reveal the specific strategy that was obtained from training, AN from RBT-DDPG controls an HGV in virtual scenarios to perform anti-interception tests. To verify the adaptability of an AN facing different initial states, a test was conducted using scenarios in which the initial altitude and velocity of the interceptors were randomly distributed (the same distribution as used during training). Since exploration is no longer needed, no further OU processes were added to AN. A total of 1000 episodes were conducted.

AN from DDPG and RBT-DDPG were tested for 1000 episodes each. Table 3 and Figure 7 illustrate the results. Suppose that the measure of the success of an anti-interception is whether it eliminates two interceptors. In that case, the 91.48% anti-interception success rate of the RBT-DDPG AN is better than the 79.74% rate of the DDPG AN, reflecting the greater adaptability of the RBT-DDPG AN to complex initial conditions. According to the average terminal miss distance

\bar{e} (t_{f})

, having eliminated the interceptors, both two actors perform well in achieving DRP regression. However, on average, in terms of terminal velocity

\bar{v} (t_{f})

, the HGV guided by AN from RBT-DDPG is faster. For RBT-DDPG, the peak probability density is 7.4, while for DDPG, the peak probability density is 6.7, indicating that RBT-DDPG can perform well in more scenarios than DDPG.

To observe the specific strategies learned through RBT-DDPG, we traversed the initial conditions of the two interceptors and tested each individually. In Figure 8, the correspondence between the initial state of the interceptors and the test case serial number is indicated. The vertical motion of the HGV and interceptors in all cases are shown in Figure 9.

In Figure 9, the neural network adopts different strategies in response to interceptors with variable initial energies. In terms of behaviour, the strategies fall into two categories: S-curve (dive–leap–dive) and C-curve (leap–dive). There is also a specific pattern to the peaks. In general, the higher the initial energy of the interceptors faced by the HGV, the higher the peak.

4.3.2. Analysis of Anti-Interception Strategies

Using the data from the 5th and 42nd test cases mentioned in Section 4.3.1, we attempted to identify the strategies the neural network learned from DRL.

As shown by the solid purple trajectory in Figure 10 and Figure 11, we also used the Differential Game (DG) approach [33] as a comparison method. While the HGV evaded the interceptor with DG guidance, it lost significant kinetic energy due to its long residence in the dense atmosphere and its low ballistic dive point, and fell below the minimum velocity at approximately 55 km from the target. The DG approach uses the relative angle and velocity information as input to guide the flight, and can successfully evade the interceptor. In contrast, DG cannot account for atmospheric density and cannot optimize energy, which is a significant advantage of DRL.

The neural network chooses to dive before leaping in Case 5, whereas, in Case 42, it takes a direct leap. Furthermore, while the peak in Case 5 is 60 km, it only reaches 54 km in Case 42 before diving. In both cases, the HGV causes one of the interceptors to go below the minimum flight speed (400 m/s) before entering the rendezvous phase. At the rendezvous phase, the minimum distances between the interceptor and the HGV are 389 m and 672 m, respectively, which indicates that the HGV is almost within the interceptable area of interceptors. The terminal velocity of the HGV in Case 5 is approximately 100 m/s lower than that in Case 42, due to the higher initial interceptor energy faced in Case 5, which resulted in a longer manoeuvre path and a more violent pull-up in the dense atmospheric region. Figure 12 illustrates that the HGV tends to perform a large overload manoeuvre in the approach phase, almost exclusively utilizing the manoeuvring ability. In the regress phase of Case 42, only very small magnitude manoeuvres were required to correct the ballistic inclination, demonstrating that the neural network can control the HGV and select the appropriate ballistic inclination for the dive after escaping the interceptor.

We derived a rudimentary instant reward function with no prior knowledge of the strategy that should be implemented, resulting in a sparse reward problem. The RBT-DDPG trained a CN; however, it does not make significant Q estimation errors. There was an accurate Q estimation at the beginning of an episode, and accuracy was maintained throughout the processes, in both case 5 and case 42 (Figure 13). This phenomenon is consistent with the idea presented in Equation (12) that the flight states of both sides at the outset determine the optimal anti-interception strategy that the HGV should implement.

Figure 10, Figure 11 and Figure 12 illustrate the anti-interception strategy learned by the RBT-DDPG:(1) The HGV lures interceptors with high initial energy into dense atmospheres by diving manoeuvres in the approach phase, thus relying on pull-up manoeuvres with a denser atmosphere to drain much of the interceptors’ energy. It is important to note that this dive manoeuvre also consumes the kinetic HGV energy (e.g., Case 5). In contrast, when confronted with interceptors with low initial energy, the neural network does not choose to dive first, even though this strategy is feasible, but instead leaps directly into the thin-atmosphere region (e.g., Case 42), reflecting the optimality of the strategy. (2) Through the approach phase manoeuvre, the HGV reduces the kinetic energy of the interceptors to a proper level and attracts interceptors to the thin atmosphere. Here, the interceptor’s AO no longer supports the interceptable area to cover the whole reachable area of the HGV, which allows for the HGV to gain an available penetration path, as shown in Figure 14.

5. Conclusions

Traditionally, research on anti-interception guidance for aircraft has focused on differential game theory and optimization algorithms. Due to the high number of matrix calculations needed, applying differential game theory online is computationally uneconomic. Even though the newly developed ADP algorithm employs a neural network that significantly reduces the computation associated with the Hamilton functions, it cannot be applied to aircraft with very large flight envelopes, such as HGVs. It is challenging to implement convex planning, sequential quadratic planning, or other planning algorithms for HGVs due to their high computational complexity and insufficient real-time performance.

We conceptualize the penetration strategy of HGVs as a GTAO problem using the perspective of the optimal problem, revealing the significant impact that both the initial conditions of the attackers and defenders have on the penetration strategy. The problem is then modelled as an MDP and solved using the DDPG algorithm. The RBT-DDPG algorithm was developed to improve the CN estimation during the training process. At the end of the paper, the data on the training process and online simulation tests verify that RBT-DDPG can autonomously learn anti-interception guidance and adopt a rational strategy (S-curve or C-curve) when interceptors have differing initial energy conditions. Compared to the traditional DDPG algorithm, our proposed algorithm reduces the training episodes by 48.48%. Since the AN deployed online is straightforward, it is suitable for onboard computers. To our knowledge, this is the first to apply DRL to achieve anti-interception guidance for HGVs.

This paper focuses on a scenario in which one HGV breaks through the defences of two interceptors; this is a traditional scenario but may not be fully adapted to the future trends in group confrontation. In the future, we anticipate using multi-agent-DRL (MA-DRL) applied to multiple HGVs. The intelligent trained by MADRL can conduct guidance for each HGV in a distributed manner with limited information constraints, so the anti-intercept guidance strategy can adapt to the numbers of enemies and HGVs. This will greatly improve the intelligent generalizability to the battlefield. Additionally, RL is known to suffer from long training times. We anticipate using pseudo-spectral methods to create a collection of expert data and then combining some expert advice [43,44] to accelerate training.

Author Contributions

Conceptualization, L.J. and Y.N.; methodology, L.J.; software, L.J.; validation, Y.Z., Z.L.; formal analysis, L.J.; investigation, Y.N.; resources, Y.N.; data curation, L.J.; writing—original draft preparation, L.J.; writing—review and editing, Y.Z.; visualization, L.J.; supervision, Y.N.; project administration, Y.N.; funding acquisition, Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Aviation Science Foundation of China under Grant 201929052002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Cheng Yuehua from the University of Aeronautics and Astronautic, for her invaluable support during the writing of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guo, Y.; Gao, Q.; Xie, J.; Qiao, Y.; Hu, X. Hypersonic vehicles against a guided missile: A defender triangle interception approach. In Proceedings of the 2014 IEEE Chinese Guidance, Navigation and Control Conference, Yantai, China, 8–10 August 2014; pp. 2506–2509. [Google Scholar]
Liu, K.F.; Meng, H.D.; Wang, C.J.; Li, J.; Chen, Y. Anti-Head-on Interception Penetration Guidance Law for Slide Vehicle. Mod. Def. Technol. 2008, 4, 39–45. [Google Scholar]
Luo, C.; Huang, C.Q.; Ding, D.L.; Guo, H. Design of Weaving Penetration for Hypersonic Glide Vehicle. Electron. Opt. Control 2013, 7, 67–72. [Google Scholar]
Zhu, Q.G.; Liu, G.; Xian, Y. Simulation of Reentry Maneuvering Trajectory of Tactical Ballistic Missile. Tactical Missile Technol. 2008, 1, 79–82. [Google Scholar]
He, L.; Yan, X.D.; Tang, S. Guidance law design for spiral-diving maneuver penetration. Acta Aeronaut. Astronaut. Sin. 2019, 40, 188–202. [Google Scholar]
Zhao, K.; Cao, D.Q.; Huang, W.H. Manoeuvre control of the hypersonic gliding vehicle with a scissored pair of control moment gyros. Sci. China Technol. 2018, 61, 1150–1160. [Google Scholar] [CrossRef]
Zhao, X.; Qin, W.W.; Zhang, X.S.; He, B.; Yan, X. Rapid full-course trajectory optimization for multi-constraint and multi-step avoidance zones. J. Solid Rocket. Technol. 2019, 42, 245–252. [Google Scholar]
Wang, P.; Yang, X.L.; Fu, W.X.; Qiang, L. An On-board Reentry Trajectory Planning Method with No-fly Zone Constraints. Missiles Space Vechicles 2016, 2, 1–7. [Google Scholar]
Fang, X.L.; Liu, X.X.; Zhang, G.Y.; Wang, F. An analysis of foreign ballistic missile manoeuvre penetration strategies. Winged Missiles J. 2011, 12, 17–22. [Google Scholar]
Sun, S.M.; Tang, G.J.; Zhou, Z.B. Research on Penetration Maneuver of Ballistic Missile Based on Differential Game. J. Proj. Rocket. Missiles Guid. 2010, 30, 65–68. [Google Scholar]
Imado, F.; Miwa, S. Fighter evasive maneuvers against proportional navigation missile. J. Aircr. 1986, 23, 825–830. [Google Scholar] [CrossRef]
Zhang, G.; Gao, P.; Tang, Q. The Method of the Impulse Trajectory Transfer in a Different Plane for the Ballistic Missile Penetrating Missile Defense System in the Passive Ballistic Curve. J. Astronaut. 2008, 29, 89–94. [Google Scholar]
Wu, Q.X.; Zhang, W.H. Research on Midcourse Maneuver Penetration of Ballistic Missile. J. Astronaut. 2006, 27, 1243–1247. [Google Scholar]
Zhang, K.N.; Zhou, H.; Chen, W.C. Trajectory Planning for Hypersonic Vehicle With Multiple Constraints and Multiple Manoeuvreing Penetration Strategies. J. Ballist. 2012, 24, 85–90. [Google Scholar]
Xian, Y.; Tian, H.P.; Wang, J.; Shi, J.Q. Research on intelligent manoeuvre penetration of missile based on differential game theory. Flight Dyn. 2014, 32, 70–73. [Google Scholar]
Sun, J.L.; Liu, C.S. An Overview on the Adaptive Dynamic Programming Based Missile Guidance Law. Acta Autom. Sin. 2017, 43, 1101–1113. [Google Scholar]
Sun, J.L.; Liu, C.S. Distributed Fuzzy Adaptive Backstepping Optimal Control for Nonlinear Multimissile Guidance Systems with Input Saturation. IEEE Trans. Fuzzy Syst. 2019, 27, 447–461. [Google Scholar]
Sun, J.L.; Liu, C.S. Backstepping-based adaptive dynamic programming for missile-target guidance systems with state and input constraints. J. Frankl. Inst. 2018, 355, 8412–8440. [Google Scholar] [CrossRef]
Wang, F.; Cui, N.G. Optimal Control of Initiative Anti-interception Penetration Using Multistage Hp-Adaptive Radau Pseudospectral Method. In Proceedings of the 2015 2nd International Conference on Information Science and Control Engineering, Shanghai, China, 24–26 April 2015. [Google Scholar]
Liu, Y.; Yang, Z.; Sun, M.; Chen, Z. Penetration design for the boost phase of near space aircraft. In Proceedings of the 2017 36th Chinese Control Conference, Dalian, China, 26–28 July 2017. [Google Scholar]
Marcus, G. Innateness, alphazero, and artificial intelligence. arXiv 2018, arXiv:1801.05667. [Google Scholar]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B. Deep Exploration via Bootstrapped DQN. arXiv 2016, arXiv:1602.04621. [Google Scholar]
Chen, J.W.; Cheng, Y.H.; Jiang, B. Mission-Constrained Spacecraft Attitude Control System On-Orbit Reconfiguration Algorithm. J. Astronaut. 2017, 38, 989–997. [Google Scholar]
Dong, C.; Deng, Y.B.; Luo, C.C.; Tang, X. Compression Artifacts Reduction by a Deep Convolutional Network. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Fu, X.W.; Wang, H.; Xu, Z. Research on Cooperative Pursuit Strategy for Multi-UAVs based on DE-MADDPG Algorithm. Acta Aeronaut. Astronaut. Sin. 2021, 42, 311–325. [Google Scholar]
Brian, G.; Kris, D.; Roberto, F. Adaptive Approach Phase Guidance for a Hypersonic Glider via Reinforcement Meta Learning. In Proceedings of the AIAA SCITECH 2022 Forum, San Diego, CA, USA, 3–7 January 2022. [Google Scholar]
Wen, H.; Li, H.; Wang, Z.; Hou, X.; He, K. Application of DDPG-based Collision Avoidance Algorithm in Air Traffic Control. In Proceedings of the ISCID 2019: IEEE 12th International Symposium on Computational Intelligence and Design, Hangzhou, China, 14 December 2020. [Google Scholar]
Lin, G.; Zhu, L.; Li, J.; Zou, X.; Tang, Y. Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning. Comput. Electron. Agric. 2021, 188, 106350. [Google Scholar] [CrossRef]
Lin, Y.; Mcphee, J.; Azad, N.L. Anti-Jerk On-Ramp Merging Using Deep Reinforcement Learning. In Proceedings of the IVS 2020: IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA, 19 October–13 November 2020. [Google Scholar]
Xu, X.L.; Cai, P.; Ahmed, Z.; Yellapu, V.S.; Zhang, W. Path planning and dynamic collision avoidance algorithm under COLREGs via deep reinforcement learning. Neurocomputing 2021, 468, 181–197. [Google Scholar] [CrossRef]
Lei, H.M. Principles of Missile Guidance and Control. Control Technol. Tactical Missile 2007, 15, 162–164. [Google Scholar]
Cheng, T.; Zhou, H.; Dong, X.F.; Cheng, W.C. Differential game guidance law for integration of penetration and strike of multiple flight vehicles. J. Beijing Univ. Aeronaut. Astronaut. 2022, 48, 898–909. [Google Scholar]
Zhao, J.S.; Gu, L.X.; Ma, H.Z. A rapid approach to convective aeroheating prediction of hypersonic vehicles. Sci. China Technol. Sci. 2013, 56, 2010–2024. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Liu, R.Z.; Wang, W.; Shen, Y.; Li, Z.; Yu, Y.; Lu, T. An Introduction of mini-AlphaStar. arXiv 2021, arXiv:2104.06890. [Google Scholar]
Deka, A.; Luo, W.; Li, H.; Lewis, M.; Sycara, K. Hiding Leader’s Identity in Leader-Follower Navigation through Multi-Agent Reinforcement Learning. arXiv 2021, arXiv:2103.06359. [Google Scholar]
Xiong, J.-H.; Tang, S.-J.; Guo, J.; Zhu, D.-L. Design of Variable Structure Guidance Law for Head-on Interception Based on Variable Coefficient Strategy. Acta Armamentarii 2014, 35, 134–139. [Google Scholar]
Jiang, L.; Nan, Y.; Li, Z.H. Realizing Midcourse Penetration With Deep Reinforcement Learning. IEEE Access 2021, 9, 89812–89822. [Google Scholar] [CrossRef]
Li, B.; Yang, Z.P.; Chen, D.Q.; Liang, S.Y.; Ma, H. Maneuvering target tracking of UAV based on MN-DDPG and transfer learning. Def. Technol. 2021, 17, 457–466. [Google Scholar] [CrossRef]
Wang, J.; Zhang, R. Terminal guidance for a hypersonic vehicle with impact time control. J. Guid. Control Dyn. 2018, 41, 1790–1798. [Google Scholar] [CrossRef]
Ge, L.Q. Cooperative Guidance for Intercepting Multiple Targets by Multiple Air-to-Air Missiles. Master’s Thesis, Nanjing University of Aeronautics and Astronautics, Nanjing, China, 2019. [Google Scholar]
Cruz, F.; Parisi, G.I.; Twiefel, J.; Wermter, S. Multi-modal integration of dynamic audiovisual patterns for an interactive reinforcement learning scenario. In Proceedings of the RSJ 2016: IEEE International Conference on Intelligent Robots & Systems, Daejeon, Korea, 9–14 October 2016. [Google Scholar]
Bignold, A.; Cruz, F.; Dazeley, R.; Vamplew, P.; Foale, C. Human engagement providing evaluative and informative advice for interactive reinforcement learning. Neural Comput. Appl. 2022. [Google Scholar] [CrossRef]

Figure 1. Illustration of an HGV anti-interception manoeuvre. The anti-interception of an HGV can be divided into three phases. ①Approach. The HGV manoeuvres according to anti-interception guidance, while the interceptor operates under its own guidance. Since the HGV is located a long distance from the interceptor, and has high energy, various penetration strategies are available during this phase. ②Endezvous. At this phase, the distance between the HGV and the interceptors is the shortest. This distance may be shorter than the kill radius of the interceptors, allowing for the HGV to be intercepted, or greater than the kill radius, allowing for the HGV to successfully avoid interception. ③Egress. With its remaining energy, the HGV flies to the DRP after moving away from the interceptors. A successful mission requires the HGV to arrive at the DRP with the highest levels of energy and accuracy. From the above analysis, it can be seen that whether the HGV can evade the interceptors in phase ② depends on the manoeuvres adopted in phase ①. Phase ① also determines the difficulty of ballistic regression in phase ③.

Figure 2. Signal flow of the RBT-DDPG algorithm. When a sample batch is used to update the CN parameters, the CN is repetitively trained using the loss function as its reference threshold (as shown in Steps 4–7 of the Figure), thereby avoiding a serious misestimate of CN.

Figure 3. Structures of the neural networks ((Left) AN. (Right) CN) with theparameters of each layer.

Figure 4. Cumulative reward during training. With the help of the RBT mechanism, both DDPG and TD3 reached a faster training speed.

Figure 5. Loss function of AN and CN during training. Similar to the cumulative reward curve presented in Figure 4, the actor loss function of RBT-DDPG in Figure 5 decreases faster, indicating that RBT-DDPG ensures that AN learns faster. RBT-DDPG has a lower CN loss function almost throughout the episodes, reflecting that RBT can improve the CN estimation. The same phenomenon occurs in the comparison between RBT-TD3 and its original version.

Figure 6. The number of RBT iterations that occur during RBT-DDPG training. RBT is repeated several times at the beginning of training (near the first training iteration), and CN is required to train to a tiny estimation error. RBT is barely executed before 60,000 training steps, as CN can already provide accurate estimates for the samples in the current memory pool, and no additional training is needed. From steps 100,000 to 200,000, RBT is repeated several times, and many of the executions are greater than 10. Due to the introduction of new strategies into the memory pool, the original CN does not accurately estimate the Q value, so additional training is performed. Between steps 20,000 and 35,000, RBT is occasionallyexecuted, and most executions contain less than 10 repetitions because fine-tuning CN can accommodate the new strategies that are explored by the actor. RBT executions increase after 350,000 steps, as CN must adapt to the multiple strategy samples brought into the memory pool. At the end of training, the average -number of repetitions is approximately 2, which is an acceptable algorithmic complexity cost.

Figure 7. Probability distribution and density of the cumulative reward for each episode in the Monte Carlo test. About 10% of the RBT-DDPG rewards are less than 3, compared to about 24% of DDPG reflecting the greater adaptability of the RBT-DDPG AN to complex initial conditions.

Figure 8. Correspondence between initial state and test case serial number (the horizontal coordinate represents the initial state of the first interceptor, while the vertical coordinate represents the second interceptor. The letters M, H, and L in the first position represent 44 km, 35 km, and 26 km altitude, respectively. The letters M, H, and L in the second position represent 1500 m/s, 1250 m/s, and 1000 m/s velocities, respectively).

Figure 9. Vertical motion of the HGV and interceptors in the test cases. The strategies fall into two categories: S-curve (dive–leap–dive) and C-curve (leap–dive). There is also a specific pattern to the peaks.

Figure 10. Vertical motion comparison between RBT-DDPG(Cases 5 and 42) and Differential Game. The DG approach uses the relative angle and velocity information as input to guide the flight, and can successfully evade the interceptor. In contrast, DG cannot account for atmospheric density and cannot optimize energy, which is a significant advantage of DRL. The AN learned by RBT-DDPG chooses to dive before leaping in Case 5, whereas, in Case 42, it takes a direct leap. Furthermore, while the peak in Case 5 is 60 km, it only reaches 54 km in Case 42 before diving. The AN can control the HGV to select the appropriate ballistic inclination for the dive after escaping the interceptor.

Figure 11. Velocities comparison between RBT-DDPG(Cases 5 and 42) and Differential Game.

Figure 12. Overloads comparison between Cases 5 and 42.

Figure 13. Q-value comparison between Case 5 and 42, as shown in Figure 12. An accurate Q estimation at the beginning of an episode and maintains accuracy throughout the processes in both case 5 and case 42.

Figure 14. Illustration of the penetration strategy during the rendezvous phase learned by RBT-DDPG. The interceptor’s AO no longer supports the interceptable area to cover the whole reachable area of the HGV, which allows for the HGV to gain an available penetration path.

Table 1. Parameters of theHGV and interceptor in the virtual scenario.

Parameter	Interceptor	HGV
Mass/kg	75	500
Reference area/m $^{2}$	0.3	0.579
Minimum velocity/(m/s)	400	1000
Available AoA/°	−20∼20	−10∼10
Time constant of Attitude Control System/s	0.1	1
Initial coordinate x/km	200	0
Initial coordinate y/km	Random	35
Initial velocity/(km/s)	Random	2000
Initial inclination/°	0	0
Coordinate x of the DRP/km	-	200
Coordinate y of the DRP/km	-	35
Kill radius/m	300	-

Table 2. Hyperparameters of RBT-DDPG and RBT-TD3.

Parameter	Value
$Δ t$ /s	$10^{- 2}$
$T_{c}$ /s	1
$γ$	1
$α_{C}$	$10^{- 4}$
$α_{A}$ in RBT-DDPG	$10^{- 4}$
$α_{A}$ in RBT-TD3	$5 \times 10^{- 5}$
$s_{r}$	$0.001$
$μ$ in RBT-DDPG	$0.05$
$σ$ in RBT-DDPG	$0.01$
$θ$ in RBT-DDPG	$5 \times 10^{- 5}$
$σ$ in RBT-TD3	$0.1$
$L_{TH}$	$0.1$
Weight initialization	$N (0, 0.02)$
Bias initialization	$N (0, 0.02)$
$w_{1}$	$0.5$
$w_{2}$	$10^{- 4}$

Table 3. Statistical results.

Algorithm	Anti-Interception Success Rate	$\bar{e} (t_{f})$ /m	$\bar{v} (t_{f})$ /(m/s)
DDPG	79.74%	1425.62	1377.17
RBT-DDPG	91.48%	1514.44	1453.81

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, L.; Nan, Y.; Zhang, Y.; Li, Z. Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach. Aerospace 2022, 9, 424. https://doi.org/10.3390/aerospace9080424

AMA Style

Jiang L, Nan Y, Zhang Y, Li Z. Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach. Aerospace. 2022; 9(8):424. https://doi.org/10.3390/aerospace9080424

Chicago/Turabian Style

Jiang, Liang, Ying Nan, Yu Zhang, and Zhihan Li. 2022. "Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach" Aerospace 9, no. 8: 424. https://doi.org/10.3390/aerospace9080424

APA Style

Jiang, L., Nan, Y., Zhang, Y., & Li, Z. (2022). Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach. Aerospace, 9(8), 424. https://doi.org/10.3390/aerospace9080424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. Problem Description

2.1. The Object of Anti-Interception Guidance

2.2. Markov Decision Process

3. Proposed Method

3.1. RBT-DDPG-Based Methods

3.2. Scheme of DRL

3.2.1. State Space

3.2.2. Action Space

3.2.3. Instant Reward Function

4. Training and Testing

4.1. Settings

4.1.1. Aircraft Settings

4.1.2. Hyperparameter Settings

4.2. Training Results and Analysis

4.3. Test Results and Analysis

4.3.1. Monte Carlo Test

4.3.2. Analysis of Anti-Interception Strategies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI