An Intelligent Bait Delivery Control Method for Flight Vehicle Evasion Based on Reinforcement Learning

Xue, Shuai; Wang, Zhaolei; Bai, Hongyang; Yu, Chunmei; Deng, Tianyu; Sun, Ruisheng

doi:10.3390/aerospace11080653

Open AccessArticle

An Intelligent Bait Delivery Control Method for Flight Vehicle Evasion Based on Reinforcement Learning

by

Shuai Xue

¹,

Zhaolei Wang

²,

Hongyang Bai

^1,*

,

Chunmei Yu

²,

Tianyu Deng

¹ and

Ruisheng Sun

¹

School of Energy and Power Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

²

National Key Laboratory of Science and Technology on Aerospace Intelligence Control, Beijing Aerospace Automatic Control Institute, Beijing 100854, China

^*

Author to whom correspondence should be addressed.

Aerospace 2024, 11(8), 653; https://doi.org/10.3390/aerospace11080653

Submission received: 13 June 2024 / Revised: 5 August 2024 / Accepted: 7 August 2024 / Published: 11 August 2024

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

During aerial combat, when an aircraft is facing an infrared air-to-air missile strike, infrared baiting technology is an important means of penetration, and the strategy of effective delivery of infrared bait is critical. To address this issue, this study proposes an improved deep deterministic policy gradient (DDPG) algorithm-based intelligent bait-dropping control method. Firstly, by modeling the relative motion between aircraft, bait, and incoming missiles, the Markov decision process of aircraft-bait-missile infrared effect was constructed with visual distance and line of sight angle as states. Then, the DDPG algorithm was improved by means of pre-training and classification sampling. Significantly, the infrared bait-dropping decision network was trained through interaction with the environment and iterative learning, which led to the development of the bait-dropping strategy. Finally, the corresponding environment was transferred to the Nvidia Jetson TX2 embedded platform for comparative testing. The simulation results showed that the convergence speed of this method was 46.3% faster than the traditional DDPG algorithm. More importantly, it was able to generate an effective bait-throwing strategy, enabling the aircraft to successfully evade the attack of the incoming missile. The strategy instruction generation time is only about 2.5 ms, giving it the ability to make online decisions.

Keywords:

infrared bait; deep deterministic policy gradient; infrared interference; dropping strategy; online decision

1. Introduction

With the rapid development of modern military technology, infrared-guided missiles have become an important weapon in modern times. According to recent studies on aircraft damaged in wars abroad, approximately 90% of the aircraft were damaged by infrared-guided-missiles [1,2]. Therefore, studying infrared defense strategies is crucial for air combat.

Infrared baits are the most commonly used interference method against infrared-guided missiles. The deployment of infrared baits can interfere with the normal tracking of infrared-guided missiles and improve the survival rate of our aircraft. However, the current deployment of baits requires pilots to manually bind and control the bait deployment device based on the air condition, thus relying completely on the pilot’s experience. Improper deployment not only wastes valuable infrared bait resources but may also seriously endanger the survival and safety of the aircraft and pilots [3]. Therefore, making reasonable decisions in a battlefield environment, such as determining the launch speed, number of launches, interval between launches, and distance of infrared baits, has become a challenge that must be studied. Shen et al. [4] modeled and simulated the interference of infrared surface-source baits but did not analyze the interference strategy of infrared baits. Yang et al. [5] examined the infrared radiation characteristics of targets, infrared baits, and backgrounds and proposed a method for recognizing infrared baits; however, they did not study the interference characteristics of infrared baits. Chen et al. [6] provide a reference for the deployment strategy of infrared baits by studying the infrared radiation characteristics of an aircraft; however, they only obtained the deployment and use strategies of infrared baits through theoretical analyses and did not conduct simulation verifications. Zhang et al. [7] examined adversarial strategies from two perspectives: barrel roll maneuvers and unpowered point source bait. However, because they mainly focused on the target’s maneuver method and the bait’s deployment method, they were unable to generate the infrared bait projectile’s deployment plan through autonomous decision-making. In summary, most of the current research only focuses on the theoretical analysis of infrared modeling and scattering strategies for baits.

With the rapid development of artificial intelligence technology, countries are committed to researching more intelligent weapon systems to replace existing designs [8]. In recent years, deep reinforcement learning has become a research hotspot, attracting much attention from scholars [9]. Deep reinforcement learning technology is considered one of the most likely ways to achieve universal artificial intelligence computing, with strong universality. The deep reinforcement learning algorithm does not rely on accurate modeling and instead uses model-free methods to control the unknown dynamic environment [10], allowing it to handle the dynamics of unstable systems [11]. Shen et al. [12] proposed a data-driven method that uses Markov parameter sequences to replace the dynamics modeling of complex tension systems. At present, there is extensive and in-depth research on the use of deep reinforcement learning in fields such as guidance, control, and path planning. For example, Yang et al. [13] studied the impact time control guidance law of missiles under time-varying velocities caused by gravity and air resistance and proposed a collision time control guidance algorithm based on TD3. Aslan et al. [14] used deep reinforcement learning algorithms to solve the balance control problem of robots under external forces. Similarly, Lee et al. [15] used reinforcement learning to enable unmanned aerial vehicles to perform real-time autonomous path planning in unknown environments. Fan et al. [16] applied the DDPG algorithm to missile evasion decision training and simulated the effectiveness of escape strategies under four typical initial situations. Furthermore, Deng et al. [17] proposed a missile terminal guidance law based on an improved depth deterministic strategy gradient algorithm; however, it allowed an attacker to intercept maneuvering targets rather than scattering infrared baits, which would allow aircraft to successfully escape. Finally, Qiu et al. [18] proposed a recorded recurrent twin-delayed deep deterministic (RRTD3) strategy gradient algorithm for intercepting maneuvering missiles in the atmosphere, but this work did not study targets carrying baits.

In this study, an intelligent control method for aircraft penetration bait delivery based on an improved DDPG algorithm is proposed. Also, a dynamic model of incoming missiles, infrared bait, and aircraft is constructed in the air, with the state space, action space, and reward function being designed during the process. Meanwhile, the trained decision network is used to output the deployment strategy of baits. Significantly, to address the issue of the slow convergence speed of the algorithm, improvements were made through pre-training and classification sampling mechanisms. Importantly, the algorithm was transferred to embedded platforms for comparative testing to verify the online capability of the proposed method.

This paper is organized as follows: Section 2 describes the modeling of bait delivery strategies. Section 3 introduces the relationship between modules in aircraft penetration systems and the design of the Markov decision process. In Section 4, the deep reinforcement learning algorithm is introduced, and the algorithm is improved by pre-training and sampling mechanisms. Section 5 provides the algorithm training results and simulation verification examples of bait delivery in different scenarios. Section 6 summarizes the conclusions of this study.

2. Relative Motion Model of Aircraft-Bait-Incoming Missile

To simplify matters, assume that the incoming missile, bait, and aircraft are in a vacuum and within a uniform gravitational field. We only consider the motion of the center of mass in three dimensions. In addition, assume that the incoming missile and bait are rigid bodies with constant mass. The space coordinate systems [19] are defined as follows:

Inertial coordinate system

S_{g}

(

O_{g} X_{g} Y_{g} Z_{g}

): The origin

O_{g}

of the inertial coordinate system is taken at a point in the inertial space. The

O_{g} X_{g}

axis and

O_{g} Z_{g}

axis are in the horizontal plane, with the

O_{g} X_{g}

axis pointing due north and the

O_{g} Z_{g}

axis pointing due east. The

O_{g} Y_{g}

axis satisfies the right-hand rule of vertical upward.

Ballistic coordinate system S_t(O_tX_tY_tZ_t): The origin O_t is the instantaneous center of mass of the missile. The O_tX_t axis coincides with the missile velocity vector. The O_tY_t axis is located in the plumb plane containing the velocity vector perpendicular to the O_tX_t axis. The O_tY_t axis points upward and is positive. The O_tZ_t axis and O_tX_t, O_tY_t axis form the right-hand coordinate system.

Based on the above conditions, the dynamic models [20] were constructed as shown below.

2.1. Motion Modeling of Aircraft, Incoming Missile, and Bait

The dynamic equation of the aircraft is as follows:

\{\begin{matrix} {\dot{x}}_{m} = v_{x_{-} m} \\ {\dot{y}}_{m} = v_{y_{-} m} \\ {\dot{z}}_{m} = v_{z_{-} m} \\ {\dot{v}}_{x_{-} m} = \frac{P_{x}}{m} - \frac{G M x_{m}}{{r_{m}}^{3}} \\ {\dot{v}}_{y_{-} m} = \frac{P_{y}}{m} - \frac{G M (y_{m} + R)}{{r_{m}}^{3}} \\ {\dot{v}}_{z_{-} m} = \frac{P_{z}}{m} - \frac{G M z_{m}}{{r_{m}}^{3}} \\ \dot{m} = - \frac{| P |}{I_{s p}} \\ r_{m} = \sqrt{{x_{m}}^{2} + {(y_{m} + R)}^{2} + {z_{m}}^{2}} \end{matrix}

(1)

where

x_{m}

,

y_{m}

, and

z_{m}

are the position vectors of the aircraft;

v_{x_{-} m}

,

v_{y_{-} m}

, and

v_{z_{-} m}

are the speed vectors of the aircraft;

P_{x}

,

P_{y}

, and

P_{z}

are the thrust vectors generated by the aircraft’s maneuvering device;

m

is the mass of the aircraft;

M

is the mass of the earth;

G

is the constant of universal gravitation;

R

is the radius of the earth;

r_{m}

is the distance of the aircraft relative to the origin; and

I_{s p}

stands for fuel specific impulse.

The dynamic equation of the incoming missile is:

\{\begin{matrix} {\dot{x}}_{t} = v_{t} \cos θ_{t} \cos ϕ_{t} \\ {\dot{y}}_{t} = v_{t} \sin θ_{t} \\ {\dot{z}}_{t} = - v_{t} \cos θ_{t} \sin ϕ_{t} \\ {\dot{v}}_{t} = g (n_{t x} - \sin θ_{t}) \\ {\dot{θ}}_{t} = \frac{g}{v_{t}} (n_{t y} - \cos θ_{t}) \\ {\dot{ϕ}}_{t} = - g n_{t z} / (v_{t} \cos θ_{t}) \end{matrix}

(2)

where

(x_{t}, y_{t}, z_{t})

represent the coordinates of the incoming missile in the inertial coordinate system;

v_{t}

,

θ_{t}

, and

ϕ_{t}

respectively, represent the velocity, trajectory pitch angle, and trajectory deviation angle of the incoming missile;

n_{t x}

represents the longitudinal control overload of the incoming missile; and

n_{t y}

and

n_{t z}

respectively, represent the turning control overload of the incoming missile in the yaw and pitch directions.

Assuming that the infrared bait is released, ignoring the influence of air resistance, and only maintains uniform acceleration motion under the action of gravity, the dynamic equation of the bait is written as:

\{\begin{matrix} {\dot{x}}_{d} = v_{d} \cos θ_{d} \cos ϕ_{d} \\ {\dot{y}}_{d} = v_{d} \sin θ_{d} \\ {\dot{z}}_{d} = - v_{d} \cos θ_{d} \sin ϕ_{d} \\ {\dot{v}}_{d} = - g \sin θ_{d} \\ {\dot{θ}}_{d} = - \frac{g}{v_{d}} \cos θ_{d} \\ {\dot{ϕ}}_{d} = 0 \end{matrix}

(3)

where

(x_{d}, y_{d}, z_{d})

represent the coordinates of the bait in the inertial coordinate system, and

v_{d}

,

θ_{d}

, and

ϕ_{d}

represent the velocity, trajectory pitch angle, and trajectory deflection angle of the bait, respectively.

2.2. Relative Motion Model between Incoming Missile and Target

Assuming the position vector of the target relative to the incoming missile is

r

and the velocity vector is

V

, it can be represented by

(r, V, q_{ε}, q_{β})

in the inertial coordinate system as follows:

\{\begin{array}{l} r = \sqrt{r_{x}^{2} + r_{y}^{2} + r_{z}^{2}} \\ V = \sqrt{V_{x}^{2} + V_{y}^{2} + V_{z}^{2}} \\ q_{β} = \arctan (- r_{z} / r_{x}) \\ q_{ε} = \arcsin (r_{y} / \sqrt{r_{x}^{2} + r_{z}^{2}}) \end{array}

(4)

where

r_{x}

,

r_{y}

, and

r_{z}

are the components of the position of the target relative to the incoming missile in three directions, and

r_{x} = x_{t} - x_{m}

,

r_{y} = y_{t} - y_{m}

, and

r_{z} = z_{t} - z_{m}

;

V_{x}

,

V_{y}

, and

V_{z}

are the components of the velocity of the target relative to the incoming missile in three directions, and

V_{x} = v_{x_t} - v_{x_m}

,

V_{y} = v_{y_t} - v_{y_m}

, and

V_{z} = v_{z_t} - v_{z_m}

;

q_{β}

is the line of sight azimuth angle; and

q_{ε}

is the height angle of the line of sight.

Taking the derivative of Equation (4) with respect to time yields:

\{\begin{array}{l} \dot{r} = \frac{r_{x} v_{x} + r_{y} v_{y} + r_{z} v_{z}}{r} \\ {\dot{q}}_{β} = \frac{v_{x} r_{z} - r_{x} v_{z}}{r_{x}^{2} + r_{z}^{2}} \\ {\dot{q}}_{ε} = \frac{v_{y} (r_{x}^{2} + r_{z}^{2}) - r_{y} (r_{x} v_{x} + r_{z} v_{z})}{r^{2} \sqrt{r_{x}^{2} + r_{z}^{2}}} \end{array}

(5)

where

\dot{r}

represents the relative position change rate and

{\dot{q}}_{β}

and

{\dot{q}}_{ε}

represent the line of sight azimuth angle change rate and line of sight elevation angle change rate, respectively.

The infrared guided missile in this article adopts a proportional guidance rate, and its guidance instructions are as follows:

\{\begin{matrix} n_{t y} = k_{y} |\dot{r}| {\dot{q}}_{ε} \\ n_{t z} = k_{z} |\dot{r}| {\dot{q}}_{β} \end{matrix}

(6)

where

k_{y}

and

k_{z}

are the proportional guidance coefficients, respectively.

3. Design of Adversarial Defense Framework for Incoming Missile

3.1. Overall Framework Design

A defensive framework for incoming missiles was constructed to study the deployment strategy of using bait for interference defense in confrontation scenarios between our aircraft and incoming missiles. The entire framework consists of three modules: the bait autonomous deployment module (based on deep reinforcement learning), the target and scene real-time dynamic generation module, and the target detection and recognition module. The modules are connected and communicate via Ethernet. The general relationship between each module is shown in Figure 1.

The bait autonomous delivery module (based on deep reinforcement learning) sends the position, velocity, and angle information of incoming missiles, aircraft, and baits to the target and scene real-time dynamic generation module through a simulation. Based on that information, the module generates real-time infrared images and sends them to the target detection and recognition module. This module then sends the detection probability through the target detection and recognition module to the bait autonomous deployment module (based on deep reinforcement learning), where it serves as the input to simulate the recognition probability of the incoming missile seeker to the target.

3.2. Bait Delivery Model Based on Deep Reinforcement Learning

The Markov decision process (MDP) [21] model of autonomous delivery of baits can be described by a quintuple

(s, a, p, r, γ)

, where:

s

is the state space;

a

is the action space;

p

is the state transition probability;

r

is the reward function; and

γ

is a discount factor [22].

3.2.1. State Space and Action Space Design

When developing the autonomous launch of MDP by bait, the state space should be able to consider all states of the confrontation process. The state of the environment should be able to describe the dynamic characteristics of the environment, accurately simulate real-world conditions, and facilitate observation. Here we use the relative distance between the incoming missile and the aircraft, the elevation angle, and the azimuth angle of the line of sight to represent the state space of the environment as follows:

s = [r, q_{ε}, q_{β}]

(7)

where

r

is the relative distance between the incoming missile and the aircraft,

q_{ε}

is the elevation angle of the line of sight, and

q_{β}

is the azimuth angle of the line of sight.

The action space is:

a = [v_{d x}, v_{d y}, v_{d z}]

(8)

where

v_{d x}, v_{d y}, v_{d z}

represent the partial velocities of the bait projectile in the

x, y, z

direction.

3.2.2. Probability of State Transition

The state transition probability represents the probability of moving to the state set

s_{t + 1}

of time

t + 1

when the incoming missile, aircraft, and infrared bait are in the state set

s_{t}

and action set

a_{t}

of the current time

t

, then the deterministic state transition probability of establishing the counter defense MDP of the incoming missile would be as follows:

p (s_{t + 1}| s_{t}, a_{t}) = 1

(9)

3.2.3. Reward Function Design

The reasonable setting of a reward function is key to ensuring the convergence of the algorithm. To design the reward function we combined the ideas of sparse and process. The specific reward function design is as follows:

(1) Reward for successfully evading incoming missile attacks

The main purpose of the bait dropping strategy of our aircraft when it encounters the attack of the incoming infrared guided air-to-air missile is to successfully evade the attack of the incoming missile while deceiving the seeker. Therefore, its reward function was designed as follows:

r_{1} = \{\begin{array}{l} 100, & Successfully evade missile \\ 0, & Failed to evade the missile \end{array}

(10)

where the conditions for successful evasion are the distance between the incoming missile and the target:

r \geq 50 m

.

(2) Reward the number of infrared baits consumed

The number of baits carried by the aircraft is limited when carrying out tasks. In order to improve the continuous defense capability of our aircraft against infrared guided missiles in subsequent operations, the number of infrared baits should be reduced as much as possible during the delivery process on the premise of ensuring the safety of our aircraft. Then the design reward function is presented as follows:

r_{2} = 2 \times (n_{1} - n_{2})

(11)

where

n_{1}

is the initial number of infrared baits,

n_{2}

is the number of infrared baits deployed.

(3) Sight Angle reward

In the process of infrared bait interference on incoming missile, the greater the line of sight angle, the greater the probability our aircraft successfully can avoid an incoming missile attack, thus the design reward function is:

r_{3} = 100 e^{\frac{q_{ε} + q_{β}}{180}}

(12)

(4) Position reward of bait shot

In order to ensure that the aircraft can successfully evade an incoming missile under the action of the bait, a reward is given for its relative position during the evasion process (i.e., the minimum distance between the incoming missile and the aircraft). When the distance between the incoming missile and the aircraft is greater than 50 m, it indicates that the bait successfully caused the aircraft to escape, and a positive reward is provided. When the distance between the incoming missile and the aircraft is less than 50 m, it indicates that the aircraft failed to escape, and a negative reward is provided as punishment. Therefore, the reward function was designed as follows:

r_{4} = \{\begin{array}{l} 0.01 r_{m}, & Successfully evade missile \\ - 10, & Failed to evade the missile \end{array}

(13)

where

r_{m}

is the target distance at the time when the bait is deployed.

Considering the above four reward models, the total reward function was designed as follows:

r_{t o t a l} = r_{1} + r_{2} + r_{3} + r_{4}

(14)

3.3. Identification Model of Incoming Missile Seeker

The infrared seeker is the guided weapon’s direction control, which can automatically track its target through the continuous identification, detection, and tracking of the target’s radiation energy by the infrared detector. The infrared seeker, like other types of seekers, not only has the ability to continuously detect infrared radiation from the target, identify, detect, and track the target, but it can also eliminate the deviation angle distortion of the incoming missile, as well as automatically track the locked target and output information in real time [23].

In order to make the seeker enter the loop, a simulated incoming flight vehicle seeker model was designed. The relative motion model of incoming flight vehicle-bait-aircraft provides their respective particle motion trajectory. with the help of infrared target simulator, the infrared target characteristics of bait and aircraft are simulated. The tracking process of attack incoming flight vehicle seeker to infrared bait and aircraft is simulated.

The infrared target simulator in this study uses real-time dynamic generation of targets and scenes obtained from an image search in reference [24] and is able to generate real-time target infrared images using mathematical formulas. In addition, the target detection and recognition method uses the Yolov5 target detection algorithm to calculate the real-time recognition probability by detecting the target’s infrared image.

4. Deep Reinforcement Learning Algorithms

Deep reinforcement learning [25] is a decision algorithm based on deep learning models. Deep learning focuses on perception, while reinforcement learning differs from deep learning in that it focuses on expression. Deep reinforcement learning combines the perceptual ability of deep learning with the sequential decision-making ability of reinforcement learning, enabling intelligent agents to solve decision-making problems in complex state spaces with strong universality.

4.1. DDPG Algorithm

DDPG [26] is an important algorithm for deep reinforcement learning applied in the field of continuous control reinforcement learning.

A task-independent model is proposed by combining the deterministic strategy gradient algorithm with the actor-critic framework, which can solve numerous continuous control problems with different tasks using the same parameters. Figure 2 shows the basic structure diagram of the actor critic section of the DDPG algorithm.

The DDPG algorithm includes the current network and target network in both actor and critic. The critic value network estimates

Q

value and updates the current network parameter

ω

of the critic by minimizing loss function

Loss

, which is specifically expressed as follows:

Loss = \frac{1}{m} \sum_{i} {(r_{i} + γ Q^{'} (s_{i + 1}, a_{i + 1} ∣ ω^{'}) - Q (s_{i}, a_{i} ∣ ω))}^{2}

(15)

where

m

is the sampling number of empirical data;

r_{i}

represents the reward at

i

time;

γ

is the reward discount factor;

Q^{'} (s_{i + 1}, a_{i + 1} ∣ ω^{'})

represents the target network

Q

value of environment state

s_{i + 1}

at time

i + 1

, and

Q (s_{i}, a_{i} ∣ ω)

represents the current network

Q

value obtained by inputting current state

s_{i}

and current action

a_{i}

at time

i

.

The network of actor strategy adopts the mode of deterministic strategy gradient, uses the gradient descent method to update, and then outputs a definite action

a = μ (s_{i} | θ)

. The current network parameter

θ

of actor is updated as follows:

\nabla_{θ} J (θ) = \frac{1}{m} \sum_{i} \nabla_{a_{i}} Q (s_{i}, a_{i} ∣ ω) \nabla_{θ} μ (s_{i} | θ)

(16)

where

\nabla_{θ} μ (s_{i} | θ)

is the gradient of performance indicator

J (θ)

to actor online network parameter

θ

.

During the training process, parameters

θ^{'}

and

ω^{'}

of the target network of actor and critic are soft-updated through their current network at regular intervals. The specific updating method is as follows:

\{\begin{matrix} θ^{'} \leftarrow τ θ + (1 - τ) θ^{'} \\ ω^{'} \leftarrow τ ω + (1 - τ) ω^{'} \end{matrix}

(17)

where

τ

is used to control the update speed of target network parameters

θ^{'}

and

ω^{'}

.

4.2. Improved DDPG Algorithm

The traditional DDPG algorithm has strong randomness in the initial stages of policy exploration, and the gradient of the policy network declines slowly. Therefore, this study provides the initial network weight parameters through pre-training based on its random action selection, which decreases invalid searches during the early exploration stage and increases the algorithm’s speed of convergence. Specifically, for pre-training, the initial planning strategy is obtained by simply pre-training the agent using the reward function of the aircraft and successfully avoiding the incoming flight vehicle attack. Its network weight is then saved as the initial value of training. In this case, only the missed distance is taken into consideration.

In addition, the training effect is diminished by the fact that the traditional DDPG algorithm updates network parameters by random sampling during experience playback, and the samples extracted may have a large number of useless sample data. To solve this issue, this study classifies the samples in the experience pool during the training process and stores the aircraft data that can successfully evade incoming missiles in experience pool I. The data on aircraft hits by incoming missiles is stored in experience pool II. During the training process, the goal is that experience data of higher value can be sampled from experience pool I as much as possible. The method of proportional sampling from the two experience pools was designed with these factors in mind and is shown in Equation (18). In order to ensure randomness in the exploration and avoid falling into local optimality, random sampling is still adopted in the respective sampling of experience pools I and II.

\{\begin{matrix} N_{1} = n - N_{2} \\ N_{2} = λ n \end{matrix}

(18)

where,

N_{1}

and

N_{2}

are the sampling quantities from experience pools I and II, respectively.

n

is the total sampling quantity,

λ

is the sampling rate from experience pool II, and

λ \in (0, 1)

.

The Figure 3 shows the structure of the infrared bait autonomous delivery framework designed by the improved DDPG algorithm in this paper.

The process of designing the algorithm according to the structure diagram in Figure 3 is shown in Table 1.

5. Simulation Verification and Analysis

5.1. Simulation Scenario Setting and Platform Construction

In the simulation scenario, the position, initial speed, and initial angle of the aircraft and incoming missile are mainly set, as shown in Table 2. Meanwhile, the mass of the aircraft is set to 600 kg, and the available thrust is set to 20,000 N. The mass of the incoming missile is 20 kg, the maximum usable overload is 4 g, and the minimum miss distance is 50 m. The incoming missile adopts the proportional guidance method to attack the aircraft, and the proportional guidance coefficient is 3. The four-order Runge-Kutta method was used in the integration process. The integration step was 0.005, and the longest integration time was 20 s.

Figure 4 shows the physical simulation system developed for this study. It shows the simulation interface diagram of the three modules in the simulation platform for data transmission through the Ethernet built for this study.

5.2. Setting Simulation Parameters

In this paper, the algorithm is programmed using Python, the agent environment is built by Gym, and the deep reinforcement learning training environment is built based on PyTorch. The test host is based on the Windows10 operating system, and the hardware configuration is an Intel Core i7 9700K, RTX 3080Ti, and 16 GB of RAM. The algorithm hyperparameter settings are shown in Table 3. In addition, the environment was ported to the Nvidia Jetson TX2 embedded development board, and the test was carried out on the embedded platform with other parameters being the same. The detailed parameters of the Nvidia Jetson TX2 are shown in Table 4.

5.3. Analysis of Simulation Results

In this paper, we use the traditional DDPG algorithm to train the autonomous delivery strategy network of infrared bait. Then, the DDPG algorithm is improved and trained on the basis of pre-training and classification sampling. Finally, the reward function convergence of the two algorithms is compared. The average reward of 100 events was selected as the result of multiple training sessions according to random initial conditions, as shown in Figure 5.

As can be seen in Figure 5, under the same simulation environment, the convergence steps of the traditional DDPG algorithm and the improved DDPG algorithm are 177 and 95, respectively, and the maximum rewards are 526 and 477, respectively. Thus, the convergence speed of the improved DDPG algorithm is 46.3% faster, and the maximum rewards can be explored. Therefore, the improved DDPG algorithm developed for this study enhances the exploration ability of the agent, and the convergence of the algorithm is significantly improved. In addition, the influence of Gaussian noise was added to the training process of the improved DDPG algorithm, and the training result is shown in Figure 5 as a blue curve. Figure 5 also shows that the algorithm is very robust; it can also reach the convergence state in the end, and the curve has little fluctuation compared with the conditions before the noise interference was introduced.

Following the training phase of the improved DDPG algorithm, the parameters of the strategy network were fixed. The interaction with the environment of the incoming missile, the aircraft, and the infrared bait was simulated and verified when the initial position and velocity parameters of the incoming missile and the aircraft differed by 5%. In addition, in order to verify the real-time performance and online capability of the algorithm on the embedded platform, the simulation environment built for this study was transplanted to the Nvidia Jetson TX2 embedded platform.

Figure 6 shows the three-dimensional motion trajectory diagram of the incoming missile when the aircraft only maneuvers without adopting the autonomous delivery strategy of infrared bait in this paper. Figure 6a shows the flight curve when the aircraft and the incoming missile are in the same direction (i.e., simulation situation I). Figure 6b shows the flight curve when the aircraft and the incoming missile are in relative motion (i.e., simulation situation II). Figure 6c shows the motion curve of the incoming missile attacking the aircraft from a high point at a certain height relative to the aircraft (i.e., simulation situation III).

Figure 7 shows the change in the relative distance between the incoming missile and the aircraft. Among them, Figure 7a shows the relative distance change curve during the simulation situation I. Figure 7b shows the relative distance change curve in simulation situation II. Figure 7c shows the relative distance change curve between the incoming missile and the aircraft in simulation situation III. In all three situations, the aircraft failed to evade the attack of the incoming missile.

Figure 8 shows the three-dimensional motion trajectory diagram of the infrared bait deployment by the delivery strategy network while the aircraft is maneuvering during an encounter with the incoming missile attack. Figure 8a shows the flight curve of the bait when the aircraft and the incoming missile are in the same direction (i.e., simulation situation I). Figure 8b shows the flight curve of the bait when the aircraft and the incoming missile are in relative motion (i.e., simulation situation II). Figure 8c shows the flight curve of the bait when the incoming missile is approaching from a high point over the aircraft (i.e., simulated situation III).

Figure 9 shows the variation curve of the relative distance between the incoming missile and the maneuvering aircraft under the interference of the bait. Figure 9a shows the relative distance change curve in simulation situation I. Figure 9b shows the relative distance change curve in simulation situation II. Figure 9c shows the relative distance change curve in simulation situation Ⅲ. In all three situations, the incoming missile is lured by the infrared baits, causing it to deviate from the target, so that the aircraft can successfully evade the attack and escape.

According to the curves in Figure 6, Figure 7, Figure 8 and Figure 9, the simulation results and calculation speed in the above cases were compared and as shown in Table 5.

Each simulation situation was repeated 2000 times. As Table 5 shows, in the case of flight situation I, when the aircraft uses the bait autonomous delivery strategy network, dropping five point-source infrared baits, the miss distance of the incoming missile is 670.2 m, and the successful escape rate of the aircraft was 96.8%. Conversely, when the bait strategy is not implemented, the miss distance is 0.5 m, and the successful escape rate is 0.3%. Similarly, in flight situation II, when the bait autonomous delivery strategy network is used and the aircraft adopts the strategy to deploy four point-source infrared baits, the miss distance of the incoming missile is 1000.0 m, and the successful escape rate of the aircraft is 97.2%. When the bait strategy is not used, the miss distance is 0.2 m, and the successful escape rate is 0.24%. In the case of flight situation Ⅲ, when the bait autonomous delivery strategy network is used and the aircraft deploys four point-source infrared baits, the average miss distance of the incoming missile is 750.5 m, and the successful escape rate of the aircraft is 96.0%. In contrast, when the bait strategy is not implemented, the average miss distance of the aircraft is 0.6 m, and the successful escape rate of the aircraft is 0.4%. Therefore, the bait autonomous delivery strategy in this study can improve the success rate of aircraft in evading an incoming missile. In addition, during the decision output, the time needed to generate control instructions is 2.5 ms, while the test time of the decision instruction output on the Nvidia Jetson TX2 embedded platform is 12.5 ms, which meets the real-time requirements. These results verify that the autonomous delivery method developed in this study can be deployed on the embedded platform and has online decision-making capability.

6. Conclusions

With the aim of assessing the success rate of infrared bait launching methodologies, this paper designed an intelligent control method for autonomous bait launching based on an improved DDPG algorithm. The proposed algorithm applies deep reinforcement learning to air operations to provide continuous deterministic action decisions for infrared bait. In addition, to further optimize the algorithm, this study also developed pre-training and classification sampling. The simulation results show that the intelligent control method can highly improve the convergence of the traditional DDPG algorithm, and the convergence speed is increased by about 46.3%. The decision network can also output an effective bait-spray strategy. Meanwhile, the speed of generating the decision control order is fast, about 2.5 ms, which can meet real-time requirements. Significantly, the control command output on the embedded platform can meet the required speed with online control ability and achieve rapid autonomous decision-making. Our results show that the probability of a successful escape is about 97%. The method proposed in this work highly improves the autonomous survivability of aircraft in the face of infrared-guided missile attacks.

Author Contributions

Conceptualization, S.X.; methodology, S.X. and H.B.; software, S.X., Z.W. and T.D.; validation, Z.W. and C.Y.; formal analysis, R.S.; investigation, H.B.; resources, H.B.; data curation, H.B.; writing—original draft preparation, S.X.; writing—review and editing, S.X. and Z.W.; visualization, S.X.; supervision, H.B.; project administration, S.X. and H.B.; funding acquisition, H.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China through Grant No. U21B2028.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

Shuai Xue thanks Hongyang Bai for their helpful guidance.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nowak, J.; Achimowicz, J.; Ogonowski, K.; Biernacki, R. Protection of air transport against acts of unlawful interference: What’s next. Saf. Def. 2020, 2, 75–88. [Google Scholar]
Wang, R.F.; Wu, W.D.; Zhang, Y.P. Summarizotion of defense measures for the IR guided flight vehicle. Laser Infrared 2006, 12, 1103–1105. [Google Scholar]
Hu, Z.H.; Chen, K.; Yan, J. Research on control parameters of infrared bait delivery device. Infrared Laser Eng. 2008, 37, 396–399. [Google Scholar] [CrossRef]
Shen, T.; Li, F.; Song, M.M. Modeling and simulation of infrared surface source interference based on SE-Work-Bench. Aerosp. Electron. Work. 2019, 39, 6–10. [Google Scholar]
Yang, S.; Wang, B.; Yi, X.; Yu, H.; Li, J.; Zhou, H. Infrared baits recognition method based on dual-band information fusion. Infrared Phys. Technol. 2014, 67, 542–546. [Google Scholar] [CrossRef]
Chen, S.H.; Zhu, N.Y.; Chen, N.; Ma, X.J.; Liu, J. Infrared radiation characteristics test of aircraft and research on infrared bait delivery. Infrared Technol. 2021, 43, 949–953. [Google Scholar]
Zhang, N.; Chen, C.S.; Sun, J.G.; Liang, X.C. Research on infrared air-to-air flight vehicle based on barrel roll maneuver and bait projection. Infrared Technol. 2022, 44, 236–248. [Google Scholar]
Huang, S.C.; Li, W.M.; Li, W. Two-sided optimal decision model used for ballistic flight vehicle attack-defense. J. Airf. Eng. Univ. Nat. Sci. Ed. 2007, 8, 23–25. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement learning: An introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
Shen, Y.; Chen, M.; Skelton, R.E. Markov data-based reference tracking control to tensegrity morphing airfoils. Eng. Struct. 2023, 291, 116430. [Google Scholar] [CrossRef]
Shen, Y.; Chen, M.; Majji, M.; Skelton, R.E. Q-Markov covariance equivalent realizations for unstable and marginally stable systems. Mech. Syst. Signal Process. 2023, 196, 110343. [Google Scholar] [CrossRef]
Shen, Y.; Chen, M.; Skelton, R.E. A Markov data-based approach to system identification and output error covariance analysis for tensegrity structures. Nonlinear Dyn. 2024, 112, 7215–7231. [Google Scholar] [CrossRef]
Yang, Z.; Liu, X.; Liu, H. Impact time control guidance law with time-varying velocity based on deep reinforcement learning. Aerosp. Sci. Technol. 2023, 142, 108603. [Google Scholar] [CrossRef]
Aslan, E.; Arserim, M.A.; Uçar, A. Development of push-recovery control system for humanoid robots using deep reinforcement learning. Ain Shams Eng. J. 2023, 14, 102167. [Google Scholar] [CrossRef]
Lee, G.; Kim, K.; Jang, J. Real-time path planning of controllable UAV by subgoals using goal-conditioned reinforcement learning. Appl. Soft Comput. 2023, 146, 110660. [Google Scholar] [CrossRef]
Fan, X.L.; Li, D.; Zhang, W.; Wang, J.Z.; Guo, J.W. Flight vehicle evasion decision training based on deep reinforcement learning. Electron. Opt. Control 2021, 28, 81–85. [Google Scholar]
Deng, T.B.; Huang, H.; Fang, Y.W.; Yan, J.; Cheng, H.Y. Reinforcement learning-based flight vehicle terminal guidance of maneuvering targets with baits. Chin. J. Aeronaut. 2023, 36, 309–324. [Google Scholar] [CrossRef]
Qiu, X.Q.; Lai, P.; Gao, C.S.; Jing, W.X. Recorded recurrent deep reinforcement learning guidance laws for intercepting endoatmospheric maneuvering flight vehicles. Def. Technol. 2024, 31, 457–470. [Google Scholar] [CrossRef]
Qian, X.F.; Lin, R.X.; Zhao, Y.N. Aircraft Flight Mechanics; Beijing Institute of Technology Press: Beijing, China, 2012; pp. 102–110. [Google Scholar]
Tang, S.J.; Yang, B.E.; Xu, L.F.; Zhang, Y. Research on anti-interference technology based on target and infrared bait projectile motion mode. Aerosp. Shanghai 2017, 34, 44–49. [Google Scholar]
Sigaud, O.; Buffet, O. Markov decision processes in artificial intelligence. In Markov Processes & Controlled Markov Chains; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar] [CrossRef]
Howard, M. Multi-Agent Machine Learning: A Reinforcement Approach; China Machine Press: Beijing, China, 2017. [Google Scholar]
Ma, X.P.; Zhao, L.Y. Overview of research status of key technologies of infrared seeker at home and abroad. Aviat. Weapons 2018, 3, 3–10. [Google Scholar]
Bai, H.Y.; Zhou, Y.X.; Zheng, P.; Guo, H.W.; Li, Z.M.; Hu, K. A Real-Time Dynamic Generation System and Method for Target and Scene Used in Image Seeker. CN202010103846.3, 28 July 2020. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Vosoogh, A.; Sorkherizi, M.S.; Zaman, A.U.; Yang, J.; Kishk, A.A. An integrated Ka-Band diplexer-antenna array module based on gap waveguide technology with simple mechanical assembly and no electrical contact requirements. IEEE Trans. Microw. Theory Tech. 2018, 66, 962–972. [Google Scholar] [CrossRef]

Figure 1. Overall structural framework diagram.

Figure 2. Basic structure block diagram of actor-critic.

Figure 3. Structure diagram of an infrared bait autonomous delivery framework with improved DDPG algorithm.

Figure 4. Physical simulation system.

Figure 5. Algorithm reward convergence diagram.

Figure 6. Trajectory of incoming missile and aircraft.

Figure 7. Relative distance between incoming missile and mobile aircraft.

Figure 8. Trajectory diagram of incoming missile-bait-aircraft.

Figure 9. Relative distance between incoming missile and maneuvering aircraft under the bait effect.

Table 1. Improved DDPG algorithm flow.

Improved DDPG Algorithm

1:: Obtain initial network weight parameters ω and θ through neural network pre-training
2:: The critic current network and actor current network are randomly initialized with pre-trained parameters ω and θ
3:: The critic target network and the actor target network are initialized with parameters ω′←ω and θ′←θ
4:: Initializing experience pools I and II
5:: for episode = 1, M do
6:: Initialization state s₁
7:: for t = 1, T do
8:: According to the parameters of the pre-training model, an action a_t is selected from the current state s_t
9:: Perform action a_t to get the reward value r_t and the next state s_t+1
10:: Experience $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ is stored in the experience pool. If our aircraft successfully evades the attack of incoming flight vehicle under the interference of infrared bait, its data is distributed in the experience pool I, otherwise it is distributed in the experience pool II.
11:: The data of m samples were sampled, and the number of random samples from experience pools I and II was updated according to Equation (18)
12:: Update the current network of critic according to Formula (15)
13:: Update the current network of the actor according to Formula (16)
14:: Update the actor target network and critic target network according to Formula (17).
15:: end for
16:: end for

Table 2. Initial motion parameters of an incoming missile and the aircraft.

Name	Parameter	Value
Simulation Situation I	Initial position of aircraft (m)	$(x_{m}, y_{m}, z_{m}) = (1000, 1500, 20,000)$
	Initial position of missile (m)	$(x_{t}, y_{t}, z_{t}) = (0, 0, 0)$
	Initial speed of aircraft (m·s⁻¹)	$v_{m} = 400$
	Initial velocity of missile (m·s⁻¹)	$v_{t} = 500$
	Initial azimuth angle of missile (rad)	$(θ_{t}, ϕ_{t}) = (0.2, 0)$
Simulation Situation II	Initial position of aircraft (m)	$(x_{m}, y_{m}, z_{m}) = (1000, 0, 10,000)$
	Initial position of missile (m)	$(x_{t}, y_{t}, z_{t}) = (0, 0, 0)$
	Initial speed of aircraft (m·s⁻¹)	$v_{m} = - 210$
	Initial velocity of missile (m·s⁻¹)	$v_{t} = 500$
	Initial azimuth angle of missile (rad)	$(θ_{t}, ϕ_{t}) = (0, 0)$
Simulation Situation Ⅲ	Initial position of aircraft (m)	$(x_{m}, y_{m}, z_{m}) = (1000, 0, 12,000)$
	Initial position of missile (m)	$(x_{t}, y_{t}, z_{t}) = (0, 3000, 0)$
	Initial speed of aircraft (m·s⁻¹)	$v_{m} = - 210$
	Initial velocity of missile (m·s⁻¹)	$v_{t} = 400$
	Initial azimuth angle of missile (rad)	$(θ_{t}, ϕ_{t}) = (0, 0)$

Table 3. Algorithm hyperparameter setting.

Name	Value
Actor and critic learning rate	0.0003
Discount factor	0.99
Experience pool	1 × 10⁷
Sample rate of experience pool II	0.05
Number of batch samples	256
Maximum steps	400
Optimizer	Adam

Table 4. Detailed parameters of the Nvidia Jetson TX2.

Name	Value
GPU	NVIDIA Pascal architecture with 256 NVIDIA CUDA cores
CPU	Dual core Denver 2, 64 bit CPU and quad core ARM A57 Complex
Memory	8 GB 128 bit LPDDR4
Storage space	32 GB eMMC 5.1
Connect	Gigabit Ethernet
Size	87 mm × 50 mm

Table 5. Comparison of results.

Simulation Situation	Name	Bait Type	The Number of Baits Dropped	Bait Drop Position $(x_{d}, y_{d}, z_{d})$ /m	Bait Drop Angle $(θ_{d}, ϕ_{d})$ /rad	Average Miss Distance/m	Escape Success Rate/%	Decision Time for PC/ms	Decision Time for Nvidia Jetson TX2/ms
I	Release without bait	_	_	_	_	0.5	0.3	_	_
	Bait release	Point source bait	5	(1025.1, 1500.0, 21,234.9)	(0.8, −0.02)	670.2	96.8	2.5	12.5
				(1081.3, 1499.9, 24,013.3)	(1.1, −0.02)
				(1106.5, 1500.0, 25,248.2)	(1.2, −0.01)
				(1162.7, 1499.9, 28,026.7)	(1.4, −0.01)
				(1206.6, 1500.0, 30,187.8)	(1.6, −0.01)
II	Release without bait	_	_	_	_	0.2	0.24	_	_
	Bait release	Point source bait	4	(993.9, −0.1, 9969.1)	(−0.7, −0.02)	1000.0	97.2	2.5	12.5
				(890.1, 0.0, 9573.2)	(−1.1, −0.02)
				(656.5, 0.0, 8988.7)	(−1.6, −0.01)
				(302.1, 0.0, 8394.1)	(−2.1, −0.01)
III	Release without bait	_	_	_	_	0.6	0.4	_	_
	Bait release	Point source bait	4	(719.6, −0.01, 10,491.0)	(−1.3, −0.01)	750.5	96.0	2.5	12.5
				(481.0, −0.08, 9977.2)	(−1.7, −0.02)
				(−281.2, −0.1, 8979.7)	(−2.6, −0.02)
				(301.8, −0.06, 9680.7)	(−2.0, −0.02)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xue, S.; Wang, Z.; Bai, H.; Yu, C.; Deng, T.; Sun, R. An Intelligent Bait Delivery Control Method for Flight Vehicle Evasion Based on Reinforcement Learning. Aerospace 2024, 11, 653. https://doi.org/10.3390/aerospace11080653

AMA Style

Xue S, Wang Z, Bai H, Yu C, Deng T, Sun R. An Intelligent Bait Delivery Control Method for Flight Vehicle Evasion Based on Reinforcement Learning. Aerospace. 2024; 11(8):653. https://doi.org/10.3390/aerospace11080653

Chicago/Turabian Style

Xue, Shuai, Zhaolei Wang, Hongyang Bai, Chunmei Yu, Tianyu Deng, and Ruisheng Sun. 2024. "An Intelligent Bait Delivery Control Method for Flight Vehicle Evasion Based on Reinforcement Learning" Aerospace 11, no. 8: 653. https://doi.org/10.3390/aerospace11080653

APA Style

Xue, S., Wang, Z., Bai, H., Yu, C., Deng, T., & Sun, R. (2024). An Intelligent Bait Delivery Control Method for Flight Vehicle Evasion Based on Reinforcement Learning. Aerospace, 11(8), 653. https://doi.org/10.3390/aerospace11080653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Intelligent Bait Delivery Control Method for Flight Vehicle Evasion Based on Reinforcement Learning

Abstract

1. Introduction

2. Relative Motion Model of Aircraft-Bait-Incoming Missile

2.1. Motion Modeling of Aircraft, Incoming Missile, and Bait

2.2. Relative Motion Model between Incoming Missile and Target

3. Design of Adversarial Defense Framework for Incoming Missile

3.1. Overall Framework Design

3.2. Bait Delivery Model Based on Deep Reinforcement Learning

3.2.1. State Space and Action Space Design

3.2.2. Probability of State Transition

3.2.3. Reward Function Design

3.3. Identification Model of Incoming Missile Seeker

4. Deep Reinforcement Learning Algorithms

4.1. DDPG Algorithm

4.2. Improved DDPG Algorithm

5. Simulation Verification and Analysis

5.1. Simulation Scenario Setting and Platform Construction

5.2. Setting Simulation Parameters

5.3. Analysis of Simulation Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI