Imitation-Reinforcement Learning Penetration Strategy for Hypersonic Vehicle in Gliding Phase

Xu, Lei; Guan, Yingzi; Pu, Jialun; Wei, Changzhu

doi:10.3390/aerospace12050438

Open AccessArticle

Imitation-Reinforcement Learning Penetration Strategy for Hypersonic Vehicle in Gliding Phase

School of Astronautics, Harbin Institute of Technology, No. 92 West Dazhi Street, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(5), 438; https://doi.org/10.3390/aerospace12050438

Submission received: 10 March 2025 / Revised: 10 May 2025 / Accepted: 13 May 2025 / Published: 15 May 2025

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

:

To enhance the penetration capability of hypersonic vehicles in the gliding phase, an intelligent maneuvering penetration strategy combining imitation learning and reinforcement learning is proposed. Firstly, a reinforcement learning penetration model for hypersonic vehicles is established based on the Markov Decision Process (MDP), with the design of state, action spaces, and composite reward function based on Zero-Effort Miss (ZEM). Furthermore, to overcome the difficulties in training reinforcement learning models, a truncated horizon method is employed to integrate reinforcement learning with imitation learning at the level of the optimization target. This results in the construction of a Truncated Horizon Imitation Learning Soft Actor–Critic (THIL-SAC) intelligent penetration strategy learning model, enabling a smooth transition from imitation to exploration. Finally, reward shaping and expert policies are introduced to enhance the training process. Simulation results demonstrate that the THIL-SAC strategy achieves faster convergence compared to the standard SAC method and outperforms expert strategies. Additionally, the THIL-SAC strategy meets real-time requirements for high-speed penetration scenarios, offering improved adaptability and penetration performance.

Keywords:

hypersonic vehicle; penetration in gliding phase; imitation learning; deep reinforcement learning; truncated horizon

1. Introduction

Hypersonic vehicles, with their high speed, wide operational range, and reduced detectability, present formidable capabilities in penetration. These advantages have made them a key area of strategic development and research across major aerospace nations [1,2]. However, as defense weapon systems undergo comprehensive upgrades, including warning detection systems and interception weaponry, the advantages of hypersonic vehicles in terms of difficulty in detection and interception during the gliding phase are gradually diminishing, specifically with the development of detection systems and interception weapons aimed at countering the gliding phase, such as the Hypersonic and Ballistic Tracking Space Sensor (HTBSS) and Glide Phase Interceptor (GPI). The enhanced detection and interception capabilities of these systems have rendered traditional maneuvering penetration strategies during the gliding phase increasingly ineffective [3,4]. Consequently, the probability of hypersonic vehicles encountering terminal-phase interception by defensive missiles has significantly increased. Therefore, addressing the challenge of successfully evading interceptors during the terminal guidance phase under adversarial situations in the gliding phase has become an urgent issue.

In early research, penetration strategies primarily focused on the design of pre-programmed maneuvering patterns, widely utilized in engineering applications, including the step maneuver [5], sinusoidal maneuver [6], weaving maneuver [7,8], spiral maneuver [9,10]. These strategies aimed to alter the vehicle’s flight path, thereby increasing the difficulty for interception systems to predict the trajectory, ultimately depriving them of interception opportunities. However, these penetration strategies require the maneuver timing and patterns to be determined offline in advance, leading to poor operational flexibility and making effective penetration difficult to achieve. To address this issue, penetration guidance laws grounded in modern control theory have been extensively studied. These can be broadly classified into two categories: guidance laws based on optimal control theory and those derived from differential game theory. In term of optimal control theory-based penetration guidance laws, Shinar et al. [11] first proposed the optimal evasion strategy for proportional guidance pursuer under the assumption of two-dimensional linearized kinematics. Furthermore, numerous optimal guidance laws have been developed based on various simplified assumptions [12,13,14,15]. Tabak et al. [13] modeled the interceptor guidance system as an ideal first-order dynamics situation and derived an optimal guidance law for penetration based on the maximum principle. Shaferman [14,15] assumed the interceptor’s guidance law to be known and incorporated it into an attack–defense engagement model, leading to a unilateral penetration model. In term of differential game theory-based penetration guidance laws, the literature [16,17] provided differential game solutions under simplified model assumptions by solving complex optimal problems. Segal et al. [18] proposed a method to solve the high-speed penetration problem for aircraft in a three-dimensional space.

However, the above penetration methods require assumptions that the interceptor’s guidance law is partially or fully known, and simplifications of the interceptor model are often necessary during the modeling process. In practical applications, it is challenging to fully satisfy these conditions, which increases the probability of penetration failure. Furthermore, the optimization problem-solving process is time-consuming and difficult to complete within limited time in practice.

In recent years, rapid advancements in artificial intelligence have facilitated the widespread application of reinforcement learning across various domains, due to its inherent advantages in handling sequential decision-making problems, and it has progressively found application within the penetration domain [19,20,21]. Compared to the aforementioned penetration methods, reinforcement learning-based penetration methods do not require prior knowledge of the interceptor’s guidance strategy or simplified models. These methods transform the penetration strategy problem into an interaction with the environment, aimed at maximizing the reward function, thereby deriving an optimal “end-to-end” penetration strategy. This approach mitigates the computational complexity and time constraints typically associated with traditional methods under complex conditions, providing a novel and efficient solution to the penetration challenges faced by hypersonic vehicles in complex situations.

In particular, Zhuang et al. [22] designed an intelligent penetration strategy utilizing the Proximal Policy Optimization (PPO) algorithm for high-speed fixed-wing unmanned aerial vehicles (UAVs), which resulted in an average improvement of 15.44% in penetration effectiveness. Yan et al. [23] and Guo et al. [24] developed intelligent maneuver strategies for hypersonic vehicles based on the twin delayed deep deterministic policy gradient (TD3) to evade the interception. Hao et al. [25] proposed a method for adjusting evasion guidance law parameters based on reinforcement learning. Zhao et al. [26] proposed a maneuver command decision model utilizing the Soft Actor–Critic (SAC) algorithm. However, since reinforcement learning is an unsupervised learning method, it requires extensive trial and error and exploration of the environment during the learning process. Combined with the complexity of the flight penetration decision-making environment, reinforcement learning-based penetration methods for hypersonic vehicles also face a series of challenges, such as long training time, low efficiency in sample exploration, and difficulties in achieving convergence.

To address these challenges, combining the strengths of traditional penetration methods with reinforcement learning-based methods has emerged as an effective approach to solving penetration problems in complex environments. By imitating the strategies provided by traditional penetration methods, the training process of reinforcement learning can be expedited. In the literature [27], expert knowledge was employed using the Behavior Cloning (BC) imitation learning method for a DDPG-based penetration algorithm, leading to a substantial improvement in the convergence speed of UAV penetration algorithms. To further expedite the learning process, He et al. [28] proposed an innovative learning framework that integrates imitation learning and reinforcement learning. Jiang et al. [29] combined Generative Adversarial Imitation Learning (GAIL) with reinforcement learning methods, enabling the rapid learning of UAV path planning strategies in complex environments. Similarly, Wang et al. [30] combined GAIL with the PPO algorithm, using adversarial training to imitate expert knowledge and accelerate the training of penetration strategies, thus facilitating faster training in complex battlefield environments. However, these approaches heavily depend on the tailored design of reward functions, and the imitation training process based on expert knowledge is intricate. Additionally, challenges such as training instability and mode collapse can arise, which limit the effectiveness of these methods in reducing the difficulty of training reinforcement learning-based penetration strategies.

Motivated by the above observations, this paper proposes an intelligent penetration strategy for hypersonic vehicles in the gliding phase, which combines imitation learning and reinforcement learning. Firstly, a Markov Decision Process (MDP) model for the penetration of hypersonic vehicles in the gliding phase is established. Secondly, an imitation-reinforcement learning model is constructed using the truncated horizon (TH) method. Based on this, a Truncated Horizon Imitation Learning Soft Actor–Critic (THIL-SAC) intelligent penetration strategy is designed, which facilitates the rapid and stable convergence of the penetration strategy learning. The main contributions of this paper are summarized as follows.

A novel imitation-reinforcement learning method is developed, which unifies the optimization objectives of imitation learning and reinforcement learning through the truncated horizon. By adjusting the truncated horizon length, this method progressively increases the difficulty of learning tasks, providing a new way to enhance the efficiency of reinforcement learning in solving complex problems.
A result-oriented composite reward function based on the Zero-Effort Miss (ZEM) is designed, which incorporates both terminal rewards and process rewards. The constituent term ZEM of the reward function establishes a direct link between the current penetration situation and the final penetration outcome, enabling a more accurate evaluation of the effectiveness of the penetration actions. This effectively reduces the difficulty of learning the penetration strategy for hypersonic vehicles.
An intelligent penetration strategy for hypersonic vehicles during the gliding phase, utilizing the THIL-SAC algorithm, is proposed. By employing reward shaping and truncated experience sampling, this approach integrates imitation learning with reinforcement learning, thereby effectively enhancing the convergence rate of penetration strategy learning based on reinforcement learning. Furthermore, this method avoids any need for model simplifications or assumptions, thereby delivering improved penetration performance and greater adaptability.

The remainder of this paper is structured as follows. Section 2 presents the penetration model and MDP model for a hypersonic vehicle in the gliding phase. In Section 3, the penetration strategy based on imitation-reinforcement learning is designed. Section 4 presents simulation experiments and analysis to validate the effectiveness of the proposed methods. Finally, conclusions are offered in Section 5.

2. Materials and Methods

2.1. Penetration Model of Hypersonic Vehicle in Gliding Phase

During the gliding phase of a hypersonic vehicle, its initial penetration attempt fails due to mid-course guidance by the interceptor. Entering the terminal guidance phase, the vehicle maneuvers to evade the interceptor at a distance of 50 km. The relative motion between the hypersonic gliding vehicle and the interceptor is shown in Figure 1.

o x y z

is the north celestial east coordinate system. H and I represent the hypersonic vehicle and interceptor, respectively.

V, θ,

and

ψ

are the velocity, the flight path angle, and the heading angle of velocity vector of the hypersonic vehicle.

V_{I}

is the velocity of the interceptor.

R_{r e l}

and

V_{r e l}

are the relative position vector and relative velocity vector between them.

To better align with real-world penetration scenarios, this paper considers the path constraints during the flight of the hypersonic vehicle, including the aerodynamic overload limit, the dynamic pressure limit, and the heating rate limit, as well as environmental disturbances and the detection errors of the interceptor. Under these conditions and constraints, the penetration problem studied in this paper can be described as a game-theoretic process, where the hypersonic vehicle exploits its maneuvering capabilities to evade and penetrate against interceptor.

2.1.1. Mathematical Model of Hypersonic Vehicle

Considering the Earth’s rotation, the dynamics model of the hypersonic vehicle during the gliding flight phase is as follows [31].

\{\begin{matrix} \dot{r} = & V sin θ \\ \dot{λ} = & \frac{V cos ϕ sin ψ}{r cos ϕ} \\ \dot{ϕ} = & \frac{V cos θ cos ψ}{r} \\ \dot{V} = & - \frac{D}{m} - g sin θ + r Ω^{2} cos ϕ (cos ϕ sin θ - cos θ sin ϕ cos ψ) \\ \dot{θ} = & \frac{1}{V} [\frac{L}{m} cos σ + (V^{2} / r - g) cos θ + 2 Ω V cos ϕ sin ψ + \\ Ω^{2} r cos ϕ (cos θ cos ϕ + sin θ sin ϕ cos ψ)] \\ \dot{ψ} = & \frac{1}{V} [\frac{L sin σ}{m cos θ} + \frac{V^{2}}{r} cos θ sin ψ tan ϕ - 2 Ω V \\ (tan θ cos ϕ cos ψ - sin ϕ) + \frac{Ω^{2} r}{cos θ} sin ψ sin ϕ cos ϕ] \end{matrix}

(1)

where r is the radial distance. V is the earth-relative velocity.

λ

and

ϕ

are the longitude and latitude.

θ

and

ψ

are the flight path angle and heading angle.

σ

is the bank angle. m is the mass of the vehicle.

Ω

and g are the earth angular velocity and gravitational acceleration. D and L are the aerodynamic drag force and lift force, the magnitude of which is determined by the current attack angle of the vehicle when the altitude and speed are known.

The aerodynamic lift and drag forces are expressed as follows:

\{\begin{matrix} L = q S C_{L} \\ D = q S C_{D} \end{matrix}

(2)

where

q = \frac{1}{2} ρ V^{2}

is the dynamic pressure, with

ρ

denoting the atmospheric density. S is the aerodynamic reference area. The lift and drag coefficients,

C_{L} = f_{C_{L}} (α, V)

and

C_{D} = f_{C_{D}} (α, V)

, are nonlinear functions of the angle of attack

α

and the velocity V.

To facilitate the calculation of the relative relationship between the hypersonic vehicle and the interceptor, the position of the hypersonic vehicle can be expressed as

X, Y, Z

in the Cartesian coordinate system as follows.

\{\begin{matrix} X = r cos ϕ cos λ \\ Y = r cos ϕ sin λ \\ Z = r sin ϕ \end{matrix}

(3)

Furthermore, the hypersonic vehicle must satisfy both path constraints and control input constraints. The control input constraints include limitations on the angle of attack and the bank angle. In this paper, the angle of attack follows a predefined nominal profile, while the bank angle constraints will be specified in the subsequent detailed design. The path constraints include the heating rate limit, the aerodynamic overload limit and the dynamic pressure limit. For the sake of convenience in the subsequent design, these constraints are reformulated as bank angle limits in Equation (4), such that satisfying the bank angle limits also ensures compliance with the original path constraints.

\{\begin{matrix} {cos}^{- 1} [(g - \frac{V^{2}}{r}) \frac{2 m V^{4.3}}{C_{L} S} {(\frac{k_{Q}}{{\dot{Q}}_{max}})}^{2}] = |σ_{{\dot{Q}}_{max}} (V)| \\ {cos}^{- 1} [(g - \frac{V^{2}}{r}) \frac{\sqrt{1 + {(C_{D} / C_{L})}^{2}}}{g_{0} n_{max}}] = |σ_{n_{max}} (V)| \\ {cos}^{- 1} [(g - \frac{V^{2}}{r}) \frac{m}{C_{L} q_{max} S}] = |σ_{q_{max}} (V)| \end{matrix}

(4)

where

{\dot{Q}}_{max}

,

n_{max}

, and

q_{max}

represent the allowable upper bounds of the heating rate limit, the aerodynamic overload limit, and the dynamic pressure limit.

|σ_{{\dot{Q}}_{max}} (V)|

,

|σ_{n_{max}} (V)|

, and

|σ_{q_{max}} (V)|

represent the upper bounds of the bank angle under these constraints.

k_{Q}

is a constant related to the structural characteristics of the vehicle.

g_{0}

is the gravitational acceleration at sea level.

2.1.2. Mathematical Model of Interceptor

The dynamics of the interceptor are modelled in the launch coordinate system as follows:

\{\begin{matrix} {\dot{X}}_{I} & = V_{I} cos θ_{I} cos ψ_{I} \\ {\dot{Y}}_{I} & = V_{I} sin θ_{I} \\ {\dot{Z}}_{I} & = - V_{I} cos θ_{I} sin ψ_{I} \\ {\dot{V}}_{I} & = g (n_{x} - sin θ_{I}) \\ {\dot{θ}}_{I} & = \frac{g}{V_{I}} (n_{y} - cos θ_{I}) \\ {\dot{ψ}}_{I} & = - \frac{g}{V_{I} cos θ_{I}} n_{z} \end{matrix}

(5)

where

X_{I}, Y_{I},

and

Z_{I}

are the coordinate components of the position of the interceptor in the launch system.

V_{I}

is the velocity of interceptor.

θ_{I}

is the flight path angle.

ψ_{I}

is the heading angle.

n_{x}, n_{y},

and

n_{z}

are the overload components in the launch system.

The interceptor uses the most typical and commonly guidance law proportional guidance (PN) to intercept the hypersonic vehicle The specific form is as follows:

\{\begin{matrix} a_{I y} & = N V_{c} {\dot{ς}}_{e} \\ a_{I z} & = N V_{c} {\dot{ς}}_{a} \end{matrix}

(6)

where

a_{I y}

and

a_{I z}

are the acceleration of the interceptor. N is the coefficient of navigation law, which defines the proportionality between the interceptor’s commanded acceleration and the line-of-sight angular rate. In practice, N ranges from 2 to 6, with higher values leading to increased maneuvering.

{\dot{ς}}_{e}

and

{\dot{ς}}_{a}

are the line-of-sight angular rates.

V_{c}

is the approach velocity, and it can be calculated as follows:

V_{c} = - \frac{R_{r e l}}{∥R_{r e l}∥} V_{r e l}

(7)

where

R_{r e l}

and

V_{r e l}

are the vectors of the relative position and relative velocity.

The efficacy of the interceptor is evaluated by the ZEM, which is determined through the following formula:

L_{Z E M} = \frac{∥R_{r e l} \times V_{r e l}∥}{∥V_{r e l}∥}

(8)

where

L_{Z E M}

is the ZEM, which represents the minimum distance between the hypersonic vehicle and the interceptor, assuming that both stop maneuvering.

Remark 1.

In this paper, the interceptor model is only utilized to establish the simulation environment. The design of the penetration strategy is independent of the model.

2.2. Design of MDP Model for Hypersonic Vehicle Penetration in Gliding Phase

The solution to reinforcement learning problems is based on the MDP, which can be described by

(S, A, π, P, R)

. Within the MDP framework, the agent applies an action,

a_{t} \in A

, to the environment based on the current state

s_{t} \in S

and policy

π

. In the next moment, the environment transitions to a new state,

s_{t + 1} \in S

, according to the state transition probability P influenced by the action, and provides the agent with a reward value,

R_{t} \in R

, for the current step.

Through the learning process described above, the ultimate objective of the agent is to obtain an optimal policy,

π^{*}

, that maximizes the cumulative reward

G_{0}

. To account for foresight in learning, the cumulative reward

G_{0}

is calculated using the following discounting method.

G_{0} = \sum_{t = 0}^{\infty} γ^{t} R_{t}

(9)

where

γ \in (0, 1]

represents the discount factor.

In the hypersonic vehicle penetration process during the gliding phase discussed in this paper, the agent is the hypersonic vehicle, while the environment primarily consists of the interceptor. The state transition probability

P = 1

is determined by the model given in Equation (1). The policy represents the learning objective, which can be obtained through reinforcement learning method. Consequently, the designs of the state space, action space, and reward function are critical factors in determining the effectiveness of the hypersonic vehicle’s penetration, which guide the agent’s learning of the optimal policy. The following sections will address the design in detail.

2.2.1. State Space

The state space primarily encompasses information regarding both the hypersonic vehicle and the interceptor. We consider that the ultimate criterion for determining the success of the hypersonic vehicle’s penetration is the relative distance between the two at the time of miss and the relative velocity directly influences the change in this distance. Therefore, the vector of the relative position

R_{r e l}

and relative velocity

V_{r e l}

between the vehicle and the interceptor are selected as the state variable

s

.

s = {[R_{r e l}, V_{r e l}]}^{T} = {[r_{x}, r_{y}, r_{z}, v_{x}, v_{y}, v_{z}]}^{T}

(10)

where

r_{x}, r_{y}, r_{z}, v_{x}, v_{y},

and

v_{z}

are the components of the relative position vector and the relative velocity vector. Thus, the state space

S

for the hypersonic vehicle penetration is defined as the union of all states,

s

.

2.2.2. Action Space

In accordance with the hypersonic vehicle model in Equation (1), the vehicle’s trajectory is determined once the angle of attack

α

and angle of bank

σ

are specified. However,

α

significantly impacts the vehicle’s speed and altitude, making it difficult to meet terminal velocity and remaining range requirements, ultimately leading to an unsuccessful strike. Consequently, the hypersonic vehicle’s penetration during the gliding phase is typically conducted within the horizontal plane [32]. Therefore, in this paper, the vehicle achieves penetration by adjusting the angle of bank

σ

.

To ensure that the hypersonic vehicle can still reach the designated reentry point after successfully penetrating defenses and ultimately strike the target, the maximum bank angle for the vehicle’s maneuvering penetration command is constrained. Furthermore, in order to address the practical implementation of the bank angle command, the rate of change in the bank angle is also constrained. Therefore, the action for the hypersonic vehicle penetration is defined as follows:

\{\begin{matrix} a = σ_{c m d} \\ \begin{matrix} |σ_{c m d}| < & {\dot{σ}}_{max} \\ = & min \{|σ_{{\dot{Q}}_{max}} (V)|, |σ_{n_{max}} (V)|, |σ_{q_{max}} (V)|, σ_{m \max}\} \end{matrix} \\ |{\dot{σ}}_{c m d}| < {\dot{σ}}_{max} \end{matrix}

(11)

where

σ_{c m d}

is the penetration command for the bank angle.

σ_{max}

and

{\dot{σ}}_{max}

represent the maximum bank angle and its rate of change.

σ_{m \max}

is maximum maneuverable bank angle under the constraint of striking the target. The action space

A

for the hypersonic vehicle penetration is thus defined as the union of all actions, a.

2.2.3. Reward Function

The reward function represents a tangible expression of the reinforcement learning objective, determining the direction of the agent’s policy learning and impacting the convergence speed and stability of the learning process, making it a core component of reinforcement learning. For the hypersonic vehicle penetration problem during the gliding phase, the result of the penetration can only be determined at the end, representing a typical sparse reward scenario. Furthermore, the inability to assess the impact of the current action on the final penetration outcome further increases the difficulty of learning the penetration strategy. Inspired by the reward function based on the line-of-sight angular rate in [24], this paper introduces penetration time information into the reward function design, considering the relative azimuthal situation (line-of-sight angle) during the penetration process. To simplify the reward function expression, the ZEM, which can represent the final penetration objective, is employed to unify the azimuthal situation and time information. The following composite reward function is then proposed.

Terminal reward term
The objective of the hypersonic vehicle penetration is to maximize the line-of-sight normal displacement (ZEM) between itself and the interceptor at the time of miss. Therefore, the reward function should include a reward term, $r_{1}$ , at the time of miss.

$r_{1} = c_{1} \cdot {L_{Z E M_}}_{end}, \begin{matrix} \end{matrix} c_{1} > 0$

(12)

where $c_{1}$ is a positive constant. ${L_{Z E M_}}_{end}$ is the ZEM at the time of miss.
Process reward term
Since the interceptor uses a near-optimal inverse trajectory for interception, the miss distance is initially at its maximum. As the distance between the hypersonic vehicle and the interceptor decreases, the miss distance will typically decrease as well [23,24]. However, if the rate of decrease in the miss distance slows down or even increases, it indicates that the hypersonic vehicle may eventually succeed in penetrating. Therefore, a process reward term, $r_{2}$ , is introduced to account for the rate of change in the miss distance.

$\{\begin{matrix} r_{2} = c_{2} \cdot Δ L_{Z E M}, \begin{matrix} \end{matrix} c_{2} > 0 \\ Δ L_{Z E M} = {L_{Z E M_}}_{t} - {L_{Z E M_}}_{t - 1} \end{matrix}$

(13)

where $c_{2}$ is a positive constant. ${L_{Z E M_}}_{t}$ is the ZEM at the time of t.
If only $r_{2}$ is used as the process reward term, the reward obtained from the agent–environment interactions would almost always be negative, which is detrimental to the diversity of training samples. Therefore, to address this issue, a “dense penalty + sparse reward” reward function format is considered, and an additional reward term, $r_{3}$ , is introduced.

$r_{3} = c_{3} \begin{matrix} \end{matrix} Δ L_{Z E M} > 0$

(14)

where $c_{3}$ is a positive constant.
In summary, the reward function for the hypersonic vehicle penetration during the gliding phase is as follows:

$R_{t} = r_{1} + r_{2} + r_{3}$

(15)

Remark 2.

The terminal reward

r_{1}

directly reflects the penetration effectiveness and determines whether the penetration is successful, thus occupying a dominant position in the overall reward function. On the other hand, the introduction of the process rewards

r_{2}

and

r_{3}

primarily aims to alleviate the common issue of sparse rewards in reinforcement learning, guiding the agent toward more effective strategy convergence during the learning process. To ensure consistency and stability in the learning objectives, the cumulative value of process rewards should be smaller than or of the same order of magnitude as the terminal reward

r_{1}

, in order to prevent excessive process incentives from overshadowing the optimization of the terminal goal. Based on the above considerations, we set the weight coefficients

c_{1}

,

c_{2}

, and

c_{3}

according to the principle of “terminal dominance, process assistance” to achieve a good balance between penetration efficiency and learning stability.

3. Design of Penetration Strategy Based on Imitation-Reinforcement Learning

Reinforcement learning-based intelligent penetration methods require the exploration and learning of the environment from its initial state, which leads to long training times and slow convergence. Additionally, the significant gap between the initial learning objectives and the agent’s capabilities may result in early termination, preventing effective learning.

Therefore, this paper investigates a Truncated Horizon Imitation Learning Soft Actor–Critic (THIL-SAC) method for hypersonic vehicle penetration. This method integrates imitation learning and reinforcement learning through the truncated horizon. In the early stages of penetration strategy training, a “short-sighted” strategy is used, focusing on expert experience imitation and short-term goal achievement. This approach reduces unnecessary exploration and learning difficulty, accelerating the training process. As training progresses, the foresight of strategy learning is gradually enhanced, utilizing the SAC algorithm for comprehensive exploration of the environment to achieve successful penetration.

3.1. Truncated Horizon Imitation-Reinforcement Learning Model

For reinforcement learning, we define the Markov Decision Process

M_{0}

; then, its state–value function

V^{π}

and state–action value function

Q^{π}

are

V^{π} (s) = E [\sum_{t = 0}^{\infty} γ^{t} R_{t} (s_{t}, a_{t}) ∣ s_{0} = s, a \sim π]

(16)

Q^{π} (s, a) = E [R_{t} (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim P_{s a}} [V^{π} (s_{t + 1})]]

(17)

where

E

represents the mathematical expectation.

The objective is to find the strategy that maximizes

V^{π}

.

π^{*} = arg {max}_{π} V^{π} (s), \forall s \in S

(18)

For imitation learning, the expert’s experience is represented by fitting a temporal difference method to

V^{e}

, and combining it with the reward defines the dominance function:

A^{e} (s_{t}, a_{t}) = R_{t} (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim P} [V^{e} (s_{t + 1})] - V^{e} (s_{t})

(19)

Then, the objective of the greedy learning of the expert experience directly is as follows [33].

\hat{π} (s) = arg max_{π} A^{e} (s, a)

(20)

According to the theorem of reward shaping [34], the reinforcement learning optimization objective in Equation (18) can be equivalently expressed as follows:

π^{*} = arg max_{π} E [\sum_{t = 1}^{\infty} γ^{t - 1} A^{e} (s_{t}, a_{t}) |s_{1} = s; a \sim π]

(21)

Since the problem involves the entire MDP, it is an infinite horizon optimization problem, which is difficult to solve. Using the idea of the truncated horizon to transform it into an optimization problem with a finite number of decisions, Equation (21) can be expressed as follows.

\begin{matrix} {\tilde{π}}^{*} & = arg max_{π} E [\sum_{t = 1}^{k} γ^{t - 1} A^{e} (s_{t}, a_{t}) |s_{1} = s; a \sim π] \\ \begin{matrix} = arg max_{π} E [ & \sum_{t = 1}^{k} γ^{t - 1} R_{t} (s_{t}, a_{t}) + γ^{k} V^{e} (s_{k + 1}) - \\ V^{e} (s_{1}) |s_{1} = s; a \sim π] \end{matrix} \end{matrix}

(22)

where

k \geq 1

is the truncated length. Then, the error due to the transformation of the optimization objective is as follows [35].

\{\begin{matrix} J ({\tilde{π}}^{*}) \leq J (π^{*}) + \frac{γ^{k}}{1 - γ^{k}} κ \\ J (π) = E [V^{π} (s_{0})] \end{matrix}

(23)

where

κ

is the error between

V^{{\tilde{π}}^{*}}

and

V^{π^{*}}

.

In summary, when

k = 1

, the optimization objective of Equation (22) is the same as that of Equation (20), and the optimization objective of imitation learning can be expressed by Equation (22). When

k > 1

, the strategy obtained by using Equation(22) is better than that of imitation learning. When

k = \infty

, the optimization objective of Equation (22) is equivalent to that of reinforcement learning. Therefore, the optimization objective of reinforcement learning and imitation learning can be expressed uniformly.

3.2. THIL-SAC-Based Penetration Strategy

The THIL-SAC-based penetration strategy for hypersonic vehicles is built upon the SAC algorithm. It reduces the learning difficulty of the penetration strategy by using the truncated horizon method and imitation learning from expert experience. The approach enables the intelligent maneuvering and penetration of the hypersonic vehicle in complex battlefield environments and achieves faster convergence. The overall structure of the algorithm is shown in Figure 2.

Figure 2 illustrates the training process and online application process of the penetration strategy network. In terms of penetration strategy training, the THIL-SAC algorithm can be broadly divided into the foundational SAC algorithm and improvements designed to accelerate algorithm convergence. The improvements are divided into two parts: (1) the efficient utilization of expert experience based on reward shaping and sampling; (2) the optimization of the penetration strategy training process based on the truncated horizon. Together, these elements enable the rapid development of an effective intelligent penetration strategy network. In terms of penetration strategy online application, the states of the hypersonic vehicle and the interceptor missile are fed into the trained strategy network

π^{ϕ}

, which then outputs the maneuvering penetration command

σ

for the hypersonic vehicle to execute the penetration maneuver.

3.2.1. SAC Algorithm

The penetration mission of hypersonic vehicles is a decision-making problem with continuous action and state spaces, making it suitable for solving using policy gradient methods. However, in classical policy gradient algorithms, the DDPG algorithm is highly sensitive to training hyperparameters; the PPO algorithm faces a serious exploration–exploitation trade-off, which can result in a significant discrepancy between the training policy and the actual interaction policy, leading to difficulties in algorithm convergence. To address the aforementioned issues, this paper adopts the SAC (Soft Actor–Critic) algorithm as the foundational reinforcement learning algorithm for learning penetration strategies for hypersonic vehicles. The SAC algorithm is a reinforcement learning algorithm based on the Actor–Critic framework, which improves the exploration of the environment by incorporating maximum entropy into the learning objective. This allows the algorithm to achieve high rewards while also exhibiting greater robustness [36].

The SAC algorithm employs both the Actor and Critic networks to approximate the policy and value function, respectively. The Actor network generates actions based on the current state, while the Critic network estimates the value function, evaluating the actions suggested by the Actor. Through iterative learning, both networks converge towards the optimal policy and value function. The training process of the algorithm consists of two main steps: policy evaluation and policy improvement.

Policy evaluation
The purpose of policy evaluation is to the evaluation accuracy of the Critic network. To fully utilize the available data, increase sampling efficiency, and improve learning stability, the SAC algorithm constructs four identical networks, $Q_{s o f t}$ : two main Q-networks ( $Q_{i}^{ϑ_{i}} (s, a); i = 1, 2$ ) and two target Q-networks ( $Q_{i}^{ϑ_{t a r, i}} (s, a); i = 1, 2$ ), where $ϑ$ and $ϑ_{t a r}$ are the network parameters.
The main Q-networks update their parameters by minimizing the soft Bellman error, with the loss function given by

$\begin{matrix} J_{Q} (ϑ_{i}) = E_{(s_{t}, a_{t}, R_{t}, s_{t + 1}) \sim D} [{(Q_{i}^{ϑ_{i}} (s_{t}, a_{t}) - y (R_{t}, s_{t + 1}))}^{2}] \end{matrix}$

(24)

where D is the replay buffer. $y (R_{t}, s_{t + 1})$ is the updating target, which can be calculated by the following formula.

$\begin{matrix} y (R_{t}, s_{t + 1}) = & R_{t} + γ (min_{i = 1, 2} Q_{i}^{ϑ_{t a r, i}} (s_{t + 1}, a_{t + 1}) - \\ α_{T} log π^{ϕ} (a_{t + 1} | s_{t + 1})) \end{matrix}$

(25)

where $π^{ϕ}$ is the policy, and $a_{t + 1}$ is the action given in the state $s_{t + 1}$ by $π^{ϕ}$ .
To ensure the accuracy of $Q_{i}^{ϑ_{i}} (s, a)$ and minimize the loss function, it is common to update the parameter $ϑ_{i}$ by using gradient descent as follows:

$ϑ_{i} \leftarrow ϑ_{i} - λ_{Q} \nabla J_{Q} (ϑ_{i})$

(26)

where $λ_{Q} > 0$ indicates the learning rate of $Q_{i}^{ϑ_{i}} (s, a)$ and satisfies a small value, and $\nabla J_{Q} (ϑ_{i})$ denotes the gradient of $J_{Q} (ϑ_{i})$ .
The $Q_{i}^{ϑ_{t a r, i}} (s, a)$ parameters are then soft-updated according to the $Q_{i}^{ϑ_{i}} (s, a)$ parameters with the following update equation:

$ϑ_{t a r, i} \leftarrow τ ϑ_{i} + (1 - τ) ϑ_{i}$

(27)

where $τ$ is the soft update coefficient.
Policy improvement
The purpose of policy improvement is to enhance the policy of the agent by updating the policy network $π^{ϕ}$ , where $ϕ$ represents the policy network parameters. $π^{ϕ}$ is updated by the Kullback–Leibler (KL) divergence between the current policy and the target policy, and the loss function is as follows:

$\begin{matrix} J_{π} (ϕ) = & E_{s_{t} \sim D, a_{t} \sim π^{ϕ}} [α_{T} log (π^{ϕ} (a_{t} | s_{t})) - \\ min_{i = 1, 2} Q_{i}^{ϑ_{i}} (s_{t}, a_{t})] \end{matrix}$

(28)

In the equation, the temperature coefficient $α_{T}$ is dynamically adjusted based on the state to enhance the randomness of policy exploration. This update rule is as follows:

$\{\begin{matrix} α_{T} \leftarrow α_{T} - λ_{α_{T}} \nabla J (α_{T}) \\ J (α_{T}) = E_{a_{t} \sim π^{ϕ}} [- α_{T} log π^{ϕ} (a_{ι} | s_{ι}) - α_{T} \bar{H}] \end{matrix}$

(29)

The update formula for the policy network is

$ϕ \leftarrow ϕ - λ_{π} \nabla J_{π} (ϕ)$

(30)

where $λ_{π} > 0$ indicates the learning rate.

For the hypersonic vehicle penetration problem, the penetration strategy network for the vehicle is the SAC algorithm’s policy network, which generates the vehicle’s maneuvering and penetration commands,

σ

. The network

Q_{s o f t}

learns the penetration strategy based on the reward function, which guides the direction of strategy learning.

Considering the complexity of the hypersonic vehicle penetration environment in the gliding phase, where the relationship between the penetration target, current state, and reward is intricate, the learning task is challenging. Without prior experience, the penetration strategy network may engage in excessive exploratory behavior initially, leading to slow convergence or even failure to converge. Therefore, this paper introduces the truncated horizon imitation strategy on top of the SAC algorithm for designing the hypersonic vehicle penetration strategy. This approach employs expert experience and reduces the initial learning difficulty to enhance the speed of strategy learning.

3.2.2. Expert Experience Imitation Based on Reward Shaping and Sampling

To fully utilize and mimic expert experience during the early stages of training, expert experience is incorporated into both the reward function and the experience replay buffer.

Considering the impact of the experience replay buffer on training, during the initial training phase, the buffer is populated with penetration Markov decision data generated by expert experience

(s_{t}, a_{E t}, s_{t + 1}, r_{t})

, where

a_{E t}

represents the expert experience data. When sampling data in batches from the experience replay buffer during early training, the expert data will guide the learning of the penetration strategy. As training progresses, the experience replay buffer gradually fills with training data, and the expert experience data are removed, causing their direct influence to diminish until they eventually disappear.

To further leverage expert experience, the expert experience data are used to fit an expert value function

{\hat{V}}^{E} (s)

via temporal difference (TD) learning. Based on the theorem of reward shaping, the hypersonic vehicle reward function Equation (15) is reconstructed as follows:

{\bar{R}}_{t} = R_{t} + {\hat{V}}^{E} (s_{t + 1}) - {\hat{V}}^{E} (s_{t})

(31)

By utilizing expert experience to guide the reward function calculation, the learning process for the hypersonic vehicle glide phase penetration strategy is accelerated.

3.2.3. Optimization of the Training Process Based on TH

To address the challenges of high learning difficulty and slow training speed in the initial stages of hypersonic vehicle penetration strategy learning, the truncated length k is employed to incrementally increase the complexity of the learning task. This approach transitions the focus from primarily imitation learning based on expert experience during the initial training phase to reinforcement learning aimed at exploring the optimal strategy. The specific implementation steps are as follows:

Truncated horizon learning framework
Based on the unified imitation-reinforcement learning framework with the truncated horizon established in Section 3.1, the truncated length k can be introduced into the soft value functions $V_{s o f t} (s_{t})$ and $Q_{s o f t} (s, a)$ , which can be rewritten as follows:

$Q_{k, s o f t} (s_{t}, a_{t}) = R_{t} + E_{π} [\sum_{l = t}^{t + k} γ^{l - 1} ({\bar{R}}_{l} + α_{T} H)]$

(32)

$V_{k, s o f t} (s_{t}) = α_{T} log \int_{A} exp (\frac{1}{α_{T}} Q_{k, s o f t} (s_{t}, a_{t})) d a$

(33)

Therefore, using the truncated value functions $Q_{k, s o f t}$ and $V_{k, s o f t}$ for penetration strategy learning is equivalent to learning within the truncated horizon length k.
Considering that in the actual training process, the truncated value functions are calculated from data $(s_{t}, a_{t}, s_{t + 1}, {\bar{R}}_{t})$ sampled from the experience replay buffer, the time step range t of the sampled data determines the learning scope. Therefore, compared to the traditional SAC algorithm, the THIL-SAC algorithm’s experience replay buffer stores data considering the time step t. The truncated experience replay buffer $D^{k}$ is constructed based on the original experience replay buffer to enable training within the truncated length k. The truncated experience replay buffer $D^{k}$ is defined as

$D^{k} = \{\begin{matrix} (s_{t}^{i}, a_{t}^{i}, s_{t + 1}^{i}, {\bar{R}}_{t}^{i}) \\ (s_{t + 1}^{i}, a_{t + 1}^{i}, s_{t + 2}^{i}, {\bar{R}}_{t + 1}^{i}) \\ ⋮ \\ (s_{t + k}^{i}, a_{t + k}^{i}, s_{t + k}^{i}, {\bar{R}}_{t + k}^{i}) \end{matrix}\}$

(34)

where i is the number of training episodes.
At this point, the loss functions for the main Q-network and the target Q-network are as follows:

$\begin{matrix} J_{Q_{k}} (ϑ_{i}) = E_{(s_{t}, a_{t}, {\bar{R}}_{t}, s_{t + 1}) \sim D^{k}} [{(Q_{k, i}^{ϑ_{i}} (s_{t}, a_{t}) - y_{k} ({\bar{R}}_{t}, s_{t + 1}))}^{2}] \end{matrix}$

(35)

$\begin{matrix} J_{π, k} (ϕ) = E_{s_{t} \sim D^{k}, a_{t} \sim π^{ϕ}} [α_{T} log (π^{ϕ} (a_{t} | s_{t}) - min_{i = 1, 2} Q_{k, i}^{ϑ_{i}} (s_{t}, a_{t})] \end{matrix}$

(36)

where

$\begin{matrix} y_{k} ({\bar{R}}_{t}, s_{t + 1}) = & {\bar{R}}_{t} + γ (min_{i = 1, 2} Q_{k, i}^{ϑ_{t a r, i}} (s_{t + 1}, a_{t + 1}) . - \\ . α_{T} log π^{ϕ} (a_{t + 1} | s_{t + 1})) \end{matrix}$

(37)

Therefore, the penetration strategy learning network training within the truncated length k can be achieved by the formulas in Equations (32)–(37).
Adaptive adjustment of truncated length k
The truncated length k determines the learning range and difficulty during penetration strategy training. To achieve low-difficulty, rapid learning based on expert experience in the initial training phase, and to enable exploration of the optimal penetration strategy using reinforcement learning in the later stages, in this paper, the truncated length k is adjusted as follows:

$k = \{\begin{matrix} 1, \begin{matrix} \end{matrix} i < b_{0} \\ b_{1} (i - b_{0}), \begin{matrix} \end{matrix} b_{0} \leq i < (N + b_{0}) / b_{1} \\ N, \begin{matrix} \end{matrix} i \geq (N + b_{0}) / b_{1} \end{matrix}$

(38)

where N is the epoch of a single training episode, i is the number of training episodes, and $b_{0}$ and $b_{1}$ are positive constants.
In the THIL-SAC based penetration strategy network training process, the truncated length $k = 1$ within episode $b_{0}$ at the start of training is to satisfy imitation learning of the penetration strategy. As training progresses, the truncated length k is gradually increased in each episode until it reaches the length of a single training episode, N. This gradual increase transitions the learning process from imitation learning to reinforcement learning, enabling the exploration and optimization of the optimal penetration strategy.

3.3. Theoretical Analysis of TH and Expert Imitation in Reducing Learning Difficulty

In Section 3.2, this paper introduced two key components of the THIL-SAC algorithm designed to reduce learning complexity and accelerate training convergence: (i) expert imitation through reward shaping and trajectory sampling (Section 3.2.2), and (ii) training optimization using the TH method (Section 3.2.3). While these strategies intuitively facilitate the learning of breakthrough policies, their theoretical foundations require further elaboration. This section presents a formal analysis of how expert-guided imitation and the truncated horizon length k jointly reduce learning difficulty, highlighting their positive impact on policy convergence speed and sample efficiency.

In the standard SAC algorithm, the policy network

π^{ϕ}

is trained by minimizing the KL-regularized expected loss function Equation (32), where the value function

Q_{i}^{ϑ_{i}}

reflects the long-term return over the whole horizon length N. However, the whole optimization horizon introduces high variance in policy gradients due to delayed credit assignment [37,38], resulting in sample complexity scaling as follows [39]:

O (\frac{A^{2} N^{2}}{ε^{2}} Ξ)

(39)

where A denotes the complexity of the action space.

ε

is the expected performance error.

Ξ

is a measure of the policy space complexity. This indicates that the learning complexity is closely related to the horizon length N.

In the proposed THIL-SAC algorithm, instead of optimizing over the entire horizon length N, the update is restricted to a truncated horizon length k, thereby reducing the sampling complexity to

O (\frac{A^{2} k^{2}}{ε^{2}} Ξ)

(40)

Meanwhile, the truncated horizon leads to the following truncated gradient:

\nabla J_{π, k} \approx E_{τ} [\sum_{t = 1}^{|τ|} ({\bar{R}}_{t} \sum_{i = 0}^{k - 1} \nabla log π^{ϕ} (a_{t - i} | s_{t - i}))]

(41)

where

|τ| = N

represents the length of the trajectory,

τ

. Similar to Truncated Backpropagation Through Time (TBPTT) [40], a truncating gradient reduces the long-term correlation between value and action, leading to a decrease in gradient variance and consequently lowering the learning difficulty.

In addition, in the THIL-SAC algorithm, the reward function is shaped using expert experience (Equation (35)), with the optimal policy remaining unchanged [34]. However, since the value function

{\hat{V}}^{E}

in Equation (35) is learned through expert experience, it inherently serves as a low-noise, high-quality target signal. As a result, the variance of the shaped reward

\bar{R}

can be significantly smaller than that of the original reward R. This reduction in variance leads to the improved value function

Q_{i}^{ϑ_{i}}

, estimation accuracy, and lower gradient variance, thereby accelerating training convergence [41]. Furthermore, the initial filling and utilization of expert experiences in the replay buffer reduces state–action distribution mismatch and provides high-quality targets early on, which has been shown to improve both convergence rate and value estimation stability [42,43].

In summary, the proposed THIL-SAC algorithm leverages expert experience through reward shaping and experience replay imitation to provide high-quality examples for learning, thereby reducing the error in value function estimation. Building upon this foundation, the algorithm utilizes the TH method to reduce learning sample complexity and the variance in policy gradient calculations, effectively lowering the learning difficulty and enhancing convergence during training.

3.4. Training Process of THIL-SAC-Based Penetration Strategy

The training of the hypersonic vehicle penetration strategy based on THIL-SAC algorithm is divided into two parts: initialization and interactive training. The specific process is described in Algorithm 1.

Algorithm 1 Penetration algorithm based on THIL-SAC.

1:: initialize parameters: main Q-networks $ϑ_{1}, ϑ_{2}$ , policy networks $ϕ$ , truncated length $k = 1$ , target Q-network parameters $ϑ_{t a r, 1} \leftarrow ϑ_{1}, ϑ_{t a r, 2} \leftarrow ϑ_{2}$ , expert value function ${\hat{V}}^{E} (s)$ ; Clear the experience replay bufferD.
2:: for each episode i do
3:: Reset the penetration environment and obtain the initial state $s_{t}$ Calculate the truncated length k using Equation (38).
4:: for step t in $[1, N]$ do
5:: Observe state $s_{t}$ and select action $a_{t} \sim π^{ϕ} (s_{t})$ .
6:: Analyze $a_{t}$ to get penetration command $σ$ .
7:: Execute $σ$ , observe next state $s_{t + 1}$ , shaped reward ${\bar{R}}_{t}$ .
8:: Store $(t, s_{t}, a_{t}, s_{t + 1}, {\bar{R}}_{t})$ in truncated replay buffer $D^{k}$ .
9:: for each network learning step do
10:: Randomly sample $b a t c h_s i z e$ samples from $D^{k}$ .
11:: Update main Q-network:
12:: $ϑ_{i} \leftarrow ϑ_{i} - λ_{Q} \nabla J_{Q} (ϑ_{i}), i = 1, 2$
13:: Update policy network:
14:: $ϕ \leftarrow ϕ - λ_{π} \nabla J_{π} (ϕ), i = 1, 2$
15:: Update temperature coefficient:
16:: $α_{T} \leftarrow α_{T} - λ_{α_{T}} \nabla J (α_{T})$
17:: Update target Q-network:
18:: $ϑ_{t a r, i} \leftarrow τ ϑ_{i} + (1 - τ) ϑ_{i}, i = 1, 2$
19:: end for
20:: end for
21:: end for

4. Simulation Experiments and Analysis

This section will first verify the effectiveness of the proposed reward function, followed by the validation of the proposed penetration strategy. The validation of the penetration strategy encompassed the following aspects:

The Learning Efficiency of the Penetration Strategy: The SAC algorithm and the proposed algorithm were used to learn penetration strategies under identical conditions, and the learning speeds were compared.
The Effectiveness of the Penetration Strategy: The trained penetration strategies were applied to a realistic penetration scenario, where the performance of the expert strategy, the SAC-based penetration strategy, and the proposed method were evaluated and compared to demonstrate their effectiveness.

4.1. Simulation Condition

During the gliding phase of the hypersonic vehicle, the penetration game was initiated when the vehicle was 50 km away from the interceptor. The interceptor was modeled as a Standard Missile-6 (SM-6, Raytheon Missiles & Defense, Tucson, AZ, USA), using proportional navigation guidance with a navigation ratio of 5. The hypersonic vehicle was modeled as a CAV-L. To satisfy terminal position and velocity constraints, lateral maneuvers in the horizontal plane (by changing the bank angle) were employed for penetration. At the start of the simulation, the interceptor adopted a head-on intercept against the hypersonic vehicle, and the initial state parameters for both are shown in Table 1.

The THIL-SAC algorithm includes three types of networks: the Q-network, the policy network, and the expert value function network.

The policy network took

s

as the input. The hidden layer consists of 256 neurons, with the Rectified Linear Unit (ReLU) as the activation function. The network output is divided into two layers: an action-mean output layer and a standard deviation output layer. The action (maneuver command for penetration) was obtained through Gaussian sampling of these outputs. To ensure the validity of the output action, the Tanh function was applied to constrain the output within an appropriate range.

The Q-network took the state–action pair

(s, a)

as the input. It consists of two hidden layers, each containing 256 neurons and using ReLU as the activation function. The output layer consists of a neuron, responsible for outputting the Q-value.

The expert value function network took the state

s

as the input and output the state value. Its structure is identical to that of the Q-network.

The THIL-SAC-based penetration strategy training environment was implemented using the PyTorch 2.0.1 framework. The neural network optimizer used was Adam, with hyperparameters adjusted based on training performance. The final values of the training hyperparameters are listed in Table 2.

Remark 3.

The above training parameters were obtained through simulation optimization. The simulation optimization process for the reward function parameters

c_{1}

,

c_{2}

, and

c_{3}

is shown in Appendix A. The results confirm the reward function parameter design principles outlined in Remark 2.

In the subsequent penetration strategy training process, the network structure and parameter settings for the SAC-based penetration strategy training were the same as those for the THIL-SAC-based penetration strategy algorithm.

4.2. Simulation Results and Analysis

4.2.1. Different Reward Functions Comparison

In this section, the influence of different reward functions on the learning of hypersonic vehicle penetration strategies is analyzed to validate the effectiveness of the proposed reward function. Based on the aforementioned simulation scenarios, two sets of simulations were conducted using the SAC algorithm for training. The reward functions employed included the line-of-sight angular velocity-based reward function from Reference [24] and the proposed ZEM-based reward function, with all other conditions kept identical. The variations in the reward function during the training process are presented in Figure 3.

In Figure 3, the average rewards across five runs are shown in solid lines and the range of rewards are shown in the shaded area. Under identical conditions, the reward function proposed in this paper enabled the learning of hypersonic vehicle penetration strategies to converge at approximately 400 episodes. In contrast, when using the reward function from Reference [24], the reward values only began to increase after over 800 episodes and gradually converged to a value lower than that achieved by the proposed method. This is because the proposed reward function is constructed based on the ZEM, which is inherently more aligned with the penetration objective compared to the line-of-sight angular velocity-based reward function. Consequently, the proposed reward function effectively improves the learning efficiency of hypersonic vehicle penetration strategies.

4.2.2. Results of Penetration Strategy Training

In this paper, the optimal penetration guidance law (expert strategy) from the literature [14] was simulated under the initial conditions set in Section 4.1 to obtain expert experience data for the hypersonic vehicle’s penetration. Based on these data, the THIL-SAC-based and SAC-based hypersonic vehicle penetration strategy networks were then trained.

In addition, to further validate the effectiveness of the proposed method, the improved SAC algorithm, ISAC, proposed in [44], which incorporates Prioritized Experience Replay (PER) and Emphasizing Recent Experience (ERE), was also employed for training the penetration strategy networks.

The hypersonic vehicle penetration strategy networks were trained separately using the SAC-based, ISAC-based, and THIL-SAC-based algorithms under identical settings. The reward value variation curves during training are presented in Figure 4.

In Figure 4, the average rewards across five runs are shown in solid lines and the range of rewards are shown in shaded area. When training the hypersonic vehicle penetration strategy network using the hyper-parameters listed in Table 2, all three algorithms (SAC, ISAC, and THIL-SAC) exhibited an overall increasing trend in reward values, eventually stabilizing. This indicates that the agent successfully learned the vehicle penetration strategy.

During the training process using the THIL-SAC-based algorithm, the reward function converged at approximately 200 episodes, compared to around 400 episodes for the SAC-based algorithm and around 300 episodes for the ISAC-based algorithm. This reflects a significant improvement in convergence speed. Moreover, the reward curve of the THIL-SAC algorithm displays a smoother convergence trajectory, suggesting enhanced training stability. Although the final convergence values of the reward functions for all three algorithms were similar, the proposed THIL-SAC-based penetration algorithm demonstrated a clear advantage in terms of convergence speed and stability.

4.2.3. Testing Results of the Penetration Strategy in Typical Scenarios

To validate the effectiveness of the trained penetration strategy network, the trained penetration strategy was used to generate bank angle commands for the hypersonic vehicle’s penetration. Under the initial conditions given in Section 4.1, the expert strategy, the SAC-based penetration strategy network, and the THIL-SAC-based penetration strategy network were used for hypersonic vehicle’s penetration. The results are presented in Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11.

As shown in Figure 5 and Figure 11, under all three penetration strategies, the hypersonic vehicle successfully penetrated the interceptor. At the miss time, the ZEM was 763.76 m for the expert strategy and 972.34 m and 1008.76 m for the SAC-based and THIL-SAC-based strategies, respectively (Table 3). In Figure 6, the bank angle command curves indicate that the penetration commands generated by the THIL-SAC-based strategy were initially similar to those of the expert strategy. However, in the later stages of penetration, the THIL-SAC-based strategy adjusted the timing and method of maneuvers, resulting in a larger miss distance compared to that in the expert strategy. This is because the THIL-SAC-based strategy optimized the penetration approach without simplifying assumptions about the model, ultimately leading to better performance than the expert strategy based on optimal guidance laws. Under the influence of the penetration command depicted in Figure 6, the normal and lateral overload curves of the hypersonic vehicle are presented in Figure 7 and Figure 8. The normal and lateral overload curves of the interceptor are shown in Figure 9 and Figure 10.

4.2.4. Analysis of the Generalization of the Penetration Strategy Network

To verify the adaptability and effectiveness of the THIL-SAC-based penetration strategy, simulations were conducted under different attack–defense scenarios while considering hypersonic vehicle detection errors. The detection errors follow a normal distribution, with their standard deviation given as follows:

\{\begin{matrix} σ_{R_{r e l}} = σ_{R_{r e l 0}} {(\frac{∥R_{r e l}∥}{∥R_{r e l 0}∥})}^{2} \\ σ_{V_{r e l}} = σ_{V_{r e l 0}} {(\frac{∥R_{r e l}∥}{∥R_{r e l 0}∥})}^{2} \end{matrix}

(42)

where

σ_{R_{r e l}}

and

σ_{V_{r e l}}

are the standard deviation errors for relative distance and relative velocity, with initial values of

σ_{R_{r e l 0}} = 100 m

and

σ_{V_{r e l 0}} = 5 m / s

.

The variations in these scenarios were achieved by altering the initial position and speed of the interceptor, with the initial deviation settings detailed in Table 4.

Under the above conditions, 500 target simulations were conducted. The distribution of the ZEM at the time of miss is shown in Figure 12 and Figure 13.

As shown in Figure 12, among the 500 target simulations, the number of cases where the ZEM was less than 10 m was 102 for the expert strategy and 35 for the THIL-SAC-based strategy. Considering the interceptor’s lethal radius, penetration was deemed successful only if the terminal miss distance exceeded 10 m. Based on this criterion, the penetration success rates for the expert and THIL-SAC-based strategies were 79.6% and 93%, respectively. Figure 13 further shows that the ZEM distribution for the expert strategy was generally lower than that for the THIL-SAC-based strategy, which suggests that the THIL-SAC penetration strategy not only achieved a higher penetration success rate but also demonstrated better adaptability compared to the expert strategy.

Furthermore, on a computer with an AMD Ryzen 7 6800H CPU operating at a base frequency of 3.20 GHz, the THIL-SAC penetration strategy network took 1.479 to 1.534 s to perform 10,000 calculations, with a single calculation taking less than 0.2 milliseconds. This meets the real-time requirements for the highly dynamic penetration of hypersonic vehicles.

5. Conclusions

This paper addresses the problem of penetration in the gliding phase of a hypersonic vehicle by designing a THIL-SAC-based imitation-reinforcement learning approach with progressively increasing learning difficulty. By unifying imitation learning and reinforcement learning objectives within a truncated horizon framework, this approach gradually increases learning difficulty, enabling the rapid training of the hypersonic vehicle’s penetration strategy. The resulting THIL-SAC-based strategy not only ensures successful penetration but also demonstrates strong adaptability. Compared to traditional expert strategies, the THIL-SAC-based method avoids the need for model simplifications or assumptions, leading to enhanced penetration performance and adaptability. Additionally, the THIL-SAC method converges more quickly than SAC-based reinforcement learning methods and requires relatively low computational resources during online application, making it well suited for the real-time demands of highly dynamic penetration scenarios faced by hypersonic vehicles.

Author Contributions

Conceptualization, L.X. and Y.G.; methodology, L.X.; software, L.X.; validation, L.X., Y.G., J.P. and C.W.; formal analysis, L.X.; investigation, L.X.; resources, C.W.; data curation, L.X.; writing—original draft preparation, L.X.; writing—review and editing, L.X., Y.G., J.P. and C.W.; visualization, L.X.; supervision, Y.G., J.P. and C.W.; project administration, J.P.; funding acquisition, J.P. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (grant number U2241215 and 62373124).

Data Availability Statement

All the data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In order to achieve optimal reward function weights and facilitate the rapid learning of the hypersonic vehicle penetration strategy, the coefficient

c_{1}

of the terminal reward

r_{1}

was fixed, while the coefficients

c_{2}

and

c_{3}

of the process rewards

r_{2}

and

r_{3}

were set as 0.1, 0.5, 1, and 2 times the value of

c_{1}

, where

c_{1} = 10

. Therefore, the values of

c_{2}

and

c_{3}

were 1, 5, 10, and 20, respectively. The other training parameters are listed in Table 2. The convergence episodes of the reward function training under different combinations of process reward coefficients are shown in Table A1.

Table A1. Convergence episodes of the reward value under different process reward coefficients,

c_{3}

and

c_{2}

.

Table A1. Convergence episodes of the reward value under different process reward coefficients,

c_{3}

and

c_{2}

.

$c_{3} ∖ c_{2}$	1	5	10	20
1	280	410	771	1000+
5	200	450	1000+	1000+
10	350	485	1000+	1000+
20	400	495	1000+	1000+

In Table A1, the notation “1000+” indicates that the reward value did not converge within the maximum number of training episodes set. From the results in Table A1, it can be observed that when the reward function coefficients were

c_{1} = 10

,

c_{2} = 1

, and

c_{3} = 5

, the reward value converged most quickly. Therefore, these coefficients were selected as the values for Table 2. As the process reward coefficients

c_{2}

and

c_{3}

increased, the sum of the process rewards exceeded the terminal reward, weakening the guiding influence of the terminal reward. This, in turn, led to an increase in the number of episodes required for convergence. Among these,

c_{2}

had a greater impact on the sum of the process rewards, making the training convergence more sensitive to changes in

c_{2}

.

The simulation results presented above further validate the reward function coefficient design principles proposed in Remark 2, which can serve as a useful reference for the design of reward functions in similar problems.

References

Lv, C.; Lan, Z.; Ma, T.; Chang, J.; Yu, D. Hypersonic vehicle terminal velocity improvement considering ramjet safety boundary constraint. Aerosp. Sci. Technol. 2024, 144, 108804. [Google Scholar] [CrossRef]
Zhengxin, T.; Shifeng, Z. A Novel Strategy for Hypersonic Vehicle With Complex Distributed No-Fly Zone Constraints. Int. J. Aerosp. Eng. 2024, 2024, 9004308. [Google Scholar] [CrossRef]
Fan, L.; Jiajun, X.; Xuhui, L.; Hongkui, B.; Xiansi, T. Hypersonic vehicle trajectory prediction algorithm based on hough transform. Chin. J. Electron. 2021, 30, 918–930. [Google Scholar] [CrossRef]
Nichols, R.K.; Carter, C.M.; Drew II, J.V.; Farcot, M.; Hood, C.J.P.; Jackson, M.J.; Johnson, P.D.; Joseph, S.; Kahn, S.; Lonstein, W.D.; et al. Progress in Hypersonics Missiles and Space Defense [Slofer]. In Cyber-Human Systems, Space Technologies, and Threats; New Prairie Press: Manhattan, KS, USA, 2023. [Google Scholar]
Rim, J.W.; Koh, I.S. Survivability simulation of airborne platform with expendable active decoy countering RF missile. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 196–207. [Google Scholar] [CrossRef]
Kedarisetty, S.; Shima, T. Sinusoidal Guidance. J. Guid. Control Dyn. 2024, 47, 417–432. [Google Scholar] [CrossRef]
Zarchan, P. Proportional navigation and weaving targets. J. Guid. Control Dyn. 1995, 18, 969–974. [Google Scholar] [CrossRef]
Lee, H.I.; Shin, H.S.; Tsourdos, A. Weaving guidance for missile observability enhancement. IFAC-PapersOnLine 2017, 50, 15197–15202. [Google Scholar] [CrossRef]
Rusnak, I.; Peled-Eitan, L. Guidance law against spiraling target. J. Guid. Control Dyn. 2016, 39, 1694–1696. [Google Scholar] [CrossRef]
Ma, L. The moedling and simulation of antiship missile terminal maneuver penetration ability. In Proceedings of the 2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; IEEE: New York, NY, USA, 2017; pp. 2622–2626. [Google Scholar]
Shinar, J.; Steinberg, D. Analysis of optimal evasive maneuvers based on a linearized two-dimensional kinematic model. J. Aircr. 1977, 14, 795–802. [Google Scholar] [CrossRef]
Ben-Asher, J.; Cliff, E.M.; Kelley, H.J. Optimal evasion with a path-angle constraint and against two pursuers. J. Guid. Control Dyn. 1988, 11, 300–304. [Google Scholar] [CrossRef]
Shinar, J.; Tabak, R. New results in optimal missile avoidance analysis. J. Guid. Control Dyn. 1994, 17, 897–902. [Google Scholar] [CrossRef]
Shaferman, V. Near optimal evasion from acceleration estimating pursuers. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, Grapevine, TX, USA; 2017; p. 1014. [Google Scholar]
Shaferman, V. Near-optimal evasion from pursuers employing modern linear guidance laws. J. Guid. Control Dyn. 2021, 44, 1823–1835. [Google Scholar] [CrossRef]
Shinar, J. Solution techniques for realistic pursuit-evasion games. In Control and Dynamic Systems; Elsevier: Amsterdam, The Netherlands, 1981; Volume 17, pp. 63–124. [Google Scholar]
Gutman, S. On optimal guidance for homing missiles. J. Guid. Control 1979, 2, 296–300. [Google Scholar] [CrossRef]
Segal, A.; Miloh, T. Novel three-dimensional differential game and capture criteria for a bank-to-turn missile. J. Guid. Control Dyn. 1994, 17, 1068–1074. [Google Scholar] [CrossRef]
Zhang, X.; Guo, H.; Yan, T.; Wang, X.; Sun, W.; Fu, W.; Yan, J. Penetration Strategy for High-Speed Unmanned Aerial Vehicles: A Memory-Based Deep Reinforcement Learning Approach. Drones 2024, 8, 275. [Google Scholar] [CrossRef]
Li, K.; Wang, Y.; Zhuang, X.; Yin, H.; Liu, X.; Li, H. A penetration method for uav based on distributed reinforcement learning and demonstrations. Drones 2023, 7, 232. [Google Scholar] [CrossRef]
Li, Y.; Han, W.; Wang, Y. Deep reinforcement learning with application to air confrontation intelligent decision-making of manned/unmanned aerial vehicle cooperative system. IEEE Access 2020, 8, 67887–67898. [Google Scholar] [CrossRef]
Zhuang, X.; Li, D.; Wang, Y.; Liu, X.; Li, H. Optimization of high-speed fixed-wing UAV penetration strategy based on deep reinforcement learning. Aerosp. Sci. Technol. 2024, 148, 109089. [Google Scholar] [CrossRef]
Yan, T.; Liu, C.; Gao, M.; Jiang, Z.; Li, T. A Deep Reinforcement Learning-Based Intelligent Maneuvering Strategy for the High-Speed UAV Pursuit-Evasion Game. Drones 2024, 8, 309. [Google Scholar] [CrossRef]
Guo, Y.; Jiang, Z.; Huang, H.; Fan, H.; Weng, W. Intelligent maneuver strategy for a hypersonic pursuit-evasion game based on deep reinforcement learning. Aerospace 2023, 10, 783. [Google Scholar] [CrossRef]
Zeming, H.; Zhang, R.; Huifeng, L. Parameterized evasion strategy for hypersonic glide vehicles against two missiles based on reinforcement learning. Chin. J. Aeronaut. 2024, 38, 103173. [Google Scholar]
Zhao, S.; Zhu, J.; Bao, W.; Li, X.; Sun, H. A Multi-Constraint Guidance and Maneuvering Penetration Strategy via Meta Deep Reinforcement Learning. Drones 2023, 7, 626. [Google Scholar] [CrossRef]
Fu, X.; Zhu, J.; Wei, Z.; Wang, H.; Li, S. A UAV Pursuit-Evasion Strategy Based on DDPG and Imitation Learning. Int. J. Aerosp. Eng. 2022, 2022, 3139610. [Google Scholar] [CrossRef]
He, L.; Aouf, N.; Whidborne, J.F.; Song, B. Deep reinforcement learning based local planner for UAV obstacle avoidance using demonstration data. arXiv 2020, arXiv:2008.02521. [Google Scholar]
Jiang, S.; Ge, Y.; Yang, X.; Yang, W.; Cui, H. UAV Control Method Combining Reptile Meta-Reinforcement Learning and Generative Adversarial Imitation Learning. Future Internet 2024, 16, 105. [Google Scholar] [CrossRef]
WANG, X.; GU, K. A Penetration Strategy Combining Deep Reinforcement Learning and Imitation Learning. J. Astronaut. 2023, 44, 914. [Google Scholar]
Wu, T.; Wang, H.; Liu, Y.; Li, T.; Yu, Y. Learning-based interfered fluid avoidance guidance for hypersonic reentry vehicles with multiple constraints. ISA Trans. 2023, 139, 291–307. [Google Scholar] [CrossRef]
Guo, H. Penetration Game Strategy for Hypersonic Vehicles. Ph.D. Thesis, Northwestern Polytechnical University, Xi’an, China, 2018. [Google Scholar]
Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; JMLR Workshop and Conference Proceedings. pp. 627–635. [Google Scholar]
Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml; Morgan Kaufmann: San Francisco, CA, USA, 1999; Volume 99, pp. 278–287. [Google Scholar]
Sun, W.; Bagnell, J.A.; Boots, B. Truncated horizon policy search: Combining reinforcement learning & imitation learning. arXiv 2018, arXiv:1805.11240. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Cambridge, MA, USA, 2018; pp. 1861–1870. [Google Scholar]
Jiang, N.; Agarwal, A. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Proceedings of the Conference on Learning Theory, Stockholm, Sweden, 6–9 July 2018; PMLR: Cambridge, MA, USA, 2018; pp. 3395–3398. [Google Scholar]
Wang, R.; Du, S.S.; Yang, L.F.; Kakade, S.M. Is long horizon reinforcement learning more difficult than short horizon reinforcement learning? arXiv 2020, arXiv:2005.00527. [Google Scholar]
Kakade, S.M. On the Sample Complexity of Reinforcement Learning; University of London, University College London: London, UK, 2003. [Google Scholar]
Zipser, D. Subgrouping reduces complexity and speeds up learning in recurrent networks. Adv. Neural Inf. Process. Syst. 1989, 2, 638–641. [Google Scholar] [CrossRef]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Overcoming exploration in reinforcement learning with demonstrations. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; pp. 6292–6299. [Google Scholar]
Fujimoto, S.; Meger, D.; Precup, D. Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Cambridge, MA, USA, 2019; pp. 2052–2062. [Google Scholar]
Wang, C.; Ross, K. Boosting soft actor-critic: Emphasizing recent experience without forgetting the past. arXiv 2019, arXiv:1906.04009. [Google Scholar]

Figure 1. Relative motion of hypersonic vehicle and interceptor.

Figure 2. Implementation framework of penetration algorithm based on THIL-SAC.

Figure 3. Curve of reward under different reward functions. The baseline reward function refers to the method described in [24].

Figure 4. Curve of reward.

Figure 5. Three-dimensional trajectory.

Figure 6. Curve of bank angle of hypersonic vehicle.

Figure 7. Curve of normal overload of hypersonic vehicle.

Figure 8. Curve of lateral overload of hypersonic vehicle.

Figure 9. Curve of normal overload of interceptor.

Figure 10. Curve of lateral overload of interceptor.

Figure 11. Curve of ZEM.

Figure 12. Histogram of ZEM.

Figure 13. Distribution of ZEM.

Table 1. Initial parameters of hypersonic vehicle and interceptor.

Parameters	Hypersonic Vehicle	Interceptor
Longitude (°)	148.95	149.30
Latitude (°)	23.72	23.99
Altitude (m)	26997.33	29967.40
X-direction speed (m/s)	−2165.21	1404.25
Y-direction speed (m/s)	42.11	−153.45
Z-direction speed (m/s)	−366.06	−48.79

Table 2. Training parameters of THIL-SAC-based algorithm.

Parameters	Value
Policy network learning rate $λ_{π}$	0.0003
Q-network learning rate $λ_{Q}$	0.0003
Replay buffer size	1e5
Batch size	256
Discount factor $γ$	0.99
Temperature coefficient $α_{T}$	0.12
Soft update coefficient $τ$	0.01
Parameters of reward function ${c_{1}, c_{2}, c_{3}}$	${10, 1, 5}$
Parameters of truncated length ${b_{0}, b_{1}}$	${50, 10}$

Table 3. ZEM of three penetration strategies at miss time.

Strategy	ZEM at the Miss Time
Expert strategy	763.76 m
SAC-based strategy	972.34 m
THIL-SAC-based strategy	1008.76 m

Table 4. Initial deviation of interceptor.

Parameters	Standard Deviation
Initial position deviation along the X-axis	2500 m
Initial position deviation along the Y-axis	2500 m
Initial position deviation along the Z-axis	2500 m
Initial velocity deviation along the X-axis	25 m/s
Initial velocity deviation along the Y-axis	25 m/s
Initial velocity deviation along the Z-axis	25 m/s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, L.; Guan, Y.; Pu, J.; Wei, C. Imitation-Reinforcement Learning Penetration Strategy for Hypersonic Vehicle in Gliding Phase. Aerospace 2025, 12, 438. https://doi.org/10.3390/aerospace12050438

AMA Style

Xu L, Guan Y, Pu J, Wei C. Imitation-Reinforcement Learning Penetration Strategy for Hypersonic Vehicle in Gliding Phase. Aerospace. 2025; 12(5):438. https://doi.org/10.3390/aerospace12050438

Chicago/Turabian Style

Xu, Lei, Yingzi Guan, Jialun Pu, and Changzhu Wei. 2025. "Imitation-Reinforcement Learning Penetration Strategy for Hypersonic Vehicle in Gliding Phase" Aerospace 12, no. 5: 438. https://doi.org/10.3390/aerospace12050438

APA Style

Xu, L., Guan, Y., Pu, J., & Wei, C. (2025). Imitation-Reinforcement Learning Penetration Strategy for Hypersonic Vehicle in Gliding Phase. Aerospace, 12(5), 438. https://doi.org/10.3390/aerospace12050438

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Imitation-Reinforcement Learning Penetration Strategy for Hypersonic Vehicle in Gliding Phase

Abstract

1. Introduction

2. Materials and Methods

2.1. Penetration Model of Hypersonic Vehicle in Gliding Phase

2.1.1. Mathematical Model of Hypersonic Vehicle

2.1.2. Mathematical Model of Interceptor

2.2. Design of MDP Model for Hypersonic Vehicle Penetration in Gliding Phase

2.2.1. State Space

2.2.2. Action Space

2.2.3. Reward Function

3. Design of Penetration Strategy Based on Imitation-Reinforcement Learning

3.1. Truncated Horizon Imitation-Reinforcement Learning Model

3.2. THIL-SAC-Based Penetration Strategy

3.2.1. SAC Algorithm

3.2.2. Expert Experience Imitation Based on Reward Shaping and Sampling

3.2.3. Optimization of the Training Process Based on TH

3.3. Theoretical Analysis of TH and Expert Imitation in Reducing Learning Difficulty

3.4. Training Process of THIL-SAC-Based Penetration Strategy

4. Simulation Experiments and Analysis

4.1. Simulation Condition

4.2. Simulation Results and Analysis

4.2.1. Different Reward Functions Comparison

4.2.2. Results of Penetration Strategy Training

4.2.3. Testing Results of the Penetration Strategy in Typical Scenarios

4.2.4. Analysis of the Generalization of the Penetration Strategy Network

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI