Penetration Strategy for High-Speed Unmanned Aerial Vehicles: A Memory-Based Deep Reinforcement Learning Approach

: With the development and strengthening of interception measures, the traditional penetration methods of high-speed unmanned aerial vehicles (UAVs) are no longer able to meet the penetration requirements in diversi ﬁ ed and complex combat scenarios. Due to the advancement of Ar-ti ﬁ cial Intelligence technology in recent years, intelligent penetration methods have gradually become promising solutions. In this paper, a penetration strategy for high-speed UAVs based on improved Deep Reinforcement Learning (DRL) is proposed, in which Long Short-Term Memory (LSTM) networks are incorporated into a classical Soft Actor–Critic (SAC) algorithm. A three-dimensional (3D) planar engagement scenario of a high-speed UAV facing two interceptors with strong maneuverability is constructed. According to the proposed LSTM-SAC approach, the reward function is designed based on the criteria for successful penetration, taking into account energy and ﬂ ight range constraints. Then, an intelligent penetration strategy is obtained by extensive training, which utilizes the motion states of both sides to make decisions and generate the penetration over-load commands for the high-speed UAV. The simulation results show that compared with the classical SAC algorithm, the proposed algorithm has a training e ﬃ ciency improvement of 75.56% training episode reduction. Meanwhile, the LSTM-SAC approach achieves a successful penetration rate of more than 90% in hypothetical complex scenarios, with a 40% average increase compared with the conventional programmed penetration methods.


Introduction
With the great development of hypersonic-related technologies, High Supersonic Unmanned Vehicles (HSUAVs) have become a major threat in future wars.It will stimulate the continuous development and innovation of defending systems against HSUAVs.Correspondingly, the interception scenarios against HSUAVs are becoming complicated, and the capabilities of interceptors are upgrading dramatically.The traditional penetration strategies of HSUAVs are becoming increasingly inadequate for penetration tasks in complex interception scenarios.Consequently, the penetration strategy of high-speed aircraft is gradually being regarded as a significant and challenging research topic.Inspiringly, the Artificial-Intelligence-based penetration methods have become promising candidate solutions.
In early research, the penetration strategies with programmed maneuver prelaunch were designed, which had been extensively applied in engineering, such as the sinusoidal maneuver [1], the spiral maneuver [2,3], the jump-dive maneuver [4], the S maneuver [5], the weaving maneuver [6,7], etc., but their penetration efficiency was very limited.In order to improve the penetration effectiveness, the penetration guidance laws based on modern control theory have been widely studied.The authors in [8] presented optimalcontrol-based evasion and pursuit strategies, which can effectively improve survival probability under maneuverability constraints in one-on-one scenarios.The differential game penetration guidance strategy was derived in [9], and the neural networks were introduced in [10] to optimize the strategy to solve the pursuit-evasion problem.The improved differential game guidance laws were derived in [11,12], and the desired simulation results were obtained.In [13,14], a guidance law in a more sophisticated target-missiledefender (TMD) scenario was derived with control saturation, which could ensure the missile escaped from the defender.These solutions are mainly performed based on the linearized models, resulting in a loss of accuracy.
The inadequacy of existing strategies' capabilities and the significant superiority of Artificial Intelligence motivate us to solve the problem of penetration in complex scenarios by intelligent algorithms.As far as the authors are concerned, in most of the research on penetration strategies based on modern control theory, the interception scenario was simplified to a two-dimensional (2D) plane, wherein only one interceptor was considered.However, in practical situations, the opponent usually launches at least two interceptors to deal with high-speed targets.Therefore, the performance of these penetration strategies would degrade distinctly in practical applications.In recent years, intelligent algorithms have been widely developed due to their excellent adaptability and learning ability [15].Deep Reinforcement Learning (DRL) was proposed by DeepMind in 2013 [16].DRL combines the advantages of Deep Learning (DL) and Reinforcement Learning (RL), with powerful decision making and situational awareness capabilities [17].The designs of guidance laws based on DRL were proposed in [18][19][20], and it has been proven that the new methods are superior to the traditional proportional navigation guidance (PNG).DRL has been applied in much research, such as collision avoidance methods for Urban Air Mobility (UAM) vehicles [21] and Air-to-Air combat [22].In the field of high-speed aircraft penetration or interception, a pursuit-evasion game algorithm was designed by PID control and Deep Deterministic Policy Gradient (DDPG) algorithm in [23], and it achieved better results than only PID control.Intelligent maneuver strategies using DRL algorithms have been proposed in [24,25] to solve the problem of one-to-one midcourse penetration of an aircraft, which can achieve higher penetration win rates than traditional methods.In [26,27], the traditional deep Q-network (DQN) was improved into a dueling double deep Q-network (D3Q) and a double deep Q-network (DDQN) to solve the problem of attackdefense games between aircraft, respectively.In [28], the GAIL-PPO method that combined the Proximal Policy Optimization (PPO) and imitation learning was proposed.Compared with classic DRL, the improved algorithm provided new approaches for penetration methods.
Based on a deep understanding of the penetration process, high-speed UAVs can only make correct decisions by predicting the intention of interceptors.Therefore, algorithms are required to be able to process state information over a period of time, which is not available in the current DRL due to the sampling method of the replay buffer.To attack this problem, a Recurrent Neural Network (RNN) that specializes in processing time series data is incorporated into the classic DRL.The Long Short-Term Memory (LSTM) networks [29] are mostly used as memory modules because of their stable and efficient performance and excellent memory capabilities.The classic DRL algorithms were extended by LSTM networks in [30,31], which demonstrated superior performance than classic DRL in the problems of Partially Observable MDPs (POMDPs).The conspicuous advantages of memory modules in POMDPs were demonstrated.The LSTM-DDPG (Deep Deterministic Policy Gradient) approach has been proposed to solve the problem of sensory data collection of UAVs in [32], and the numerical results show that LSTM-DDPG could reduce packet loss more than classic DDPG.
The above research demonstrates the advantages of the DRL algorithm with memory modules, but it has not yet been applied to the penetration strategy of high-speed UAVs.The Soft Actor-Critic (SAC) algorithm [33] was proposed in 2018 and has been applied with the advantages of wide exploration capabilities.Therefore, the LSTM networks are incorporated into the SAC to explore the application of LSTM-SAC in the penetration scenario of a high-speed UAV escaping from two interceptors.The main contributions in this paper can be summarized as follows: (a) A more complex 3D space engagement scenario is constructed, where the UAV faces two interceptors with a surrounding case and its penetration difficulty dramatically increases.
(b) A penetration strategy based on DRL is proposed, in which a reward function is designed to enable high-speed aircraft to evade interceptors with low energy consumption and minimum deviation, resulting in a more stable effect than conventional strategies.
(c) A novel memory-based DRL approach, LSTM-SAC, is developed by combining the LSTM network with the SAC algorithm, which can effectively make optimal decisions on input temporal data and significantly improve training efficiency compared with classic SAC.
The rest of the paper is organized as follows: The engagement scenario and basic assumptions are described in Chapter 2. The framework of LSTM-SAC is introduced in Chapter 3, and the state space, action space, and reward function are designed based on MDP and engagement scenarios.In Chapter 4, the algorithm is trained and validated.Finally, the simulation results are analyzed and summarized.

Engagement Scenario
The combat scenario modeling, kinematic, and dynamic analysis are formulated in this chapter.To simplify the scenario, some basic assumptions are made as follows: Assumption 1: Both the high-speed aircraft and interceptors are described by pointmass models.The 3-DOF particle model is established in the ground coordinate system: where , , x y z are the coordinates of the vehicle.n n n denote the projection of the overload vector on each axis of the direction of velocity.Specifically, overload is the ratio of the force exerted on the aircraft to its own weight, which is a dimensionless variable used to describe the acceleration state and the maneuverability of the aircraft.

,
, , where i N represents the projection of the external force on the i -axis in the direction of velocity, and G represents the gravity of the aircraft.x n is tangential overload, which indicates the ability of the aircraft to change the magnitude of its velocity, and , y z n n are normal overloads, which represent the ability of the aircraft to change the direction of its flight in the plumb plane and the horizontal plane, respectively.Assumption 2: The high-speed UAV cruises at a constant velocity toward the ultimate attack target when encountering two interceptors.The enemy is able to detect the highspeed UAV at a range of 400 km and launches two interceptors from two different launch positions against one target.The onboard radar of the aircraft starts to work at a range of 50 km relative to the interceptor and then detects the position information of the interceptor missile.
Assumption 3: The enemy is capable of recognizing the ultimate attack target of the high-speed aircraft.The guidance law of the interceptors does not switch during the entire interception procession.The interceptors adopt proportional navigation guidance (PNG) law with varying navigation gain: where C I a denotes the acceleration command of the interceptor.
, HI H I r q denote the relative distance and line of sight (LOS) between the interceptor and the high-speed aircraft, respectively.
The above assumptions are widely used in the design of maneuvering strategies of aircraft, which can simplify the calculation process while providing accurate approximations.
In 3D planar, it is difficult to directly obtain the analytical relationship between LOS and fight-path angle.The motion of an aircraft in three-dimensional space can be simplified as a combination of horizontal and plumb planes, which can help us analyze and understand the motion characteristics of the aircraft more easily.Therefore, for the convenience of engineering applications, the motion of the 3D planar is projected onto two 2D planes (horizontal and plumb).Due to the two interceptors surrounding the horizontal plane instead of the plumb plane to strike high-speed aircraft, it is more difficult to escape on the horizontal plane.Maneuvers on the horizontal plane should be primarily considered.In addition, limited by engine technology conditions, maneuvering is primarily conducted on the horizontal plane.
Taking the horizontal plane as an example, the engagement between two interceptors and one high-speed aircraft is considered.As in Figure 1, X-O-Z is a Cartesian inertial reference frame.The notations 1 2 , I I and H are two interceptors and an aircraft, respec- tively, and the variables with subscripts 1 2 , I I and H represent the variables of the two interceptors and the high-speed aircraft.The velocity and acceleration are denoted by , V a , respectively.The notations of , , q ϕ φ are LOS angle, fight-path azimuth angle, and lead angle.The notation r is the relative distance between the interceptor and the highspeed aircraft.Taking the interceptor-Ⅰ as an example, the relative motion equations of the interceptor-Ⅰ and the high-speed aircraft are: ( ) To make the simulation model more realistic, the characteristics of an onboard autopilot for both interceptors and high-speed aircraft are incorporated.The first-order lateral maneuver dynamic is assumed as follows: where T τ is the time constant of the target dynamics, and C n denotes the overload command.

Model of MDP
The optimal decision model is obtained by interacting with an environment without data labels using the DRL algorithm.As in Figure 2, the agent obtains the state information from the environment at each step, computes to obtain the action, and outputs it back to the environment.After obtaining the data, the agent updates its strategy by learning to maximize the total reward value.The Markov Decision Process formalizes the environment of RL [34], which is a memoryless random process.

MDP consists of 5 elements [ , , , , ]
s A P R γ , where s is a finite set of states, A is a finite set of actions, P is a state transition probability matrix, R is a reward function used to score the decisions of the agent, and γ is a discount factor, and γ ∈ .The total return over the complete MDP process is obtained by weighted accumulation of r with γ .The train- ing of DRL is the process that maximizes the total return: The learning process of MDP is to maximize the rewards by learning strategies during the interaction with the environment.

Design of State Space and Action Space
During the process of penetration, the state and action of the high-speed aircraft are continuous.The construction of state space is based on the correct analysis of combat scenarios and missions, and the state space should be able to provide the complete required information during the penetration process.The high-speed aircraft's own state information and relative motion information are selected as the state variables: , , , ,... ... , , , , , where ConsumeFuel denotes the aircraft's fuel usage, aiming at constraining the energy consumption of the aircraft during the penetration process.During the training process, it is represented as the overload H n of the aircraft at each step.
, , denote the relative position vector between the high-speed aircraft and the th i interceptor in , , x y z axis, respectively, which are computed by , , . They are used to indicate the motion state of the offense and defense., θ ϕ denote the trajectory angle and inclination angle of the aircraft, respectively, which are set to describe the flight status of the aircraft.r Δ denotes the distance between aircraft and the target.The above state variables have different dimensions and scales, and being used directly, can lead to the learning process becoming unstable or even training not converging.The different unit features were manually scaled for dimensionless preprocessing before entering the DRL network.The processed state variables are all between -1 and 1, so that each state feature can be treated fairly by the algorithm.Normalization is beneficial for stabilizing gradients, improving training efficiency, and improving the generalization ability of the algorithm.The processing method is shown in Table 1.
denotes the maximum overload of the aircraft, and and 0 r Δ denote the relative positions and relative distance at the initial moment, respectively.
The overload commands of the high-speed aircraft are selected as the action space of the agent.In this case, the intelligent agent makes maneuvering decisions completely autonomously throughout the entire process.We design the action space as follows: where C y n and C z n are the longitudinal overload and normal overload commands generated by the agent, which, respectively, control the flight direction of the aircraft on the plumb and horizontal planes.Since the UAV is assumed to fly at a constant velocity during the penetration process, x n , which is used to control the magnitude of the velocity, it is not considered in the action state.It is worth noting that the action space directly outputted by the network is between -1 and 1, so it needs to be multiplied by the maximum overload m ax n before outputting to the environment.

Design of Reward Function
The reward function maps the state information into enhancement signals, reflecting the understanding of the task logic.It is used to evaluate the quality of actions and determine whether the agent can learn the required strategy.The reward function is designed as follows: where _ reward stage is used to evaluate the effectiveness of actions taken by the agent in each simulation step.
_ reward end is used to evaluate whether the aircraft has completed the evasion of all interceptors.The accumulation of the two constitutes a reward function.
_ reward stage can be specifically represented as where 1 rew ard represents the penalty for ConsumeFuel , which is used to constrain the energy consumption during the penetration process.reward is used to guide the agent to continuously fly towards to the target direction, which is beneficial for constraining the agent to not deviate from the preset trajectory after successful penetration.
_ reward end can be specifically represented as where r represents the miss distance of the aircraft, and d r represents the killing radius of the interceptor.When the miss distance is larger than the killing radius, the penetration task is successful.

Model of SAC
The SAC algorithm is based on the Actor-Critic (AC) framework.The AC framework contains two Deep Neural Networks (DNN), which are used to fit the Actor network and Critic network.The AC framework is shown in Figure 3, which separates the process of evaluating action value and updating strategies.The Actor network is a policy-based method responsible for making and updating decisions.The Critic network is a valuebased method responsible for estimating the ( )  SAC is an Off-Policy DRL algorithm based on the maximum entropy reinforcement learning framework.Entropy is a measure of uncertainty in probability distributions:

V s
The learning process of the SAC algorithm is to obtain a policy that maximizes both its cumulative reward and the entropy of each action: where α is the entropy temperature coefficient that indicates the degree of randomness in the strategy.It can randomize the strategy.That is, the probability of each output action is as dispersed as possible.
The action-value function containing α is defined as follows: Similarly, the state-value function is defined as follows: Therefore, the relationship between V π and Q π can be written as The SAC algorithm is applicable to both continuous and discrete action space simultaneously, so it is widely used in complex missions because of its excellent exploration ability and robustness.The complete summary of the SAC algorithm is shown in Algorithm 1.

1:
Initialize the value function and target value function with parameter vectors , ψ ψ

2:
Initialize the soft q-function and policy network with parameter vectors , θ φ

3:
Initiate the experience buffer D 4: for each iteration do 5: for each environment step do 6: Generate the action from actor network based on the current state 7:

17:
Update the soft q-function ˆ( ) for i {1,2} Update the policy network Adjust the temperature

20:
Update the target value critic network end for The traditional SAC algorithm is based on the sampling method of reply buffer and fully connected layers, which can only fit the policy and value function based on the inputs at the current moment.However, the state space is closely related in time during the penetration process, and the decision instructions output by the algorithm also depend on the states at the moments before and after.Therefore, the traditional SAC algorithm is considered to be improved in order to make accurate decisions on time series data.

Architecture of LSTM-SAC
The LSTM network is improved from RNN, which excels in handing times series data with long-term dependencies.LSTM can selectively store information through its ability to remember and forget, as shown in Figure 4.Each LSTM unit consists of an input gate, an output gate, and a forget gate.The calculation formula for LSTM is as follows: where σ and tanh represent the sigmoid function and hyperbolic sine function, respec- tively.t f represents the output of the forget gate, which measures the degree of forgetting about the output  The traditional SAC algorithm is improved by incorporating the LSTM network.An AC framework with an LSTM layer and fully connected layer is created, in which the characterization layers of LSTM could learn beneficial information from the previous states and adjust the weights and bias to assist with decision making based on the SAC algorithm.Therefore, LSTM networks can predict future state information based on time series information before a certain time period t .The architecture of the LSTM-SAC is shown in Figure 5. Due to the temporal processing characteristics of the LSTM network, the input of the algorithm is no longer the state values at a single simulation step but a column of timecontinuous state data.Any length of historical data can be processed.In this paper, n consecutive states  are selected as a set of inputs for LSTM-SAC after comparative experiments, which requires the replay buffer to implement random continuous batch sampling.Similarly, the actor network obtains actions based on n consecutive states.The LSTM can extract effective features, make predictions based on continuous input states, and finally output them to the SAC.In this way, the agent has the ability to memorize, and the efficiency of the algorithm training is improved.

Simulation Parameter Settings
In this section, the LSTM-SAC agent is used to generate penetration strategy commands for a high-speed UVA.Only the head-on engagement scenarios are considered because of the speed limitation of interceptors.
The experiments were implemented in the Python3.11,PyTorch2.0.1, and cuda117 environments with simulation step size set to 0.01 s, and the fourth-order Runge-Kutta (RK4) method is used for ballistic calculations in the engagement scenario.The enemy's launch sites are distributed on both sides of the high-speed aircraft flight trajectory.Therefore, after the enemy detects our aircraft, two interceptors are launched, and a perimeter is formed to intercept the aircraft.The maneuverability of the interceptors is higher than that of the aircraft, which makes the escape of the aircraft more difficult.
To conform to real engagement scenarios and prevent overfitting, the specific launch locations of the interceptors are randomly generated in a circle with a radius of 3 km, centered on the enemy launch center site.That is, the initial states of the interceptors are different in each episode of training.The simulation parameter settings are shown in the Table 2 Table 2. Simulation parameter settings.

Parameter
Value Enemy launch position center site ( , , ) / x y z km (0,0,30), (0,0,−30) High-speed aircraft initiate location ( , , ) / The Tanh function is chosen as the activation function for the output layer of the actor network, which limits the output of the actor network to between −1 and 1.The Tanh function is defined as tanh( )  The constant parameters used to set the reward function in the simulation scenario of LSTM-SAC training are shown in Table 4.Among the _ reward stage , the energy limitation has the highest proportion, which aims to minimize the energy consumption during the penetration process without affecting the subsequent strike mission.In addition, the degree of deviation from the predetermined route during the escape process is reduced for the vehicle under energy limitation.Therefore, the range constraint 2 k has a much lower percentage of _ reward stage .The termination condition for a single training process is set to: It can be concluded that the penetration process has ended when the relative distance begins to increase.The miss distance is used to determine the results of a single training process.If the miss distance is larger than the killing radius of the interceptor, the penetration is successful.
The selection of hyperparameters is important for the DRL training process.By continuously adjusting during the training process, the hyperparameters setting that is suitable for SAC-LSTM is obtained.The hyperparameter settings are shown in Table 5.

Training Results
The scenario of the penetration process of the high-speed aircraft is trained based on DRL algorithms.In order to verify the effectiveness of the LSTM network, the classical SAC algorithm is used for comparison with the LSTM-SAC approach.The fully connected network layers of the classic SAC algorithm are the same as that of LSTM, and the activation function is the ReLU function.States with different historical lengths ( 1 ~5 l = ) are input into the LSTM-SAC to verify the ability of the LSTM networks to handle temporal dependencies.Meanwhile, the advanced Proximal Policy Optimization (PPO) algorithm is used to compare with the LSTM-SAC and SAC.The simulation parameters and hyperparameters settings are the same for all three.
The PPO algorithm was proposed in 2017 [35].Based on the strategy gradient method and AC framework, the PPO algorithm can be efficiently applied to continuous state and action spaces.A new strategy update mechanism called 'Clipping' has been introduced into PPO, which limits the magnitude of policy updates.This avoids training instability caused by significant updates.The objective function of the PPO algorithm is where ( ) t r θ is the ratio of the new strategy to the old strategy, and ˆt A is the advantage estimation function.
, clip ε are the truncation function and truncation constant, respectively, which limit the ratio of new and old strategies to 1 ε − and 1 ε + to ensure that the gap between the two strategies is not too large.The variation in the average rewards is demonstrated as Figure 6.As all algorithms have converged, only the average rewards for the first 600 episodes are shown for clarity.The values of the reward function determine the degree to which the agent learns the ideal strategy, indicating that the agent is able to continuously learn and improve its strategy during the training process.It can be seen that the PPO algorithm performs well in the first 200 episodes during training, but the reward function shows a gradual downward trend after convergence.This is because the algorithm eventually converges to an aggressive strategy: the agent attempts to escape with minimal energy consumption.The highest rewards will be received once the escape is successful, but this causes the vehicle to be easily intercepted.Therefore, the SAC and the LSTM-SAC perform better in this scenario.
The values in brackets in the legend represent the length of the input historical states, which is set with the values reference [31].It can be seen that both the traditional SAC and the LSTM-SAC can converge quickly and make effective decisions for aircraft penetration strategies.In such an engagement scenario, LSTM-SAC with three historical states showed the best performance, demonstrating the advantage of the memory component in complex scenarios.LSTM-SAC(4) has higher training efficiency in the first 100 rounds but lower average rewards in the later stages.This is because the penetration success rate of the agent is lower after training convergence.LSTM-SAC(5) takes second place, the reason considered is that due to the short simulation step, the three sliding windows are similar to the five sliding windows.However, the inputs of the five history states increase the difficulty of the algorithm to converge and the computation resource.LSTM-SAC with longer history states should work better if it is used in more complex engagement scenarios or Partially Observable Markov Decision Processes (POMDPs).
It is worth noting that the LSTM-SAC(1) with only current state inputs and the LSTM-SAC (2) with only one historical state perform lower than the classical SAC with the same inputs.It is indicated that if the input data do not have a time-dependent relationship, the LSTM networks cannot extract effective information from them, which makes the performance of the memory-based DRL worse.
In summary, the LSTM-SAC(3) with an effective time series can explore the optimal strategy faster and converge to a higher reward value compared with the classic SAC.The training efficiency of LSTM-SAC(3) is improved by about 75.56% over the SAC, indicating that the improved algorithm has significant improvements in reducing training times, saving resources and time.To simplify the description, the LSTM-SAC in the following text refers to LSTM-SAC (3) with three historical state inputs.
For the LSTM-SAC approach, the average rewards in the first 100 episodes are low and unstable, which means the agent has not learned the ideal penetration strategy and the algorithm has not converged to the optimum.In this case, the high-speed aircraft maneuvering decisions are highly stochastic, resulting in easy interception and high energy consumption.As the training progressed, the agent gradually learned the penetration strategy we expected.In the later episodes of training, the reward value was maintained at a high level.The agent learned the ideal penetration strategy is to avoid collisions with lower energy.However, low energy consumption results in vehicles that are easily intercepted; the agent needs to maintain a balance between energy consumption and win rate.Due to the agent still exploring the optimal strategy and random inferiority initial states, it is extremely difficult to achieve 100% evasion of interceptors.

Verification Results
In order to verify the effectiveness of the memory-based DRL, locally generated files of intelligent decision networks based on LSTM-SAC, SAC, and PPO are saved after training convergence.Sinusoidal maneuvers with fixed or random parameters, square maneuvers, and no maneuver scenarios are used to compare with the DRL maneuver.The same simulation scenario and parameters are used in the verification.
The overload commands of sinusoidal maneuvers are written in the body coordinate system as where , , A T ω are the amplitude, period, and phase of the sinusoidal maneuver command function.K denotes the bias, which is reflected as the translation of the function.
The overload commands of square maneuver can be written in the body coordinate system as where 3 4 , A A are the amplitude constants, and 3 4 , T T is the period of the square maneuver.
Different parameters of the sinusoidal maneuver have different effects on the penetration results.Therefore, two forms of sinusoidal maneuver instructions with fixed and random parameters are considered.The sinusoidal maneuver with random parameters is instructed to randomly generate parameters in a fixed range at fixed intervals.After simulation verification, the optimal parameter settings of sinusoidal maneuvers and square maneuvers are shown in Table 6.The above maneuver methods are tested with the same simulation scenario 1000 times, and the test results are recorded.The launch positions of the interceptors are randomly generated in launch sites with a radius of 5 km to verify the robustness of the penetration strategies.
The win rate statistics are shown in Table 7.Excluding the case of no maneuver, the overload response obtained from the 10th test and the corresponding trajectory graph are recorded as shown in Figures 7 and 8, respectively.The corresponding miss distance values are recorded in Table 7.If the miss distance is less than 10 m, it indicates interception.When not maneuvering, the aircraft still has the opportunity to evade with its speed advantage.The addition of maneuver strategies all results in higher escape success rates.For the DRL maneuver strategy, LSTM-SAC and SAC maneuvers have the highest escaping probabilities and are not affected by the different initial flight states of the interceptors.The LSTM-SAC maneuver is considered to possess strong robustness.However, the PPO maneuver has a relatively low win rate, which is related to the aggressive strategy of the PPO algorithm.Among procedural maneuvers, the square maneuver has the highest win rate, followed by the sinusoidal maneuver.According to the overload command response situation, it can be seen that the DRL maneuver commands change in real time based on the state information.The overload responses generated by the LSTM-SAC and SAC maneuvers are roughly the same, indicating that both LSTM-SAC and SAC can converge to the optimal solution.Compared with LSTM-SAC, the overload command of SAC changed more frequently and had a larger amplitude, making it more difficult for the onboard control system to track the command.Due to the 'Clipping' strategy update mechanism, the PPO maneuver has a smoother and smaller overload variation.This saves fuel for higher average rewards but can easily lead to penetration failure.
The commands for programmed maneuvers are input prelaunch and cannot be changed during flight to suit the actual situation.The sinusoidal maneuver with fixed parameters has the most regular change in overload, making it easy for enemies to predict its trajectory.Therefore, the win rate of this method is the lowest.Sinusoidal maneuvers with random parameters are harder to predict in comparison and consume energy better than the former.
The trajectory plots show that the PPO maneuver has the least amount of deviation from the original trajectory during the penetration process.The maximum energy consumption of the square maneuver leads to a higher win rate of escaping, but this makes it more difficult to return to the predetermined trajectory after the penetration process.The LSTM-SAC maneuver is able to maintain the win rate while deviating from the original trajectory and consuming the least amount of energy.It can be considered that the penetration strategy of LSTM-SAC has achieved optimal results in this engagement scenario.
After the penetration process, the high-speed aircraft switches back to the horizontal flight and prepares for guiding back to the preset trajectory.

Conclusions
In this paper, the penetration strategy of a high-speed UAV is transformed into decision-making issues based on MDP.A memory-based SAC algorithm is applied in decision algorithms, in which LSTM networks are used to replace the fully connected networks in the SAC framework.The LSTM networks can learn from previous states, which enables the agent to deal with decision-making problems in complex scenarios.The architecture of the LSTM-SAC approach is described, and the effectiveness of the memory components is verified by mathematical simulations.
The reward function based on the motion states of both sides is designed to encourage the aircraft to intelligently escape from the interceptors under energy and range constraints.By the LSTM-SAC approach, the agent is able to continuously learn and improve its strategy.The simulation results demonstrate that the converged LSTM-SAC has similar penetration performance to the SAC, but the LSTM-SAC agent with three historical states has a training efficiency improvement with 75.56% training episodes reduction compared with the classical SAC algorithm.In the engagement scenarios wherein two interceptors with strong maneuverability and random initial flight states within a certain range are considered, the probability of successful evasion of the high-speed aircraft is higher than 90%.Compared with the conventional programmed maneuver strategies represented by the sinusoidal maneuver and the square maneuver, the LSTM-SAC maneuver strategy possesses a win rate of over 90% and more robust performance.
In the future, noise and unobservable measurements are about to be introduced to the engagement scenarios, and the effect of LSTM-SAC will be evaluated in more realistic engagement scenarios.Additionally, the 3-DOF dynamic model is based on simplified aerodynamic effects.Due to the complex aerodynamic effects in the real battlefield, there is no guarantee that the strategies trained on the point-mass model will perform as well in real scenarios as they do in the simulation environment.In subsequent research, the 6-DOF dynamics model, which can better simulate complex maneuvers, will be considered for implementing penetration strategy research.In summary, the LSTM-SAC can effectively improve the training efficiency of classic DRL algorithms, providing a new promising method for the penetration strategy of high-speed UAVs in complex scenarios.

2
k are the constants to shape the reward function.Their values are set based on the actual penetration mission and adjusted during the training process to obtain the optimal results.

π
of the Actor network.
b , respectively, represent the bias of the forget gate, input gate, and output gate.
settings for LSTM-SAC are shown in the Table3, with the StateDim representing the dimension of state space, and the ActionDim representing the dimension of action space.The actor network and critic network are represented by two-layer LSTM networks and a fully connected layer.The parameter n represents the length of the historical states entered into the algorithm.The rectified linear units (ReLU) function is chosen as the activation function of the FC layers because it can solve the problem of overfitting and accelerate training efficiency.The ReLU function is defined as

Figure 6 .
Figure 6.Average reward in the training process.

Figure 8 .
Figure 8. Verification results for all maneuver strategies.

Table 4 .
Constant parameters in reward function.

Table 6 .
Parameter settings in maneuver commands.

Table 7 .
Win rate and miss distance statistics.