Next Article in Journal
Research on Parameter Prediction Model of S-Shaped Inlet Based on FCM-NDAPSO-RBF Neural Network
Previous Article in Journal
Design Methodology for Fishtailed Pipe Diffusers and Its Application to a High-Pressure Ratio Centrifugal Compressor
Previous Article in Special Issue
Vector Field-Based Robust Quadrotor Landing on a Moving Ground Platform
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reentry Trajectory Online Planning and Guidance Method Based on TD3

School of Mechanics and Aerospace Engineering, Dalian University of Technology, Dalian 116081, China
*
Author to whom correspondence should be addressed.
Aerospace 2025, 12(8), 747; https://doi.org/10.3390/aerospace12080747
Submission received: 12 July 2025 / Revised: 16 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025
(This article belongs to the Special Issue Flight Guidance and Control)

Abstract

Aiming at the problem of poor autonomy and weak time performance of reentry trajectory planning for Reusable Launch Vehicle (RLV), an online reentry trajectory planning and guidance method based on Twin Delayed Deep Deterministic Policy Gradient (TD3) is proposed. In view of the advantage that the drag acceleration can be quickly measured by the airborne inertial navigation equipment, the reference profile adopts the design of the drag acceleration–velocity profile in the reentry corridor. In order to prevent the problem of trajectory angle jump caused by the unsmooth turning point of the section, the section form adopts the form of four multiple functions to ensure the smooth connection of the turning point. Secondly, considering the advantages of the TD3 dual Critic network structure and delay update mechanism to suppress strategy overestimation, the TD3 algorithm framework is used to train multiple strategy networks offline and output profile parameters. Finally, considering the reentry uncertainty and the guidance error caused by the limitation of the bank angle reversal amplitude during lateral guidance, the networks are invoked online many times to solve the profile parameters in real time and update the profile periodically to ensure the rapidity and autonomy of the guidance command generation. The TD3 strategy networks are trained offline and invoked online many times so that the cumulative error in the previous guidance period can be eliminated when the algorithm is called again each time, and the online rapid generation and update of the reentry trajectory is realized, which effectively improves the accuracy and computational efficiency of the landing point.

1. Introduction

The Reusable Launch Vehicle (RLV) [1,2] uses aerodynamic lift for the return to reentry. Compared with ballistic reentry, it has significant advantages such as strong lateral maneuverability, high landing point accuracy, controllable overload, and heat flow peak. It is a key technical way to achieve high-precision fixed-point return and flexible space use. However, the reentry process faces severe process constraints and terminal constraints, and the flight environment changes dynamically and has significant uncertainties, which pose a great challenge to online generation of feasible trajectories and guidance. Therefore, reentry trajectory planning and guidance algorithms have received extensive attention.
RLV reentry trajectory planning and guidance algorithms [3] mainly include nominal trajectory tracking guidance, predictor–corrector guidance [4,5], and intelligent guidance. In view of the fact that drag acceleration can quickly calculate flight range and can be quickly measured by airborne inertial navigation equipment, nominal trajectory tracking guidance using drag acceleration profile as the nominal trajectory has certain advantages. In order to ensure that the trajectory satisfies all flight constraints, it is necessary to transform the process constraints into the reentry corridor and plan the drag acceleration–velocity profile [6,7] in the reentry corridor. At the same time, for the time-controllable reentry flight problem, a fast trajectory planning method based on the drag acceleration-energy profile is proposed, which makes the time greatly adjustable and realizes the rapid prediction of the flight capability boundary. In order to obtain the drag acceleration-energy profile that meets the requirements of the flight range, References [8,9] used a multi-segment linear function to set the drag acceleration profile as a parameterized form, and then used the prediction–correction method to solve the parameters to obtain a standard trajectory. However, the turning point of the linear function is not smooth, which is not conducive to tracking guidance. In Reference [10], the reciprocal profile was described by a cubic spline function, and the smooth trajectory was generated quickly. The traditional standard trajectory guidance relies on the nominal trajectory and has been successfully applied to the space shuttle, but the flexibility is limited, and it is difficult to meet the requirements of future aircraft autonomy. Numerical predictor–corrector guidance relies on iterative numerical integration, which has low computational efficiency and is difficult to meet real-time application requirements. Therefore, it is urgent to develop online trajectory planning and guidance algorithms.
With the development of artificial intelligence, intelligent guidance has proposed new solutions to the above problems. The core of intelligent guidance is to use artificial intelligence (AI) technology, especially machine learning (ML), deep learning (DL), reinforcement learning (RL), and intelligent optimization algorithms, to give the reentry guidance system stronger perception, learning, decision-making, and adaptability. In References [11,12], a reentry guidance method based on reinforcement learning is proposed. It adopts the decoupling design idea of longitudinal and lateral guidance, and uses reinforcement learning to output the amplitude of the bank angle, which has considerable computational efficiency and higher terminal accuracy and robustness. Attempting to solve the problem of long generation time of predictive correction guidance commands, Reference [13] introduced a deep deterministic policy gradient (DDPG) combined with the flight profile generated by quasi-balanced flight conditions. Through the design of reinforcement learning reward function and the adjustment of guidance period, the problem of non-convergence of results caused by sparse reward in the process of deep reinforcement learning training is solved so that the aircraft can complete flight tasks with process constraints and terminal constraints. In order to improve the flight potential of the aircraft, Reference [14] used Proximal Policy Optimization (PPO) and TD3 algorithms to train reinforcement learning agents, output angle of attack and bank angle instructions, and improve the adaptability of the aircraft. Considering the influence of the no-fly zone, References [15,16] proposed a step-by-step intelligent guidance research framework of predictive correction guidance-pre-trained bank angle guidance model based on supervised learning-further upgraded bank angle guidance model based on reinforcement learning, which gets rid of the constraints of the traditional predictive corrector guidance method on the bank angle solution space, and expects to produce a better fly-around strategy.
In this paper, the pre-reentry trajectory is obtained by numerical integration of the dynamic equation after selecting the constant angle of attack and the bank angle. On the premise of pre-setting the angle of attack profile, the process constraints are converted into reentry corridors, and four multiple-function drag acceleration reference profiles are designed for the glide trajectory. According to the principle of profile continuity and smoothness, the profile parameters are simplified. Secondly, by offline training multiple TD3 strategy networks, a hierarchical reward is designed to meet the needs of the flight range, and the parameters of different nodes in the profile are solved, respectively. At the same time, through the longitudinal guidance of the tracking of the drag acceleration profile, the bank angle command is solved; lateral guidance is performed by tilting angle reversal to limit heading angle deviation. According to the current state and flight range of each drag acceleration profile node, the TD3 strategy networks are invoked multiple times online to plan the trajectory and update the profile periodically. By combining it with tracking guidance, the cumulative error in the previous guidance period can be eliminated when the algorithm is invoked again each time, which effectively improves the accuracy and computational efficiency of the landing point and realizes the online rapid generation and update of the reentry trajectory.

2. Reentry Modeling and Constraints

Ignoring the Earth’s oblateness, considering the Earth’s rotation, and taking time as the independent variable, the 3-DOF reentry dynamic model [6] of RLV is as follows:
r ˙ = V sin γ λ ˙ = V cos γ cos ϕ r l ˙ = V cos γ sin ϕ r cos λ V ˙ = D m g sin γ + C ˜ V γ ˙ = 1 V L cos σ m g cos γ + V 2 cos γ r + C ˜ γ ϕ ˙ = 1 V L sin σ m cos γ + V 2 cos γ tan λ sin ϕ r + C ˜ ϕ
C ˜ V = ω e 2 r cos 2 λ sin γ ω e 2 r cos λ sin λ cos ϕ cos γ C ˜ γ = 2 ω e cos λ sin ϕ + ω e 2 r cos 2 λ cos γ cos λ sin λ cos ϕ sin γ V C ˜ ϕ = ω e 2 r cos λ sin λ sin ϕ V cos γ 2 ω e cos λ cos ϕ sin γ sin λ cos γ cos γ
where r is the geocentric distance, λ is latitude, l is longitude, V is the velocity of the aircraft relative to the Earth, m is the mass of the aircraft, γ is the track angle of the track, ϕ is the azimuth of the track (positive in the north by west), g is the acceleration of gravity, σ is the bank angle, and ω e is the angular velocity of the Earth’s rotation. C ˜ V , C ˜ γ , and C ˜ ϕ are the accelerations generated by the rotation of the Earth. D and L are the drag and lift of the aircraft.
Assuming that the Earth is an ideal sphere, the calculation formula of the flight height of the aircraft is as follows:
h = r R e
where R e is the average radius of the Earth.
In order to ensure that the aircraft successfully completes the reentry mission, the core constraints that the aircraft needs to meet during the reentry process mainly include process constraints, control constraints, and terminal constraints.
The process constraints [7] are mainly divided into heat flow constraint, overload constraint, dynamic pressure constraint, and equilibrium glide constraint (QEGC) when the bank angle is 0. The formula is
Q ˙ = K Q ρ V 3.15 Q ˙ max
n z = ( L cos α + D sin α ) / mg n z max
q ¯ = ρ V 2 / 2 q ¯ max
g V 2 / r L 0
where α is the angle of attack, Q ˙ is the heat flux rate, K Q is the heat transfer coefficient, n z is aerodynamic overload, and q ¯ is dynamic pressure.
The control variables of the aircraft are the angle of attack and the bank angle. In the trajectory optimization, its amplitude needs to be constrained. The formula is
α α max σ σ max
where α max and σ max are the maximum values of the angle of attack and the bank angle.
The terminal constraints of an aircraft generally include altitude, velocity, latitude, and longitude constraints, which can be expressed as
V ( t f ) = V f H ( t f ) = H f λ ( t f ) = λ f l ( t f ) = l f
where t f is the terminal moment, V f is the terminal velocity, H f is the terminal height, λ f is the terminal latitude, and l f is the terminal longitude.

3. Rapid Planning of Reentry Trajectory in Drag Acceleration–Velocity Profile

3.1. Angle of Attack Profile and Reentry Corridor

1.
Angle of attack profile
The angle of attack profile ( α V ) is a function of angle of attack with respect to velocity. At the beginning of reentry, RLV needs to fly at a high angle of attack to increase lift and drag, so as to avoid excessive heat flow caused by excessive height drop. With the decrease in speed and height, the angle of attack should be gradually reduced to reduce the lift and drag, which can avoid the jump of the trajectory caused by excessive lift, and can also reduce the drag and increase the range. When the reentry phase is coming to an end, in order to ensure that the RLV can perform guidance tracking in the energy management phase, it is necessary to transition the angle of attack to the front end of the maximum lift-to-drag ratio [7].
According to the above design principles, this paper refers to the angle of attack profile planning method of the space shuttle. At the beginning of reentry, it keeps flying at a high angle of attack. When the velocity is less than a certain value, the angle of attack begins to decrease with the velocity until it reaches the maximum lift-to-drag ratio at the end of reentry. In order to ensure the smooth connection of the subsequent track angle curve, the designed angle of attack–velocity profile formula is as follows:
α ( V ) = 45 V 5000 a α V 2 + b α V + c α e l s e
a α 5000 2 + b α 5000 + c α = 45 a α V f 2 + b α V f + c α = 15 b α / 2 a α = 5000
where α α , b α , and c α are design parameters.
The angle of attack profile designed in this paper is shown in Figure 1.
2.
Reentry corridor
The calculation method of the reentry corridor is to convert the multiple process constraints into the constraints of a single physical quantity through a series of mathematical transformations [17]. The three process constraints of heat flow, overload, and dynamic pressure form the hard constraints of the reentry corridor, and QEGC forms the soft constraints of the reentry corridor, which together form the upper and lower boundaries of the reentry corridor.
The expression transformed from Equation (4) to Equation (7) into the drag acceleration–velocity corridor is
D a Q ˙ = ( Q ˙ max K Q V 2.15 ) 2 C D S / ( 2 m ) D a n z = n z max g C L C D cos α + sin α D a q ¯ = q ¯ max C D S m D a e g = ( g V 2 r ) C D C L
D a e g D a ( V ) min ( D a Q ˙ , D a n z , D a q ¯ )
where D a Q ˙ , D a n z , D a q ¯ , and D a e g are the drag acceleration boundaries corresponding to Q ˙ max , n z max , q ¯ max , and QEGC constraints, respectively. D a ( V ) is the drag acceleration reference profile.
It is assumed that the overall constraint of RLV is as follows: the maximum heat flux rate is Q ˙ max = 3000   kw / m 2 ; the maximum overload is n z max = 3 ; and the maximum dynamic pressure is q ¯ max = 50   kpa . The corresponding relationship between the drag acceleration and the relative velocity is calculated by Equation (12), and the reentry corridor of the drag acceleration–velocity profile can be calculated, as shown in Figure 2. D a Q ˙ , D a n z , and D a q ¯ together constitute the lower boundary of the height–velocity corridor, and D a e g forms the upper boundary of the height–velocity corridor.

3.2. Reentry Drag Acceleration Reference Trajectory Designing

The reentry trajectory of an unpowered RLV is usually divided into a pre-reentry section and a glide section [18]. The pre-reentry section is usually located at 60–120 km above sea level. In this stage, the atmospheric density is very thin, and the aerodynamic effect is not obvious. It is difficult for the spacecraft to adjust its trajectory by aerodynamic force. Therefore, it is usually necessary to maintain a constant large angle of attack and bank angle flight. When the spacecraft enters the glide phase, as the height decreases, the aerodynamic force begins to play a role, and the aircraft gradually generates guidance capability. At this time, the aerodynamic force can be used to achieve the purpose of adjusting the reentry trajectory, so this phase becomes the key link of reentry trajectory planning.
  • Pre-reentry section
When the spacecraft is flying in the rarefied atmosphere, the aerodynamic force is weak, the aircraft has no guidance ability, and the control quantity has little effect on the descent trajectory. It can be considered that the spacecraft makes free-fall motion in this stage. Therefore, the angle of attack and bank angle in the pre-reentry section can be taken as constant α 0 and σ 0 . The reentry trajectory in the pre-reentry section is obtained by numerical integration of the dynamic Equation (1) [18].
The conditions for the end of the pre-reentry section are set as follows:
n z δ
where δ is a small quantity pre-set according to the accuracy requirement.
The reason why Equation (14) is used as the end condition of the pre-reentry trajectory is that as the RLV enters the dense atmosphere from the thin atmosphere, when the overload reaches a certain value, the spacecraft can rely on aerodynamic force to change its flight trajectory, and the aircraft gradually has the ability to change the trajectory.
2.
Glide section
The glide section needs to carry out segmented trajectory design, which is generally composed of four sections [7]: temperature control section, equilibrium glide section, constant resistance section, and transition section. In order to ensure a smooth connection of each section, the first or second order function is selected for profile design. The drag acceleration–velocity expression of each stage is given below.
  • Temperature control section: Ensure that the aircraft passes through the high heat flux area safely, starting from the end of the pre-reentry section until the maximum heat flux peak appears.
D a ( V ) = b 1 V + c 1   if   V 1 < V V y z
  • Equilibrium glide section: Establish a larger overload as soon as possible.
D a ( V ) = a 2 V 2 + b 2 V + c 2       e l s e   i f V 2 < V V 1
  • Constant resistance section: Maintain a large overload so that the aircraft has better maneuverability.
D a ( V ) = c 3       e l s e   i f V 3 < V V 2
  • Transition section: By adjusting the velocity corresponding to the start of the transition section, the whole range can be adjusted more accurately.
D a ( V ) = a 4 V 2 + b 4 V + c 4       e l s e V f < V V 3
where b 1 , c 1 , a 2 , b 2 , c 2 , a 4 , b 4 , and c 4 are trajectory design parameters. V y z , V 1 , V 2 , V 3 , and V f are the terminal velocities of each section of the pre-reentry section and the gliding section.
According to Equations (15)–(18), there are 9 design parameters of the reference section. If the direct design is very complicated and not intuitive, and if the endpoints of each section are considered as the parameters to be designed, the design difficulty will be greatly simplified.
After the angle of attack–velocity profile is determined, the process constraints can be directly converted into the boundary of the reentry corridor. As long as the aircraft flies within the reentry corridor, all constraints are satisfied.
According to the principle of smooth connection of each segment, the state ( V y z , D a y z ) and d D a y z d V y z of the pre-reentry terminal and the state ( V f , D a f ) of the reentry terminal can be known:
  • If the terminal state ( V 1 , D a 1 ) of the temperature control section is known, and in order to ensure that the track angle from the pre-reentry section to the gliding section does not jump, d D a y z d V y z = b 1 and c 1 can be calculated.
  • If the terminal state V 2 of the equilibrium glide section is known, and the terminal state ( V 1 , D a 1 ) and d D a w z d V w z of the temperature control section and the terminal state d D a p z d V p z = 0 of the equilibrium glide section are given, a 2 , b 2 , c 2 , and the terminal state D a 2 of the equilibrium glide section can be calculated.
  • If the terminal state V 2 of the equilibrium glide section is known, given the terminal state D a 2 of the equilibrium glide section, c 3 and the terminal state D a 3 of the constant resistance section can be calculated.
  • If the terminal state V 3 of the constant resistance section is known, and the terminal state D a 3 of the constant resistance section and the reentry terminal state ( V f , D a f ) are given, a 4 , b 4 , and c 4 can be calculated.
Let the parameters to be designed be
X = [ V 1 , V 2 , V 3 ]
According to the flight mission and characteristics of each segment, the basic principles of design parameter adjustment are as follows:
  • The endpoint and the drag acceleration profile determined by the endpoint must be in the reentry corridor;
  • As far as possible to ensure a smooth connection at the endpoints of each section;
  • The control quantity such as the heeling angle does not exceed the control quantity constraint;
  • The velocity of the end of the temperature control section should be after the velocity corresponding to the maximum heat flux;
  • Considering the uncertainty of flight aerodynamic parameters, the drag acceleration profile should be kept at a certain distance from the reentry corridor.
According to Equations (12) and (13), the function D a V of the upper and lower boundaries of the reentry corridor are follow as
D a u p ( V ) = f u p ( V )
D a d o w n ( V ) = f d o w n ( V )
The reference drag acceleration profile parameters are uniquely determined by X and can be described as a function of X and V .
D a ( V ) = f ( X , V )
Considering the above principles and the characteristics of the reentry process, the final online planning problem of the reentry drag acceleration reference profile is that given the initial conditions, under the conditions of satisfying the process constraints, control constraints, and terminal constraints, the minimum flight distance of the aircraft is taken as the optimization objective, and the trajectory parameters X are optimized online to plan a qualified trajectory. The expression is
L t o g o = ( λ λ f ) 2 + ( l l f ) 2
min { J ( X ) } = L t o g o s . t . D a d o w n < D a < D a u p V Q m < V 1 < V Q n V a l p h a < V 2 < V 1 δ V z + ε < V 3 < V 2
where L t o g o is the distance to be flown; l O b and λ O b the latitude and longitude coordinates of the current aircraft position; J ( X ) is the objective optimization function, V Q m and V Q n are the maximum heat flux decrease with m % and n % being the corresponding velocity, respectively; V a l p h a is the corresponding velocity when the angle of attack is pressed down; and δ and ε are the thresholds.

4. Online Trajectory Planning and Guidance Based on TD3

As shown in Figure 3, the online planning and guidance algorithm of reentry trajectory based on TD3 is divided into two parts: ① Offline training part—training multiple strategy networks by using TD3 algorithm framework by pulling off the initial states and tracking guidance offline; ② Online planning and guidance part—online invoke TD3 strategy network to plan and periodically update the drag acceleration profile, and the tracking guidance law is calculated online.

4.1. Offline Network Training Based on TD3

4.1.1. Twin Delayed Deep Deterministic Policy Gradient (TD3)

Reinforcement learning [19] belongs to the machine learning paradigm under the sequential decision-making framework. As shown in Figure 4, its core mechanism is reflected in the multi-step interaction process between the agent and the environment, imitating ‘trial and error learning’ of the organism mechanism so that an agent learns how to make a series of actions (Action) to change the state of the environment (State) to achieve the goal of maximizing the cumulative reward (Reward), and gradually approaching the optimal behavior strategy.
The TD3 [20] algorithm is a deep reinforcement learning algorithm based on the Actor–Critic framework, which aims to solve the problem of overestimation of Q value in deep deterministic policy gradient methods (such as deep deterministic policy gradient, DDPG). Its core design includes three key technologies: Clipped Double Q-Learning, Target Policy Smoothing, and Delayed Policy Updates.
  • Clipped Double Q-Learning
In order to suppress the overestimation bias of the Q function, TD3 maintains two independent Critic networks ( Q θ 1 , Q θ 2 ) and their target networks ( Q θ 1 , Q θ 2 ) at the same time. The target Q value is generated by the minimum value of the two:
y = r a + γ a min Q θ 1 ( s , a )
where r a is a reward signal caused by an action, γ a is the discount factor, and ( s , a ) are the state and action of the next moment.
2.
Target Policy Smoothing
When TD3 calculates the target Q value, a small Gaussian noise is added to the target action to constrain the action disturbance range ( ε c ) so that the Q function is locally smooth in the action space, avoiding the strategy overfitting to the sharp peak of the Q function.
a = π ϕ ( s ) + χ , χ ~ c l i p ( N ( 0 , σ ) , c , c )
where χ is the random noise of a normal distribution, with the range being c , c .
3.
Delayed Policy Updates
TD3 introduces a delay effect in the update mechanism of the Actor network; that is, the update of the Actor network will lag behind the multiple iterations of the Critic network, ensuring that the Actor can receive more stable and accurate Q value information.

4.1.2. TD3 Strategy Network Designing

Considering the environmental uncertainty and the bank angle reversal strategy adopted by the lateral guidance, there is a limit on the amplitude of the change rate when the bank angle is reversed, and the reversal is not completed instantaneously. It is necessary to consider the time required for the roll angle reversal during the actual flight process, resulting in insufficient accuracy of the final landing point of the aircraft. In order to improve this shortcoming, the core idea of the TD3 trajectory online planning algorithm is to train three different TD3 reinforcement learning strategy networks N i ( i = 1 , 2 , 3 ): ① Strategy network N 1 : The initial states of reentry are biased, the terminal state data set of the pre-reentry section is collected offline, and the strategy network N 1 is trained to generate the drag acceleration profile of the whole section of the gliding section at the beginning of the gliding section. ② Strategy network N 2 : The tracking guidance is carried out offline for the drag acceleration profile of the temperature control section. The terminal state data set of the temperature control section is collected offline, and the training strategy network N 2 is used to generate the drag acceleration profile of the equilibrium glide section, the constant drag section and the transition section at the beginning of the equilibrium glide section. ③ Strategy network N 3 : The tracking guidance is carried out offline for the drag acceleration profile of the balanced glide section. The terminal state data set of the balanced glide section is collected offline, and the training strategy network N 3 is used to generate the drag acceleration profile of the constant resistance section and the transition section at the beginning of the constant resistance section. By invoking the TD3 strategy network N i planning trajectory multiple times online at each node of the glide section, it is combined with the tracking guidance so that the cumulative error in the previous guidance period can be eliminated each time the algorithm is invoked again, and the accuracy of the aircraft landing point can be improved.
TD3 strategy network training needs to set up three parts, namely state space (State), action space (Action), and reward (Reward).
  • State space
The state vector S R 13 comprehensively characterizes the initial motion state and task target of the aircraft.
S = h ¯ 0 , λ 0 , l 0 , V ¯ 0 , γ 0 , ϕ 0 , σ 0 , V ¯ f , D a f , λ f , l f , L t o g o ¯ , t ¯ 0
where h ¯ 0 , V ¯ 0 , V f ¯ , L t o g o ¯ , and t ¯ 0 are the normalized initial height and velocity, terminal velocity, distance, and time between the aircraft and the endpoint, respectively.
The first seven states reflect the initial motion state of the aircraft, the eighth to eleventh states give the task target information, and finally reflect the terminal waiting distance and simulation time. In order to eliminate the influence of dimension on training, some parameters of the state space are normalized. The normalization method is as follows:
x ¯ = x k
where x ¯ is the normalized variable and k is the normalized coefficient, see Table 1.
2.
Action space
By designing the action space, the drag acceleration profile parameters can be solved to determine the size and shape of the profile. According to the different trajectory generation tasks of three different strategy networks N i , the action space A to be designed is different. Therefore, each action output by the agent is normalized as a i 0 ,   1 , and the specific design is as follows:
  • Strategy network N 1 :
A = [ a 1 , a 2 , a 3 ]
V 1 = V Q m + a 1 ( V Q n V Q m ) V 2 = V a l p h a + a 2 ( V 1 δ V a l p h a ) V 3 = V z + ε + a 3 ( V 2 V z ε )  
  • Strategy network N 2 :
A = [ a 2 , a 3 ]
V 2 = V a l p h a + a 2 ( V 1 δ V a l p h a ) V 3 = V z + ε + a 3 ( V 2 V z ε )
  • Strategy network N 3 :
A = a 3
V 3 = V z + ε + a 3 ( V 2 V z ε )
3.
Reward
The reward function in reinforcement learning is the core of driving agent learning, and its design quality directly determines the success or failure of reinforcement learning training. In this paper, a hierarchical reward design method is adopted to meet the requirements of the flight range under the condition of satisfying the process constraints so that the aircraft can successfully reach the target point.
  • Flight distance reward R L
The goal of the reward is to guide the aircraft to gradually approach the target point, and the reward is negatively correlated with the terminal flight range. Therefore, a piecewise linear function reward is designed to make the terminal flight range smaller and the reward greater.
R L = r 1 ( L 1 L togo ) L togo < L g o a l 1 r 2 ( L 2 L togo ) + r 1 ( L 1 L g o a l 1 ) L g o a l 1 L togo < L g o a l 2 r 2 ( L 2 L g o a l 2 ) L togo L g o a l 2
where r 1 , r 2 , L 1 , L 2 , L g o a l 1 , and L g o a l 2 are all positive coefficients.
  • Process constraint reward R c
The drag acceleration profile exceeds the reentry corridor and is given a negative reward. If it does not exceed, no reward is given.
R c = a c D a D a d o w n a c D a u p D a   0 D a d o w n < D a < D a u p
where a c is a positive coefficient.
  • Bank angle reversal reward
In order to reduce the inclination angle reversal, when the total number of inclination angle reversals exceeds a certain value, a negative reward is given.
R σ = a σ T ( σ ) > T ( σ ) max 0 e l s e
where a σ is a positive coefficient, T ( σ ) is the number of bank angle reversals in the whole reentry process of the aircraft, and T ( σ ) max is limitation of the number of bank angle reversals.
In summary, the sum of the aircraft final awards is
R = R L + R c + R σ   IsDone
where IsDone = ( V V f ) 2 Δ V represents the end of the training round and Δ V is the terminal velocity threshold.
The TD3 strategy network design logic is the state quantity of the aircraft as the environmental state, and the agent uses a dual Critic network and a delay strategy to update the collaborative optimization. By updating the two Critic networks in parallel and taking the minimum value to construct the target Q value to suppress the overestimation bias, the Actor network is gradually optimized at a lower update frequency than the Critic network to ensure the stability of the function evaluation. At the same time, the clipping Gaussian noise is added to the output action of the target Actor to smooth the Q value estimation surface, and the soft update mechanism is used to gradually synchronize the target network parameters, and finally, output the action to maximize the reward, as shown in Figure 5.

4.2. The Tracking Guidance

4.2.1. Longitudinal Guidance

In the TD3 trajectory online planning algorithm, the drag acceleration–velocity profile solved by the strategy network can be converted into a height–velocity profile. Combined with the 3-DOF reentry dynamic model, the amplitude of the bank angle can be inversely solved.
Through the transformation of Equation (1), it can be obtained that
d H d V = V sin γ D m g sin γ + C ˜ V
When the height–velocity profile is known, the height can be expressed as a function related to the velocity, and the corresponding velocity derivative d V d H can be obtained by the difference in the velocity. Given the size of the track angle, the difference method is used again to obtain the corresponding velocity derivative d γ d V , and according to the chain derivative rule, the change rate of the track angle can be obtained, that is,
d γ d t = d γ d V d V d t = d γ d V ( D m g sin γ + C ˜ V )
The change rate of the track angle is known, and the size of the bank angle can be expressed as
σ = a c r cos ( V ( γ · C ˜ γ ) ( V 2 / r g ) cos γ L / m )
The longitudinal guidance directly tracks the bank angle without correction.

4.2.2. Lateral Guidance

Because the angle of attack profile is preset, the lateral guidance only needs to determine the sign of the bank angle, and the change process of the remaining state quantity can be obtained by numerical integration of the dynamic equations of longitude, latitude, and heading angle.
In this paper, the sign of the bank angle is obtained by using the bank angle reversal control strategy. The basic idea can be described as follows: by giving the heading angle deviation corridor, when the velocity heading angle of the aircraft exceeds the maximum range allowed by the corridor, the aircraft immediately reverses the bank angle, that is, the sign of the bank angle so that the velocity heading angle returns to the established range to ensure that the spacecraft flies towards the target point, which is expressed by the followingformula:
sgn σ ( t ) = 1 Δ ϕ Δ ϕ up sgn σ ( t 1 ) Δ ϕ down < Δ ϕ < Δ ϕ up 1 Δ ϕ Δ ϕ down
where Δ ϕ up and Δ ϕ down are the designed corridor boundaries, and
Δ ϕ up = ϕ 1 V V m ϕ 1 V V 1 V 2 V 1 ϕ 2 V m > V > V n ϕ 2 V < V n Δ ϕ down = Δ ϕ up
where V m , V n , ϕ 1 , and ϕ 2 are the velocity heading angle deviation corridor parameters designed according to experience.
When the velocity of the aircraft is high, the distance between the aircraft and the target point is far, and there is more time to adjust the error, so a larger heading angle error corridor is set to reduce the number of roll angle reversals. When approaching the target point, in order to improve the terminal accuracy, it is necessary to narrow the range of the heading angle error corridor, as shown in the following Figure 6.

4.3. Online Trajectory Planning and Guidance

Combined with the TD3 strategy network set of offline training, online trajectory generation and guidance can be realized, as shown in Figure 3. The initial state of the aircraft is determined, and the aircraft maintains a constant bank angle and angle of attack to complete the pre-reentry section flight. The TD3 strategy network N 1 is invoked at the end node of the pre-reentry section to generate the entire drag acceleration profile of the glide section online for online tracking guidance. The TD3 strategy network N 2 is invoked at the end node of the temperature control section to generate the drag acceleration profile of the equilibrium glide section, the constant resistance section, and the transition section online for online tracking guidance. The TD3 strategy network N 3 is invoked at the end node of the equilibrium glide section to generate the drag acceleration profile of the constant resistance section and the transition section online for online tracking guidance, thereby completing the flight task. The TD3 strategy networks are trained offline and invoked online many times so that the cumulative error in the previous guidance period can be eliminated when the algorithm is called again each time, and the online rapid generation and update of the reentry trajectory is realized, which effectively improves the accuracy and computational efficiency of the landing point. Due to the black box characteristics of the trained neural network, it is difficult to absolutely guarantee the reliability of the algorithm. In actual execution, in order to avoid security risks, the complete or stage trajectory can be quickly integrated to evaluate the reliability of the trajectory. On the other hand, the TD3 guidance strategy does not exclude the involvement of other algorithms. For example, traditional guidance methods such as predictive correction guidance can be used to take over for a short time.

5. Simulation and Discussion

  • TD3 strategy network offline training
According to the above settings, the training agent is trained. The initial state of training is set as shown in Table 2, and the initial state is uncertainly biased.
The TD3 strategy network training results are compared with the DDPG training results to verify that the TD3 online trajectory planning algorithm has strong adaptability and robustness. Firstly, the training network parameters of TD3 and DDPG are set: ① TD3 network module: The Actor network and Critic network modules have two hidden layers, and the two hidden layers contain 1024 and 512 neurons, respectively; the learning rates of the Actor network and Critic network are 0.0001 and 0.0001, respectively; the size of the playback buffer is 10 6 ; the sampling batch size of each training is 256; the discount factor is 0.99; the soft update coefficient of the target network is 0.005; and the target strategy Gaussian noise is ( 0.5 ,   0.5 ) . ② DDPG network module: The Actor network module and Critic network module have two hidden layers, and the two hidden layers contain 1024 and 512 neurons, respectively. The learning rates of the Actor network and Critic network are 0.001 and 0.001, respectively; the size of the playback buffer is 10 6 ; the sampling batch size of each training is 256; the discount factor is 0.99; the soft update coefficient of the target network is 0.005; and the regression coefficient and variance of OU noise are 1 and 0.1, respectively. Secondly, the training process reward parameters are set. If the process constraint reward parameter is too large, the accuracy of the landing point will be affected. If the process constraint reward parameter is too small, the process constraint will not be satisfied. When setting the tilt angle reversal reward, if the number of bank angle reversal restrictions is too large or too small, the reward will be ineffective. If the tilt angle reversal reward is too large, it will affect the accuracy of the landing point. If it is too small, the reward will not work. It is necessary to train and debug multiple times to obtain the optimal result, as shown in Table 3.
Figure 7 shows the average reward variation curve of the training process. As shown in Figure 7 and Table 4, during the whole training process, the average training reward of TD3 is much better than that of DDPG, which represents that the trajectory online generation and guidance algorithm based on TD3 reinforcement learning has strong applicability and robustness, and can adapt to different initial states and significant model parameters. And the convergence speed of TD3 is basically faster than DDPG. The average rewards of the three TD3 strategy networks are gradually increasing. When the training times are set to 1000 times, the rewards are gradually converged after training 600 rounds, indicating that the basic training of the TD3 strategy network is completed.
As shown in Table 4, whether TD3 or DDPG is used, the relationship between the convergence average reward values of the three strategy networks is N 3 > N 2 > N 1 , which proves that the aircraft can improve the landing accuracy of the aircraft by calling the corresponding strategy network during the reentry process.
2.
Online trajectory planning and guidance
The initial state of the aircraft is given randomly, and determination of the aerodynamic uncertainty deviation is carried out. The specific guidance tasks and aerodynamic uncertainty are shown in Table 5 and Table 6. The trajectory generation and guidance of the three tasks are carried out online, respectively. The simulation diagram is shown in Figure 8, Figure 9 and Figure 10.
In Figure 8a,b, Figure 9a,b and Figure 10a,b, it can be seen that in each task, the trajectory online generation and guidance algorithm based on TD3 reinforcement learning can complete the flight task; and the landing accuracy of the method of invoking the TD3 policy network multiple times is better than that of invoking a single policy network only in the pre-reentry section, and the two methods are better than the prediction–correction guidance. Figure 8c, Figure 9c and Figure 10c, show the drag acceleration profiles corresponding to the three tasks during the flight, and it can be seen that the algorithm can adapt to the requirements of the three tasks. In Figure 8d–f, Figure 9d–f and Figure 10d–f, it can be seen that the drag acceleration profile in the form of four multiple functions is used to make the trajectory angle not jump, so as to ensure that the aircraft can track the nominal trajectory during the guidance process, so that the aircraft can reach the terminal height and velocity.
Table 7 shows the simulation time of the three methods. The single command generation efficiency of the TD3 guidance method is improved by 10 2 , and the average single command generation is less than 0.01 s, which can realize real-time online guidance.
In summary, the most significant advantages of TD3-based reentry trajectory online planning and guidance compared to traditional guidance methods are as follows:
  • Improve the accuracy of landing point: In the case that the traditional guidance method can also reach the target point, the TD3 guidance method has higher landing point accuracy, strong random error applicability, and flight adaptability, and the accuracy of the preset task is improved by 84%.
  • The computational efficiency is improved: Compared with the traditional guidance method, the single command generation efficiency of the TD3 guidance method is improved by 10 2 , which can realize online guidance.
  • Simplified longitudinal guidance strategy: Because the profile design of traditional drag acceleration–velocity profile guidance is not smooth, it is necessary to design the longitudinal guidance law offline to correct the bank angle, such as PID guidance. TD3 guidance smooths the angle of attack profile and the reference profile so that the calculated bank angle can be directly guided longitudinally to ensure that the aircraft tracks the nominal profile and achieves the effect of simplifying the guidance strategy.

6. Conclusions

Aiming at the problem of poor autonomy and weak time performance of reentry trajectory planning for RLV, an online reentry trajectory planning and guidance method based on TD3 reinforcement learning is proposed. The reference profile adopts four multi-function drag acceleration–velocity profiles designed in the reentry corridor to ensure smooth connection of turning points. The TD3 strategy networks are trained offline to solve the profile parameters to meet the needs of the flight range. According to the current state and flight range of each section of the drag acceleration profile, the TD3 strategy network is invoked multiple times online to plan the trajectory and update the profile periodically. The offline training and multiple online invoking of TD3 strategy networks can eliminate the cumulative error in the previous guidance cycle when the algorithm is invoked again each time, and the online rapid generation and update of the reentry trajectory is realized, which effectively improves the accuracy of the landing point and the efficiency of instruction generation, and provides an effective method for real-time online trajectory planning and autonomous guidance in the complex environment of aerospace vehicles.

Author Contributions

H.W. wrote the main manuscript text and prepared the figures and tables in this manuscript. S.A. investigated the current research status of the paper. J.L. jointly supervised this work and modified the whole manuscript. G.W. provided a feasible way to solve the problem. K.L. supervised and guided the completion of the paper, and jointly supervised this work and modified the whole manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Fan, W.; Wang, L.; Long, X.D. Review of Foreign Reusable Launch Vehicle Development in 2022. China Aerosp. 2023, 25–30. [Google Scholar]
  2. Zhang, C.; Liu, D.Q.; Wang, J.W. Review of Foreign Hypersonic Technology Development in 2020. Aerosp. Missiles 2021, 12–16. [Google Scholar] [CrossRef]
  3. Mu, L.X.; Wang, X.M.; Xie, R. Review of Hypersonic Vehicles and Their Guidance and Control Technologies. J. Harbin Inst. Technol. 2019, 51, 1–14. [Google Scholar]
  4. Pan, L.; Peng, S.; Xie, Y. 3D Guidance for Hypersonic Reentry Gliders Based on Analytical Prediction. Acta Astronaut. 2020, 167, 42–51. [Google Scholar] [CrossRef]
  5. Liu, Z.; He, Y. Study of Reentry Guidance Based on Analytical Predictor-Corrector for Aerospace Vehicle. In Proceedings of the 36th Chinese Control Conference, Dalian, China, 26–28 July 2017; pp. 5827–5832. [Google Scholar]
  6. Yu, B. Research on Guidance Technology for Initial Reentry Phase of Reusable Launch Vehicle. Master’s Thesis, Nanjing University of Aeronautics and Astronautics, Nanjing, China, 2015. [Google Scholar]
  7. Yu, L. Research on Guidance Technology for Reentry Phase of Reusable Launch Vehicle. Master’s Thesis, Nanjing University of Aeronautics and Astronautics, Nanjing, China, 2018. [Google Scholar]
  8. Wang, P.; Yan, X.; Nan, W.; Li, X. A Rapid and Near Analytic Planning Method for Gliding Trajectory under Time Constraints. Acta Armamentarii 2024, 45, 2294–2305. [Google Scholar]
  9. Wang, T.; Zhang, H.B.; Zhu, R.Y. Reentry Predictor-Corrector Guidance Algorithm Considering Drag Acceleration. J. Astronaut. 2017, 38, 143–151. [Google Scholar]
  10. Huang, H.B.; Liang, L.Y.; Yang, Y. Reentry Trajectory Planning and Guidance Method Based on Inverse Drag Acceleration Profile. Acta Aeronaut. 2018, 39, 338–348. [Google Scholar]
  11. Yan, X.L.; Wang, K.; Zhang, Z.J. Reentry Guidance Method Based on LSTM-DDPG. Syst. Eng. Electron. 2025, 47, 268–279. [Google Scholar]
  12. Jiang, Q.; Wang, X.; Li, Y. Intelligent Reentry Guidance with Dynamic No-fly Zones Based On Deep Reinforcement Learning. In Proceedings of the 29th International Conference on Computational and Experimental Engineering and Science, Shenzhen, China, 3–6 August 2024; pp. 291–313. [Google Scholar]
  13. Huang, X.Y.; Dai, J.; Liu, G. Predictor-Corrector Guidance Algorithm for Launch Vehicle Based on DDPG. Aerosp. Technol. 2024, 89–102. [Google Scholar]
  14. Das, P.P.; Pei, W.; Niu, C. Reentry Trajectory Design of A Hypersonic Vehicle Based On Reinforcement Learning. In Proceedings of the 2023 Asian Aerospace and Astronautics Conference, Wuhan, China, 15–17 September 2023; p. 012005. [Google Scholar]
  15. Hui, J.P.; Wang, R.; Guo, J.F. Intelligent Guidance Technology for No-Fly Zone Avoidance Based on Reinforcement Learning. Acta Aeronaut. 2023, 44, 240–252. [Google Scholar]
  16. Hui, J.P.; Wang, R.; Yu, Q.D. Online Generation Technology of ‘New Quality’ Corridor for Reentry Vehicle Based on Reinforcement Learning. Acta Aeronaut. 2022, 43, 623–635. [Google Scholar]
  17. Wang, Z.H.; Li, T.; Yan, H.D.; Hu, Y.; Gao, C. Improved Drag Acceleration Profile Design for Hypersonic Vehicle. Mod. Def. Technol. 2024, 52, 115–123. [Google Scholar]
  18. Nie, Z.T. Reentry Trajectory Planning and Tracking Control Based on Altitude-Velocity Profile. Master’s Thesis, Dalian University of Technology, Dalian, China, 2020. [Google Scholar]
  19. Barto, A.G.; Sutton, R.S. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  20. Shi, L.A. Research on Deep Reinforcement Learning Methods and Applications. Master’s Thesis, Xidian University, Xian, China, 2023. [Google Scholar]
Figure 1. Angle of attack–velocity profile.
Figure 1. Angle of attack–velocity profile.
Aerospace 12 00747 g001
Figure 2. Drag acceleration–velocity corridor: (a) the reentry corridor boundary corresponding to each constraint; (b) reentry corridor composed of soft and hard constraints.
Figure 2. Drag acceleration–velocity corridor: (a) the reentry corridor boundary corresponding to each constraint; (b) reentry corridor composed of soft and hard constraints.
Aerospace 12 00747 g002
Figure 3. Algorithm structure diagram.
Figure 3. Algorithm structure diagram.
Aerospace 12 00747 g003
Figure 4. Reinforcement learning diagram.
Figure 4. Reinforcement learning diagram.
Aerospace 12 00747 g004
Figure 5. Framework of designing TD3 strategy networks.
Figure 5. Framework of designing TD3 strategy networks.
Aerospace 12 00747 g005
Figure 6. Heading angle deviation corridor.
Figure 6. Heading angle deviation corridor.
Aerospace 12 00747 g006
Figure 7. Reward curves of training: (a) the reward convergences of TD3 and DDPG strategy network N 1 ; (b) the reward convergences of TD3 and DDPG strategy network N 2 ; (c) the reward convergences of TD3 and DDPG strategy network N 3 .
Figure 7. Reward curves of training: (a) the reward convergences of TD3 and DDPG strategy network N 1 ; (b) the reward convergences of TD3 and DDPG strategy network N 2 ; (c) the reward convergences of TD3 and DDPG strategy network N 3 .
Aerospace 12 00747 g007aAerospace 12 00747 g007b
Figure 8. Simulation of task 1: (a) trajectory map of ground; (b) position error of impact point; (c) the variation in drag acceleration with velocity; (d) the variation in height with time; (e) the variation in the bank angle with time; (f) the variation in trajectory angle with time.
Figure 8. Simulation of task 1: (a) trajectory map of ground; (b) position error of impact point; (c) the variation in drag acceleration with velocity; (d) the variation in height with time; (e) the variation in the bank angle with time; (f) the variation in trajectory angle with time.
Aerospace 12 00747 g008
Figure 9. Simulation of task 2: (a) trajectory map of ground; (b) position error of impact point; (c) the variation in drag acceleration with velocity; (d) the variation in height with time; (e) the variation in the bank angle with time; (f) the variation in trajectory angle with time.
Figure 9. Simulation of task 2: (a) trajectory map of ground; (b) position error of impact point; (c) the variation in drag acceleration with velocity; (d) the variation in height with time; (e) the variation in the bank angle with time; (f) the variation in trajectory angle with time.
Aerospace 12 00747 g009
Figure 10. Simulation of task 3: (a) trajectory map of ground; (b) position error of impact point; (c) the variation in drag acceleration with velocity; (d) the variation in height with time; (e) the variation in the bank angle with time; (f) the variation in trajectory angle with time.
Figure 10. Simulation of task 3: (a) trajectory map of ground; (b) position error of impact point; (c) the variation in drag acceleration with velocity; (d) the variation in height with time; (e) the variation in the bank angle with time; (f) the variation in trajectory angle with time.
Aerospace 12 00747 g010
Table 1. Normalized coefficient.
Table 1. Normalized coefficient.
VariableUnitCoefficient
h m R e
V m·s−1 g 0 R e
L t o g o m R e
t s10
Table 2. Initial state of training.
Table 2. Initial state of training.
ParameterUnitPull Deviation Situation
H m ± 2000
V m·s−1 ± 10
γ ° ± 0.5
λ ° ± 1
l ° ± 1
ϕ ° ± 3
Table 3. Training process reward parameters.
Table 3. Training process reward parameters.
ParameterValue
r 1 24
r 2 12.67
L 1 5
L 2 3
L g o a l 1 5
L g o a l 2 13
a c 20
a σ 10
T ( σ ) max 5
Table 4. Average reward convergence of strategy networks.
Table 4. Average reward convergence of strategy networks.
NetworkDDPGTD3
N 1 2371
N 2 6177
N 3 6685
Table 5. Guidance tasks.
Table 5. Guidance tasks.
TaskUnit123
H km119.947119.878118.609
V m·s−17837.5787834.5697829.437
γ °0.6780.7950.834
λ °−5−5.972−4.068
l °−4.068−5.316−5.891
ϕ °53.45354.75453.167
H f km18
V f m·s−1760
λ f °37
l f °92
Table 6. Aerodynamic uncertainty of aircraft.
Table 6. Aerodynamic uncertainty of aircraft.
Section C L C D
Equilibrium glide section + 5 % + 5 %
Table 7. Simulation time.
Table 7. Simulation time.
MethodUnitAverage Total TimeAverage Single Instruction Generation Time
Predictor-corrector s811.92020.5635
TD 3   ( strategy   network   N 1 )s3.51150.0039
TD 3   ( strategy   network   N 1 , 2 , 3 )s8.28660.0037
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; An, S.; Li, J.; Wang, G.; Liu, K. Reentry Trajectory Online Planning and Guidance Method Based on TD3. Aerospace 2025, 12, 747. https://doi.org/10.3390/aerospace12080747

AMA Style

Wang H, An S, Li J, Wang G, Liu K. Reentry Trajectory Online Planning and Guidance Method Based on TD3. Aerospace. 2025; 12(8):747. https://doi.org/10.3390/aerospace12080747

Chicago/Turabian Style

Wang, Haiqing, Shuaibin An, Jieming Li, Guan Wang, and Kai Liu. 2025. "Reentry Trajectory Online Planning and Guidance Method Based on TD3" Aerospace 12, no. 8: 747. https://doi.org/10.3390/aerospace12080747

APA Style

Wang, H., An, S., Li, J., Wang, G., & Liu, K. (2025). Reentry Trajectory Online Planning and Guidance Method Based on TD3. Aerospace, 12(8), 747. https://doi.org/10.3390/aerospace12080747

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop