1. Introduction
The traditional proportional navigation guidance (PNG) law is simple in form and easy to realize and has been widely used in engineering [
1,
2]. However, with modern warfare development, targets have become more and more defensive and maneuverable, necessitating missiles that not only achieve precise target strikes but also meet a specified terminal impact angle [
3,
4]. This is crucial for scenarios such as intercepting missiles with minimal impact angles for direct collision or striking vulnerable sections of targets like ground tanks and aircraft wings with optimal angles.
Current research into guidance laws with impact angle constraints primarily focuses on methods such as proportional navigation guidance’s additional bias term, optimal control theory, and sliding-mode variable structure control theory [
5,
6,
7] among others. The basic idea of the proportional guidance method with an additional bias term [
8,
9,
10] is to introduce an additional bias parameter to design a guidance law that meets the constraints of the miss distance and impact angle. The guidance law designed by this method is simple and easy to implement, but it mainly attacks stationary or slow-moving targets and lacks efficacy against rapid and highly maneuverable targets [
11]. The optimal control method is also the most widely studied method; the idea is to find the guidance law with terminal impact angle constraint by minimizing the performance index. In [
12], a guidance law with impact angle constraint that can hit maneuvering targets with constant acceleration is proposed based on the linear quadratic optimization theory. Reference [
13] proposes a modeling method with the desired line of sight as the control target and designs an optimal guidance law for ballistic forming on the basis of constraining the relative motion direction of the terminal projectile and ballistic overload. The above optimal control methods achieve satisfactory control accuracy, but all of them require an estimation of the time-to-go, which can be challenging to ascertain in real-world scenarios.
Sliding-mode variable structure control has the characteristics of high precision and strong robustness, so it is widely used in the design of guidance laws with constraints [
14,
15]. In [
16,
17,
18], the authors use a terminal sliding-mode (TSM) control method to design the guidance law, which can guarantee that the impact angle converges to the expected impact angle and the line-of-sight (LOS) angle rate converges to zero in finite time, and they apply the finite-time control theory to solve the specific convergence time. However, the existing TSM and fast terminal sliding-mode (FTSM) controllers have a common shortcoming, that is, the singular problem occurs easily on the terminal sliding-mode surface when the error approaches zero. To solve this problem, NTSM was proposed. NTSM can not only achieve finite time convergence but also avoid singularity problems; references [
19,
20,
21,
22] verify the above theory. It should be pointed out that in [
20,
22], in order to eliminate chattering, a special function is used to replace the sign function, which introduces steady-state errors, and the guidance laws above are all for non-maneuvering targets. For maneuvering targets, the authors of [
23] achieve impact angle convergence by selecting the missile’s lateral acceleration to enforce the terminal sliding mode on a switching surface designed using nonlinear engagement dynamics, but they do not take into account the singularity that the proposed design may exhibit in its implementation. A guidance law based on NTSM is proposed in [
19], but the maneuvering upper limit of targets needs to be estimated, and there are chattering problems. In order to suppress high-frequency chattering, the boundary-layer method, reaching law method, and observer method are all effective solutions. The authors in [
24] used continuous saturation function to approximate the sign function, which reduced the chattering, but this method led to a decrease in the robustness of the system as the boundary layer increased. To solve this problem, reference [
25] designed a second-order sliding-mode guidance law with continuous characteristics based on the high-order sliding-mode algorithm. The super-twisting algorithm was adopted as the reaching law of the sliding-mode control to eliminate the discontinuous term in the guidance law and thus weaken the chattering. However, the traditional super-twisting algorithm has the shortcomings of a slow convergence speed when the system state is far from the equilibrium point and cannot make full use of the missile overload capacity [
26]. Nowadays, with powerful nonlinear approximation and data representation methods, data-driven reinforcement learning(RL) methods have attracted considerable attention in the design of guidance law, and model-free RL has been implemented in several algorithms for finding optimal solutions, making complex decisions, and in self-learning systems which improve behavior based on previous experiences [
27]. Traditional reinforcement learning methods are suited for discrete environments with small action and sample spaces. However, real-world tasks, such as guidance and control challenges, involve large state spaces and continuous action domains with complex, nonlinear dynamics, which conventional methods find difficult to manage. Deep reinforcement learning (DRL) addresses these limitations by integrating deep neural networks with reinforcement learning, thereby enabling a more robust approach to handling the complexities of continuous and nonlinear systems. In recent years, many authors [
28,
29,
30,
31,
32] have used Q-learning, deep Q-leaning network (DQN), deep deterministic policy gradient (DDPG), twin-delayed deep deterministic policy gradient (TD3), and other RL algorithms to train neural network models. The parameters of the missile’s LOS rate, position vector, and so on are used as state variables in a Markov decision as inputs, then the trained neural network can directly map the guidance command, which provides a new idea for the design of guidance law under multi-constraint conditions. Gaudet et al. [
33] improved the Proximal Policy Optimization (PPO) algorithm and proposed a three-dimensional guidance law design framework based on DRL for interceptors trained with the PPO algorithm. A comparative analysis showed that the deduced guidance law also had better performance and efficiency than the extended Zero-Effort-Miss (ZEM) policy. In a similar way, Gaudet et al. [
34] developed a new adaptive guidance system using reinforcement meta-learning with a recursive strategy and a value function approximator to complete the safe landing of an agent on an asteroid with unknown environmental dynamics. This is also a good method to solve the continuous control problem, but the PPO algorithm uses a random policy to explore and exploit rewards, and this method to estimate the accurate gradient requires a large number of random actions to find the accurate gradient. Therefore, such random operation reduces the algorithm’s convergence speed. In [
35], an adaptive neuro-fuzzy inference sliding-mode control (ANFSMC) guidance law with impact angle constraint is proposed by combining the sliding-mode control method with an adaptive neuro-fuzzy inference system (ANFIS), and the ANFIS is introduced to adaptively update the additional control command and reduce the high-frequency chatter of the sliding-mode control (SMC), which enhances the robustness and reduces the chattering of the system; numerical simulations also verify the effectiveness of the proposed guidance law. In [
36], the author proposes a three-dimensional (3D) intelligent impact-time control guidance (ITCG) law based on nonlinear relative motion relationship with the field of view (FOV) strictly constrained. By feeding back the FOV error, the modified time bias term including a guidance gain which can accelerate the convergence rate is incorporated into 3D-PNG; then, the guidance gain is obtained by DDPG in the RL framework, which makes the proposed guidance law achieve a smaller time error and have less energy loss. But in [
37], the authors pointed out that the DDPG had the problem of an overestimation of the Q value, which may not give the optimal solution. Moreover, the DDPG uses a replay buffer to remove the correlations existing in the input experience (i.e., the experience replay mechanism) [
38,
39] and uses the target network method to stabilize the training process. As an essential part of the DDPG, experience replay significantly affects the performance and speed of the learning process by choosing the experience used to train the neural network.
Based on the above problems, this paper designs a novel adaptive impact angle-constrained guidance law based on DRL and the super-twisting algorithm. It introduces a parameter
to replace the fractional power equal to one half in the conventional super-twisting algorithm, which improves the guidance performance and the convergence rate of the system state by adaptively adjusting this parameter. Drawing on the insights from [
37,
40], we apply the preferential sampling or prioritized experience replay mechanism [
41] to enhance the TD3 algorithm and then obtain the adaptively optimal parameters
using our improved DRL algorithm. Compared to the sliding-mode guidance law based on the traditional super-twisting algorithm, our algorithm is more effective in reducing chattering, has better guidance accuracy, and achieves better constraint effects on the terminal impact angle. Moreover, the trained neural network exhibits strong generalization capabilities in unfamiliar scenarios. Finally, we design a GSTESO, different from the modified ESO in [
42] that only uses nonlinear terms; the proposed GSTESO can provide faster convergence rate when the estimation error is far away from the equilibrium, and it does not depend on an upper bound of uncertainty. Furthermore, in comparison to the modified ESOs incorporating both linear and nonlinear terms discussed in [
43], the GSTESO we propose does not introduce any additional parameters that require adjustment.
4. Simulation and Analysis
In this section, we considered two maneuver modes, a sinusoidal maneuver and a constant maneuver, and then tested them from three aspects: flight trajectory, normal acceleration, and LOS angle rate. Finally, the performance of the three guidance laws was analyzed in terms of miss distance.
First, when the target was on a sinusoidal maneuver, the PERTD3 and TD3 algorithms were both trained 10,000 times using the parameters in
Table 1 and
Table 2. Both actor network and critic network were implemented by a three-layer fully connected neural network, Critic1 and Critic2 used the same network structure; the network structure is shown in
Figure 3. Except for the output layer of critic, the neurons in all other layers were activated by a ReLu function, i.e.,
The output layer of the critic network was activated by the tanh function, which is defined as
The policy and value functions were periodically updated during optimization after accumulating trajectory rollouts of the replay buffer size.
As shown in
Figure 4, our improved deep reinforcement learning algorithm was better than two other existing deep reinforcement learning algorithms in our written dynamic environment. The guidance algorithm based on PERTD3 and TD3 basically converged when approaching 800 steps, but the TD3 algorithm with prioritized experience replay could achieve higher scores and had higher stability, while the DDPG algorithm did not converge at the end.
To verify the robustness of our algorithm, we took the following two maneuvers and the case of the required impact angle constraint, and case 1 was not used during training but was used to test the generalization of our reinforcement learning algorithms.
Case 1: The target performed a constant maneuver , i.e., the impact angle constraint was for every .
Case 2: The target performed a constant maneuver , i.e., the impact angle constraint was for every .
As shown in
Figure 5 and
Figure 6, the simulation results in case 1 are shown in
Figure 5, and the simulation results in case 2 are shown in
Figure 6. From
Figure 5a,c,d and
Figure 6a,c,d,
Table 3 and
Table 4, it can be seen that the missile could accurately hit the target with the desired terminal LOS angle in a limited time, the LOS angle rate converged to 0°/s, and the LOS angle converged to the expected value under different desired LOS angles, which verified the superstrong control ability of the guidance law for the miss distance and the impact angle. From
Figure 5b and
Figure 6b, it can be seen that the missile acceleration command was smooth. Since the control law was continuous with respect to the sliding variable, it can be seen that the sliding variable converged to zero continuously and smoothly in finite time in
Figure 5e and
Figure 6e, and the chattering phenomenon did not appear in the control channel. It can be seen from
Figure 5f and
Figure 6f that the disturbance differential observer could achieve a great estimation of the system uncertainty. As can be seen from the overall
Figure 6, when facing untrained scenes, our improved reinforcement learning algorithm could still hit the target, and the LOS angle converged, which proves that our algorithm has generalization and practical significance.
In
Figure 7,
Figure 7a is the parameter change for case 1, and
Figure 7b is the parameter change for case 2. It can be seen from the figure that the parameter change finally converged to between two and three, and the change process was smooth, which is of practical engineering significance.
In order to show the advantages of our guidance law, we compared with ST-NSTM and NSTM guidance laws, and we set the miss distance to less than 0.1 m. Our algorithm could call the RLST-NTSM guidance law. The parameters of Equation (
20) were borrowed from the literature [
58], where
,
, and then we chose
,
, after the comparison. The
in the ST-NTSM guidance law was the same as before, and the traditional super-twisting non-singular terminal sliding-mode guidance law (ST-NTSM [
51]) was formulated as:
where
is the output of the finite-time convergence disturbance observer (FTDO). According to the reference [
58], the specific form of the FTDO [
60] introduced by the author in his article is as follows:
where the choice of
in the equation can be obtained from [
58]. The parameter
L is used to adjust the transient performance of the estimation process and can be set to 200, which makes the observer output
converge to
in finite time.
In addition, another extended state observer based on the traditional super-twisting extended state observer (STESO) [
42] was also added for comparison; it was formulated as:
where
,
were the same as in the GSTESO.
And the NSTM guidance law was of the form:
where the values of
were also the same as before, and the value of the switching gain
K was 400 according to reference [
58].
As shown in
Table 5 and
Figure 8, we simulated different
values corresponding to ST-NTSM for case 2. It can be seen from
Table 5 that no matter how the parameter
value was selected, the accuracy of ST-NTSM was not as high as our algorithm, only the values of 2.3 and 2.6 qualified, given that the miss distance had to be less than 0.1 m.
Figure 8 also shows that ST-NTSM still had a small number of chattering problems. And as the parameter
increased, the convergence rate of the guidance law was slower, while our guidance law converged faster, and the chattering phenomenon did not occur.
The simulation results for case 2 with
are shown in
Figure 9 and
Table 6. The RLST-NTSM proposed in this paper was better than the other two guidance laws in terms of accuracy and convergence of the final LOS angle error, and
Figure 9b shows that the missile guided by NTSM did not hit the target, while the performance of ST-NTSM was relatively good. However, according to
Figure 9a,d, there was still a certain chattering phenomenon, and the proposed algorithm further reduced the system chattering. It can also be seen from
Figure 9c that RLST-NTSM reached the equilibrium state of the system earlier than ST-NTSM. From
Figure 9d,e, it can be seen that the sliding-mode variable and LOS angle rate of ST-NTSM and NTSM did not converge well, which would greatly affect their practical engineering application. Finally, it should be noted that although the three observers could be directly added to any guidance law, it can be seen from
Figure 9f–h that the GSTESO performed better than the other two observers, as its initial uncertainty observation error was minimal.
As overload saturation emerged in the simulation, we eventually decided to consider the constraints of the autopilot in a simplified manner; combined with Equation (
14), we studied the following augmented system.
where
,
is the modeling error,
and
are the damping ratio and natural frequency of the autopilot.
This paper employed an autopilot approach similar to the one described in reference [
61], and its formulation is given by
where
,
.
We selected the virtual control signal
as in Equation (
29). The first and second derivatives of
were estimated using the following filter instructions:
where
,
and
are the damping ratio and natural frequency of the command filter, respectively,
and
are the estimate and the corresponding derivative, respectively,
and
are positive constants. The system was observed using the following ESO
where
are the estimate errors,
, and
are the positive observer gains to be designed,
, and
is a small positive constant.
The stability of the autopilot component can be established by consulting reference [
61], and a detailed analysis was not conducted. The autopilot parameters were
. The design parameters for implementing guidance law (61) were set to
and
. The parameters of the ESOs were selected as
.
As illustrated in
Figure 10 and
Table 7, this paper focused on simulating the scenario that produced the maximum normal overload, specifically,
in case 2; the results showed that considering the autopilot lag characteristics did not affect the performance of the proposed guidance law.