Adaptive Impact-Time-Control Cooperative Guidance Law for UAVs Under Time-Varying Velocity Based on Reinforcement Learning

Liu, Zhenyu; Lei, Gang; Xian, Yong; Ren, Leliang; Li, Shaopeng; Zhang, Daqiao

doi:10.3390/drones9040262

Open AccessArticle

Adaptive Impact-Time-Control Cooperative Guidance Law for UAVs Under Time-Varying Velocity Based on Reinforcement Learning

by

Zhenyu Liu

,

Gang Lei

,

Yong Xian

,

Leliang Ren

^*,

Shaopeng Li

and

Daqiao Zhang

Xi’an Research Institute of High Technology, Xi’an 710025, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(4), 262; https://doi.org/10.3390/drones9040262

Submission received: 7 March 2025 / Revised: 26 March 2025 / Accepted: 27 March 2025 / Published: 29 March 2025

(This article belongs to the Special Issue Advanced Cross-Domain Unmanned Platform Command and Security Technology)

Download

Browse Figures

Versions Notes

Abstract

In this study, an adaptive impact-time-control cooperative guidance law based on deep reinforcement learning considering field-of-view (FOV) constraints is proposed for high-speed UAVs with time-varying velocity. Firstly, a reinforcement learning framework for the high-speed UAVs’ guidance problem is established. The optimization objective is to maximize the impact velocity; and the constraints for impact time, dive attacking, and FOV are considered simultaneously. The time-to-go estimation method is improved so that it can be applied to high-speed UAVs with time-varying velocity. Then, in order to improve the applicability and robustness of the agent, environmental uncertainties, including aerodynamic parameter errors, observation noise, and target random maneuvers, are incorporated into the training process. Furthermore, inspired by the

{RL}^{2}

algorithm, the recurrent layer is introduced into both the policy and value network. In this way, the agent can automatically adapt to different mission scenarios by updating the hidden states of the recurrent layer. In addition, a compound reward function is designed to train the agent to satisfy the requirements of impact-time control and dive attack simultaneously. Finally, the effectiveness and robustness of the proposed guidance law are validated through numerical simulations conducted across a wide range of scenarios.

Keywords:

multiple high-speed UAVs; cooperative guidance; impact-time-control guidance; reinforcement learning; field-of-view constraints

1. Introduction

As mission scenarios for multiple cross-domain UAVs have become increasingly complex, the requirements of UAV guidance systems have also become more sophisticated [1,2,3]. The implementation of saturation attacks through impact-time-control guidance (ITCG) can effectively improve the penetration probability of cross-domain UAVs when facing a defense system [1]. It is therefore evident that impact-time-control guidance has become a crucial aspect of guidance technology.

The existing studies on ITCG could be divided into two categories. The first category is based on the biased proportional navigation guide (BPNG) law [4,5], which consists of the PNG law and a bias term for correcting the arrival time error. The second category is based on modern advanced control theories, including stability theory [6,7], sliding-mode variable structures [8,9,10], and backstepping control [11,12,13]. However, the BPNG laws are unable to guarantee robustness in complex environments, while guidance laws based on modern advanced control theories usually require accurate estimation of time-to-go.

To prevent high-speed UAVs from losing track of the target, the seeker’s FOV limits must be considered. There are several approaches to dealing with FOV constraints, one of which is to use compound guidance laws consisting of two or more phases [14,15,16]. However, the discontinuous guidance command at the moment of switching guidance laws may result in undesired transient responses. Another approach to addressing the FOV constraints is to shape the trajectory of the look angle [17,18,19], then the guidance commands are generated by designing reference-trajectory tracking control laws. Other approaches such as optimal control theory [2,20] or nonlinear control theory [21,22] have also been used to address FOV constraints. Although there have been many studies on ITCG and FOV constraints, existing methods are difficult to apply directly to high-speed UAVs with time-varying velocity. On the other hand, the coupling effects between the longitudinal and lateral channel in three-dimensional (3D) space cannot be ignored.

In recent years, computational guidance algorithms have received increasing attention due to the rapid development of embedded computing capabilities. The authors in ref. [23] proposed a guidance law with terminal time and impact angle constraints by employing model predictive static programming (MPSP). Similarly, the authors in ref. [24] proposed a guidance law for vehicle guidance with state and input constraints by using model predictive convex programming (MPCP) to solve guidance commands. Although these approaches can obtain optimal or suboptimal guidance commands online, a poor initial solution may result in difficulties in achieving convergence during the computational process [25].

Deep reinforcement learning (DRL) techniques have undergone substantial advancements and progress in recent years and have demonstrated exceptional performance in decision-making tasks involving high-dimensional data [26]. However, previous studies have focused on DRL applications in intelligent control, with few studies focusing on guidance problems in aerospace. To solve the missile guidance problem in the terminal flight phase, the DDPG algorithm was used to train the agent [27]. The authors in ref. [28] proposed an ITCG law with FOV constraints, where the guidance gain was obtained by DDPG. Similarly, the authors in ref. [29] proposed an impact-angle-control guidance (IACG) law based on BPNG, where the bias term was obtained by a series of training processes. Ref. [30] proposed an terminal flight guidance law for hypersonic gliding vehicles employing the meta-reinforcement learning technique. In Ref. [31], the authors proposed an ITCG algorithm based on a general prediction–correction framework; the method employed a DNN to predict the time-to-go, then the correction commands were learned by using proximal policy optimization (PPO). The author in [32] developed a cooperative interception guidance law for multiple missiles by using the action-enhanced multi-agent deep deterministic policy gradient (EA-MADDPG) algorithm. However, the FOV constraints and velocity with time-varying conditions were not considered. Although the agent training and learning process may be time-consuming when utilizing DRL methods, acquiring guidance commands after deployment requires only computing the output of the neural network.

Although many guidance laws have been developed so far, there are still challenges that need to be addressed through further research to achieve practical implementation. First, the majority of existing studies on ITCG are developed using ideal kinematic assumptions. However, the applicability of these assumptions needs to be carefully examined, especially in the context of UAVs with time-varying velocity and three-dimensional pursuit scenarios. Second, most existing ITCG laws require an accurate estimated value of the time-to-go; therefore, the accuracy of the time-to-go estimation could directly affect the performance of ITCG laws. Third, the application of the DRL method to the development of guidance laws still presents certain limitations that require further attention. These include the development of guidance laws for high-speed UAVs with time-varying velocity and the consideration of FOV constraints on the guidance process.

Based on the above discussion, this study proposes an adaptive ITCG law with FOV constraints for high-speed UAVs with time-varying velocity. The objective is to achieve a cooperative dive attack and maximize impact velocity. To this end, the high-speed UAVs’ guidance problem is first reconstructed in the DRL framework. Furthermore, uncertainties of the environment are considered, including random perturbation biases in aerodynamic coefficients and atmospheric density, random maneuvers of the target, and observation noise. We also impose path constraints on the FOV angle and overload. Finally, a compound reward function is proposed to compensate for guide accuracy, energy consumption, and impact-time error. The state-of-the-art DRL algorithm PPO is used to train the agent that maps the observations to guidance commands. Inspired by the

{RL}^{2}

algorithm [33], a gated recurrent unit (GRU) layer is introduced into the policy network. This enables the well-trained agent to rapidly adapt to the different environment and tasks by updating the hidden state of the GRU layer during deployment.

The contributions of this paper are summarized below. Firstly, the ITCG law for high-speed UAVs with time-varying velocity is developed on the basis of DRL techniques. The objective of this study was to achieve terminal time control for high-speed UAVs with time-varying velocity and to enable dive attacks; a compound reward function is proposed that consists of a positive reward and negative penalty to ensure the high-speed UAV arrives at the target while continuously correcting the arrival time. Secondly, a straightforward but effective time-to-go estimation approach for time-varying velocity is proposed, which has higher accuracy than the analytical formula prediction method. In addition, to improve the applicability of the guidance law in practical application scenarios, FOV constraints and overload constraints are considered. Third, inspired by the

{RL}^{2}

algorithm [33], the gated recurrent unit (GRU) layers are introduced into the policy network and the value network to achieve rapid adaptation to different tasks by continuously updating the hidden states of the GRU layers.

This paper is organized as follows: Section 2 briefly reviews the PPO algorithm. Section 3 formulates the guidance problem for the UAV. Section 4 details the proposed guidance law. Section 5 shows the results of the numerical simulations. Finally, Section 6 provides the conclusion.

2. Preliminary

The proximal policy optimization algorithm (PPO) [34] is recognized for its benchmark performance in complex control tasks. It implements a two-stream neural network structure: the actor network is used to generate the action for the observation at the current time step, and the critic network is used to approximate the state value function, which evaluates the value of the current state.

Let

θ, w

denote the parameters of the actor network and the critic network, respectively. The probability ratio of the new policy

π_{θ}

to the old policy

π_{θ^{-}}

is calculated as follows:

p_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ^{-}} (a_{t} | s_{t})}

(1)

The optimization objective of the PPO algorithm is to maximize the objective function as follows:

J (θ) = E (min (p_{t} (θ), clip (p_{t} (θ), 1 - ϵ, 1 + ϵ)) {\hat{A}}_{t})

(2)

where

ϵ

represents the clip coefficient, and

{\hat{A}}_{k}

represents the estimated value of the advantage function at time step k, which can be obtained as follows:

{\hat{A}}_{t} = [\sum_{k = t}^{T} γ^{k} r_{t + k}] - V_{w}^{π} (s_{t})

(3)

where the state value function

V_{w}^{π}

is updated by minimizing the loss function as follows:

L (w) = \sum_{i = 1}^{| D |} {(V_{w}^{π} (s_{t}) - [\sum_{k = t}^{T} γ^{k} r_{t + k}])}^{2}

(4)

where

| D |

represents the total number of collected data. Then, the parameters

θ

can be updated by maximizing the objective function

J (θ)

, and the parameters w can be minimized using the loss function

L (w)

, that is,

θ^{+} = θ^{-} + α_{θ} \nabla_{θ} {J (θ) |}_{θ = θ^{-}}

(5)

w^{+} = w^{-} - α_{w} \nabla_{w} {L (w) |}_{w = w^{-}}

(6)

where

α_{θ}

and

α_{w}

denote the learning rates for the actor network

π_{θ} (a_{t} | s_{t})

and the critic network

V_{w}^{π} (s_{t})

, respectively.

3. Problem Statement

As part of the system design, we established a local geodetic reference frame (Figure 1) following the North–East–Down (NED) convention: the x-axis aligned with true north, the z-axis pointed eastward, and the y-axis completed the orthogonal right-handed coordinate system. The following foundational assumptions were introduced to formulate the problem.

Assumption 1.

The vehicle adopts a bank-to-turn (BTT) control scheme, that is, the side slip angle β of the high-speed UAV is always steered to be zero.

Assumption 2.

The high-speed UAV and target are treated as mass points.

Assumption 3.

The target’s velocity is much smaller than that of the high-speed UAV.

This paper focused primarily on the guidance problem of high-speed UAVs during the terminal flight phase; therefore, the effects of Earth’s curvature were neglected. Thus, according to Assumptions 1 and 2, the 3D motion equations of a high-speed UAV could be expressed as follows [35]:

\begin{matrix} {\dot{x}}_{M} & = V_{M} cos γ cos ψ \end{matrix}

(7a)

\begin{matrix} {\dot{y}}_{M} & = V_{M} sin γ \end{matrix}

(7b)

\begin{matrix} {\dot{z}}_{M} & = V_{M} cos γ sin ψ \end{matrix}

(7c)

\begin{matrix} {\dot{V}}_{M} & = - \frac{D}{m} - g sin γ \end{matrix}

(7d)

\begin{matrix} \dot{γ} & = \frac{L cos σ}{m V_{M}} - \frac{g}{V} cos γ \end{matrix}

(7e)

\begin{matrix} \dot{ψ} & = \frac{L sin σ}{V_{M} cos γ} \end{matrix}

(7f)

\begin{matrix} \dot{α} & = Δ α \end{matrix}

(7g)

\begin{matrix} \dot{σ} & = Δ σ \end{matrix}

(7h)

where

x_{M}, y_{M}, z_{M}

represent the position coordinates of the vehicle, and

V_{M}

denotes the velocity of the UAV. The definitions of other symbols are shown in Figure 1. The gravity acceleration

g = 9.81 {m / s}^{2}

.

α

and

σ

represent the angle of attack and bank angle, respectively. The expressions of the aerodynamic lift and drag forces are as follows:

\begin{matrix} L & = \frac{1}{2} ρ V_{M}^{2} S_{r e f} C_{L} \end{matrix}

(8a)

\begin{matrix} D & = \frac{1}{2} ρ V_{M}^{2} S_{r e f} C_{D} \end{matrix}

(8b)

where

ρ

denotes the atmospheric density,

S_{r e f}

represents the reference area, and

C_{L}

and

C_{D}

are the lift and drag coefficients.

In addition, the actuator delay was also considered, that is,

\begin{matrix} \dot{Δ α} & = \frac{Δ α_{c} - Δ α}{τ_{a}} \end{matrix}

(9a)

\begin{matrix} \dot{Δ σ} & = \frac{Δ σ_{c} - Δ σ}{τ_{b}} \end{matrix}

(9b)

where

τ_{a}

and

τ_{b}

are the actuator time constants,

Δ α

and

Δ σ

are the actual angular rates, and

Δ α_{c}

and

Δ σ_{c}

are the command angular rates.

Consider a maneuvering target on the ground with the following motion equations:

\begin{matrix} {\dot{r}}_{T} & = v_{T} \end{matrix}

(10a)

\begin{matrix} {\dot{v}}_{T} & = a_{T} \end{matrix}

(10b)

where

r_{T}, v_{T},

and

a_{T}

denote the position, velocity, and acceleration vector of the target, respectively. Because the target only maneuvers on the ground,

y_{T} = v_{y T} = a_{y T} = 0

. Moreover, the amplitude of the velocity and acceleration of the target need to satisfy the following limits:

\begin{matrix} r_{T} (t_{0}) & = {[0, 0, 0]}^{T} \end{matrix}

(11a)

\begin{matrix} ∥ v_{T} ∥ & \leq V_{T_{m a x}} \end{matrix}

(11b)

\begin{matrix} ∥ a_{T} ∥ & \leq a_{T_{m a x}} \end{matrix}

(11c)

Let

r_{M} = {[x_{M}, y_{M}, z_{M}]}^{T}

denote the position vector of the vehicle and

r_{T M}

denote the vector from the vehicle to the target; then, the elevation angle

θ_{L}

and azimuth angle

ψ_{L}

of the line-of-sight (LOS) vector are defined as follows:

\begin{matrix} θ_{L} & = arcsin \frac{y_{T} - y_{M}}{∥ r_{T M} ∥} \end{matrix}

(12a)

\begin{matrix} ψ_{L} & = arctan 2 (z_{T} - z_{M}, x_{T} - x_{M}) \end{matrix}

(12b)

where

r_{T M} = r_{T} - r_{M}

, and

arctan 2 (\cdot, \cdot)

denotes the four-quadrant inverse tangent function. Taking the time derivative of

θ_{L}, ψ_{L}

and combining it with Equation (7a–h), we can obtain

\begin{matrix} {\dot{θ}}_{L} & = \frac{V_{M} cos γ sin (ψ_{L} - ψ)}{S_{T M}} \end{matrix}

(13a)

\begin{matrix} {\dot{ψ}}_{L} & = - \frac{V_{M} (cos θ_{L} sin γ - sin θ_{L} cos γ cos (ψ_{L} - ψ))}{∥ r_{T M} ∥} \end{matrix}

(13b)

where

S_{T M} = \sqrt{{(x_{T} - x_{M})}^{2} + {(z_{T} - z_{M})}^{2}}

represents the length of the projection of the LOS vector in the horizontal plane.

In addition, the leading angle

η

of the vehicle, which represents the angle between the LOS vector and the velocity vector of the vehicle, can be calculated as follows:

η = arccos (\frac{r_{T M}^{T} \cdot v_{M}}{∥ r_{T M} ∥ \cdot ∥ v_{M} ∥})

(14)

The primary goal of the designed guidance law is to achieve the following final conditions:

| t_{f} - T_{d} | \leq ϵ_{δ_{t}}

(15)

∥ r_{M} (t_{f}) - r_{T} (t_{f}) ∥ \leq ϵ_{r}

(16)

γ_{f} \leq γ_{d i v e}

(17)

where

t_{f}

and

T_{d}

represent the final time and the desired final time, respectively.

γ_{d i v e}

represents the minimum path angle required for dive attacking.

Moreover, the path constraints should also be considered, such as the overload constraints and FOV constraints, that is,

\frac{\sqrt{L^{2} + D^{2}}}{m g} \leq n_{m a x}

(18)

| η (t) + α (t) | \leq F O V_{m a x}, \forall t \leq t_{f}

(19)

where

n_{m a x}

and

F O V_{m a x}

denote the allowable maximum limit of the overload and FOV angle of vehicle, respectively.

It should be noted that the majority of existing studies [19,20] express the FOV constraint in the form of

| η (t) | \leq F O V_{m a x}

. This is due to the fact that these studies employ the small-angle-of-attack assumption, by which the angle of attack is regarded as being approximately zero. However, in reality, the vehicle’s FOV angle is equal to the sum of the angle of attack and the leading angle (as illustrated in Figure 2), so we expressed the FOV constraints in the form of Equation (19).

Our objective was for the velocity of the vehicle to be as large as possible when it reached the target. Consequently, the designation of the guidance law was reduced to solving the following constrained optimization formulation:

min - V_{M} (t_{f})

(20)

s.t. (7), (15)–(19).

4. Method

This section presents a description of the guidance problem within the context of the DRL framework, together with an optimization of the guidance policy through the application of the PPO algorithm. The representative images in Figure 3 illustrate the reinforcement learning guidance framework employed in this study, where the IMU is the inertial measurement unit. The agent generates acceleration commands according to the observation information, and then the autopilot generates the angle of attack and bank angle change rate commands based on the acceleration command.

4.1. Observation Space and Action Space

For the guidance and control problems, the observation vector must fully describe the information about the relative state of the vehicle and target. Therefore, the observation vector was designed as follows:

\tilde{o_{t}} = {[∥ r_{T M} ∥, λ, V_{M}, {\dot{θ}}_{L}, {\dot{ψ}}_{L}, {\hat{t}}_{g o}, δ_{t}, δ_{γ}]}^{T}

(21)

where

λ = \frac{r_{T M}}{∥ r_{T M} ∥}

denotes the LOS uni vector,

{\hat{t}}_{g o}

denotes the estimated value of the time-to-go

t_{g o}

,

δ_{t} = t + {\hat{t}}_{g o} - T_{d}

is the estimated arrival time error, and

δ_{γ} = \frac{π}{2} - θ_{L}

. Moreover, the observation noise was considered, that is,

o_{t} = {\tilde{o}}_{t} \cdot N (1.0, ε_{o b s}, n)

(22)

where

N (1.0, ε_{o b s}, n)

denotes the n-dimensional independent Gaussian distributed random variables with mean 1.0 and standard deviation

ε_{o b s}

, and n is the dimension of the observation.

The majority of existing time-to-go estimation methods employ constant velocity and PNG guidance assumptions. However, the velocity is not constant during the terminal flight phase, and the guidance methods used do not necessarily adhere to the principles of PNG. Therefore, existing time-to-go estimation methods cannot be applied directly in vehicles with time-varying velocity. For this reason, we propose a straightforward but effective iterative approach to estimate the time-to-go. First, the initial guess for the remaining flight time estimate can be expressed as:

{\hat{t}}_{g o} = \frac{∥ r_{T M} ∥}{∥ {\dot{r}}_{T M} ∥} \approx \frac{∥ r_{T M} ∥}{V_{M} cos η}

(23)

Here, we utilized Assumption 3, as

V_{T} ≪ V_{M}

, and thus

V_{M T} = V_{M} - V_{T} \approx V_{M}

. Based on the assumption that the derivative of the vehicle’s velocity with respect to time is constant, the predicted value of the terminal velocity can be calculated as follows:

V_{f} = V_{M} - {\dot{V}}_{M} {\hat{t}}_{g o}

(24)

It is important to note that, since the deviations in aerodynamic parameters and atmospheric density cannot be determined in advance, it is not feasible to calculate

{\dot{V}}_{M}

using Equation (7d) in practice. However, in real-world applications,

{\dot{V}}_{M}

can be directly obtained from the output of an accelerometer.

Then, the average velocity over the entire flight process can be obtained using the following equation:

V_{a v e} = \frac{V_{M} + V_{f}}{2}

(25)

Therefore,

V_{M}

in Equation (23) can be replaced with the average velocity

V_{a v e}

, thereby obtaining a new estimate for the time-to-go, expressed as

{\hat{t}}_{g o}^{n e w} = \frac{∥ r_{T M} ∥}{V_{a v e} cos η}

(26)

In general,

{\hat{t}}_{g o}^{n e w}

is not directly equal to

{\hat{t}}_{g o}

. However, after a few iterations of the above process, the estimate of time-to-go

{\hat{t}}_{g o}

converges quickly.

It is worth noting that in the above analysis, we implicitly used two assumptions:

{\dot{V}}_{M}

is constant, and the leading angle

η

is also constant. These assumptions may introduce errors in the final estimate. However, the estimation error of the time-to-go will also gradually converge to zero as the distance between the UAV and the target continues to decrease. The time-to-go estimation approach is shown in Algorithm 1.

Algorithm 1 Improved time-to-go estimation algorithm

Input: Error tolerance $ϵ_{e}$ , maximum number of iterations $N_{i t e r}$ , distance $∥ r_{T M} ∥$ , leading angle $η$ , velocity $V_{M}$ and the rate of change in velocity ${\dot{V}}_{M}$
Output: Estimated value of time-to-go ${\hat{t}}_{g o}$

1:: Let $k = 0$ , ${\hat{t}}_{g o}^{o l d} = \frac{∥ r_{T M} ∥}{V_{M} cos η}$
2:: while $| {\hat{t}}_{g o}^{o l d} - {\hat{t}}_{g o}^{n e w} | > ϵ_{e}$ do
3:: Let ${\hat{t}}_{g o}^{o l d} = {\hat{t}}_{g o}^{n e w}$
4:: Estimated the final velocity of vehicle $V_{f} = V_{M} + {\dot{V}}_{M} {\hat{t}}_{g o}^{o l d}$
5:: Obtain the mean velocity $V_{a v e} = \frac{V_{M} + V_{f}}{2}$
6:: Obtain the new estimated value of time-to-go ${\hat{t}}_{g o}^{n e w} = \frac{∥ r_{T M} ∥}{V_{a v e} cos η}$
7:: $k = k + 1$
8:: if $k \geq N_{i t e r}$ then
9:: Break
10:: end if
11:: end while
12:: return The estimated value of time-to-go ${\hat{t}}_{g o}^{n e w}$ .

The proposed time-to-go estimation algorithm has the following advantages: First, it does not rely on any specific guidance law, making it applicable to a broader range of scenarios. In contrast, traditional analytical prediction formulas [8,9,10] can only predict the remaining flight time for trajectories derived from PNG. Second, compared to numerical integration methods, the proposed method benefits from its simple iterative process, offering superior real-time performance.

To avoid the excessive overload of the vehicle during flight, the action space was formed as an acceleration command, that is,

a_{t} = {[a_{y c}, a_{z c}]}^{T}

(27)

where

a_{y c}

and

a_{z c}

are the required acceleration of the vehicle in the horizontal and vertical planes, respectively.

There are two principal reasons for adopting the acceleration command as the action. Firstly, the overload of the vehicle can be easily made to satisfy the constraints by applying a clip operation to the output of the policy

π (o_{t})

. Secondly, although taking the command angular rates

[Δ α, Δ σ]

as action is an end-to-end solution, which makes the problem formulation more concise, it may prove challenging in terms of achieving convergence on the policy of the agent.

Although a method for calculating the angle of attack and bank angle commands was proposed in ref. [35], applying that method implies that when the sign of the required lift changes, the sign of the bank angle must also change accordingly. When the sign of the bank angle change occurs for larger orders of magnitude of the bank angle, it leads to the occurrence of undesired transient processes. To circumvent this challenging problem, we adopted an unlimited bank angle scheme which was employed in [12] (as illustrated in Figure 4). Consequently, a simple and effective computational approach is given.

Based on the action of the agent, the commands

α_{c}

can be obtained by solving the following equation:

α_{c} = \underset{α_{l} \leq α \leq α_{u}}{arg min} | L - L_{d} |

(28)

where

α_{l}

and

α_{u}

represent the minimum and maximum values of the vehicle’s angle of attack, respectively.

L_{d}

denotes the required lift force, whose expressions are given by the following equations:

L_{d} = m \sqrt{{(a_{y c} + g cos γ)}^{2} + a_{z c}^{2}}

(29)

Equation (28) is essentially a single-variable root-finding problem. In this paper, the Newton–Raphson method was employed to solve Equation (28). However, alternative methods such as the secant method or Brent’s method can also achieve rapid solutions for Equation (28).

Then, the commanded bank angle

σ_{c}

can be obtained by the following equation with the unlimited bank angle control scheme:

σ_{c} = arctan 2 (a_{z c}, a_{y c} + g cos γ)

(30)

Finally, the commanded angular rate can be obtained by the following equations

\begin{matrix} Δ α_{c} = clip (\frac{α_{c} - α_{t}}{Δ t}, - Δ α_{m a x}, Δ α_{m a x}) \end{matrix}

(31a)

\begin{matrix} Δ σ_{c} = clip (\frac{σ_{c} - σ_{t}}{Δ t}, - Δ σ_{m a x}, Δ σ_{m a x}) \end{matrix}

(31b)

where

Δ t

is the guidance period.

The adoption of an unlimited bank angle control scheme is motivated by two primary reasons: First, to satisfy the requirement for dive attacks, the high-speed UAV must generate a negative lifting force. If the bank angle is restricted to the range

[- π / 2, π / 2]

, the negative lift force can only be achieved through the negative angles of attack. However, for non-axisymmetric high-speed UAVs, the aerodynamic effects of negative and positive angles of attack often differ significantly, which increases control complexity. Second, when the high-speed UAV approaches the target, oscillations in guidance commands inevitably occur. When the sign of the required lateral acceleration

a_{z c}

changes, the unlimited bank angle control scheme allows the bank angle to adjust by a small magnitude. In contrast, if the bank angle is restricted to

[- π / 2, π / 2]

, a sign change in

a_{z c}

would require large variations in the bank angle, severely compromising the flight stability of high-speed UAVs.

4.2. Reward Function and Termination Conditions

The most significant challenge in using DRL to address the UAV with time-varying velocity guidance problem is developing an effective reward function in a sparse reward environment. If the reward signal is given only if the vehicle arrives at the target and satisfies the process and terminal constraints, it is probable that, with a limited number of episodes, the policy of the agent may encounter difficulty in identifying and exploiting positive samples.

The potential-based reward shaping (PBRS) method was proposed in [36] to avoid the challenge of a sparse reward environment. Inspired by PBRS, a compound reward function was designed that provided cues to the agent within each time step, thereby motivating the vehicle to reach the target. The shaping reward was designed for keeping the LOS angular rate of the vehicle as small as possible, that is,

r_{m i s s} = β_{m i s s} exp (- \frac{{∥ Ω ∥}^{2}}{κ_{Ω}^{2}})

(32)

where

Ω = {[{\dot{θ}}_{L}, {\dot{ψ}}_{L}]}^{T}

,

β_{m i s s}

is a positive constant, and

κ_{Ω} > 0

is the LOS angular rate’s scaling factor.

Then, we also need to keep the arrival time error within tolerance. The reward signal on arrival time error is

r_{t i m e} = β_{t i m e} exp (- \frac{| δ_{t} |^{2}}{κ_{t}^{2}})

(33)

where

β_{t i m e}

is a positive constant, and

κ_{t} > 0

is the arrival time error’s scaling factor.

Meanwhile, in order to achieve a dive attack, an overly smooth path angle is undesirable behavior. Therefore, a penalty signal must be applied to the dive angle error, that is,

r_{d i v e} = - β_{d i v e} {|\frac{δ_{γ}}{κ_{γ}}|}^{2}

(34)

where

β_{d i v e}

is a positive constant, and

κ_{γ}

is the dive angle error’s scaling factor.

From Equations (28) and (29), it can be derived that the required lift

L_{d}

and the commanded acceleration

a_{c} = {[a_{y c}, a_{z c}]}^{T}

are positively correlated. This implies that as

a_{c}

increases,

L_{d}

also increases, resulting in a higher commanded angle of attack

α_{c}

. Consequently, the drag coefficient

C_{D}

increases, leading to greater energy loss. Therefore, to minimize the energy consumption of the vehicle guidance policy, we introduced a negative reward signal to penalize excessive vehicle commanded acceleration, that is,

r_{e} = - β_{e} {∥ a_{t} ∥}^{2}

(35)

where

a_{t}

is the output of the policy.

Finally, for terminal conditions that satisfy the constraints, it is necessary to provide an appropriate bonus signal:

r_{b o n u s} = \{\begin{matrix} ξ + \frac{V_{M} (t_{f})}{κ_{V_{M}}}, & if ∥ r_{T M} ∥ \leq ϵ_{r} and δ_{t} \leq ϵ_{δ_{t}} \\ and γ < γ_{d i v e} and done \\ 0 & else \end{matrix}

(36)

where

ξ

is a positive constant, and

κ_{V_{M}}

is the impact velocity scaling factor.

Combining Equations (32), (33), (35), and (36), the reward signal for each time step can be obtained as follows:

r_{t} = r_{m i s s} + r_{t i m e} + r_{d i v e} + r_{e} + r_{b o n u s}

(37)

Equation (7a–h) was solved by employing the fourth-order Adams prediction-correction numerical integration method with a

0.1 s

time step, and the initial four steps were integrated using the Runge–Kutta method. In addition, to ensure sufficient accuracy, the time step was reduced by a factor of 100 when the distance was less than

800 m

.

Given the mission profile and operational constraints, the terminal criteria were formally established as follows:

The height of the vehicle is less than zero;
The FOV constraints or the overload constraints are violated;
The vehicle is flying in a direction that is away from the target;
The vehicle has arrived at the target successfully.

It is of significant importance to note that the current episode will be terminated if the constraints are violated. In such an instance, the agent will not receive a terminal reward. This incentive is sufficient to encourage the agent to learn to satisfy the constraints; therefore, no negative reward is required when the constraint is violated.

4.3. Policy Optimization

In light of the satisfactory performance of GRU in long-term historical memory, a GRU layer was introduced into the actor network and the critic network with the objective of enhancing the generalization ability of the policy under different task conditions. This integration enabled the policy to adapt to varying tasks through the updating of the hidden states of the GRUs.

It should be noted that when GRU layers are introduced into policy and value networks, the way to update the hidden state of the GRU layer during the rollout collection process has a significant impact on the training effect. Inspired by [33,37], the hidden state was reset before the start of each rollout. However, the hidden state of each episode in the same rollout was inherited from the previous episode. The interaction protocol between the learning agent and its operational environment is illustrated in Figure 5. It should be noted that the length of each episode may vary, although this is not explicitly shown in Figure 5. Through this design, the aim was to encourage the agent to update the parameters of the GRU layer so that updates to its hidden state could be used to describe an embedded representation of the current environment or task. This provided the agent with multi-level input features for rapid adaptation to different environments and tasks, thereby enhancing the adaptability and generalization performance of the final network.

The PPO algorithm has two versions of the objective function. The first is the clipped surrogate objective, whose expression is shown in Equation (2). The second involves incorporating the KL divergence into the objective function as a penalty, and the penalty coefficient is adaptively adjusted so that the KL divergence can reach a specific target value

d_{t a r g}

during each policy update.

In general, it is more straightforward to achieve convergence when utilizing the clipped surrogate objective. However, this approach inevitably requires a longer training process. In contrast, it is challenging to achieve convergence when employing the adaptive KL penalty coefficient. Therefore, in order to strike a balance between the speed and stability of the training process, the two aforementioned methods were combined in a novel manner, whereby the clipped factor

ϵ

and the learning rate

α_{θ}

of the policy network were adaptively adjusted in order to achieve the target value of the KL divergence

d_{t a r g}

, that is,

ϵ = \{\begin{matrix} min (ϵ_{m a x}, 1.5 ϵ) & if d < \frac{1}{2} d_{t a r g} \\ max (ϵ_{m i n}, 1.5 ϵ) & if d > 2 d_{t a r g} \\ ϵ & others \end{matrix}

(38)

α_{θ} = \{\begin{matrix} 1.5 α_{θ} & if d < \frac{1}{2} d_{t a r g} and ϵ > \frac{1}{2} ϵ_{m a x} and α_{θ} < α_{θ}^{m a x} \\ \frac{1}{1.5} α_{θ} & if d > 2 d_{t a r g} and ϵ < 2 ϵ_{m i n} and α_{θ} > α_{θ}^{m i n} \\ α_{θ} & others \end{matrix}

(39)

In the training phase, the experience data were fully utilized through the conduct of E update epochs, The training process involved a range of generic operations, including ratio calculation, the computation of various losses, and gradient clipping. The adaptive PPO (APPO) algorithm is shown in Algorithm 2.

Algorithm 2 Adaptive PPO (APPO) algorithm

Input: Target value of the KL divergence $d_{t a r g}$ , number of epochs M
Output: The optimized parameters of actor $θ$ , and critic w

1:: Initialize network parameters, including the actor $θ$ , and critic w
2:: for epoch = $1 : M$ do
3:: Initialize replay buffer R and reset the hidden states of the RNNs
4:: for episodes = $1 : N$ do
5:: Reset the environment;
6:: while not done do
7:: Obtain the current observation $s_{t}$
8:: The observation $s_{t}$ is fed to the actor to obtain the action $a_{t}$
9:: Execute one step in the environment
10:: Store $(o_{t}, a_{t}, r_{t}, d o n e_{t})$ into R
11:: end while
12:: end for
13:: Calculate discounted returns G and advantages $\hat{A}$
14:: for opt = $1 : E$ do
15:: Calculate policy network loss, value network loss, and KL divergence
16:: Update $θ, w$
17:: Adjust the clipped factor $ϵ$ and the learning rate $α_{θ}$
18:: end for
19:: end for
20:: return The optimized parameters $θ, w$

5. Simulation Results

5.1. Training Scenarios

The parameters of the vehicle referenced in [12] were adopted; the full parameters of the vehicles are given in Table 1, and the detail expressions of the aerodynamic coefficients

C_{L}

and

C_{D}

can be found in ref. [12].

In order to ensure that the agent exhibits good generalization ability, it was trained on a range of initial conditions. The details of the initial conditions are listed in Table 2.

It should be noted that four-layer network structures are frequently employed in DRL applications [38]. Accordingly, a four-layer neural network was employed to implement the policy and value functions, where the first hidden layer of the policy and value functions was a GRU [39] layer. The network architectures are shown in Table 3.

5.2. Training Results

Table 4 and Table 5 provide a detailed list of the hyperparameter values used during the training process. In practice, the selection of appropriate hyperparameters is often a crucial determinant of the performance of APPO. Different problems may require the use of different sets of hyperparameters. Accordingly, these hyperparameters were tuned for the guiding problem based on a series of trial-and-error tests.

For comparison, simulations were also performed that replaced the GRU layer with an MLP layer. For the sake of simplicity, the APPO algorithm with GRU is henceforth referred to as APPO-GRU, and the APPO algorithm employing an MLP layer is referred to as APPO-MLP.

The results of comparing the learning curves are shown in Figure 6. The figure clearly demonstrates that the introduction of the GRU layer resulted in a significant increase in the average steady-state reward value. This also validated the previous analysis in this paper, which indicated that using the hidden state of the GRU layer as a deep embedded representation of environmental features could effectively enhance the adaptability and generalization performance of the agent.

The final impact-time error change process during the training process is illustrated in Figure 7. As can be observed from Figure 7, the arrival-time error of APPO-GRU eventually converged to around 0.2 s, while the arrival-time error of APPO-MLP converged to approximately 0.7 s. This also demonstrated that the incorporation of the GRU layer could effectively enhance the guidance performance of the agent.

The incorporation of a GRU layer contributed to the agent’s performance through two primary mechanisms. Primarily, the hidden state of the GRU layer enabled the agent to delineate the characteristics of various task scenarios, thereby augmenting its capacity for generalization and robustness. Secondly, the hidden state could retain the historical information of the trajectory, endowing the guidance policy with the capability to draw insights from the historical state.

Furthermore, Figure 8 illustrates the curves of the KL divergence of the policy throughout the training process. This figure demonstrates that the proposed adaptive PPO algorithm could maintain the KL divergence of the policy at a level closely aligned with the target KL divergence, thereby ensuring that the training process remained stable and ultimately converged.

5.3. Test Results

To verify the impact-time-control capability of the proposed guidance policy, simulations were performed on scenarios with the same initial state but different preset impact times. The initial state settings are shown in Table 6 and the preset impact times were 76, 78, and 80 s, respectively. For comparison, the PNG law was also simulated. The simulation results for different desired arrival times are presented in Figure 9 and Figure 10. In addition, the final miss distance, flight path angle, and arrival-time error results are listed in Table 7.

As illustrated in Figure 9, the proposed ITGC law primarily controlled the arrival time by controlling the vehicle to turn in the horizontal plane. In other words, different specified arrival times were achieved by controlling the magnitude of lateral maneuvers; the larger the specified arrival time, the greater the amplitude of the turning maneuvers in the lateral plane. At the same time, it can be observed from Table 7 that the impact time errors of the proposed ITCG law were all less than 0.5 s. Furthermore, Figure 10f illustrates the FOV angle curves for different specified arrival times; it can be seen that the proposed ITGC law was capable of maintaining the FOV constraint at all times, thereby providing further evidence of the efficiency of the proposed ITCG law.

It should be noted that while the PNG law could also meet FOV constraints, it was not capable of controlling impact time or diving attack. Moreover, as can be found in Figure 10b, the terminal velocity of the vehicle guided by the PNG law was around 600 m/s, while the terminal velocities of the proposed guidance law were around 1000 m/s. Therefore, the proposed ITCG law also demonstrated significant superiority over PNG in terms of optimizing performance metrics.

It should be noted that, as shown in Figure 10d, when the ITCG law proposed in this paper was employed, the bank angle frequently alternated between

- π

and

π

. However, this did not violate the bank angle rate constraint specified in Equation (31b), as the bank angle adjustments were controlled through the minimum rotation angle (as illustrated in Figure 11).

According to Figure 11, we can define the corresponding vectors for

σ

and

σ_{c}

as follows:

\begin{matrix} q = {[cos σ, sin σ, 0]}^{T} \end{matrix}

(40a)

\begin{matrix} q_{c} = {[cos σ_{c}, sin σ_{c}, 0]}^{T} \end{matrix}

(40b)

Subsequently, the cross-product of

q

and

q_{c}

can be calculated using the following equation:

q \times q_{c} = {[0, 0, sin σ_{c} cos σ - cos σ_{c} sin σ]}^{T} = {[0, 0, sin (σ_{c} - σ)]}^{T}

(41)

Therefore, according to Figure 11, when

sin (σ_{c} - σ) < 0

, the bank angle

σ

rotates clockwise towards the commanded bank angle

σ_{c}

, whereas when

sin (σ_{c} - σ) > 0

,

σ

rotates counterclockwise towards

σ_{c}

.

Finally, Equation (31b) can be rewritten as follows:

Δ σ_{c} = \{\begin{matrix} clip (\frac{σ_{c} - σ + 2 π}{Δ t}, - Δ σ_{m a x}, Δ σ_{m a x}), & if sin (σ_{c} - σ) < 0 \\ clip (\frac{σ - σ_{c} - 2 π}{Δ t}, - Δ σ_{m a x}, Δ σ_{m a x}), & if sin (σ_{c} - σ) > 0 \end{matrix}

(42)

Taking the flight trajectory with

T_{d} = 76 s

as an example, in Figure 10d, the bank angle of the UAV changed from

- 177.28 \deg

to

177.76 \deg

at

t = 5 s

. However, the actual change in the bank angle was only

- 4.96 \deg

During the guidance cycle, the corresponding bank angle rate of change was

- 4.96 \deg / s

, which was significantly lower than the bank angle rate constraint specified in Table 1.

5.4. Monte Carlo Simulations

Monte Carlo simulations are performed to assess the adaptability and robustness of the proposed ITCG law, considering factors such as aerodynamic disturbances and measurement noise. The initial state variables of the vehicle were stochastically generated based on the parameter ranges outlined in Table 2, while the environmental uncertainties are defined in Table 1. The results of 500 independent Monte Carlo trials are summarized in Figure 12, Figure 13 and Figure 14.

As illustrated in Figure 12b, the terminal velocity of the vehicle was approximately 1000 m/s, its velocity loss was significantly less than that of the PNG law. The results presented in Figure 12c shows that the final flight path angles were all around

- 80^{\circ}

, ensuring compliance with the dive-attack specifications. Furthermore, Figure 12d illustrates that the FOV constraints were also satisfied. The results indicate that the proposed guidance law exhibited exceptional robustness, even when subjected to aerodynamic deviations and sensor noise.

To gain further insight into guidance performance, a series of key variables were subjected to statistical analysis, the results of which are presented in Table 8. Figure 13 illustrates the distribution of the miss distance in the 2D plane, and Figure 14 presents the statistical histograms that include the miss distance, impact time error, terminal flight path angle, and terminal velocity.

Figure 13 depicts the impact point distribution results from Monte Carlo simulations. As illustrated in Figure 13, despite the presence of aerodynamic parameter deviations and observation noise, the miss distances of the proposed guidance policy remained within 10 m, with impact time errors below 0.5 s. The results demonstrate that the proposed ITCG law displayed robust performance. Furthermore, Table 8 indicates that the minimum impact velocity was 880.43 m/s, which was considerably higher than the impact velocity of the PNG law.

6. Conclusions

This study proposed a DRL-based ITCG law for a high-speed UAV with time-varying velocity, enabling impact-time control and dive attack while maintaining FOV constraints. Firstly, a reinforcement learning framework was designed for high-speed UAV guidance with time-varying velocity, in which the guidance system mapped observations, including time-to-go estimation, to the commanded acceleration of the vehicle. Secondly, a multi-objective reward architecture was developed to jointly optimize both time-constrained impact and dive attack. Furthermore, in order to improve the adaptability of the guidance law, environmental uncertainties were considered during the training process. In particular, GRU layers were introduced into the policy network and value network to make the policy adaptive to different task scenarios. Finally, simulations were conducted in various challenging scenarios, and the results illustrated that the proposed ITCG law exhibited good performance even in the presence of aerodynamic parameter deviation and environmental uncertainties.

Future research will prioritize three key directions. First, the development of an integrated guidance framework incorporating simultaneous impact-time and terminal-angle control is expected to significantly enhance high-speed UAVs’ operational capabilities in executing complex mission profiles; second, it would be beneficial to consider extending this method to integrated guidance-and-control frame; third, based on the establishment of communication links among UAV swarms, multi-agent reinforcement learning (MARL) techniques could be utilized to construct a distributed cooperative guidance decision-making method, enabling the realization of collective intelligence in a true sense.

Author Contributions

Conceptualization, G.L.; formal analysis, Z.L.; methodology, Z.L. and Y.X.; resources, G.L., Y.X. and S.L.; validation, L.R.; writing—original draft, Z.L.; writing—review and editing, L.R., S.L. and D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

FOV	Field-of-view
ITCG	Impact-time-control guidance
PNG	Proportional navigation guidance
3D	Three-dimensional
MPSP	Model predictive static programming
MPCP	Model predictive convex programming
DRL	Deep reinforcement learning
RL	Reinforcement learning
DDPG	Deep deterministic policy gradient
PPO	Proximal policy optimization
RNN	Recurrent neural network
GRU	Gated recurrent unit
IMU	Inertial measurement unit
PBRS	Potential-based reward shaping
MLP	Multi-layer perceptron

References

Lee, C.H.; Seo, M.G. New Insights into Guidance Laws with Terminal Angle Constraints. J. Guid. Control. Dyn. 2018, 41, 1832–1837. [Google Scholar] [CrossRef]
Park, B.G.; Kim, T.H.; Tahk, M.J. Range-to-go weighted optimal guidance with impact angle constraint and seeker’s look angle limits. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 1241–1256. [Google Scholar] [CrossRef]
Ren, L.; Guo, W.; Xian, Y.; Liu, Z.; Zhang, D.; Li, S. Deep reinforcement learning based integrated evasion and impact hierarchical intelligent policy of exo-atmospheric vehicles. Chin. J. Aeronaut. 2025, 38, 103193. [Google Scholar] [CrossRef]
Jeon, I.S.; Lee, J.I.; Tahk, M.J. Impact-Time-Control Guidance with Generalized Proportional Navigation Based on Nonlinear Formulation. J. Guid. Control. Dyn. 2016, 39, 1885–1890. [Google Scholar] [CrossRef]
Livermore, R.; Shima, T. Deviated Pure-Pursuit-Based Optimal Guidance Law for Imposing Intercept Time and Angle. J. Guid. Control. Dyn. 2018, 41, 1807–1814. [Google Scholar] [CrossRef]
Saleem, A.; Ratnoo, A. Lyapunov-Based Guidance Law for Impact Time Control and Simultaneous Arrival. J. Guid. Control. Dyn. 2016, 39, 164–173. [Google Scholar] [CrossRef]
Kim, M.; Jung, B.; Han, B.; Lee, S.; Kim, Y. Lyapunov-based impact time control guidance laws against stationary targets. IEEE Trans. Aerosp. Electron. Syst. 2015, 51, 1111–1122. [Google Scholar] [CrossRef]
Hou, Z.; Yang, Y.; Liu, L.; Wang, Y. Terminal sliding mode control based impact time and angle constrained guidance. Aerosp. Sci. Technol. 2019, 93, 105142. [Google Scholar] [CrossRef]
Kim, H.G.; Cho, D.; Kim, H.J. Sliding Mode Guidance Law for Impact Time Control Without Explicit Time-to-Go Estimation. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 236–250. [Google Scholar] [CrossRef]
Yang, Z.; Wang, H.; Lin, D.; Zang, L. A New Impact Time and Angle Control Guidance Law for Stationary and Nonmaneuvering Targets. Int. J. Aerosp. Eng. 2016, 2016, 6136178. [Google Scholar] [CrossRef]
Ye, Q.; Liu, C.; Sun, J. A Backstepping-Based Guidance Law for an Exoatmospheric Missile with Impact Angle Constraint. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 547–561. [Google Scholar] [CrossRef]
Wang, J.; Zhang, R. Terminal Guidance for a Hypersonic Vehicle with Impact Time Control. J. Guid. Control. Dyn. 2018, 41, 1790–1798. [Google Scholar] [CrossRef]
Xian, Y.; Ren, L.; Xu, Y.; Li, S.; Wu, W.; Zhang, D. Impact point prediction guidance of ballistic missile in high maneuver penetration condition. Def. Technol. 2023, 26, 213–230. [Google Scholar] [CrossRef]
Erer, K.S.; Merttopçuoglu, O. Indirect Impact-Angle-Control Against Stationary Targets Using Biased Pure Proportional Navigation. J. Guid. Control. Dyn. 2012, 35, 700–704. [Google Scholar] [CrossRef]
Park, B.G.; Kim, T.H.; Tahk, M.J. Biased PNG With Terminal-Angle Constraint for Intercepting Nonmaneuvering Targets Under Physical Constraints. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 1562–1572. [Google Scholar] [CrossRef]
Ratnoo, A. Analysis of Two-Stage Proportional Navigation with Heading Constraints. J. Guid. Control. Dyn. 2016, 39, 156–164. [Google Scholar] [CrossRef]
Sharma, Y.R.; Ratnoo, A. A Bearings-Only Trajectory Shaping Guidance Law With Look-Angle Constraint. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 3303–3315. [Google Scholar] [CrossRef]
Kim, H.G.; Lee, J.Y.; Kim, H.J.; Kwon, H.H.; Park, J.S. Look-Angle-Shaping Guidance Law for Impact Angle and Time Control with Field-of-View Constraint. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 1602–1612. [Google Scholar] [CrossRef]
Hu, Q.; Cao, R.; Han, T.; Xin, M. Field-of-view limited guidance with impact angle constraint and feasibility analysis. Aerosp. Sci. Technol. 2021, 114, 106753. [Google Scholar] [CrossRef]
He, S.; Lee, C.H.; Shin, H.S.; Tsourdos, A. Optimal three-dimensional impact time guidance with seeker’s field-of-view constraint. Chin. J. Aeronaut. 2021, 34, 240–251. [Google Scholar] [CrossRef]
Wang, X.; Zhang, Y.; Wu, H. Sliding mode control based impact angle control guidance considering the seeker’s field-of-view constraint. ISA Trans. 2016, 61, 49–59. [Google Scholar] [CrossRef]
Kim, H.G.; Lee, J.Y.; Kim, H.J. Look Angle Constrained Impact Angle Control Guidance Law for Homing Missiles With Bearings-Only Measurements. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 3096–3107. [Google Scholar] [CrossRef]
Li, A.; Hu, X.; Yang, S.; Dong, K. Adaptive Terminal Time and Impact Angle Constraint Cooperative Guidance Strategy for Multiple Vehicles. Drones 2024, 8, 134. [Google Scholar] [CrossRef]
Hong, H.; Maity, A.; Holzapfel, F.; Tang, S. Model Predictive Convex Programming for Constrained Vehicle Guidance. IEEE Trans. Aerosp. Electron. Syst. 2019, 55, 2487–2500. [Google Scholar] [CrossRef]
Pan, B.; Ma, Y.; Yan, R. Newton-Type Methods in Computational Guidance. J. Guid. Control. Dyn. 2019, 42, 377–383. [Google Scholar] [CrossRef]
Piccinin, M.; Lavagna, M.R. Deep Reinforcement Learning approach for Small Bodies Shape Reconstruction Enhancement. In Proceedings of the AIAA Scitech 2020 Forum, Orlando, FL, USA, 6–10 January 2020. [Google Scholar] [CrossRef]
He, S.; Shin, H.S.; Tsourdos, A. Computational Missile Guidance: A Deep Reinforcement Learning Approach. J. Aerosp. Inf. Syst. 2021, 18, 571–582. [Google Scholar] [CrossRef]
Wang, N.; Wang, X.; Cui, N.; Li, Y.; Liu, B. Deep reinforcement learning-based impact time control guidance law with constraints on the field-of-view. Aerosp. Sci. Technol. 2022, 128, 107765. [Google Scholar] [CrossRef]
Lee, S.; Lee, Y.; Kim, Y.; Han, Y.; Kwon, H.; Hong, D. Impact Angle Control Guidance Considering Seeker’s Field-of-View Limit Based on Reinforcement Learning. J. Guid. Control. Dyn. 2023, 46, 2168–2182. [Google Scholar] [CrossRef]
Gaudet, B.; Furfaro, R. Terminal Adaptive Guidance for Autonomous Hypersonic Strike Weapons via Reinforcement Metalearning. J. Spacecr. Rocket. 2023, 60, 286–298. [Google Scholar] [CrossRef]
Liu, Z.; Wang, J.; He, S.; Shin, H.S.; Tsourdos, A. Learning prediction-correction guidance for impact time control. Aerosp. Sci. Technol. 2021, 119, 107187. [Google Scholar] [CrossRef]
Cai, H.; Li, X.; Zhang, Y.; Gao, H. Interception of a Single Intruding Unmanned Aerial Vehicle by Multiple Missiles Using the Novel EA-MADDPG Training Algorithm. Drones 2024, 8, 524. [Google Scholar] [CrossRef]
Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P.L.; Sutskever, I.; Abbeel, P. RL2: Fast Reinforcement Learning via Slow Reinforcement Learning. arXiv 2016, arXiv:1611.02779. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Lu, P.; Doman, D.B.; Schierman, J.D. Adaptive Terminal Guidance for Hypervelocity Impact in Specified Direction. J. Guid. Control. Dyn. 2006, 29, 269–278. [Google Scholar] [CrossRef]
Ng, A.; Harada, D.; Russell, S.J. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping. In Proceedings of the International Conference on Machine Learning, Bled, Slovenia, 27–30 June 1999. [Google Scholar]
Mishra, N.; Rohaninejad, M.; Chen, X.; Abbeel, P. A Simple Neural Attentive Meta-Learner. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep reinforcement learning that matters. Proc. AAAI Conf. Artif. Intell. 2018, 32. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Gated feedback recurrent neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML’15), Lille, France, 6–11 July 2015; Volume 37, pp. 2067–2075. [Google Scholar]

Figure 1. Engagement geometry.

Figure 2. The FOV angle schematic.

Figure 3. Block diagram of the guidance law.

Figure 4. Unlimited bank angle control scheme [12].

Figure 5. Procedure of agent–environment interaction.

Figure 6. Comparison of learning curves.

Figure 7. Comparison of

δ_{t} (t_{f})

curves.

Figure 7. Comparison of

δ_{t} (t_{f})

curves.

Figure 8. Comparison of KL divergence values.

Figure 9. Guidance trajectories for different impact times.

Figure 10. Results for different specified arrival times.

Figure 11. Schematic diagram for the calculation of the rate of change in the bank angle.

Figure 12. Monte Carlo simulation results.

Figure 13. Miss distance statistics of Monte Carlo simulation.

Figure 14. Monte Carlo simulation results.

Table 1. Vehicle parameters.

Parameter	Value	Parameter	Value
Mass m (kg)	1600	Reference area $S_{r e f}$ ( $m^{2}$ )	0.5026
Guidance period (s)	0.5	FOV angle limit (degrees)	60
Angle of attack $α$ (degrees)	$[0, 28]$	Bank angle $σ$ (degrees)	$[- 180, 180]$
Angle of attack rate $Δ α$ (degrees/s)	[−5, 5]	Bank angle rate $Δ σ$ (degrees/s)	[−40, 40]
Observation standard deviation $ϵ_{o b s}$	0.003	Aerodynamic lift coefficient of variation (%)	[−5, 5]
Aerodynamic drag coefficient of variation (%)	[−5, 5]	Atmospheric density variation (%)	[−5, 5]

Table 2. Initial conditions.

Parameter	Value	Parameter	Value
Range $R_{i n i t}$ (km)	[120, 120]	Altitude $h_{i n i t}$ (km)	[28.8, 29.2]
Velocity $V_{i n i t}$ (m/s)	[2060, 2120]	Flight path angle $γ_{i n i t}$ (degrees)	[−5, 0]
Azimuth $ψ_{i n i t}$ (degrees)	[−10, 10]	Heading error $δ ψ_{i n i t}$ (degrees)	[−8, 8]
Angle of attack $α_{i n i t}$ (degrees)	[1, 3]	Bank angle $σ_{i n i t}$ (degrees)	[−2, 2]
Desired impact time $T_{d}$ (s)	[75, 80]	Target maximum speed $V_{T_{m a x}}$ (m/s)	15.0
Target maximum acceleration $a_{T_{m a x}}$ ( ${m / s}^{2}$ )	0.5

Table 3. Network layer size.

	Policy Network		Value Network
Layer	Size	Activation	Size	Activation
Input layer	(10, 100)	tanh	(10, 100)	tanh
GRU layer	(100, 44)	tanh	(100, 22)	tanh
Hidden layer 2	(44, 20)	tanh	(22, 5)	tanh
Output layer	(20, 2)	linear	(5, 1)	linear

Table 4. Hyperparameters in the reward function.

Parameter	Value	Parameter	Value	Parameter	Value	Parameter	Value
$β_{m i s s}$	0.2	$β_{t i m e}$	0.05	$β_{d i v e}$	0.01	$β_{e}$	0.01
$ξ$	10	$κ_{Ω}$ (rad/s)	0.1	$κ_{t}$ (s)	0.5	$κ_{γ}$ (rad)	$δ_{γ} (t_{0}) / 2$
$κ_{V_{M}}$ (m/s)	200	$ϵ_{r}$ (m)	10	$ϵ_{t}$ (s)	0.5	$γ_{d i v e}$ (degrees)	75

Table 5. Hyperparameter settings.

Parameter	Value	Parameter	Value
Maximum episode steps	800	Maximum epochs M	2500
Collected episodes per epoch N	60	Update epochs E	20
Discounting factor, $γ$	0.95	Value network learning rate $α_{w}$	$2.0 \times 10^{- 3}$
Maximum learning rate $α_{θ}^{m a x}$	$1.5 \times 10^{- 3}$	Minimum learning rate $α_{θ}^{m i n}$	$1.5 \times 10^{- 5}$
Maximum clipped factor $ϵ_{m a x}$	0.5	Minimum clipped factor $ϵ_{m i n}$	0.01
Target KL divergence $d_{t a r g}$	0.001

Table 6. Initial state of vehicles.

Parameter	Value	Parameter	Value
Range (km)	120	Altitude (km)	29
Velocity (m/s)	2100	Flight path angle (degrees)	−2
Azimuth (degrees)	5	Heading error (degrees)	4
Angle of attack (degrees)	1.5	Bank angle (degrees)	0
Desired impact time (s)	76, 78, 80

Table 7. Final state for different impact times.

Scenarios	Miss Distance (m)	Impact Time (s)	Velocity (m/s)	Impact Angle (deg)
$T_{d} = 76$ s	7.146	76.367	991.347	−79.766
$T_{d} = 78$ s	8.453	77.795	1003.193	−82.438
$T_{d} = 80$ s	9.022	79.674	975.535	−83.945
PNG	2.334	79.853	559.952	−18.650

Table 8. The statistical results of the Monte Carlo experiment.

Parameters	Mean	Min	Max
Impact angle (degrees)	−83.137	−89.999	−75.357
Impact-time error (s)	−0.0601	−0.4986	0.4990
Miss distance (m)	6.2235	0.4892	9.4947
Impact velocity	1003.53	880.43	1082.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Lei, G.; Xian, Y.; Ren, L.; Li, S.; Zhang, D. Adaptive Impact-Time-Control Cooperative Guidance Law for UAVs Under Time-Varying Velocity Based on Reinforcement Learning. Drones 2025, 9, 262. https://doi.org/10.3390/drones9040262

AMA Style

Liu Z, Lei G, Xian Y, Ren L, Li S, Zhang D. Adaptive Impact-Time-Control Cooperative Guidance Law for UAVs Under Time-Varying Velocity Based on Reinforcement Learning. Drones. 2025; 9(4):262. https://doi.org/10.3390/drones9040262

Chicago/Turabian Style

Liu, Zhenyu, Gang Lei, Yong Xian, Leliang Ren, Shaopeng Li, and Daqiao Zhang. 2025. "Adaptive Impact-Time-Control Cooperative Guidance Law for UAVs Under Time-Varying Velocity Based on Reinforcement Learning" Drones 9, no. 4: 262. https://doi.org/10.3390/drones9040262

APA Style

Liu, Z., Lei, G., Xian, Y., Ren, L., Li, S., & Zhang, D. (2025). Adaptive Impact-Time-Control Cooperative Guidance Law for UAVs Under Time-Varying Velocity Based on Reinforcement Learning. Drones, 9(4), 262. https://doi.org/10.3390/drones9040262

Article Menu

Adaptive Impact-Time-Control Cooperative Guidance Law for UAVs Under Time-Varying Velocity Based on Reinforcement Learning

Abstract

1. Introduction

2. Preliminary

3. Problem Statement

4. Method

4.1. Observation Space and Action Space

4.2. Reward Function and Termination Conditions

4.3. Policy Optimization

5. Simulation Results

5.1. Training Scenarios

5.2. Training Results

5.3. Test Results

5.4. Monte Carlo Simulations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI