Integrated Guidance and Control for Strap-Down Flight Vehicle: A Deep Reinforcement Learning Approach

Zhang, Qinglong; Zhao, Bin; Jiang, Yifu; Zhang, Jingyan; Zhang, Jiale

doi:10.3390/aerospace12050400

Open AccessArticle

Integrated Guidance and Control for Strap-Down Flight Vehicle: A Deep Reinforcement Learning Approach

by

Qinglong Zhang

,

Bin Zhao

^*

,

Yifu Jiang

,

Jingyan Zhang

and

Jiale Zhang

Institute of Precision Guidance and Control, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(5), 400; https://doi.org/10.3390/aerospace12050400

Submission received: 21 February 2025 / Revised: 28 April 2025 / Accepted: 28 April 2025 / Published: 1 May 2025

(This article belongs to the Special Issue Integrated Guidance and Control for Aerospace Vehicles)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a three-dimensional (3D) deep reinforcement learning-based integrated guidance and control (DRLIGC) method, which is restricted by the narrow field-of-view (FOV) constraint of the strap-down seeker. By leveraging the data-driven nature of the deep reinforcement learning (DRL) algorithm, this method mitigates the challenges associated with integrated guidance and control (IGC) method design arising from model dependencies, thereby addressing the inherent complexity of the IGC model. Firstly, according to different states and actions, the pitch and yaw channels of the six-degree-of-freedom (6-DOF) IGC model are modeled as Markov decision processes (MDPs). Secondly, a channel-by-channel progressive training method based on the twin delayed deep deterministic policy gradient (TD3) algorithm is proposed. The agents of the pitch and yaw channels are trained using the TD3 algorithm independently, which substantially alleviates the complexity of the training process, while the roll channel is stabilized through the application of the back-stepping method. Thirdly, a comprehensive reward function is designed to simultaneously address the narrow FOV constraint and enhance the target engagement capability. Additionally, this function mitigates the issue of sparse rewards to some extent. Through Monte Carlo (MC) and comparative simulation verification, it is shown that the DRLIGC method proposed in this paper can effectively approach the target while maintaining the narrow FOV constraint and also has good robustness.

Keywords:

deep reinforcement learning; integrated guidance and control; field-of-view constraint; strap-down seeker; three-dimensional

1. Introduction

With the progress of miniaturization technology, the inertial-stabilized gimbaled seeker is increasingly being supplanted by the strap-down seeker. This substitution is mainly attributed to the complex structure and high cost of the inertial-stabilized gimballed seeker [1]. The strap-down seeker, rigidly mounted on the flight vehicle, offers significant advantages for various small flight vehicle platforms, including cost effectiveness, enhanced reliability, and improved stability. However, the fixed connection between the seeker and the flight vehicle body results in a complex coupling of the seeker’s measurement data with the vehicle’s attitude, which creates pronounced non-linearities and poses significant challenges for model decoupling, as described in [2]. Furthermore, the inherently limited FOV characteristic of a strap-down seeker significantly increases the likelihood of exceeding the FOV during attitude adjustments for target tracking [3]. Addressing the design issues of the guidance and control system under the constraints of intense coupling and FOV of the strap-down seeker is of paramount importance and persists as an urgent challenge in current research.

The first category of solutions to the aforementioned problems is to design guidance laws with FOV constraints. In this regard, the previous studies can be mainly divided into three categories: proportional navigation (PN)-based methods [4,5], sliding mode control (SMC)-based guidance laws [6,7], and optimal control-based guidance laws [8]. For non-maneuvering targets, a biased PN algorithm that takes into account the FOV constraint was devised in [4]. In reference [5], a piece-wise design approach was implemented, and special switching logic was elaborately designed to ensure that the PN algorithm adheres to the FOV constraint. The SMC technique tackles the constraint issue by redefining both the sliding mode surface and the reaching law. In [6], the sigmoid function was incorporated into the design of the sliding mode surface, effectively resolving the FOV constraint problem. Regarding the reaching law design, as presented in [7], the FOV constraint problem is regarded as a state-constraint control problem, and an Integration Barrier Lyapunov Function (IBLF) is used to solve it effectively. In [8], the optimal control theory was employed. By optimizing the positive weighting coefficient related to the constraint range, the performance index was minimized. Concurrently, both the impact angle and the FOV constraint were successfully regulated. Nonetheless, all the guidance laws mentioned above are formulated for two-dimensional (2D) scenarios and generally only find applicability to stationary or weakly maneuvering targets. These guidance laws fail to take into account the coupling relationships among diverse motion channels (e.g., pitch, yaw, and roll) in 3D dynamics, nor do they consider the impact of autopilot dynamics. Consequently, in 3D scenarios, the meticulous design of guidance laws subject to FOV constraint warrants further in-depth attention.

To design an FOV constraint guidance law in a 3D scenario, a back-stepping technique based on non-linear mapping was employed to develop a 3D guidance law considering both impact angle and FOV constraints. Yet, as shown in [9], it still only works for stationary targets. In [10], a 3D guidance law with impact time and the FOV constraints was proposed, based on the Barrier Lyapunov Function (BLF). It uses the time-to-go information of a flight vehicle to meet the FOV constraint. However, it demands additional information on the target, hindering practical implementation. In [11], a time-varying function related to the line-of-sight (LOS) replaced the FOV constraint, converting it into an output constraint solved by a time-varying asymmetric BLF. Similarly, Ref. [12] presented a 3D optimal guidance law based on biased proportional navigation, utilizing a non-linear function for FOV constraint satisfaction and impact time control. However, both methods apply only to stationary or weakly maneuvering targets. On the other hand, most of the above-mentioned studies assume a small but ideal angle of approach, which allows the FOV angle constraint problem to be converted into a velocity lead angle constraint problem for preliminary analysis. However, the rigid attachment of the strap-down seeker to the flight vehicle causes strong coupling between the measurement information of the seeker and flight vehicle’s attitude. This leads to a complex interaction between the guidance loop and the rotational dynamics. Given this, finding effective methods to address the coupling between the guidance and the attitude dynamics has now emerged as a crucial and urgent task in the field [13].

The IGC design can alleviate the coupling problem between the guidance system and the attitude dynamics to a certain extent and has received extensive attention. In [14], the small-gain theory was employed, while in [15], the prescribed performance method was utilized to address the IGC problem. In [16,17,18,19], the sliding mode control theory was adopted to design IGC algorithm for a flight vehicle. In [20], an adaptive block dynamic surface control algorithm was used to tackle the time-varying non-linear system problem of IGC design. However, these methods failed to consider the narrow FOV constraint problem inherent in the strap-down seeker. To achieve constraints on the impact angle, input, and system states, an IGC method integrating the back-stepping technique and IBLF was designed in [21]. In [22], an IGC method combining dynamic surface control with the IBLF was proposed for the Skid-to-Turn (STT) flight vehicles, thereby realizing the narrow FOV constraint. In [23], an adaptive disturbance observer was designed using the adaptive gain super-twisting algorithm to estimate the uncertainties in the IGC system. The IBLF and time-varying sliding mode variables were combined to achieve the FOV constraint. In [24], an integrated cooperative guidance and control method capable of achieving FOV constraints was derived by means of the BLF and dynamic surface control. In [25], an adaptive IGC method under side-window constraint was proposed based on the asymmetric BLF, finite-time stability theory, and the modified dynamic surface technique. In [13], an IGC method based on the IBLF and the adaptive algorithm was put forward, and the FOV constraint was achieved through state constraints. However, due to the intense guidance-control coupling in the IGC model, the above-mentioned methods present substantial challenges in terms of complex design and parameter tuning.

In contrast to traditional design methods, the DRL algorithm, a data-driven intelligent approach, substantially lessens the reliance on models and obviates the necessity for intricate model processing. By integrating the perception capacity of deep learning with the decision-making and optimization capabilities of reinforcement learning, it can extract features from vast amounts of data. This enables the agent to optimize strategies in accordance with the reward signals during environmental interactions [26]. In [27,28,29,30], the DRL method was tailored to the guidance law design, thereby validating the feasibility of this concept. In [31], the Deep Q-Network algorithm was applied to address the guidance problem. Nevertheless, as it can only generate a discrete action space, the TD3 algorithm was adopted in [32] for enhancement. The TD3 algorithm can generate a continuous action space, rendering it more suitable for handling guidance and control issues. In [33,34,35], the DRL algorithm was applied to the design of 3D guidance laws with constraints on impact angle, FOV, and approach time. It was shown that this method creates a direct mapping from states to actions via offline training, eliminating model dependence and effectively resolving the encountered constraint problems. In [36], the DRL algorithm was applied to the design of the 2D IGC method. In [37], the 3D IGC model was simplified, and an IGC method was formulated in the pitch and yaw directions, but no constraints were taken into account.

In the present technological context, research that employs DRL algorithms to solve the 3D IGC problem under the constraint of a narrow FOV remains relatively scarce. To devise an IGC method for STT flight vehicles within the narrow FOV constraint in three-dimensional scenario, this paper puts forward a DRLIGC approach founded on the TD3 algorithm. Specifically, the 6-DOF IGC model is transformed into Markov Decision Process (MDP) models channel by channel. Elaborate reward functions, which take into account both the FOV constraint and the hitting capability, are designed separately for the pitch and yaw channels.

Leveraging the data-driven characteristic of the DRL algorithm, a direct mapping from states to rudder deflection angles is established through training. This mapping can significantly mitigate the challenges associated with designing the guidance and control system, which are typically caused by the complex couplings that come with the strap-down seeker. The contributions of this paper are as follows:

(1): In contrast to existing methods, this paper introduces a DRLIGC algorithm capable of addressing the issue of accurately approaching maneuvering targets in 3D scenarios while adhering to the narrow FOV constraint of the strap-down seeker. Utilizing the DRL algorithm, this approach significantly reduces the complexity associated with the design of IGC methods and mitigates the challenges of parameter adjustment.
(2): In comparison to the three-channel coupling model training strategy, the DRLIGC algorithm proposed in this paper employs a channel-by-channel progressive training approach. This not only markedly decreases the training complexity arising from the coupling between the guidance and the attitude dynamics but also improves the training efficiency.
(3): Comprehensive reward functions are separately devised for the pitch and yaw channels. These functions effectively strike a balance between the FOV constraints and the hitting capability of the respective channels, thus alleviating the problem of sparse rewards.

The structure of this paper is organized as follows: In Section 2, the 6-DOF IGC design model is presented, and the specific problems that this paper aims to resolve are also delineated. In Section 3, the training procedure and specific algorithm of the DRLIGC method are detailed. In Section 4, Monte Carlo simulations and comparative simulations are conducted to verify the effectiveness and superiority of the method proposed in this paper. And conclusions are drawn in Section 5.

2. Problem Formulation

In this section, a 3D IGC model for the STT flight vehicle considering FOV constraint of the strap-down seeker is established. Based on this model, the design objective of this paper and the required preliminary concepts are presented.

Due to the data-driven characteristic of the DRL algorithm, which substantially lessens the reliance on the mathematical model, there is no need for intricate transformations of the entire model. Instead, training can be carried out directly using the 6-DOF model of the flight vehicle in the context of IGC design.

As shown in Figure 1, the inertial coordinate system

o x_{1} y_{1} z_{1}

, the body coordinate system

o x_{b} y_{b} z_{b}

, the trajectory coordinate system

o x_{i} y_{i} z_{i}

, and the velocity coordinate system

o x_{v} y_{v} z_{v}

are presented, respectively;

ϑ

and

ψ

represent the pitch angle and the yaw angle of the flight vehicle, respectively;

α

and

β

are the angle of attack and the sideslip angle of the flight vehicle, respectively, and

γ_{V}

represents the velocity roll angle of the flight vehicle [13].

In Figure 2, the flight vehicle and a moving target are represented by M and T, respectively;

o x_{4} y_{4} z_{4}

and

o x_{2} y_{2} z_{2}

represent the LOS coordinate system and the body-LOS (BLOS) coordinate system, respectively; q and

η

represent the LOS elevation angle and LOS azimuth angle, respectively. To simplify the problem, the BLOS elevation angle

q_{B L}

and the BLOS azimuth angle

η_{B L}

are introduced to represent the FOV angles in the longitudinal plane and the lateral plane, respectively. Consequently, the seeker’s FOV constraint can be formulated in terms of the BLOS angles.

The kinematic equations of the flight vehicle can be expressed as [13]:

\{\begin{matrix} \dot{ϑ} = ω_{y} sin γ + ω_{z} cos γ \\ \dot{ψ} = ω_{y} cos γ sec ϑ - ω_{z} sin γ sec ϑ \\ \dot{γ} = ω_{x} - ω_{y} cos γ tan ϑ + ω_{z} sin γ tan ϑ \\ \dot{α} = \{cos θ [sin ϑ cos γ cos (ψ - ψ_{V}) - sin γ sin (ψ - ψ_{V})] - sin θ cos ϑ cos γ\} / cos β \\ \dot{β} = cos θ [cos γ sin (ψ - ψ_{V}) + sin ϑ sin γ cos (ψ - ψ_{V})] - sin θ cos ϑ sin γ \\ {\dot{γ}}_{V} = (cos α sin β sin ϑ - sin α sin β cos γ cos ϑ + cos β sin γ cos ϑ) / cos θ \\ \dot{x} = V cos θ cos ψ_{V} \\ \dot{y} = V sin θ \\ \dot{z} = - V cos θ sin ψ_{V} \end{matrix}

(1)

where

γ

denotes the roll angle of the flight vehicle;

x, y, z

represent the position of the flight vehicle; V is the magnitude of the velocity of the center of mass of the flight vehicle;

θ

is the angle between the velocity vector of the flight vehicle and the horizontal plane of the

o x_{1} y_{1} z_{1}

coordinate system, and

ψ_{V}

is the angle between the projection of the velocity vector in the horizontal plane and the

o x_{1}

axis. They are, respectively, called the flight path angle and the heading angle.

The dynamics equations are depicted as [13]:

\{\begin{matrix} {\dot{ω}}_{x} = \frac{J_{y} - J_{z}}{J_{x}} ω_{y} ω_{z} + \frac{M_{x}}{J_{x}} \\ {\dot{ω}}_{y} = \frac{J_{x} - J_{z}}{J_{y}} ω_{x} ω_{z} + \frac{M_{y}}{J_{y}} \\ {\dot{ω}}_{z} = \frac{J_{y} - J_{x}}{J_{z}} ω_{y} ω_{x} + \frac{M_{z}}{J_{z}} \\ M \frac{d V}{d t} = P cos α cos β - X - M g sin θ \\ M V \frac{d θ}{d t} = P (sin α cos γ_{V} + cos α sin β sin γ_{V}) + Y {cos}_{V} - Z sin γ_{V} - M g cos θ \\ - M V cos θ \frac{d ψ_{V}}{d t} = P (sin α sin γ_{V} - cos α sin β cos γ_{V}) + Y sin γ_{V} + Z cos γ_{V} \end{matrix}

(2)

where P represents the thrust; X, Y, and Z are the components of the aerodynamic force in the directions of the respective coordinate axes of the velocity coordinate system, and they are, respectively, called the drag force, the lift force, and the side force;

M_{x}

,

M_{y}

, and

M_{z}

are the components of the aerodynamic moment along the three directions of the body coordinate system, and they are, respectively, called the rolling moment, the yawing moment, and the pitching moment. The specific calculations are as follows [22]:

\{\begin{matrix} X = Q S (c_{x 0} + c_{x}^{α} | α | + c_{x}^{β} β + c_{x}^{α β} | α β | + c_{x}^{δ_{x}} | δ_{x} | + c_{x}^{δ_{y}} | δ_{y} | + c_{x}^{δ_{z}} | δ_{z} |) \\ Y = Q S (c_{y}^{α} α + c_{y}^{β} β + c_{y}^{δ_{z}} | δ_{z} |) \\ Z = Q S (c_{z}^{α} α + c_{z}^{β} β + c_{z}^{δ_{y}} | δ_{y} |) \\ M_{x} = Q S L (m_{x}^{α} α + m_{x}^{β} β + m_{x}^{δ_{x}} δ_{x}) \\ M_{y} = Q S L (m_{y}^{β} β + m_{y}^{δ_{y}} δ_{y}) \\ M_{z} = Q S L (m_{z}^{α} α + m_{z}^{δ_{z}} δ_{z}) \end{matrix}

(3)

where

Q, S, L

denote dynamic pressure, reference area, and reference length, respectively; M is the mass of flight vehicle;

ω_{x}, ω_{y}

, and

ω_{z}

are the components of the rotational angular velocity

ω

of the body coordinate system relative to the inertial coordinate system on each axis of the missile body coordinate system;

J_{x}, J_{y}

, and

J_{z}

are the moments of inertia in the roll, yaw, and pitch directions, respectively;

c_{x 0}, c_{i}^{j}, i = {x, y, z}, j = {α, β}, c_{x}^{α β}, c_{x}^{n}, n = {δ_{x}, δ_{y}, δ_{z}}, c_{y}^{δ_{z}}, c_{z}^{δ_{y}}

and

m_{x}^{α}, m_{x}^{β}, m_{x}^{δ_{x}}, m_{y}^{β}, m_{y}^{δ_{y}}, m_{z}^{α}, m_{z}^{δ_{z}}

represent the aerodynamic force and aerodynamic moment coefficients, respectively;

δ_{x}, δ_{y}, δ_{z}

are the aileron, rudder, and elevator deflections.

Since this paper focuses on STT flight vehicles, the roll angle is typically kept close to zero. Additionally, for the strap-down flight vehicle facing the maneuvering target, it is of utmost importance to restrict the seeker’s BLOS angle to a very small region. Under these circumstances, the relationship between the angles of

η, ψ, q, ϑ

and

η_{B L}, q_{B L}

is given by [22]:

\{\begin{matrix} η_{B L} = η - ψ \\ q_{B L} = q - ϑ \end{matrix}

(4)

Therefore, the seeker’s FOV constraint can be transformed into

| η_{B L} | \leq σ_{max}, | q_{B L} | \leq σ_{max}

, where

σ_{max}

is the maximum of the FOV of the strap-down seeker.

Taking the above-mentioned factors into account, the design objective of this paper is to achieve precise guidance for moving targets while maintaining the FOV constraint and keeping the roll angle stabilized at approximately zero.

Considering the inevitable signal transmission and processing delays in practical application scenarios, a first-order link is used to represent this time delay characteristic, ensuring that the simulation environment is more in line with the operating conditions of the real world. The formula for the time delay is shown in Equation (5), where

δ_{i}^{c}

represents the signal output by the intelligent agent, and

δ_{i}

represents the signal actually used by the model. In addition,

τ_{t}

is defined as the time constant.

{\dot{δ}}_{i} = - \frac{1}{τ_{t}} δ_{i} + \frac{1}{τ_{t}} δ_{i}^{c}, i = x, y, z

(5)

Based on the above mathematical models, the simulation process of the 6-DOF IGC model in this paper is shown in Figure 3:

Remark 1.

It should be noted that when the distance R between the flight vehicle and the target is less than a certain value

R_{d}

, the seeker enters the working blind area. During this period, the output information of the seeker becomes invalid, and the flight vehicle is guided by its inertial guidance system. The inertial guidance system means that after analyzing the acceleration based on hardware systems such as accelerometers, kinematic calculation methods are directly used, and no target information is required.

Remark 2.

All simulation training in this paper is conducted entirely based on the 6-DOF IGC design model.

3. The DRLIGC Method Design with FOV Constraint

In this section, the IGC algorithm for the pitch, yaw, and roll channels of the STT flight vehicle are designed independently. A DRLIGC method is put forward to control the pitch and yaw channels of the flight vehicle while adhering to the FOV constraint of the seeker. Meanwhile, the back-stepping method is employed to maintain the roll angle at approximately zero.

3.1. Twin Delayed Deep Deterministic Policy Gradient Algorithm

In this paper, the TD3 algorithm, a specific type of DRL algorithm, is harnessed for the design of the IGC method. Characterized as a data-driven intelligent algorithm, DRL refines its strategies via iterative trial-and-error processes. It is effectively applied to resolve a succession of issues within the MDP model. To leverage the capabilities of the DRL algorithm, it is imperative to cast the IGC issue into the form of an MDP model.

The MDP model includes a state space

S

that can describe the characteristics of the model, a suitable action space

A

, and a comprehensive reward function

R

that takes into account the objective of the problem. At each time step t in the TD3 algorithm, the agent will select an action

a_{t} \in A

based on the current environmental state

s_{t} \in S

. The environment will transition to

s_{t + 1}

according to the chosen action and provide the reward

r_{t}

according to the reward function

R

. Then, the agent will continuously adjust its own strategy based on the magnitude of the reward. The basic architecture of the TD3 algorithm including its network components is shown in Figure 4, where

μ_{ϕ}, μ_{ϕ^{'}}, Q_{θ_{i}}, Q_{{θ_{i}}^{'}}, i = 1, 2

are the actor, target actor, critic network, and target critic networks, respectively;

ϕ, ϕ^{'}, θ_{i}, θ_{i^{'}}, i = 1, 2

are the corresponding network parameters, respectively.

As depicted in Figure 4, the TD3 algorithm represents a classic Actor-Critic algorithm, as noted in [34]. By incorporating two Q-networks and the target network update mechanism, it efficiently alleviates the overestimation problem that is intrinsic to policy gradient methods. This, in turn, significantly improves both the stability and the convergence rate of the algorithm. The twin delayed update mechanism ensures more stable parameter updates. Additionally, the smoothing of target networks reduces variance during the learning process. This architecture enables the TD3 algorithm to rapidly adapt and make efficient decisions in complex environments.

To address the IGC design challenge for STT flight vehicles with FOV constraints, the model ingeniously converts the FOV constraint into constraints on the BLOS angles.

This paper presents a high-precision approach IGC method that utilizes the TD3 algorithm. This method has the ability to enforce FOV constraints in the pitch and yaw channels. Through channel decomposition, the method significantly reduces the training complexity of the agent. Its objective is to directly establish a mapping from the state space to the action space obtained that the flight vehicle attains. By this way, it effectively alleviates the formidable multi-channel coupling problem inherent in the IGC issue. Figure 5 vividly depicts the information interaction between flight vehicle and environment.

3.2. MDP Model Design in Pitch and Yaw Channels

In this subsection, the IGC design of the pitch channel and the yaw channel is carried out separately. Leveraging the model-agnostic nature of the DRL algorithm, our objective is to address the strong coupling relationship that exists between the pitch and yaw channels within the IGC design model.

According to Equations (2) and (3), it can be seen that the pitch channel involves a more complex balance among lift, drag, and gravity, and the uncertainty is much higher than that of the yaw channel. Training the pitch channel first represents a more effective curriculum learning strategy. By addressing the more complex pitch control problem initially, a more stable training environment is created for the subsequent yaw training. Therefore, in this paper, the training of the intelligent agent for the pitch channel is conducted first.

When devising the DRLIGC method for the pitch channel, the corresponding IGC design model is structured as an MDP model. The entire simulation environment is grounded in the 6-DOF IGC design model introduced in Section 2. Relying on this model, the subsequent MDP model is developed.

The strap-down seeker can obtain the BLOS elevation angle

q_{B L}

and the BLOS azimuth angle

η_{B L}

. Through further calculation, the LOS elevation angle q and the LOS azimuth angle

η

can be acquired. By utilizing the observable information obtained by the strap-down seeker and the information derivable from the observable data, a suitable state space is established. A well-chosen state space has the potential to mitigate the complexity of DRL training and expedite its convergence rate.

In light of the FOV angle constraint imposed on

q_{B L}

that the pitch channel must satisfy, state variables that can comprehensively represent this channel are carefully selected to formulate the state space, which is defined as follows:

s_{t} = (q, \dot{q}, q_{B L})

(6)

In order to further improve the convergence efficiency, the normalization method is employed to process the state space. Considering that the angle range within the model is

[- π, π]

, the normalized state space is defined as follows:

s_{t} = (\frac{q}{π}, \frac{\dot{q}}{π}, \frac{q_{B L}}{π})

(7)

The action space is selected as the elevator deflection

δ_{z}

corresponding to the pitch channel. This choice can make full use of the data-driven characteristics of the DRL algorithm to form a direct mapping from the state space to

δ_{z}

.

Finally, the construction of the MDP model can be finalized through careful selection of the most crucial reward function. The reward function represents the linchpin of the DRL algorithm. A well-designed reward function is capable of substantially accelerating the training of the agent and continuously fine-tuning it to conform to the specifications of the problem under consideration. This enables the agent to acquire a relatively optimal strategy.

In light of the constraints imposed on the BLOS angle within the pitch channel, as well as the ultimate necessity of attaining a high-precision target approach, and referring to the idea of achieving a quasi-parallel approach by suppressing the LOS angular rate in the classical PN guidance law, this paper determines the reward function of the pitch channel as:

r_{y} = r_{y t} + r_{σ_{1}} + r_{y e n d}

(8)

where

r_{y t}

represents the immediate reward function in the pitch channel;

r_{σ_{1}}

denotes the penalty term generated when

q_{B L}

exceeds the constraint;

r_{y e n d}

is the terminal reward in the pitch channel. The specific calculation method of

r_{y t}

is formulated as:

r_{y t} = b_{1} exp (- {\dot{q}}^{2} / σ_{1}) + b_{2} exp (- {q_{B L}}^{2} / σ_{2})

(9)

where

b_{1}

and

b_{2}

are the proportionality coefficients of each term, satisfying the condition that

b_{1} + b_{2} = 1

;

σ_{1}

and

σ_{2}

are the corresponding scaling coefficients. The specific calculation method of

r_{σ_{1}}

is defined as:

r_{σ_{1}} = \{\begin{matrix} - C_{1} \\ 0 \end{matrix} \begin{matrix} , i f | q_{B L} | \geq σ_{max} \\ , o t h e r w i s e \end{matrix}

(10)

where

C_{1}

is the positive constant chosen for the penalty.

The terminal reward

r_{y e n d}

is designed as:

r_{y e n d} = \{\begin{matrix} C_{2} \\ 0 \end{matrix} \begin{matrix} , R \leq R_{s t o p} \\ , o t h e r w i s e \end{matrix}

(11)

where

C_{2}

is the hit reward obtained after a hit and is a selected positive constant.

When devising the reward function for the pitch channel, its content is essentially partitioned into three components. The underlying design concepts are elaborated on in the following manner:

(1): The immediate reward function $r_{y t}$ yields the corresponding reward value at every time step. Its design intent is to minimize the LOS angular velocity $\dot{q}$ . A non-linear function of the type $exp (\cdot)$ is employed to obtain a reasonable distribution of the reward values.This approach endeavors to keep the LOS angular velocity $\dot{q}$ within a relatively narrow range to the greatest extent possible, thereby emulating the effect of the PN guidance law. Furthermore, the function $r_{σ_{1}}$ is crafted to maintain the BLOS angle $q_{B L}$ within a relatively small range. This serves to enhance the detection performance of the seeker. Simultaneously, this component also contributes to accelerating the convergence speed.
(2): The FOV constraint penalty function $r_{σ_{1}}$ takes effect when the BLOS elevation angle $q_{B L}$ exceeds the constraint. It will impose a relatively substantial penalty value, compelling the agent to exert its utmost efforts to prevent the emergence of such a state.
(3): The terminal reward function $r_{y e n d}$ is activated only upon the successful hit of the target. It will confer a relatively significant reward, informing the agent that this particular state yields greater rewards. This enables the agent to learn and gravitate towards a more optimal strategy.

Through the seamless integration of the state variables, action space, and the reward function with the established simulation model, a highly comprehensive and robust model for the pitch channel can be successfully constructed. This model can be utilized for agent training to derive the corresponding control strategy for the pitch channel, thereby forming an agent capable of directly obtaining the elevator deflection angle from the state values.

Analogously, in order to effectively tackle the FOV constraint issue within the yaw channel by means of the DRL algorithm, it is imperative to formulate the yaw channel model into an MDP model. Initially, it is crucial to establish that the FOV constraint in the yaw direction can be equivalently transformed into a constraint imposed on the LOS azimuth angle

η_{B L}

, i.e.,

η_{B L} \leq σ_{max}

. Based on this condition, the MDP model for the yaw channel is established.

The normalized state space constructed for the yaw channel is:

s_{t} = (\frac{η}{π}, \frac{\dot{η}}{π}, \frac{η_{B L}}{π})

(12)

Based on the characteristics of the 6-DOF model of the flight vehicle, the rudder deflection

δ_{y}

is chosen as the action space for the yaw channel. The reward function is crafted according to the principle of the PN guidance law. The specific form of the comprehensive reward function is designed as follows:

r_{z} = r_{z t} + r_{σ_{2}} + r_{z e n d}

(13)

where

r_{z t}

represents the immediate reward function in the yaw channel;

r_{σ_{2}}

denotes the penalty term generated when

η_{B L}

exceeds the constraint; and

r_{z e n d}

stands for the terminal reward in the yaw channel. The specific calculation approach of

r_{z t}

is formulated as:

r_{z t} = b_{3} exp (- {\dot{η}}^{2} / σ_{3}) + b_{4} exp (- {η_{B L}}^{2} / σ_{4})

(14)

where

b_{3}

and

b_{4}

are the proportionality coefficients, which meet the condition

b_{3} + b_{4} = 1

. Meanwhile,

σ_{3}

and

σ_{4}

are the corresponding scaling coefficients. The specific calculation method for

r_{σ_{2}}

is defined as:

r_{σ_{2}} = \{\begin{matrix} - C_{3} \\ 0 \end{matrix} \begin{matrix} , i f | η_{B L} | \geq σ_{max} \\ , o t h e r w i s e \end{matrix}

(15)

where

C_{3}

is the positive constant chosen for penalizing behaviors that exceed the constraints.

The terminal reward

r_{z e n d}

is selected as:

r_{z e n d} = \{\begin{matrix} C_{4} \\ 0 \end{matrix} \begin{matrix} , R \leq R_{s t o p} \\ , o t h e r w i s e \end{matrix}

(16)

where

C_{4}

is a selected positive constant.

Similar to the design idea of

r_{y}

, the comprehensive reward function

r_{z}

for the yaw channel is divided into three parts: the immediate reward function

r_{z t}

, the penalty function

r_{σ_{2}}

for exceeding the FOV constraint, and the terminal reward function

r_{z e n d}

for target hitting. In the immediate reward function

r_{z t}

,

b_{3} exp (- {\dot{η}}^{2} / σ_{3})

aims to achieve a smaller LOS angular rate velocity, while

b_{4} exp (- {η_{B L}}^{2} / σ_{4})

is employed to reduce the BLOS azimuth angle. This design makes the behaviors to be constrained more explicit and accelerates the convergence rate.

Remark 3.

The training episode terminates when any of the following conditions are met: the FOV penalty reward or the terminal reward is triggered, or the relative distance between the flight vehicle and the target starts to increase. Once the episode ends, a final comprehensive reward value is obtained.

Remark 4.

Corresponding proportionality coefficients are used to strike a balance between target engagement and constraint satisfaction. As a result, the sum of these proportionality coefficients in the corresponding rewards must equal to one.

Remark 5.

The scaling coefficients are utilized to adjust the magnitudes of the corresponding elements. They have a pronounced impact on the convergence of DRL training. Moreover, within the function

exp (\cdot)

, the scaling coefficients contribute to determining the magnitudes of the optimal values of the corresponding variables.

Remark 6.

Differences in training step sizes lead to varying frequencies of obtaining immediate rewards. This variation directly impacts the influence exerted by the remaining components of the comprehensive reward function on the strategy formulation. As a result, different hyperparameters are required.

3.3. DRLIGC Method

In the DRLIGC method utilizing the TD3 algorithm, the training process commences with the independent training of the pitch channel. During this initial stage, the acceleration command of the yaw channel is directly determined according to the PN guidance law. Meanwhile, the roll angle and angular velocity of the roll channel are both set to zero. Under these conditions, the yaw channel achieves relatively high approach accuracy, which provides a stable environment for the separate training of the pitch channel. After the training phase, a highly effective engagement agent for the pitch channel is successfully obtained. The PN guidance law applied to the yaw channel is presented as follows:

a_{z} = - N V \dot{η}

(17)

where N is the proportionality coefficient.

Once the pitch-channel agent is acquired, the yaw channel is switched to normal computation mode, and the yaw-channel agent is introduced into the training process. At this stage, the training effort is concentrated exclusively on the yaw-channel agent. Through this targeted training, a yaw-channel agent that disregards the roll factor can be obtained. By integrating this yaw-channel agent with the previously obtained pitch-channel agent, a DRLIGC method that does not take the roll factors into account can be developed. This approach streamlines the system design and potentially enhances the computational efficiency while maintaining an acceptable level of performance for the engagement task. The specific training process of the DRLIGC method is illustrated in Figure 6, with the algorithm for the pitch channel shown in Figure 7.

The DRLIGC procedure for the yaw channel is similar to that of the pitch channel, so it will not be repeated here. Through the process described above, the DRLIGC method that excludes the consideration of the roll channel is obtained. Regarding the roll channel, this paper adopts an adaptive control law based on the back-stepping method to carry out roll angle control. The model of the roll channel is as follows [13]:

\{\begin{matrix} \dot{γ} = - ω_{y} cos γ tan ϑ + ω_{z} sin γ tan ϑ + ω_{x} \\ {\dot{ω}}_{x} = \frac{J_{y} - J_{z}}{J_{x}} ω_{y} ω_{z} + \frac{Q S L m_{x}^{δ_{x}}}{J_{x}} δ_{x} + d_{ω_{x}} \end{matrix}

(18)

where

d_{ω_{x}}

denotes the disturbance term, caused by aerodynamic moments.

Select the state variables as

x_{1} = γ, x_{2} = ω_{x}

and control input as

u = δ_{x}

, the roll channel model can be transformed into:

\{\begin{matrix} {\dot{x}}_{1} = f_{1} (ϑ, ω_{y}, ω_{x}) + x_{2} \\ {\dot{x}}_{2} = f_{2} (ω_{y}, ω_{z}) + g_{1} u + d_{2} \end{matrix}

(19)

where

\{\begin{matrix} f_{1} (ϑ, ω_{y}, ω_{z}) = - ω_{y} cos γ tan ϑ + ω_{z} sin γ tan ϑ \\ f_{2} (ω_{y}, ω_{z}) = (J_{y} - J_{z}) ω_{y} ω_{z} / J_{x} \\ g_{1} = Q S L m_{x}^{δ_{x}} / J_{x} \\ d_{2} = d_{ω_{x}} \end{matrix}

(20)

The roll channel IGC method presented in [13] is employed in this study and is defined as:

\{\begin{matrix} s_{1} = x_{1} \\ x_{2 d} = - k_{1} s_{1} - f_{1} (ϑ, ω_{y}, ω_{z}) \\ τ {\dot{x}}_{2 c} + x_{2 c} = x_{2 d}, x_{2 c} (0) = x_{2 d} (0) \\ s_{2} = x_{2} - x_{2 c} \\ u = g_{1}^{- 1} (- f_{2} - \hat{ρ} s_{2} + {\dot{x}}_{2 d} - s_{1} - k_{2} s_{2}) \\ \dot{\hat{ρ}} = λ ({s_{2}}^{2} + σ \hat{ρ}) \end{matrix}

(21)

where

k_{1}, k_{2}, λ, σ

represent the designed parameters, where

\hat{ρ}

stands for the estimation of

ρ

. Here,

ρ

denotes the unknown upper bound of

d_{2}

, satisfying the condition

| d_{2} | \leq ρ

.The proof of the stability of the roll IGC method can be retrieved in [13]. By utilizing the above-mentioned roll IGC method, the objective of keeping the roll angle controlled around zero can be realized.

During the simulation process, when

R \geq R_{d}

, the flight vehicle is guided using the DRLIGC algorithm proposed in this paper. When

R < R_{d}

, it switches to inertial guidance. At this time, it can be regarded as an unguided section. The acceleration and its direction are provided by the acceleration measurement components inside the flight vehicle, and the subsequent motion trajectory is calculated through kinematic integration.

4. Numerical Simulation Analysis

In this section, a comprehensive account of the training scenarios, hyperparameters, and network parameters associated with the DRLIGC method is provided. Moreover, numerical simulations are carried out to verify the viability and effectiveness of this method.

4.1. DRLIGC Training Settings

To cultivate robust agents for the pitch and yaw channels within the DRLIGC method, variables are randomly initialized across a multitude of scenarios. The specific adjustments are comprehensively detailed in Table 1, from which the corresponding parameters are selected.

The flight vehicle is initially positioned at (0, 200, 0) m, boasting an initial velocity of 400 m/s. The initial values of both its attitude angles and angular velocities are set to 0. The target starts at the position (

X_{T 0}

, 0, 0) m, with an initial velocity of (

V_{x T}

,

V_{y T}

,

V_{z T}

). The chosen maneuver acceleration is (4, 0, 2) m/

s^{2}

. The FOV constraint size

σ_{m a x}

of the flight vehicle’s seeker is set to

7^{°}

. Taking into account the detection frequency limitation of the seeker, the training step size is set at 0.05 s, while the simulation step size is fixed at 0.01 s. The time constant

τ_{t}

is randomly selected between 0.01 and 0.2. Given the simulation step size of 0.01 s and considering the velocities of the flight vehicle and the target, the value of

R_{s t o p}

is set to 30 m.

The network architectures of the pitch channel and the yaw channel are identical. Table 2 details the specific number of layers and nodes in the Actor and Critic networks. For the Actor network, the activation layer employs the tanh function. Moreover, the target network shares the same architecture as the corresponding network.

The settings of the relevant hyperparameters in the reward function and the DRL algorithm are elaborated in Table 3 and Table 4, respectively.

Table 5 provides the aerodynamic characteristics of the flight vehicle, which are derived from the data reported in [22]. Meanwhile, Table 6 provides a detailed breakdown of the relevant parameters in Equation (21).

4.2. DRLIGC Method Training

Based on the scenario settings and relevant parameter configurations in Section 4.1, the DRLIGC method is applied to carry out DRL training for the pitch channel and the yaw channel separately. The training results of the pitch channel and yaw channel are presented in Figure 8 and Figure 9, respectively.

In Figure 8 and Figure 9, the average reward is calculated as the mean of the episode rewards from the adjacent 30 episodes. This metric serves as an indicator of the current agent’s stability.

We conducted an in-depth analysis of the curve in Figure 8. The entire training process for the pitch channel spans 1500 episodes. Approximately at the 200th episode, the reward reaches its peak. This suggests that the agent has discovered a strategy that can successfully approach the target within the constraints of the FOV. Around the 700th episode, the reward value begins to stabilize. By the 1300th episode, convergence is attained, signifying that a relatively superior strategy has been formulated. It can be seen from Figure 9 that the reward value hits its peak multiple consecutive times around the 400th episode for the yaw channel. This implies that, through continuous optimization, the strategy enables successful approaches in both channels. Convergence is achieved after the 1300th episode. When running on an Intel(R) Core(TM) i7-1065G7 CPU using the MATLAB R2023b Reinforcement Learning Toolbox, the total off-line training time for the pitch and yaw channels amounts to 7.3 h and 6.8 h, respectively.

An attempt was made to integrate the pitch and yaw channels into a single agent. This unified agent was designed with six inputs and two outputs, aiming to concurrently determine the rudder and elevator deflections of the flight vehicle. However, it was found that the coupling between the pitch and yaw channels was excessively severe. This high-level coupling made it extremely difficult for the agent to be trained and to converge. As a result, it was a great challenge to realize a successful approach. In contrast, the DRLIGC method proposed in this paper showcases certain advantages. It can effectively achieve target engagement, providing a more feasible and efficient solution for the engagement task.

4.3. Monte Carlo Simulation

To validate the feasibility and robustness of the DRLIGC method put forward in this paper, a Monte Carlo simulation experiment encompassing 500 episodes was carried out. In this MC simulation experiment, scenarios were randomly chosen from those elaborated in Table 1. The specific simulation results are presented in Figure 10.

In Figure 10a, the flight trajectories of the flight vehicle and the target are depicted. The solid line denotes the flight vehicle’s trajectory, while the dotted line represents that of the target. Evidently, the flight vehicle’s flight path appears normal. Figure 10b,c, respectively, exhibit the curves of the flight vehicle’s BLOS elevation angle

q_{B L}

and azimuth angle

η_{B L}

. When combined with Figure 11, it can be clearly observed that throughout the entire guidance process, both angles are within the FOV constraint range. Figure 10d presents the roll angle curve of the flight vehicle. It is apparent that the roll angle remains close to zero with a relatively small amplitude, thus satisfying the design requirements. Figure 10e–g, respectively, display the aileron, rudder, and elevator deflections of the flight vehicle. Figure 11c represents the miss distance. Through statistical analysis, it is found that the mean value of the miss distance amounts to 0.6731 m, and the variance is 0.0699.

To study the impact of noise on the performance of the DRLIGC method proposed in this paper, a set of simulations is conducted while taking into account the influence of noise on the observed values. Random Gaussian noise with a mean value of 0° and a variance of 0.2 was added to

(q, q_{B L}, η, η_{B L})

. The simulation results are shown in Figure 12. After adding the noise, the FOV constraint can still be satisfied and the target can be successfully hit, with the miss distance being 0.70 m. The above simulation experiments further verified the feasibility of the DRLIGC method.

4.4. Comparison Study

To further substantiate the superiority of the proposed DRLIGC method, a comparative analysis is carried out in this subsection. The comparative algorithm employs the 3D IBLF-IGC algorithm presented in [22]. The specific details of the IBLF-IGC algorithm are presented as follows:

\{\begin{matrix} s_{1} = x_{1} \\ x_{2 d} = - k_{Q} ({\hat{D}}_{1} λ_{1} + 1) s_{1} - k_{1} s_{1} \\ τ_{2} {\dot{x}}_{2 c} + x_{2 c} = x_{2 d} \\ s_{2} = x_{2} - x_{2 c} \\ x_{3 d}^{#} = g_{2}^{- 1} (t) (- {\hat{D}}_{2} λ_{2} s_{2} - k_{2} s_{2} - f_{2} (x_{2}, x_{3}^{#}) + {\dot{x}}_{2 c}) \\ τ_{3} {\dot{x}}_{3 c}^{#} + x_{3 c}^{#} = x_{3 d}^{#} \\ s_{3}^{#} = x_{3}^{#} - x_{3 c}^{#}, s_{3} = x_{3} - {[\begin{matrix} 0 & {x_{3 c}^{#}}^{T} \end{matrix}]}^{T} \\ x_{4 d} = g_{3}^{- 1} (ϑ, x_{3}) (- {\hat{D}}_{3} λ_{3} s_{3} - k_{3} s_{3} - f_{3} (x_{3}) + {[\begin{matrix} 0 & {\dot{x}}_{3 c}^{#}^{T} \end{matrix}]}^{T}) \\ τ_{4} {\dot{x}}_{4 c} + x_{4 c} = x_{4 d} \\ s_{4} = x_{4} - x_{4 c} \\ u = g_{4}^{- 1} (t) (- {\hat{D}}_{4} λ_{4} s_{4} - k_{4} s_{4} - f_{4} (x_{3}^{#}, x_{4}) + {\dot{x}}_{4 c}) \end{matrix}

(22)

where

x_{1} = {[\begin{matrix} η_{B L} & q_{B L} \end{matrix}]}^{T}, x_{2} = {[\begin{matrix} \dot{η} & \dot{q} \end{matrix}]}^{T}, x_{3}^{#} = {[\begin{matrix} β & α \end{matrix}]}^{T}, x_{3} = {[\begin{matrix} γ & β & α \end{matrix}]}^{T}, x_{4} = {[\begin{matrix} ω_{x} & ω_{y} & ω_{z} \end{matrix}]}^{T}, u = {[\begin{matrix} δ_{x} & δ_{y} & δ_{z} \end{matrix}]}^{T}, r = R sin q, f_{2} (x_{2} x_{3}^{#}) = - [\begin{matrix} \frac{2 \dot{R} \dot{η} cos q}{r} - \frac{2 R \dot{η} \dot{q} sin q}{r} + P cos α sin β \\ \frac{2 \dot{R} \dot{q}}{R} + \frac{{\dot{η}}^{2} r sin ε}{R} + \frac{P sin α}{m R} \end{matrix}], g_{2} (t) = [\begin{matrix} \frac{Q S c_{z}^{β}}{m r} & 0 \\ 0 & - \frac{Q S c_{y}^{α}}{m R} \end{matrix}], f_{3} (x_{3}^{#}) = [\begin{matrix} 0 \\ \frac{1}{m V} (Q S c_{z}^{β} β - P cos α sin β) \\ - \frac{1}{m V cos β} (P sin α + Q S c_{y}^{α} α) \end{matrix}], f_{4} (x_{3}^{#}, x_{4}) = [\begin{matrix} \frac{J_{z} - J_{y}}{J_{x}} ω_{y} ω_{z} \\ \frac{1}{J_{y}} Q S L m_{y}^{β} β + \frac{J_{x} - J_{z}}{J_{y}} ω_{x} ω_{z} \\ \frac{1}{J_{z}} Q S L m_{z}^{α} α + \frac{J_{y} - J_{x}}{J_{z}} ω_{x} ω_{y} \end{matrix}], g_{3} (ϑ, x_{3}) = [\begin{matrix} 1 & - tan ϑ cos γ & tan ϑ sin γ \\ sin α & cos α & 0 \\ - tan β cos α & sin α tan β & 1 \end{matrix}], k_{Q} = [\begin{matrix} \frac{σ_{max}^{2}}{σ_{max}^{2} - x_{11}^{2}} & 0 \\ 0 & \frac{σ_{max}^{2}}{σ_{max}^{2} - x_{12}^{2}} \end{matrix}], g_{4} (t) = [\begin{matrix} \frac{1}{J_{x}} Q S L m_{x}^{δ_{x}} & 0 & 0 \\ 0 & \frac{1}{J_{y}} Q S L m_{y}^{δ_{y}} & 0 \\ 0 & 0 & \frac{1}{J_{z}} Q S L m_{z}^{δ_{z}} \end{matrix}], \{\begin{matrix} {\dot{\hat{D}}}_{1} = λ_{1} {s_{1}}^{T} k_{Q}^{2} s_{1} - μ_{1} {\hat{D}}_{1} \\ {\dot{\hat{D}}}_{2} = λ_{2} {s_{2}}^{T} s_{2} - μ_{2} {\hat{D}}_{2} \\ {\dot{\hat{D}}}_{3} = λ_{3} {s_{3}}^{T} s_{3} - μ_{3} {\hat{D}}_{3} \\ {\dot{\hat{D}}}_{4} = λ_{4} {s_{4}}^{T} s_{4} - μ_{4} {\hat{D}}_{4} \end{matrix} .

The design parameters of the IBLF-IGC algorithm are given by

k_{1} = d i a g (0.1, 0.1)

,

k_{2} = d i a g (3, 3), k_{3} = d i a g (60, 3, 3) . k_{4} = d i a g (60, 20, 20), τ_{2} = d i a g (0.1, 0.1)

,

τ_{3} = d i a g (0.1, 0.1), τ_{4} = d i a g (0.05, 0.05, 0.05), λ_{i}, μ_{i} (i = 1, 2, 3, 4) = 0.1

.

The QLF-IGC algorithm, as elaborated on in [22], is employed to illustrate the indispensability of FOV constraints within the present scenario. This algorithm bears resemblance to the IBLF-IGC algorithm yet incorporates the following modifications:

\{\begin{matrix} x_{2 d} = - {\hat{D}}_{1} λ_{1} s_{1} - k_{1} s_{1} \\ {\dot{\hat{D}}}_{1} = λ_{1} s_{1}^{T} s_{1} - k_{12} {\hat{D}}_{1} \end{matrix}

(23)

In order to further validate the robustness of the algorithm put forward in this paper, within the comparative simulation scenarios, the target maneuver is intensified to a certain degree. The specific scenarios are presented in Table 7.

The results of the comparative simulation are shown in Figure 13.

As illustrated in Figure 13, the results of the comparative simulation reveal that the DRLIGC method, along with the IBLF-IGC and QLF-IGC algorithms, all accomplished successful target approach. Specifically, the miss distances were measured at 0.76 m for the DRLIGC method, 1.18 m for the IBLF-IGC algorithm, and 0.97 m for the QLF-IGC algorithm. Significantly, both the DRLIGC and IBLF-IGC algorithms strictly complied with the FOV constraint conditions. However, the QLF-IGC algorithm failed to meet these constraints, which powerfully emphasizes the indispensability of integrating FOV constraints within the current scenario. Furthermore, during the approach phase, the IBLF-IGC algorithm prematurely encountered divergence problems. Such issues have the potential to severely undermine the guidance effectiveness in particular circumstances, thus highlighting its relatively lower stability. In marked contrast, the DRLIGC method demonstrated outstanding stability, standing out as a more reliable and robust solution in comparison to the other algorithms under study.

Within the identical scenario described in Section 4.3, 500 iterations of MC simulations are carried out for the IBLF-IGC algorithm. The obtained miss distance yielded an average value of 0.9715 m and a variance of 0.1273. This clearly indicates that the DRLIGC algorithm proposed in this study not only showcases higher approach accuracy but also exhibits superior overall performance, highlighting its effectiveness and reliability in comparison to other approaches in the context of target engagement tasks.

Remark 7.

As the approach phase nears its end, divergence becomes evident in several simulation curves. This is because as the flight vehicle approaches and passes the target, it causes the angular velocity of the line of sight to diverge. This divergence leads to a sharp increase in some physical quantities. At this moment, the relative distance between the flight vehicle and the target has become close enough, enabling a successful approach.

5. Conclusions

In addressing the IGC design challenge for STT flight vehicles in 3D scenarios, this paper proposes a novel DRLIGC method, which takes into account both the FOV constraint of the strap-down seeker and approach accuracy.

We employ a channel-separated approach to decompose the design model into three distinct channels: pitch, yaw, and roll. Specifically, the roll channel is stabilized using the back-stepping method, ensuring its stability and smooth operation, while the pitch and yaw channels are designed with the DRLIGC method based on the TD3 algorithm.

A comprehensive reward function considering the FOV constraint conditions and approach accuracy is designed. This function effectively capitalizes on the model-free advantage of the DRL algorithm. As a result, it can directly map the designed state space to the deflection angle of the corresponding channel, enabling more accurate and efficient control. Simulation results verify the effectiveness and superiority of the proposed method and reflect the broad potential of the DRL algorithm in the field of IGC design.

Author Contributions

Conceptualization, Q.Z. and B.Z.; methodology, Q.Z. and B.Z.; software, Q.Z. and B.Z.; validation, Q.Z. and Y.J.; formal analysis, Q.Z. and B.Z.; investigation, Y.J., J.Z. (Jingyan Zhang), and J.Z. (Jiale Zhang); resources, B.Z. and J.Z. (Jingyan Zhang); data curation, Q.Z.; writing—original draft preparation, Q.Z.; writing—review and editing, Q.Z., B.Z., Y.J., and J.Z. (Jiale Zhang); visualization, Q.Z. and J.Z. (Jingyan Zhang); supervision, B.Z.; project administration, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62373307).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BLOS	Body-LOS
FOV	Field-of-view
BLF	Barrier Lyapunov Function
3D	Three-dimensional
IBLF	Integration BLF
2D	Two-dimensional
DRL	Deep reinforcement learning
6-DOF	Six-degree-of-freedom
IGC	The heading angle
IGC	Integrated Guidance and Control
MC	Monte Carlo
STT	Skid-to-Turn
MDP	Markov decision processe
TD3	Twin delayed deep deterministic policy gradient
DRLIGC	Deep reinforcement learning-based IGC
LOS	Line-of-sight
PN	Proportional navigation

References

Xiao, Y.; Zhou, J.; Zhao, B. Attitude dynamics aiding for three-dimensional passive target tracking of strap-down seeker based on instrumental variable Kalman filter. Trans. Inst. Meas. Control 2020, 42, 2645–2659. [Google Scholar] [CrossRef]
Ch, K.; Tahk, M. Biased PNG for target observability enhancement against nonmaneuvering targets. IEEE Trans. Aerosp. Electron. Syst. 2015, 51, 2–17. [Google Scholar]
Lee, C.H.; Hyun, C.; Lee, J.G.; Choi, J.Y.; Sung, S. A hybrid guidance law for a strapdown seeker to maintain lock-on conditions against high speed targets. J. Electr. Eng. Technol. 2013, 8, 190–196. [Google Scholar] [CrossRef]
Park, B.G.; Kim, T.H.; Tahk, M.J. Biased PNG with terminal-angle constraint for intercepting nonmaneuvering targets under physical constraints. IEEE Trans. Aerosp. Electron. Syst. 2017, 53, 1562–1572. [Google Scholar] [CrossRef]
He, S.M.; Lin, D.F. A robust impact angle constraint guidance law with seeker’s field-of-view limit. Trans. Inst. Meas. Control 2015, 37, 317–328. [Google Scholar] [CrossRef]
Kim, H.G.; Kim, H.J. Field-of-view constrained guidance law for a maneuvering target with impact angle control. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 4974–4983. [Google Scholar] [CrossRef]
Wang, X.L.; Zhang, Y.A.; Wu, H.L. Sliding mode control based impact angle control guidance considering the seeker’s field-of-view constraint. ISA Trans. 2016, 61, 49–59. [Google Scholar] [CrossRef]
Park, B.G.; Kim, T.H.; Tahk, M.J. Range-to-go weighted optimal guidance with impact angle constraint and seeker’s look angle limits. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 1241–1256. [Google Scholar] [CrossRef]
Liu, B.J.; Hou, M.; Yu, Y.; Wu, Z.H. Three-dimensional impact angle control guidance with field-of-view constraint. Aerosp. Sci. Technol. 2020, 105, 106014. [Google Scholar] [CrossRef]
Kumar, S.R.; Mukherjee, D. Three-dimensional nonsingular impact time guidance with limited field-of-view. IEEE Trans. Control Syst. Technol. 2021, 30, 1448–1459. [Google Scholar] [CrossRef]
Wang, N.Y.; Wang, X.G.; Bin, E.; Li, Y. Three-dimensional adaptive dynamic surface guidance law for missile with terminal angle and field-of-view constraints. ISA Trans. 2024, 154, 113–131. [Google Scholar] [CrossRef] [PubMed]
He, S.M.; Lee, C.H.; Shin, H.S.; Antonios, T. Optimal three-dimensional impact time guidance with seeker’s field-of-view constraint. Chin. J. Aeronaut. 2021, 34, 240–251. [Google Scholar] [CrossRef]
Zhang, D.; Ma, P.; Du, Y.; Chao, T. Integral barrier Lyapunov function-based three-dimensional low-order integrated guidance and control design with seeker’s field-of-view constraint. Aerosp. Sci. Technol. 2021, 116, 106886. [Google Scholar] [CrossRef]
Yan, H.; Tan, S.; He, Y. A small-gain method for integrated guidance and control in terminal phase of reentry. Acta Astronaut. 2017, 132, 282–292. [Google Scholar] [CrossRef]
Song, H.; Zhang, T.; Zhang, G.; Lu, C. Integrated interceptor guidance and control with prescribed performance. Int. J. Robust Nonlinear Control 2015, 25, 3179–3194. [Google Scholar] [CrossRef]
Hou, M.; Duan, G. Integrated guidance and control of homing missiles against ground fixed targets. Chin. J. Aeronaut. 2008, 21, 162–168. [Google Scholar]
Wang, J.; Cheng, L.; Cai, Y.; Tang, G. Low-order diving integrated guidance and control for hypersonic vehicles. Aerosp. Sci. Technol. 2019, 91, 96–109. [Google Scholar] [CrossRef]
Wang, J.; Cheng, L.; Cai, Y.; Tang, G. A novel reduced-order guidance and control scheme for hypersonic gliding vehicles. Aerosp. Sci. Technol. 2020, 106, 106115. [Google Scholar] [CrossRef]
Ming, C.; Wang, X.; Sun, R. A novel non-singular terminal sliding mode control-based integrated missile guidance and control with impact angle constraint. Aerosp. Sci. Technol. 2019, 94, 105368. [Google Scholar] [CrossRef]
Hou, M.; Liang, X.; Duan, G. Adaptive block dynamic surface control for integrated missile guidance and autopilot. Chin. J. Aeronaut. 2013, 26, 741–750. [Google Scholar] [CrossRef]
Liang, L.; Zhao, B.; Zhou, J.; Zhang, Z. Impact angle controlled integrated guidance and control with input and state constraints. Int. J. Control 2024, 97, 796–810. [Google Scholar] [CrossRef]
Guo, J.; Zhou, J.; Zhao, B. Three-dimensional integrated guidance and control for strap-down missiles considering seeker’s field-of-view angle constraint. Trans. Inst. Meas. Control 2020, 42, 1097–1109. [Google Scholar] [CrossRef]
Zhao, B.; Feng, Z.; Guo, J. Integral barrier Lyapunov functions-based integrated guidance and control design for strap-down missile with field-of-view constraint. Trans. Inst. Meas. Control 2021, 43, 1464–1477. [Google Scholar] [CrossRef]
Huang, X.; Zhao, B.; Zhou, J.; Xie, J.; Zhao, W. Integrated cooperative guidance and control of strapdown missiles. Proc. Inst. Mech. Eng. Part G J. Aerosp. Eng. 2024, 239, 528–556. [Google Scholar] [CrossRef]
Chao, T.; Quan, S.; Ma, P.; Zhang, D. Three-dimensional low-order finite-time integrated guidance and control design with side-window constraint. Aerosp. Sci. Technol. 2022, 121, 107355. [Google Scholar] [CrossRef]
Shalumov, V. Cooperative online guide-launch-guide policy in a target-missile-defender engagement using deep reinforcement learning. Aerosp. Sci. Technol. 2020, 104, 105996. [Google Scholar] [CrossRef]
Hong, D.; Kim, M.; Park, S. Study on reinforcement learning-based missile guidance law. Appl. Sci. 2020, 10, 6567. [Google Scholar] [CrossRef]
Jiang, L.; Nan, Y.; Zhang, Y.; Li, Z. Anti-interception guidance for hypersonic glide vehicle: A deep reinforcement learning approach. Aerosp. 2022, 9, 424. [Google Scholar] [CrossRef]
He, S.M.; Shin, H.S.; Tsourdos, A. Computational missile guidance: A deep reinforcement learning approach. J. Aerosp. Inf. Syst. 2021, 18, 571–582. [Google Scholar] [CrossRef]
Hu, Z.; Xiao, L.; Guan, J.; Yi, W.J.; Yin, H.Q. Intercept Guidance of Maneuvering Targets with Deep Reinforcement Learning. Int. J. Aerosp. Eng. 2023, 2023, 1–16. [Google Scholar] [CrossRef]
Wu, M.Y.; He, X.J.; Qiu, Z.M.; Chen, Z.H. Guidance law of interceptors against a high-speed maneuvering target based on deep Q-Network. Trans. Inst. Meas. Control 2022, 44, 1373–1387. [Google Scholar] [CrossRef]
Yang, Z.Q.; Liu, X.D.; Liu, H.K. Impact time control guidance law with time-varying velocity based on deep reinforcement learning. Aerosp. Sci. Technol. 2023, 142, 108603. [Google Scholar] [CrossRef]
Wang, N.Y.; Wang, X.G.; Cui, N.G.; Li, Y.; Liu, B.B. Deep reinforcement learning-based impact time control guidance law with constraints on the field-of-view. Aerosp. Sci. Technol. 2022, 128, 107765. [Google Scholar] [CrossRef]
Lee, S.M.; Lee, Y.J.; Kim, Y.D.; Han, Y.S.; Kwon, H.; Hong, D. Impact Angle Control Guidance Considering Seeker’s Field-of-View Limit Based on Reinforcement Learning. J. Guid., Control, Dyn. 2023, 46, 2168–2182. [Google Scholar] [CrossRef]
Qiu, X.; Gao, C. Three-dimensional cooperative guidance with impact angle constraints via value-policy decomposed multi-agent reinforcement learning. Aerosp. Sci. Technol. 2024, 155, 109616. [Google Scholar] [CrossRef]
Fan, J.; Dou, D.; Ji, Y. Impact-Angle Constraint Guidance and Control Strategies Based on Deep Reinforcement Learning. Aerospace 2023, 10, 954. [Google Scholar] [CrossRef]
Wang, W.; Wu, M.; Chen, Z.; Liu, X. Integrated Guidance-and-Control Design for Three-Dimensional Interception Based on Deep-Reinforcement Learning. Aerospace 2023, 10, 167. [Google Scholar] [CrossRef]

Figure 1. The coordinate systems related to the flight vehicle.

Figure 2. The 3D engagement geometry of the flight vehicle and target.

Figure 3. The simulation process of the 6-DOF IGC model.

Figure 4. Architecture of the TD3 algorithm.

Figure 5. The information interaction between flight vehicle and environment.

Figure 6. Training process of the pitch and yaw channels.

Figure 7. DRLIGC algorithm for pitch channel.

Figure 8. The training reward curve of the pitch channel.

Figure 9. The training reward curve of the yaw channel.

Figure 10. Results of the Monte Carlo simulation: (a) engagement trajectory. (b) The BLOS elevation angle. (c) The BLOS azimuth angle. (d) The roll angle. (e) The aileron deflection. (f) The rudder deflection. (g) The elevator deflection.

Figure 11. Monte Carlo simulation results’ statistics: (a) the maximum

q_{B L}

. (b) The maximum

η_{B L}

. (c) miss distance.

Figure 11. Monte Carlo simulation results’ statistics: (a) the maximum

q_{B L}

. (b) The maximum

η_{B L}

. (c) miss distance.

Figure 12. Simulation curves under noise: (a) engagement trajectory. (b) The BLOS elevation angle. (c) The BLOS azimuth angle. (d) The roll angle. (e) The aileron deflection. (f) The rudder deflection. (g) The elevator deflection.

Figure 13. Comparative simulation curves: (a) engagement trajectory. (b) The BLOS elevation angle. (c) The BLOS azimuth angle. (d) The roll angle. (e) The aileron deflection. (f) The rudder deflection. (g) The elevator deflection.

Table 1. Initialization range for random scenarios.

Parameter	Min	Max	Parameter	Min	Max
$X_{T 0}$	2800	3000	$V_{x T}$	30	40
$V_{y T}$	30	40	$V_{z T}$	15	25
$σ_{m a x}$	7		$R_{s t o p}$	30

Table 2. Network layer size.

Layer	Actor Network	Critic Network
Input layer	3 (the dimension of s)	4 (the dimension of s+ the dimension of a)
Hidden Layer 1	64	64
Hidden Layer 2	128	128
Hidden Layer 3	128	128
Hidden Layer 4	64	64
Output Layer	1 (the dimension of a)	1 (the dimension of a)

Table 3. Hyper-parameters in shaping the reward function.

$b_{1}$	$b_{2}$	$b_{3}$	$b_{4}$	$σ_{1}$	$σ_{2}$	$σ_{3}$	$σ_{4}$	$C_{1}$	$C_{2}$	$C_{3}$	$C_{4}$
0.8	0.2	0.8	0.2	0.04	0.04	0.04	0.04	50	50	50	50

Table 4. Algorithm hyperparameter settings.

Parameter	Value	Parameter	Value
Discount factor	0.99	$L_{2}$ regularization constant	$5 \times 10^{- 4}$
Experience pool size	$10^{6}$	Upper bound of the gradient	1
Sample size	256	Average attraction constant	0.7
Actor learning rate	$1 \times 10^{- 4}$	Policy frequency	2
Critic learning rate	$1 \times 10^{- 4}$	Soft update rate $κ$	$5 \times 10^{- 3}$
Policy noise variance	0.1	Ornstein–Uhlenbeck (OU) noise	0.2
Maximum noise value	0.1	Type of optimization algorithm	Adam

Table 5. The aerodynamic characteristics of the flight vehicle.

Name	Value	Name	Value	Name	Value
M	1200	$m_{y}^{β}$	−27.31	$c_{x}^{δ_{x}}$	0.05
P	0	$m_{y}^{δ_{y}}$	−26.57	$c_{x}^{δ_{y}}$	0.07
$J_{x}$	100	$m_{x}^{δ_{x}}$	2.12	$c_{x}^{δ_{z}}$	0.06
$J_{y}$	5700	$m_{x}^{α}$	0.46	$c_{y}^{α}$	57.16
$J_{z}$	5600	$m_{x}^{β}$	−0.37	$c_{y}^{δ_{z}}$	5.74
S	0.42	$c_{x 0}$	0.32	$c_{y}^{β}$	−0.08
L	0.68	$c_{x}^{α}$	0.21	$c_{z}^{β}$	−56.31
$m_{z}^{α}$	−28.16	$c_{x}^{β}$	0.19	$c_{z}^{δ_{y}}$	−5.62
$m_{z}^{δ_{z}}$	−27.92	$c_{x}^{α β}$	25.38	$c_{z}^{α}$	0.09

Table 6. The parameter of the roll channel.

$k_{1}$	$k_{2}$	$τ$	$λ$	$σ$
60	60	0.01	0.1	0.1

Table 7. Simulation scenario of the comparative study.

$X_{T 0}$	$V_{xT}$	$V_{yT}$	$V_{zT}$	$σ_{\max}$	$a_{xT}$	$a_{yT}$	$a_{zT}$	$τ$
3000	40	30	20	7	4	2	3	0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q.; Zhao, B.; Jiang, Y.; Zhang, J.; Zhang, J. Integrated Guidance and Control for Strap-Down Flight Vehicle: A Deep Reinforcement Learning Approach. Aerospace 2025, 12, 400. https://doi.org/10.3390/aerospace12050400

AMA Style

Zhang Q, Zhao B, Jiang Y, Zhang J, Zhang J. Integrated Guidance and Control for Strap-Down Flight Vehicle: A Deep Reinforcement Learning Approach. Aerospace. 2025; 12(5):400. https://doi.org/10.3390/aerospace12050400

Chicago/Turabian Style

Zhang, Qinglong, Bin Zhao, Yifu Jiang, Jingyan Zhang, and Jiale Zhang. 2025. "Integrated Guidance and Control for Strap-Down Flight Vehicle: A Deep Reinforcement Learning Approach" Aerospace 12, no. 5: 400. https://doi.org/10.3390/aerospace12050400

APA Style

Zhang, Q., Zhao, B., Jiang, Y., Zhang, J., & Zhang, J. (2025). Integrated Guidance and Control for Strap-Down Flight Vehicle: A Deep Reinforcement Learning Approach. Aerospace, 12(5), 400. https://doi.org/10.3390/aerospace12050400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrated Guidance and Control for Strap-Down Flight Vehicle: A Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. Problem Formulation

3. The DRLIGC Method Design with FOV Constraint

3.1. Twin Delayed Deep Deterministic Policy Gradient Algorithm

3.2. MDP Model Design in Pitch and Yaw Channels

3.3. DRLIGC Method

4. Numerical Simulation Analysis

4.1. DRLIGC Training Settings

4.2. DRLIGC Method Training

4.3. Monte Carlo Simulation

4.4. Comparison Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI