A Deep Reinforcement Learning-Based Cooperative Guidance Strategy Under Uncontrollable Velocity Conditions

Cui, Hao; Zhang, Ke; Tan, Minghu; Wang, Jingyu

doi:10.3390/aerospace12050411

Open AccessArticle

A Deep Reinforcement Learning-Based Cooperative Guidance Strategy Under Uncontrollable Velocity Conditions

School of Astronautics, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(5), 411; https://doi.org/10.3390/aerospace12050411

Submission received: 10 February 2025 / Revised: 21 April 2025 / Accepted: 23 April 2025 / Published: 6 May 2025

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

:

We present a novel approach to generating a cooperative guidance strategy using deep reinforcement learning to address the challenge of cooperative multi-missile strikes under uncontrollable velocity conditions. This method employs the multi-agent proximal policy optimization (MAPPO) algorithm to construct a continuous action space framework for intelligent cooperative guidance. A heuristically reshaped reward function is designed to enhance cooperative guidance among agents, enabling effective target engagement while mitigating the low learning efficiency caused by sparse reward signals in the guidance environment. Additionally, a multi-stage curriculum learning approach is introduced to smooth agent actions, effectively reducing action oscillations arising from independent sampling in reinforcement learning. Simulation results demonstrate that the proposed deep reinforcement learning-based guidance law can successfully achieve cooperative attacks across a range of randomized initial conditions.

Keywords:

uncontrollable velocity; cooperative guidance; MAPPO; curriculum learning

1. Introduction

Recent improvements in missile fleet operations have been driven by better data exchange and information sharing, enabling missions that would be impossible for a single missile to carry out [1,2,3]. In both current and future aerial combat, targets often have varying levels of intelligence and autonomous decision-making, which presents new challenges for traditional air-to-air missile guidance systems. To address these challenges, research on guidance strategies based on intelligent algorithms is both academically and practically important. The use of artificial and collective intelligence technologies has the potential to significantly enhance the collaborative warfare capabilities.

Significant progress has been made in the field of collaborative warfare, particularly in the context of cooperative attack missions. Generally speaking, missile guidance laws for cooperative attacks are typically divided into static cooperative guidance laws and dynamic cooperative guidance laws. Static cooperative guidance laws, such as Impact Time Control Guidance (ITCG) [4], are time control strategies designed for individual missiles. They ensure simultaneous target engagement by setting a common impact time for all missiles. Since these laws do not rely on missile-to-missile communication, they are susceptible to external interference and variations in the missile state. The ITCG designed in Ref. [5] used proportional navigation guidance (PNG) and impact time error feedback to enable cooperative attacks. Later advancements in static cooperative guidance incorporated optimal control theory, sliding mode control [6,7], and trajectory shaping [8]. Refs. [9,10] enhanced ITCG by refining residual time estimation and using nonlinear time-varying guidance gains. Ref. [11] avoided residual time estimation by modeling the missile-to-target distance as a polynomial function. To improve the missile lethality, angle constraints were introduced, requiring strikes at both a specific time and angle. For instance, Refs. [12,13] proposed a two-stage strategy for stationary targets, while [14] designed a law to meet impact time, angle, and overload constraints under varying missile speeds. Further studies in Refs. [15,16,17] focused on shaping strategies to control both the time and angle. However, the current static guidance laws rely on fixed models and predefined parameters, limiting their adaptability in dynamic environments like electronic warfare. Additionally, missile speed constraints and flight performance limits may render certain impact times infeasible, potentially causing mission failures. As an open-loop approach, ITCG is less adaptable to these dynamic scenarios [18].

On the other hand, dynamic cooperative guidance methods synchronize missile impact times through network communication. Researchers have proposed various distributed strategies, especially for stationary targets. For example, Ref. [19] proposed a two-dimensional impact time control cooperative guidance law for constant velocities and a 3D version for time-varying velocities. Ref. [20] proposed optimal cooperative guidance strategies for aircraft defense with constraints on the impact angle. Ref. [21] introduced a distributed law for cooperative attacks under fixed and changing communication topologies, while Ref. [22] presented a fixed-time consensus strategy that uses sliding mode control to transform the attack problem into a consensus issue on directed cyclic graphs, ensuring convergence within a fixed time using heterogeneous gains. Ref. [23] designed an optimal cooperative guidance law intended to protect a target from a guided missile. In Ref. [24], a distributed guidance law for multi-missile cooperative attacks on stationary targets was developed, using two forms of residual flight time estimation to achieve flight time consensus within a limited time, reducing the communication burdens. Ref. [25] used the desired LOS angles for cooperative attacks under directed communication topologies, ensuring consistent impact times and LOS angle convergence. Ref. [26] combined PNG with consensus theory to synchronize missile flight times, enhancing the robustness against input delays and topology changes. Inspired by multi-agent systems, Ref. [27] introduced cooperative geometric principles for simultaneous target interception. Ref. [28] presented a novel cooperative and predictive guidance law to intercept high-speed, highly maneuverable targets using less capable interceptors.

The rapid advancements in artificial intelligence have significantly accelerated the development of intelligent guidance systems, particularly through the integration of advanced technologies such as reinforcement learning. These innovations have enhanced the adaptability and decision-making capabilities of missiles in complex battlefield environments, offering substantial advantages and considerable research potential [29]. In Refs. [30,31,32,33], the combination of traditional guidance laws with reinforcement learning was employed to improve the guidance performance, enabling agents to plan strategies in real time and effectively navigate the complexities and uncertainties of the environment. However, these methods remain somewhat reliant on the framework of conventional guidance laws. In the context of multi-missile and multi-target engagement, Ref. [34] proposed an integrated engagement framework that includes an evaluation model, intelligent assignment, and cooperative interception. The intelligent assignment strategy uses reinforcement learning for optimal target–missile pairing, while cooperative interception techniques ensure the effectiveness and precision of the interception mission. Nonetheless, the reinforcement learning approach in this case does not incorporate the design process of missile guidance laws. For intelligent cooperative active defense scenarios, Ref. [35] introduces an intelligent defense guidance strategy that effectively coordinates the maneuvers of both the defender and the target with minimal prior knowledge and observational noise, leading to successful evasion. Furthermore, Ref. [36] presents a decentralized multi-agent deep reinforcement learning method for cooperative tracking strategies in UAV swarms, employing maximum mutual reward learning and utilizing the multi-agent actor–critic algorithm to develop a collaborative sharing strategy among homogeneous UAVs. In Ref. [37], a novel intelligent differential game guidance law is proposed in the continuous action domain, based on deep reinforcement learning, to enable the intelligent interception of various types of maneuvering evaders. Although these studies demonstrate the feasibility of applying reinforcement learning to intelligent guidance law design, the use of reinforcement learning in addressing multi-missile cooperative guidance problems remains relatively rare. Given the highly nonlinear and dynamically evolving nature of such systems, existing technologies may require further innovation and significant breakthroughs to fully realize their potential.

Reinforcement learning technology has demonstrated exceptional performance in addressing complex, dynamic, and uncertain tasks. In this paper, we propose a cooperative guidance strategy for maneuvering targets based on deep reinforcement learning, referred to as Reinforcement Learning Cooperative Guidance (RLCG). This approach employs the multi-agent proximal policy optimization (MAPPO) algorithm to construct guidance policies within a continuous action space, with the goal of enhancing the accuracy of cooperative attacks against targets. To mitigate the issue of sparse reward signals during the guidance process, a heuristically reshaped reward function is introduced. Additionally, by adopting a multi-stage curriculum training framework, the training efficiency and stability of the RLCG algorithm are significantly improved, while also effectively addressing the action jitter problem caused by independent sampling in reinforcement learning.

2. Problem Statement

The tracking problem between the flying vehicle and the target can be simplified by decoupling the cross-coupling between the two orthogonal components, thereby decomposing the line-of-sight rate vector into two cross-range planes of the missile [38]. Consequently, the primary focus of this paper is the design and validation of cooperative guidance laws within a two-dimensional plane. This approach, however, can be extended to multi-agent cooperative guidance scenarios in a three-dimensional space. The target is assumed to be a dynamic multirotor UAV with a size ranging from 3 to 5 m. The motion dynamics of multi-agent cooperative guidance are illustrated in Figure 1.

In Figure 1,

v_{M}

and

v_{T}

represent the velocities of the flying vehicle and the target, respectively.

λ

denotes the line-of-sight angle between the vehicle and the target.

\dot{λ}

refers to the rate of change of the line-of-sight angle.

θ_{M}

and

θ_{T}

correspond to the bank angles of the flying vehicle and the target, respectively.

η_{M}

and

η_{T}

indicate the lead angles of the vehicle and the target.

a_{M}

and

a_{T}

represent the normal accelerations of the vehicle and the target, respectively.

r

is the distance between the vehicle and the target.

It is assumed that the flying vehicle’s velocity magnitude remains constant. In other words, the direction of its velocity is controllable, while its axial velocity is uncontrollable. Based on the relevant principles of kinematics, the planar motion equations for both the vehicle and the target are given as follows:

\{\begin{matrix} \dot{x} = v \cos θ \\ \dot{y} = v \sin θ \\ \dot{θ} = a / v \end{matrix}

(1)

Since the motion patterns of both vehicles are identical, a one-to-one vehicle–target guidance model is developed. The relative kinematic equations for the vehicle and target are as follows:

\{\begin{matrix} \begin{matrix} r = \sqrt{(x_{T} - x_{M})^{2} + (y_{T} + y_{M})^{2}} \\ λ = \arctan (\frac{y_{T} - y_{M}}{x_{T} - x_{M}}) \end{matrix} \\ \dot{r} = - v_{T} \cos (λ - θ_{T}) - v_{M} \cos (λ - θ_{M}) \\ r \dot{λ} = v_{M} \sin (λ - θ_{M}) - v_{T} \sin (λ - θ_{T}) \end{matrix}

(2)

3. Design Procedure for Planar Guidance Strategies

3.1. Multi-Agent Proximal Policy Optimization

Reinforcement learning is a machine learning approach that trains an agent to make decisions through interaction with its environment. Among the various reinforcement learning algorithms, the MAPPO algorithm stands out as a sophisticated method specifically designed for multi-agent systems. MAPPO is an extension of the proximal policy optimization (PPO) algorithm and is particularly effective in handling continuous action spaces. Like PPO, MAPPO uses an actor–critic architecture, where the policy network generates actions based on current observations, and the value network estimates the state value to guide policy updates. The key distinction is that, in MAPPO, the value network learns a centralized value function, meaning that it has access to global information, including data about other agents and the environment. The central idea of the MAPPO algorithm is to enhance the policy using proximal policy optimization, which promotes more stable and efficient learning [39]. The action space in MAPPO can be either continuous or discrete, with this paper adopting a continuous action space. The objective of the MAPPO algorithm is to maximize the expected reward, using a proximal policy optimization objective function that takes into account the magnitude of policy changes during updates to prevent excessive alterations, thereby contributing to the stability of the learning process [40].

During the network update training process, the MAPPO algorithm samples historical experience data from the experience replay pool, uses these data to compute the loss function, and then optimizes and updates the network parameters accordingly. The weight parameters for the action network and the critic network are denoted as

θ

and

φ

, respectively. The action network is represented as

π_{θ}

, and the loss functions for both the action network and the critic network are expressed as follows:

\begin{matrix} L (θ) = [\frac{1}{B_{n}} \sum_{i = 1}^{B} \sum_{k = 1}^{n} \min (r_{θ, i}^{(k)} A_{i}^{(k)}, c l i p (r_{θ, i}^{(k)}, 1 - ε, 1 + ε) A_{i}^{(k)})] \\ + σ \frac{1}{B_{n}} \sum_{i = 1}^{B} \sum_{k = 1}^{n} S [π_{θ} (o_{i}^{(k)}))] \end{matrix}

(3)

L (ϕ) = \frac{1}{B_{n}} \sum_{i = 1}^{B} \sum_{k = 1}^{n} (V_{ϕ} (s_{i}^{(k)}) - {\hat{R}}_{i})^{2}

(4)

In the formula,

r_{θ, i}^{(k)} = \frac{π_{θ} (a_{i}^{(k)} | ο_{i}^{(k)})}{π_{θ o l d} (a_{i}^{(k)} | ο_{i}^{(k)})}

,

A_{i}^{(k)}

is calculated using the general advantage estimation (GAE) method,

S

represents the policy entropy,

σ

is the hyperparameter for the entropy coefficient, B represents the size of the sample,

n

indicates the number of agents, and

{\hat{R}}_{i}

is the discounted reward. The objective of

c l i p (\cdot)

is to optimize the policy while controlling the magnitude of policy updates to prevent large updates that could cause drastic changes in the policy. This can provide stability for the algorithm and help it to converge to a better policy.

3.2. Training Procedure of MAPPO Cooperative Guidance Strategy

In multi-agent cooperative confrontation scenarios, challenges such as sparse rewards, complex state spaces, large action spaces, and instability due to independent sampling at each step arise. Directly training cooperative strategies not only makes it difficult for agents to choose appropriate actions from a vast action space based on sparse rewards but also requires considerable time. To address these issues of sparse rewards and complex action spaces in multi-agent cooperative confrontations, we propose a stable convergence method for intelligent cooperative guidance strategies, which leverages multi-agent reward function reshaping and multi-stage curriculum learning techniques.

3.2.1. Distance-Based Heuristic Reward Reshaping Method

The reward function provides immediate feedback to an agent following a specific action, and its design is crucial for the success of reinforcement learning algorithms. In this paper, we construct a reward function grounded in information theory principles. The distance-based heuristic reward function dynamically adjusts the reward signal based on the current vehicle–target distance. This relative position indicator is vital in assessing the effectiveness of cooperative attack strategies. The core idea of this method is to guide the agent to select better strategies based on its environmental observations [41]. By utilizing the relative distances between the target and each agent at time

t

, we shape a continuously evolving reward function that adapts to the state. The continuous reward function for the cooperative confrontation process is expressed as follows:

R_{d i s t} = \{\begin{array}{l} \sum_{i, j} (k_{1} - d_{i} (t) / k_{2}), t = 0 \\ \sum_{i, j} [(k_{1} - d_{i} (t) / k_{2}) - (k_{1} - d_{i} (t - 1) / k_{2})], t > 0 \end{array}

(5)

where

d_{i} (t)

,

d_{j} (t)

are the relative distances between the target and flying vehicle, i and j, respectively.

The terminal discrete reward function is defined as follows:

R_{t e r} = \{\begin{array}{l} 0, otherwise \\ k_{3}, d_{i, j} (t) \leq d_{z} \end{array}

(6)

where

d_{z}

represents the maximum miss distance range, and

k_{3}

is a constant. This means that, at time

t

, if both agents successfully hit the target within the miss distance range, the agent receives a terminal reward of

k_{3}

, indicating that the two agents have coordinated with a time error of 0 s. Otherwise, the agent receives a terminal reward of 0. Prior to a successful strike, the reward function incentivizes the agents to hit the target simultaneously, encouraging the agent to achieve a higher reward.

The overload constraint reward is defined as follows:

R_{a c t} = \sum_{i, j} \{k_{4} {[a (t) - a (t - 1)]}^{2} + k_{5} {[a (t)]}^{2}\}

(7)

where

k_{4}

and

k_{5}

are constants,

a (t)

represents the current overload, and

a (t - 1)

denotes the overload at the previous moment. The overload constraint reward mechanism consists of two main components: first, the reward term

k_{4} {[a (t) - a (t - 1)]}^{2}

helps to reduce agent jitter caused by sudden changes in action by limiting the difference between the current and previous actions; second,

k_{5} {[a (t)]}^{2}

aims to minimize the agent’s energy consumption during flight. Through these two optimization aspects, the mechanism ensures that the agent performs actions more smoothly during the cooperative attack and enhances the overall operational efficiency.

Incorporating the three aforementioned rewards, the overall reward function is as follows:

r = {R_{d i s t} + R}_{t e r} + R_{a c t}

(8)

3.2.2. Multi-Stage Curriculum Training Framework

To ensure the stable and rapid convergence of the algorithm, while addressing the issue of action jitter caused by independent sampling at each step in traditional reinforcement learning, we introduce a multi-stage curriculum training method. This framework is based on the core principles of curriculum learning, gradually increasing the task difficulty to help the agent to improve its performance progressively until the desired outcome is achieved.

Direct training may make it challenging for the algorithm to converge while ensuring that agents balance both overload and energy consumption during cooperative operations. To resolve this, we propose a three-stage training process, progressing from simple to complex.

Stage 1: The focus is on obtaining a cooperative confrontation strategy for the agents, without considering energy consumption or action jitter.
Stage 2: An overload constraint reward mechanism is introduced to ensure smooth agent actions during the cooperative attack, preventing jitter.
Stage 3: This stage incorporates the agent energy consumption during the cooperative confrontation, preventing excessive energy use and optimizing the overall operational efficiency.

The key to the effective training of the reinforcement learning lies in the design of the reward function, as the magnitude of the reward determines whether the current task is successfully completed. By adjusting the reward function parameters for each stage, the three stages of training are systematically achieved, as outlined in Table 1. This multi-stage curriculum learning method provides an effective approach to guiding the agent in making appropriate action choices in a complex multi-agent cooperative confrontation environment.

To evaluate the agent’s training progress, we use a metric

σ

, and the calculation formula for

σ

is as follows:

σ = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (R_{i} - μ)}

(9)

where

R_{i}

denotes the cumulative reward of the ith episode;

N

is the sampling length (with

N = 10)

; and

μ

is the mean cumulative reward of the samples and is calculated as

μ = \frac{1}{N} \sum_{i = 1}^{N} R_{i}

.

When

σ \leq ε

, the training of the multi-agent cooperative guidance strategy progresses to the next stage, with

ε

being a small value (

ε \leq 1

).

3.2.3. Planar Multi-Agent Cooperative Guidance Law Design

This section discusses the design of a planar cooperative guidance law (RL-CG) for maneuvering targets based on the MAPPO algorithm. Since the missile’s speed is uncontrollable, it can only adjust its flight trajectory to control the flight time. Therefore, the primary goal of the guidance law is to ensure a minimal miss distance when striking the target, making the relative state change between the missile and target particularly important.

In a multi-missile cooperative combat scenario where the speed is uncontrollable, the agent needs a larger exploration space to identify the optimal cooperative guidance strategy. To facilitate this, the missile’s normal acceleration is directly controlled, and the action space is defined as continuous. The MAPPO algorithm, which is well suited for continuous action spaces in multi-agent systems, enables the search for the optimal solution within a broader range. With multiple agents involved, the action space becomes a continuous multidimensional space, allowing for the design of the cooperative guidance law as follows:

a_{M} = [a_{i}, a_{j}], i \neq j a n d a_{i, j} \in [- 20 g, 20 g]

(10)

In a planar multi-missile environment, it is essential for the missiles to fully perceive the entire state of the environment during the guidance process. When defining the state space, the relative states of both the missile and the target must be considered. As such, the missile’s observation is defined as follows:

o = [x_{s}, y_{s}, v_{x}, v_{y}]

(11)

where

x_{s}

represents the relative position between the missile and the target along the x-axis, and

y_{s}

represents the relative position along the y-axis, where

x_{s} = x_{T} - x_{M}

,

y_{s} = y_{T} - y_{M}

.

v_{x}

and

v_{y}

represent the components of the relative velocity between the missile and target along the x- and y-axes, respectively. The global state space is constructed by concatenating the observations of both missiles, which is expressed as

S = [o_{i}, o_{j}], i \neq j

(12)

During each round of cooperative engagement, the agent model first receives the current environmental state observation

o_{t}

. Using this information, the agent generates actions

a_{t}

based on its policy function

π_{θ} (a_{t} |o_{t})

. In parallel, the environment provides an immediate reward

r_{t}

based on the agent’s actions, and the system transitions to a new state according to the transition probability

P (s_{t + 1} ∣ s_{t}, a_{t})

. After receiving the new state

s_{t + 1}

and reward

r_{t + 1}

, the policy model updates to select actions that will yield higher rewards in future rounds. This iterative process forms a dynamic optimization loop, where the agent continuously refines its policy based on cumulative feedback. The process continues until a predefined termination condition is met, signaling the end of the agent’s interaction with the environment. The MAPPO algorithm, as an effective policy optimization approach, is visually represented by the flowchart in Figure 2.

4. Numerical Results and Analysis

The simulation experiments are conducted on a shared computing platform equipped with an Intel(R) Xeon(R) Gold 5118 CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 3090 GPU in Xi’an. The simulation environment is developed using Python 2024.2.4 (Community Edition) and executed on the PyCharm Community 2024.2.4 platform, with Anaconda3 for package management.

To evaluate the effectiveness of the reinforcement learning-based cooperative guidance law, a missile cooperative engagement scenario is designed, as shown in Figure 3. In this scenario, two missiles are tasked with intercepting a maneuvering target in the same plane. As shown in Table 2, the missiles are launched from random positions within a specified range, while the target’s initial coordinates are also set at random positions, with initial inclination of 0°. The simulation time step is 0.05 s. The target’s acceleration is assumed to be perpendicular to its velocity, with the magnitude defined as

\{\begin{array}{l} a_{x} = 0.05 \times v_{T} \\ a_{x} = 0 \end{array}

(13)

where

v_{T} = 210

m/s is the initial velocity of the target. Additional parameters of the target can be found in Table 2.

4.1. Verification of the Multi-Stage Curriculum Training Method

The multi-stage curriculum learning training setup is outlined in Table 2, with a maximum of 500 steps per episode. The training batch size is 64, and the mini-batch size is 2. The learning rate for both the actor and critic networks is set to

= 3 \times 10^{- 4}

, while the discount factor is

γ = 0.99

.

The reward convergence curve obtained using the multi-stage curriculum training approach is presented in Figure 4. The success rates of cooperative attacks in the three stages are shown in Figure 5. In Stage 1, the MAPPO algorithm is trained for up to

7 \times 1 0^{6}

steps, with the reward curve during this stage shown in Figure 4a. Initially, the reward fluctuates considerably as the agents explore the environment up to

6 \times 1 0^{6}

steps. After this point, the action strategy networks for the two missiles begin to converge, and the reward curve smooths out. In Stage 2, which involves a maximum of

4 \times 1 0^{6}

training steps, the reward curve is as shown in Figure 4b. Since this stage builds upon Stage 1, reward convergence occurs more quickly, with the action strategy network stabilizing by

1.5 \times 1 0^{6}

steps and the reward curve becoming stable. In Stage 3, which benefits from the progress made in Stage 2, the success rate of cooperative attacks is approximately 100% at the beginning of the training, as shown in Figure 5. The reward curve during this stage, illustrated in Figure 4c, shows a faster convergence rate. The main function of this stage is to avoid excessive energy consumption by the missiles, thereby optimizing the overall operational efficiency of the cooperative guidance missions.

To assess the effectiveness of the multi-stage curriculum training approach, two training schemes are developed to compare their influence on Reinforcement Learning Cooperative Guidance (RLCG). Scheme 1 employes the multi-stage curriculum method, gradually progressing through the three stages of training. On the other hand, Scheme 2 involves directly using the reward function parameters from Stage 3 for training. The performance of the algorithm is evaluated primarily through the cumulative reward curve and the cooperative success rate curve.

Figure 5 shows the training performance of the agent under Scheme 1. From the reward convergence curve, it is clear that, when the multi-stage curriculum training method (Scheme 1) is used, the agent’s reward steadily converges throughout the training process, reaching convergence at around

6 \times 1 0^{6}

steps. In contrast, while the direct training method (Scheme 2) shows an upward trend in the reward value, it does not achieve convergence. Specifically, as shown in Figure 6, when the cooperative guidance network is trained directly, the agent’s performance significantly lags behind the mission requirements, with the success rate of cooperative attacks being nearly zero (as shown in Figure 7), highlighting the difficulties that the agent encounters with the direct training approach. These results confirm that the multi-stage curriculum training framework not only accelerates the agent’s learning process but also ensures more stable and effective training outcomes.

For the RLCG guidance strategy obtained using the multi-stage curriculum training framework, a typical scenario is analyzed. For the simulated environment, Figure 8 illustrates an example of the missile–target engagement trajectory for the cooperative attack on the target. In Figure 8, the initial position of the target is set to (4000 m, 1400 m), where the initial positions of the two missiles are (0 m, 900 m) and (200 m, 1600 m). Throughout the flight, the missiles maintain a constant speed, compensating for the time constraint by adjusting the flight distance. This is reflected in the trajectory’s curvature—the greater the curvature, the longer the required flight time. The two flying vehicles successfully carry out a cooperative attack on the target, with miss distances of 4.82 m and 3.26 m, respectively, and no coordination time error (0 s). The changes in the overloads of the two missiles are shown in Figure 9. As shown in Figure 9b, the missiles exhibit minimal maneuvering in the first 15 s to ensure a direct hit. After approximately 15 s, they start making significant adjustments to their flight paths in order to synchronize their attacks. This approach not only ensures strike accuracy but also improves the success rate of cooperative attack. More examples of missile–target engagement trajectories for cooperative attacks with random initial positions are shown in Figure 10, and their parameters are listed in Table 3, where Figure 10a–d show the missiles and target trajectories corresponding to the initial positions in rows 1–4 of Table 3, respectively. As shown in Figure 10a–d, the missiles successfully achieve simultaneous hits on the target under all four initial conditions.

It should be noted that sliding mode behavior exists in the first half of Figure 9a. This phenomenon arises from two key factors during the early training phase. Firstly, in Stage 1, the primary objective is to learn a cooperative strategy without constraints on energy consumption or action smoothness. The reinforcement learning agent initially explores a wide range of acceleration commands to discover effective policies. This exploration phase naturally leads to abrupt changes in the acceleration signals, resembling sliding mode oscillations. Secondly, the overload constraint reward mechanism, as shown in Equation (7), is inactive in Stage 1 (

k_{4} = 0, k_{5} = 0

), allowing the agent to freely explore extreme acceleration values within the [−20g, 20g] bounds. Frequent saturation at these limits amplifies the oscillatory behavior. Furthermore, in Stage 2, the utilization of the overload constraint reward (

k_{4} = - 0.1

) penalizes abrupt changes in acceleration, forcing the agent to smooth its actions. Stage 3 further optimizes the energy efficiency (

k_{5} = - 0.1

), refining the policy to minimize unnecessary maneuvers. As shown in Figure 9b, these curriculum learning stages progressively suppress oscillations, resulting in smoother and more optimal control signals.

Furthermore, by comparing the overload curve from the Stage 1-trained strategy (Figure 9a) with that from Stage 3, it is evident that the latter successfully reduces the energy consumption and produces a smoother overload curve. These simulation results highlight that the RLCG, trained using the multi-stage curriculum framework, achieves not only cooperative attacks but also addresses action jitter caused by independent sampling in reinforcement learning.

To further evaluate the performance advantages of the RLCG algorithm, this study conducts a series of comparative experiments designed to quantitatively assess its effectiveness. The algorithm used for the comparison is the proportional cooperative guidance law (PCGL) proposed in Ref. [42], which extends the traditional proportional navigation by incorporating a time-error bias term, as expressed by

\{\begin{array}{l} a_{m i} = {{\bar{N}}_{i} V}_{m i} {\dot{θ}}_{L i} \\ {\bar{N}}_{i} = N_{i} \{1 - k_{1 i} s g n (ξ_{i}) - k_{2 i} ξ_{i}\} \\ ξ_{i} = \sum_{j = 1}^{n} c_{i j} (t_{g o j} - t_{g o i}) \end{array}

(14)

According to Ref. [42], the parameters are set as follows:

N_{i} = 5

,

k_{11} = 2.8

,

k_{12} = 2.5

,

k_{21} = k_{22} = 1.2

,

c_{i j} = [\begin{matrix} 0 & 1 \\ 1 & 0 \end{matrix}]

.

The initial states of the missiles are randomly sampled within the same parameter ranges employed to train the cooperative guidance network, as outlined in Table 3. The target’s motion is defined according to the dynamic model presented in Equation (13). A maximum allowable miss distance of 10 m is assumed for all simulation scenarios. A total of 1000 Monte Carlo simulations are conducted for both the PCGL and RLCG algorithms. The results show that, for maneuvering targets whose accelerations are governed by Equation (13), the RLCG-based cooperative guidance consistently achieves simultaneous interceptions, with a time-to-impact error of zero across all test cases. The trajectories of the missiles and the target over the course of the 1000 engagements are illustrated in Figure 11. In comparison, PCGL achieves a success rate of 99.4%, where success is defined as both missiles striking the target within the prescribed 10 m miss distance. However, PCGL exhibits a significant average time-to-impact error of 0.388225 s, underscoring the superior synchronization and accuracy of the RLCG algorithm.

4.2. Generalization Analysis of RLCG

To evaluate the generalization capabilities of the model trained through reinforcement learning, simulation experiments are conducted from two perspectives: (1) increasing the missile count and (2) extending the missile launch range.

First, the performance of the RLCG guidance model is evaluated in combat scenarios of varying scales. To test the model’s adaptability to more complex situations, the basic two-on-one scenario (two missiles targeting one target) is expanded to a four-on-one scenario. This expansion increases both the number of missiles and the complexity of the cooperative combat strategies. An example of the four-on-one scenario is illustrated in Figure 12. In this scenario, two missiles and the target start at the same positions as in Figure 8, while the other two missiles are positioned at (100 m, 1100 m) and (400 m, 1800 m). As shown in Figure 12a, the missiles successfully achieve simultaneous hits on the target. With the cooperative attack time set to zero, the resulting miss distances are 4.82 m, 3.26 m, 3.97 m, and 2.53 m, respectively. The overloads of the four missiles are shown in Figure 12b. The missile overload changes smoothly without exhibiting significant oscillations.

Additionally, experiments are conducted to assess the impact of extending the missile launch range. Using the initial conditions from the network training phase, the missile launch range is increased, and 5000 engagement trials are performed using the Monte Carlo method. The initial missile states for this simulation are outlined in Table 4. Notably, the space for the initial position of each participant (missile or target) in the engagement is 2.25 times larger than the space described in Table 2.

The simulation results show that the success rate of two missiles hitting the target within a miss distance range of 10 m is 88.12%. The miss distance histogram in Figure 13 reveals that approximately 52.34% of the missile miss distances fall within the 4–5 m range, while 16.52% and 17.22% fall within the 3–4 m and 2–3 m ranges, respectively, indicating that the guidance strategy delivers high accuracy.

In conclusion, the results from both scaling up the missile count and extending the missile launch range demonstrate that the RLCG guidance model performs effectively in both small- and large-scale combat scenarios, maintaining high levels of engagement accuracy and cooperative guidance efficiency.

5. Extension to Three-Dimensional Engagements

The primary objective of this paper is to demonstrate the design methodology for a cooperative guidance strategy based on deep reinforcement learning. The effectiveness of the proposed design has been validated for planar (two-dimensional) engagements in Section 4. In this section, the two-dimensional RLCG guidance strategy is further extended to address three-dimensional engagement scenarios. The motion dynamics of both the missile and the target are formulated as follows:

\{\begin{array}{l} \dot{x} = v \cos θ \\ \dot{y} = v \sin θ \\ \dot{z} = - v \cos θ \sin φ \\ \dot{θ} = \frac{a_{y}}{v} \\ \dot{φ} = - \frac{a_{z}}{v} \cos θ \end{array}

(15)

The reward functions used in the three-dimensional scenario are consistent with those employed in the two-dimensional case (Equations (5)–(8)). The action space is defined as follows:

a_{M} = [a_{i y}, a_{i z}, a_{j y}, a_{j z}], i \neq j a n d a_{i y, i z, j y, j z} \in [- 20 g, 20 g]

(16)

where

a_{i y}

and

a_{j y}

represent the accelerations of missiles i and j along the y-axis, and

a_{i z}

and

a_{j z}

represent the accelerations of missiles i and j along the z-axis.

The missile’s observation is defined as follows:

o = [x_{s}, y_{s}, z_{s}, v_{x}, v_{y}, v_{z}]

(17)

where

x_{s}

represents the relative position between the missile and the target along the x-axis,

y_{s}

represents the relative position along the y-axis, and

z_{s}

represents the relative position along the z-axis, where

x_{s} = x_{T} - x_{M}

,

y_{s} = y_{T} - y_{M}

,

z_{s} = z_{T} - z_{M}

.

v_{x}

,

v_{y}

, and

v_{z}

represent the components of the relative velocity between the missile and target along the x-, y-, and z-axes, respectively. The global state space is constructed by concatenating the observations of both missiles, which is expressed as

S = [o_{i}, o_{j}], i \neq j

(18)

The multi-stage curriculum learning setup is detailed in Table 5, with all other parameters remaining consistent with those used in the two-dimensional training process.

The reward convergence curve achieved through the multi-stage curriculum training approach is presented in Figure 13. The success rates of cooperative attacks across the three stages are shown in Figure 14. Due to the increased dimensionality of the action space in the three-dimensional environment, the convergence process becomes more complex. As a result, a longer training horizon is employed to support the learning process. In Stage 1, the MAPPO algorithm is trained for up to

3 \times 1 0^{7}

steps, with the reward curve during this stage shown in Figure 14a. Initially, the reward fluctuates considerably as the agents explore the environment up to

2.2 \times 1 0^{7}

steps. Following this exploration phase, the action strategy networks for both missiles begin to converge, leading to the smoothing of the reward curve. Stage 2, which involves a maximum of

8 \times 1 0^{6}

training steps, is shown in Figure 14b. Since this stage builds upon Stage 1, the convergence of the reward occurs more rapidly, with the action strategy network stabilizing by

6.8 \times 1 0^{6}

steps and the reward curve becoming stable. In Stage 3, which benefits from the progress made in Stage 2, the success rate of cooperative attacks approaches 100% early in the training, as depicted in Figure 15. The reward curve for this stage, shown in Figure 14c, exhibits a faster convergence rate. The primary objective of Stage 3 is to minimize the energy consumption of the missiles, thereby enhancing the overall operational efficiency of the cooperative guidance mission.

Based on the RLCG guidance strategy developed above, cooperative guidance laws are designed. A total of 1000 simulations are conducted, with the initial states of the missile and the target randomly selected from the conditions specified in Table 5. Among these simulations, 997 engagements result in successful simultaneous attacks, as shown in Figure 16. For the RLCG guidance strategy derived through the multi-stage curriculum training framework, a typical scenario is analyzed. Figure 17 illustrates an example of the missile–target engagement trajectory during the cooperative attack. In this scenario, the target’s initial position is set at (4000 m, 1400 m, 0 m), while the initial positions of the two missiles are (0 m, 500 m, 0 m) and (200 m, 1600 m, 0 m). The two missiles successfully execute a cooperative attack on the target, with miss distances of 9.46 m and 6.98 m, respectively, and no coordination time error (0 s). The variations in the overloads of both missiles are shown in Figure 18, with Figure 18a depicting the overloads of Missile 1 and Figure 18b illustrating those of Missile 2. As demonstrated in Figure 18, the missiles exhibit minimal maneuvering during the first 17 s to ensure a direct hit. After approximately 17 s, they begin making substantial adjustments to their flight paths in order to synchronize their attacks. This strategy not only ensures strike accuracy but also enhances the success rate of the cooperative attack.

6. Conclusions

This paper presents a multi-agent cooperative guidance law based on deep reinforcement learning (DRL), designed for scenarios with uncontrollable velocity conditions. Unlike traditional guidance methods, the proposed approach enables each agent to dynamically adjust its acceleration based on real-time environmental feedback, thereby optimizing cooperative engagement strategies. To overcome the challenge of sparse rewards in the guidance environment, a dedicated reinforcement learning framework is developed. This framework integrates a heuristically reshaped reward function along with a task-specific composite reward to effectively guide the learning process. In addition, a multi-stage curriculum learning strategy is introduced to enhance the training stability, reduce action oscillations, and ensure the reliable convergence of the algorithm. The simulation results demonstrate that the proposed guidance law enables cooperative attacks with high accuracy and minimal timing errors, validating its effectiveness.

For the sake of modeling simplicity, this study assumes a constant vehicle velocity. Future work will focus on incorporating autopilot dynamics and aerodynamic drag into the model to improve its realism. Additionally, since few existing studies have addressed cooperative guidance laws that account for both time constraints and angle limitations, future research will explore strategies for guidance under angle constraints, especially in scenarios involving maneuvering targets.

Author Contributions

Conceptualization, H.C. and M.T.; methodology, M.T.; software, M.T.; validation, K.Z. and J.W.; formal analysis, J.W.; investigation, K.Z.; resources, M.T.; data curation, H.C.; writing—original draft preparation, M.T.; writing—review and editing, M.T.; visualization, K.Z.; supervision, J.W.; project administration, H.C.; funding acquisition, M.T. All authors have read and agreed to the published version of the manuscript.

Funding

Aeronautical Science Fund, Grant/Award Number: 202400010530002, 20230001053004.

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Y.; Ma, G.; Liu, A. Guidance law with impact time and impact angle constraints. Chin. J. Aeronaut. 2013, 26, 4960–4966. [Google Scholar] [CrossRef]
Zhao, J.; Zhou, R.; Jin, X. Progress in reentry trajectory planning for hypersonic vehicle. J. Syst. Eng. Electron. 2014, 25, 627–639. [Google Scholar] [CrossRef]
Wang, J.Q.; Li, F.; Zhao, J.H.; Wang, C.M. Summary of guidance law based on cooperative attack of multi-missile method. Flight Dyn. 2011, 29, 6–10. [Google Scholar]
Jeon, I.-S.; Lee, J.-I.; Tahk, M.-J. Impact-time-control guidance law for anti-ship missiles. IEEE Trans. Control Syst. Technol. 2006, 14, 260–266. [Google Scholar] [CrossRef]
Guo, Y.; Li, X.; Zhang, H.; Cai, M.; He, F. Data-driven method for impact time control based on proportional navigation guidance. J. Guid. Control Dyn. 2020, 43, 955–966. [Google Scholar] [CrossRef]
Hu, Q.; Han, T.; Xin, M. New impact time and angle guidance strategy via virtual target approach. J. Guid. Control Dyn. 2018, 41, 1755–1765. [Google Scholar] [CrossRef]
Sinha, A.; Kumar, S.R.; Mukherjee, D. Three-dimensional guidance with terminal time constraints for wide launch envelops. J. Guid. Control Dyn. 2021, 44, 343–359. [Google Scholar] [CrossRef]
Tekin, R.; Erer, K.S.; Holzapfel, F. Polynomial shaping of the look angle for impact-time control. J. Guid. Control Dyn. 2017, 40, 2668–2673. [Google Scholar] [CrossRef]
Dong, W.; Wang, C.; Wang, J.; Xin, M. Varying-gain proportional navigation guidance for precise impact time control. J. Guid. Control Dyn. 2023, 46, 535–552. [Google Scholar] [CrossRef]
Cho, N.; Kim, Y. Modified pure proportional navigation guidance law for impact time control. J. Guid. Control Dyn. 2016, 39, 852–872. [Google Scholar] [CrossRef]
Tekin, R.; Erer, K.S.; Holzapfel, F. Impact time control with generalized-polynomial range formulation. J. Guid. Control Dyn. 2018, 41, 1190–1195. [Google Scholar] [CrossRef]
Zhang, Z.; Ma, K.; Zhang, G.; Yan, L. Virtual target approach-based optimal guidance law with both impact time and terminal angle constraints. Nonlinear Dyn. 2022, 107, 3521–3541. [Google Scholar] [CrossRef]
Chen, X.; Wang, J. Two-stage guidance law with impact time and angle constraints. Nonlinear Dyn. 2019, 95, 2575–2590. [Google Scholar] [CrossRef]
Zhang, W.; Chen, W.; Li, J.; Yu, W. Guidance algorithm for impact time, angle, and acceleration control under varying velocity condition. Aerosp. Sci. Technol. 2022, 123, 107462. [Google Scholar] [CrossRef]
Wang, P.; Guo, Y.; Ma, G.; Lee, C.-H.; Wie, B. New look-angle tracking guidance strategy for impact time and angle control. J. Guid. Control Dyn. 2022, 45, 545–557. [Google Scholar] [CrossRef]
Surve, P.; Maity, A.; Kumar, S.R. Polynomial based impact time and impact angle constrained guidance. IFAC-PapersOnLine 2022, 55, 486–491. [Google Scholar] [CrossRef]
Chen, Y.; Shan, J.; Liu, J.; Wang, J.; Xin, M. Impact time and angle constrained guidance via range-based line-of-sight shaping. Int. J. Robust. Nonlinear Control 2022, 32, 3606–3624. [Google Scholar] [CrossRef]
Shiyu, Z.; Rui, Z. Cooperative guidance for multimissile salvo attack. Chin. J. Aeronaut. 2008, 21, 533–539. [Google Scholar] [CrossRef]
Jiang, Z.; Ge, J.; Xu, Q.; Yang, T. Impact time control cooperative guidance law design based on modified proportional navigation. Aerospace 2021, 8, 231. [Google Scholar] [CrossRef]
Li, Q.; Yan, T.; Gao, M.; Fan, Y.; Yan, J. Optimal Cooperative Guidance Strategies for Aircraft Defense with Impact Angle Constraints. Aerospace 2022, 9, 710. [Google Scholar] [CrossRef]
Qilun, Z.; Xiwang, D.; Liang, Z.; Chen, B.; Jian, C.; Zhang, R. Distributed cooperative guidance for multiple missiles with fixed and switching communication topologies. Chin. J. Aeronaut. 2017, 30, 1570–1581. [Google Scholar]
Kumar, S.R.; Mukherjee, D. Cooperative salvo guidance using finite-time consensus over directed cycles. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 1504–1514. [Google Scholar] [CrossRef]
Li, C.; Wang, J.; Huang, P. Optimal cooperative line-of-sight guidance for defending a guided missile. Aerospace 2022, 9, 232. [Google Scholar] [CrossRef]
Zhou, J.; Yang, J. Distributed guidance law design for cooperative simultaneous attacks with multiple missiles. J. Guid. Control Dyn. 2016, 39, 2439–2447. [Google Scholar] [CrossRef]
Lyu, T.; Guo, Y.; Li, C.; Ma, G.; Zhang, H. Multiple missiles cooperative guidance with simultaneous attack requirement under directed topologies. Aerosp. Sci. Technol. 2019, 89, 100–110. [Google Scholar] [CrossRef]
Yu, H.; Dai, K.; Li, H.; Zou, Y.; Ma, X.; Ma, S.; Zhang, H. Distributed cooperative guidance law for multiple missiles with input delay and topology switching. J. Frankl. Inst. 2021, 358, 9061–9085. [Google Scholar] [CrossRef]
Zadka, B.; Tripathy, T.; Tsalik, R.; Shima, T. Consensus-based cooperative geometrical rules for simultaneous target interception. J. Guid. Control Dyn. 2020, 43, 2425–2432. [Google Scholar] [CrossRef]
Cevher, F.Y.; Leblebicioğlu, M.K. Cooperative Guidance Law for High-Speed and High-Maneuverability Air Targets. Aerospace 2023, 10, 155. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert. Syst. Appl. 2023, 231, 23541. [Google Scholar] [CrossRef]
Zhang, R.; Zong, Q.; Zhang, X.; Dou, L.; Tian, B. Game of drones: Multi-UAV pursuit-evasion game with online motion planning by deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7900–7909. [Google Scholar] [CrossRef]
Wu, M.-Y.; He, X.-J.; Qiu, Z.-M.; Chen, Z.-H. Guidance law of interceptors against a high-speed maneuvering target based on deep Q-Network. Trans. Inst. Meas. Control 2022, 44, 1373–1387. [Google Scholar] [CrossRef]
Li, W.; Zhu, Y.; Zhao, D. Missile guidance with assisted deep reinforcement learning for head-on interception of maneuvering target. Complex. Intell. Syst. 2022, 8, 1205–1216. [Google Scholar] [CrossRef]
Qinhao, Z.; Baiqiang, A.; Qinxue, Z. Reinforcement learning guidance law of Q-learning. Syst. Eng. Electron. 2020, 42, 414–419. [Google Scholar]
Guo, J.; Hu, G.; Guo, Z.; Zhou, M. Evaluation Model, Intelligent Assignment, and Cooperative Interception in multimissile and multitarget engagement. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 3104–3115. [Google Scholar] [CrossRef]
Ni, W.; Liu, J.; Li, Z.; Liu, P.; Liang, H. Cooperative guidance strategy for active spacecraft protection from a homing interceptor via deep reinforcement learning. Mathematics 2023, 11, 4211. [Google Scholar] [CrossRef]
Zhou, W.; Li, J.; Liu, Z.; Shen, L. Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chin. J. Aeronaut. 2022, 35, 100–112. [Google Scholar] [CrossRef]
Xi, A.; Cai, Y. Deep Reinforcement Learning-Based Differential Game Guidance Law against Maneuvering Evaders. Aerospace 2024, 11. [Google Scholar] [CrossRef]
Ha, I.-J.; Hur, J.-S.; Ko, M.-S.; Song, T.-L. Performance analysis of PNG laws for randomly maneuvering targets. IEEE Trans. Aerosp. Electron. Syst. 1990, 26, 713–721. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Li, X.; Vasile, C.I.; Belta, C. Reinforcement learning with temporal logic rewards. In Proceedings of the International Conference on Intelligent Robots and Systems, Vancouver, BC, Canada, 24–28 September 2017; pp. 3834–3839. [Google Scholar]
Ma, M.; Song, S. Multi-missile cooperative guidance law for intercepting maneuvering target. Aero Weapon. 2021, 28, 19–27. [Google Scholar]

Figure 1. Missile interceptor scenario schematic.

Figure 2. MAPPO algorithmic framework.

Figure 3. Missile and target departure diagram.

Figure 4. Reward convergence curve of the planar multi-stage curriculum training method. (a) Reward convergence curve for Stage 1 training; (b) reward convergence curve for Stage 2 training; (c) reward convergence curve for Stage 3 training.

Figure 5. Success rates of the planar cooperative guidance in the three stages.

Figure 6. Reward convergence curve of direct training.

Figure 7. Success rate curve per episode of direct training.

Figure 8. Planar missile and target trajectories.

Figure 9. Overload curves for different stages in the planar engagements: (a) overload curve in Stage 1; (b) overload curve in Stage 3.

Figure 10. Four missile–target engagement trajectories for cooperative attacks with random initial positions with the attack time (a) 21.05 s; (b) 23.00 s; (c) 21.25 s; (d) 21.60 s.

Figure 11. Planar trajectories of missiles and target with different initial states.

Figure 12. (a) Trajectories of 4 missiles in cooperative attacks on the target; (b) overload curves of the 4 missiles in Figure 12.

Figure 13. Success rate of cooperative attacks of 2 missiles versus the miss distance.

Figure 14. Reward convergence curve of the three-dimensional multi-stage curriculum training method. (a) Reward convergence curve for Stage 1 training; (b) reward convergence curve for Stage 2 training; (c) reward convergence curve for Stage 3 training.

Figure 15. Success rate of the three-dimensional cooperative guidance in the three stages.

Figure 16. Three-dimensional trajectories of missiles and targets with different initial states.

Figure 17. Three-dimensional missile and target trajectories.

Figure 18. Overload curves of missiles in the three-dimensional engagements: (a) overload curve of Missile 1; (b) overload curve of Missile 2.

Table 1. Multi-stage curriculum learning design.

Stage	Reward Function Parameters
Stage 1	$k_{1} = 5, k_{2} = 100, k_{3} = 5, k_{4} = 0, k_{5} = 0$
Stage 2	$k_{1} = 5, k_{2} = 100, k_{3} = 5, k_{4} = - 0.1, k_{5} = 0$
Stage 3	$k_{1} = 5, k_{2} = 100, k_{3} = 5, k_{4} = - 0.1, k_{5} = - 0.1$

Table 2. Parameters in the planar guidance environment.

Parameter	Missile 1	Missile 2	Target
Initial x-axis coordinate (m)	0~400	0~400	3800~4200
Initial y-axis coordinate (m)	800~1200	1600~2000	1200~1600
Initial velocity (m/s)	400	400	210
Initial pitch angle (°)	0	0	/

Table 3. Parameters of the missile–target engagement trajectories in Figure 10.

Figure	Initial Position of Missile 1	Initial Position of Missile 2	Initial Position of Target	Attack Time (s)
Figure 10a	(82.68, 1167.44)	(195.36, 1844.69)	(3888.79, 1548.29)	21.05
Figure 10b	(118.72, 875.08)	(32.29, 1895.37)	(4106.36, 1407.36)	23.00
Figure 10c	(351.97, 909.63)	(165.69, 1718.43)	(3976.52, 1263.32)	21.25
Figure 10d	(239.97, 906.32)	(113.87, 1701.43)	(4051.51, 1431.93)	21.60

Table 4. Initial states of the missiles and target with expanded positional spaces.

Parameter	Missile 1	Missile 2	Target
Initial x-axis coordinate (m)	0~600	0~600	3700~4300
Initial y-axis coordinate (m)	700~1300	1500~2100	1100~1700

Table 5. Parameters in the three-dimensional guidance environment.

Parameter	Missile 1	Missile 2	Target
Initial x-axis coordinate (m)	0~400	0~400	3800~4200
Initial y-axis coordinate (m)	800~1200	1600~2000	1200~1600
Initial z-axis coordinate (m)	0	0	1000
Initial velocity (m/s)	400	400	210
Initial pitch angle (°)	0	0	/
Initial yaw angle (°)	0	0	/

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, H.; Zhang, K.; Tan, M.; Wang, J. A Deep Reinforcement Learning-Based Cooperative Guidance Strategy Under Uncontrollable Velocity Conditions. Aerospace 2025, 12, 411. https://doi.org/10.3390/aerospace12050411

AMA Style

Cui H, Zhang K, Tan M, Wang J. A Deep Reinforcement Learning-Based Cooperative Guidance Strategy Under Uncontrollable Velocity Conditions. Aerospace. 2025; 12(5):411. https://doi.org/10.3390/aerospace12050411

Chicago/Turabian Style

Cui, Hao, Ke Zhang, Minghu Tan, and Jingyu Wang. 2025. "A Deep Reinforcement Learning-Based Cooperative Guidance Strategy Under Uncontrollable Velocity Conditions" Aerospace 12, no. 5: 411. https://doi.org/10.3390/aerospace12050411

APA Style

Cui, H., Zhang, K., Tan, M., & Wang, J. (2025). A Deep Reinforcement Learning-Based Cooperative Guidance Strategy Under Uncontrollable Velocity Conditions. Aerospace, 12(5), 411. https://doi.org/10.3390/aerospace12050411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Reinforcement Learning-Based Cooperative Guidance Strategy Under Uncontrollable Velocity Conditions

Abstract

1. Introduction

2. Problem Statement

3. Design Procedure for Planar Guidance Strategies

3.1. Multi-Agent Proximal Policy Optimization

3.2. Training Procedure of MAPPO Cooperative Guidance Strategy

3.2.1. Distance-Based Heuristic Reward Reshaping Method

3.2.2. Multi-Stage Curriculum Training Framework

3.2.3. Planar Multi-Agent Cooperative Guidance Law Design

4. Numerical Results and Analysis

4.1. Verification of the Multi-Stage Curriculum Training Method

4.2. Generalization Analysis of RLCG

5. Extension to Three-Dimensional Engagements

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI