Pursuit-Interception Strategy in Differential Games Based on Q-Learning-Cover Algorithm

Bai, Yu; Zhou, Di; He, Zhen

doi:10.3390/aerospace12050428

Open AccessArticle

Pursuit-Interception Strategy in Differential Games Based on Q-Learning-Cover Algorithm

by

Yu Bai

^†,

Di Zhou

^*,† and

Zhen He

School of Astronautics, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Aerospace 2025, 12(5), 428; https://doi.org/10.3390/aerospace12050428

Submission received: 8 April 2025 / Revised: 1 May 2025 / Accepted: 8 May 2025 / Published: 12 May 2025

(This article belongs to the Section Aeronautics)

Download

Browse Figures

Versions Notes

Abstract

Due to the limited difference in maneuverability between the pursuer and the evader in three-dimensional space, it is difficult for a single pursuer to capture the evader. To address this, this paper proposes a strategy where three pursuers intercept one evader and introduces a Q-learning-cover algorithm. Based on the motion models of the pursuers and the evader in three-dimensional space, this paper presents a region coverage scheme based on the Ahlswede ball and analyzes the convergence upper bound of the Q-learning-cover algorithm by designing an appropriate Lyapunov function. Through extensive model training, the successful capture of the evader by the pursuers in a three-on-one scenario was achieved. Finally, numerical simulation experiments and hardware-in-the-loop simulation experiments are presented, both of which demonstrate that the proposed Q-learning-cover algorithm can effectively realize the three-on-one encirclement and interception of the evading target.

Keywords:

pursuit dynamics; Q-learning; regional coverage; time synchronization; differential game

1. Introduction

In recent years, the regional coverage problem has garnered substantial research attention. The acceleration-based coverage interception algorithm that controls flight angle [1] is mainly applicable in two-dimensional space, using acceleration coverage to intercept the target. However, acceleration coverage does not equate to regional coverage in three-dimensional space; it only covers the maneuvering range of the target, unable to effectively cover the target’s activity range in three-dimensional space. Moreover, the collaborative allocation interception strategy proposed in [2] is based on multi-missile coverage in two-dimensional space. However, this strategy is also limited to two dimensions and cannot be applied in three-dimensional space, potentially creating escape vulnerabilities. Another three-dimensional cooperative coverage interception strategy for intercepting highly maneuverable targets [3] achieves all-around coverage by calculating the arc length of the interception region, but it does not account for the influence of interception time, which could cause the target to exit the interception region. In [4,5,6], several cooperative guidance strategies based on coverage range were proposed, where each missile’s interception region is pre-allocated based on zero-control miss distance, ensuring that all missile interception regions cover the target. However, the zero-control miss distance is computed using a linearized guidance model, and significant linearization errors may lead to coverage failure.

In [7], a differential game strategy was proposed to solve the multiple-to-one pursuit problem, but the control input in this strategy is based on angles. This paper considers normal acceleration as the control input, as relying solely on angle changes significantly differs from the actual situation, leading to large miss distances that prevent the pursuers from successfully intercepting the target. In [8,9], differential game strategies are also used for interception, but the control input remains angle-based, while for missiles, normal acceleration is typically used as the control input [10].

The guidance strategies in the aforementioned literature contain errors and mismatches between the control inputs and the actual conditions, making them unsuitable for multi-to-one missile interception. In three-dimensional space, miss distance is closely related to errors, and reinforcement learning training can effectively reduce errors. Therefore, this paper proposes the Q-learning-cover algorithm to compute the missile guidance strategy. There has been substantial research in the field of reinforcement learning, addressing the optimal problem of systems through experience [11]. Reinforcement learning was initially introduced in [12] and is one of the most common and fundamental reinforcement learning algorithms used to describe unknown systems via Markov decision processes. The convergence of Q-learning has been extensively studied and proven through various methods, including the original proof [13], random approximation and contraction mapping methods [14], and ordinary differential equation approaches [15]. In [16], a minimax Q-learning-based algorithm was proposed to solve zero-sum games between two players. However, this algorithm involves significant computation and usually requires rapid iteration in pursuit problems, making it unsuitable for such scenarios. The algorithm designed in this paper aims to achieve iteration within 200ms. Although the training time is longer, the model used for encirclement needs to complete an iteration quickly. In [17], a new control theory framework is provided to analyze Q-learning’s convergence, introducing bias terms and Lyapunov functions for analyzing the algorithm’s convergence in switched systems. This paper also adopts this approach to analyze the impact of encirclement and time coordination on the convergence of the pursuit system. Similarly, in [18,19], new Q-learning update schemes are proposed. To ensure convergence in finite time, Reference [20] used Lyapunov theory to analyze the convergence of finite-time sampling algorithms. This paper also uses Lyapunov functions, but it focuses on proving the iteration upper bound and ensuring finite-time convergence by establishing the existence of the iteration upper bound.

Regarding synchronous time interception, many scholars have conducted in-depth research, and a new guidance method provides a computational approach for interception time [21]. This method is mainly applied in two-dimensional planes, where the normal acceleration is calculated by determining the deviation between the expected and actual arrival times. Similar to the aforementioned references, pre-time prediction methods are used to solve the interception of anti-ship missiles in maritime environments. These methods are primarily suited for two-dimensional planes, and multi-to-one missile interception designs analyze changes in expected interception time based on angle variations [22]. In contrast to the idealized conditions considered in the aforementioned literature, this paper introduces a time prediction-based method to address the interception of anti-ship missiles in maritime environments. The derived guidance strategy can achieve precise interception within a specified time window [23]. Notably, all three referenced works rely on iterative methods to compute flight times, whereas [24] proposes a recursive time calculation method, updating time non-iteratively. Furthermore, Reference [25] introduces a recursive time estimation method that compensates for errors caused by non-zero initial heading errors and non-iteratively updates the missile’s remaining flight time.

In [26], flight time is calculated by dividing relative distance by relative speed, whereas [27] uses relative distance, target and missile speeds, and missile heading angles to calculate flight time. However, in many cases, these estimation methods fail to predict lead time accurately, resulting in the missile not achieving synchronous interception under CPN guidance laws.

To effectively intercept highly maneuverable targets, a multi-to-one cooperative interception strategy presents an effective solution. In recent years, numerous advanced cooperative guidance methods have emerged. Several cooperative interception strategies have been proposed in two-dimensional space. For example, a geometric-based synchronous target interception method was introduced in [28], and a time-constrained guidance law was designed in [29] to control missiles to attack synchronously at a specified time. For non-maneuvering targets, Reference [30] designed a new two-dimensional impact time guidance law for zero-error interception. Reference [31] proposed a sliding mode-based interception time guidance rate to solve the synchronous interception problem under large navigation angle errors. Furthermore, Reference [32] proposed two cooperative guidance schemes with impact angles and time constraints, designed to intercept maneuvering targets, either with or without lead missiles.

Reference [33] proposed an improved combined navigation guidance law that integrates proportional navigation, angle acceleration, and fixed lead navigation. Reference [34] introduced a multi-to-one cooperative proportional guidance law that enables missiles of different speeds to intercept synchronously based on remaining flight time. Reference [35] proposed a cooperative salvo guidance strategy using fixed finite-time consensus.

For the aforementioned two-dimensional cooperative guidance laws, References [36,37] proposed three-dimensional space guidance law design methods. In Reference [38], an optimal strategy for solving non-collision games in three-dimensional space is introduced, including an optimal state feedback strategy for solving two-to-one interception problems. However, in non-collision games, both players must have complete information to compute the game strategy. In practice, some information (e.g., angle of view) may not be accessible. To address this issue, Reference [39] proposed a two pursuers–one evader differential game cooperative strategy that analyzes the positional relationship between the two pursuers and one evader. Using the solution of the HJI equation in differential games, the optimal equations for pursuers and evaders are derived [40]. In [41], a new non-maneuvering aircraft air defense missile guidance law is proposed to solve the cooperative interception problem between pursuers. This solution is based on dynamic games and provides real-time optimal trajectories. Reference [1] introduced a cooperative navigation strategy based on covering the flight line-of-sight angle.

The application of deep reinforcement learning in autonomous guidance systems has been widely discussed, especially in missile and UAV domains [42]. These systems improve performance through real-time decision-making optimization, sensor fusion, and path planning. Reference [43] proposed a missile trajectory control system based on artificial neural networks, which adapts to various operational environments, improving missile accuracy. Similarly, Reference [44] introduced an adaptive missile guidance method, where a neural network adjusts guidance laws in real time based on environmental changes (e.g., wind speed or target maneuvering).

Neural networks have also been used in sensor fusion for guidance and navigation systems [45], integrating data from radar, infrared, and GPS sensors to enhance system accuracy and robustness. Reference [46] discussed the challenges faced by autonomous vehicles in real-time path planning and obstacle avoidance. Using deep learning algorithms (e.g., CNNs and reinforcement learning), neural networks optimize UAV navigation and control. Reference [47] proposed a novel missile guidance method where neural networks adapt in real-time based on environmental conditions, improving missile accuracy and performance in uncertain environments.

Regarding the guidance rate in three-dimensional space, this paper proposes a proportional guidance law based on time deviation to achieve the synchronous interception of virtual targets. Unlike the previous methods, regarding stability analysis, Reference [48] proposed a Lyapunov candidate function for analyzing the stability of neural networks in control systems. This Lyapunov function helps prove that, in the presence of interference, the weight norm of the neural network is always constrained by system design parameters. Similar Lyapunov methods are applied to analyze the interception performance of pure proportional navigation guidance laws in three-dimensional space.

In this paper, we aim to solve the problem of the strategic interception of a moving target in three-dimensional space by multiple pursuers using reinforcement learning techniques. Specifically, we focus on the multi-on-one pursuit scenario, where a set of pursuers must intercept a single evader while accounting for the evader’s dynamic motion and evasive strategies. The challenge lies in efficiently planning the pursuers’ trajectories in a complex 3D environment while minimizing computational costs and ensuring real-time decision-making.

The proposed Q-learning-cover algorithm integrates reinforcement learning with geometric covering methods to optimize the interception strategy. By leveraging the Q-learning framework, the algorithm enables pursuers to learn and adapt their behavior based on the evader’s movements. Furthermore, the use of a complex plane projection reduces the computational burden typically associated with real-time 3D trajectory planning, making the algorithm suitable for practical applications in real-time systems.

This paper introduces the Q-learning-cover algorithm, which combines regional coverage interception with time synchronization algorithms. The reward–punishment mechanism incorporates coverage probability and time penalty terms. The algorithm is trained based on various escapee motion patterns, generating new models to guide each pursuer’s strategy. In practical scenarios, the trained model can be loaded onto hardware to enable multiple pursuers to encircle a single evader.

A schematic diagram of regional coverage interception is shown in Figure 1.

The main contributions of this paper are as follows:

1. Introduction of the Q-learning-cover regional coverage interception algorithm: This paper innovatively proposes the Q-learning-cover algorithm, combining regional coverage with time coordination. By integrating the geometric relationship between the pursuers and evaders and the coverage strategy of reinforcement learning, the algorithm increases the pursuit coverage area with each iteration in three-dimensional space, thus improving the pursuers’ interception efficiency.

2. Simplification of calculations using spherical-to-complex plane mapping: By introducing projection mapping, this paper projects the maneuvering range of the pursuers and evaders on a three-dimensional sphere onto the complex plane, simplifying arc surface calculations. This approach converts the coverage problem in three-dimensional space into a geometric mapping problem on the plane, significantly reducing computation complexity and improving the algorithm’s operability and efficiency.

3. Convergence analysis of the Q-learning-cover algorithm: Based on the assumption of iteration limits, this paper proves the convergence of the Q-learning-cover algorithm. By optimizing the iteration process, it ensures that pursuers can quickly find the optimal strategy to effectively cover and intercept the evader, guaranteeing convergence within a finite time.

In this paper, we propose a novel Q-learning-cover algorithm that combines geometric coverage in 3D space with temporal synchronization. Unlike traditional methods, our approach utilizes offline training of the Q-learning algorithm, where the model is trained in a simulation environment on various escapee motion patterns. Once trained, the model is then deployed online for the real-time guidance of the pursuers. This allows the system to leverage the computational efficiency of offline training while still enabling dynamic, online application during interception. The Q-learning algorithm is trained offline on a powerful computing platform, and the trained model is subsequently loaded onto the embedded hardware for online execution in real-world scenarios.

The structure of this paper is as follows: Section 2 describes the missile interception model in three-dimensional space; Section 3 discusses spherical-to-complex plane projection mapping for calculating the maneuvering range of pursuers and evaders; Section 4 presents the calculation method for interception time; Section 5 provides the update formula and iteration limit constraints for the Q-learning-cover algorithm; Section 6 presents the simulation platform used for missile interception. Section 7 shows numerical simulation and hardware-in-the-loop simulation experiments; Section 8 concludes the paper.

2. Dynamics Model

In the field of missile interception, the missile is considered as the pursuer, and the target is regarded as the evader. The pursuers

P_{i}

cooperate to capture the and the evader

E_{i}

. This game takes place in three-dimensional space. It is assumed that both the pursuer and the evader are treated as particles with conventional acceleration constraints. The direction of velocity is adjusted by conventional acceleration, and the acceleration is always perpendicular to the direction of velocity. The motion model between the pursuer and the evader is shown in Figure 2. Based on the relationship between the pursuer

P_{i}

and the evader

E_{i}

in three-dimensional space, a nonlinear differential equation has been derived, as detailed in the study by Song et al. [49]. The model proposed in this paper forms the foundation for the models in the Figure 2.

{\dot{R}}_{P i} = v_{E i} cos θ_{E i} cos φ_{E i} - v_{P i} cos θ_{P i} cos φ_{P i}

(1)

R_{P i} {\dot{θ}}_{L i} = v_{E i} sin θ_{E i} - v_{P i} sin θ_{P i}

(2)

R_{P i} {\dot{φ}}_{L i} cos θ_{L i} = v_{P i} cos θ_{P i} sin φ_{P i} - v_{E i} cos θ_{E i} sin φ_{E i}

(3)

\begin{matrix} {\dot{θ}}_{P i} = & \frac{A_{y P i}}{v_{P i}} + tan θ_{L i} sin φ_{P i} \times \frac{(v_{P i} cos θ_{P i} sin φ_{P i} - v_{E i} cos θ_{E i} sin φ_{E i})}{R_{P i}} \\ + cos φ_{P i} \frac{(v_{P i} sin θ_{P i} - v_{E i} sin θ_{E i})}{R_{P i}} \end{matrix}

(4)

\begin{matrix} {\dot{φ}}_{P i} = & \frac{A_{z P i}}{v_{P i} cos θ_{P i}} + sin θ_{P i} cos φ_{P i} tan θ_{L i} + \frac{v_{E i} cos θ_{E i} sin φ_{E i} - v_{P i} cos θ_{P i} sin φ_{P i}}{R_{P i} cos θ_{P i}} \\ - sin θ_{P i} sin φ_{P i} \frac{v_{E i} sin θ_{E i} - v_{P i} sin θ_{E i}}{R_{P i} cos θ_{P i}} - \frac{v_{E i} cos θ_{E i} sin φ_{E i} - v_{P i} cos θ_{P i} sin φ_{P i}}{R_{P i}} \end{matrix}

(5)

\begin{matrix} {\dot{θ}}_{E i} = & \frac{A_{y E i}}{v_{E i}} + tan θ_{L i} sin φ_{E i} * \frac{(v_{P i} cos θ_{P i} sin φ_{P i} - v_{E i} cos θ_{E i} sin φ_{E i})}{R_{P i}} \\ + cos φ_{E i} \frac{(v_{P i} sin θ_{P i} - v_{E i} sin θ_{E i})}{R_{P i}} \end{matrix}

(6)

\begin{matrix} {\dot{φ}}_{E i} = & \frac{A_{z E i}}{v_{E i} cos θ_{E i}} + sin θ_{E i} cos φ_{E i} tan θ_{L i} + \frac{v_{E i} cos θ_{E i} sin φ_{E i} - v_{P i} cos θ_{P i} sin φ_{P i}}{R_{P i} cos θ_{E i}} \\ - sin θ_{E i} sin φ_{E i} \frac{v_{E i} sin θ_{E i} - v_{P i} sin θ_{E i}}{R_{P i} cos θ_{E i}} - \frac{v_{E i} cos θ_{E i} sin φ_{E i} - v_{P i} cos θ_{P i} sin φ_{P i}}{R_{P i}} \end{matrix}

(7)

The pursuit–evasion problem in three-dimensional space is modeled as multiple pursuers attempting to capture an evader. The symbols used to describe the system dynamics are shown in Table 1. In this paper, we default to using the Euclidean norm (2-norm) as the measure for norms. When other norms (e.g., infinity norm) are used, they will be explicitly indicated in the subscript.

3. Coverage Interception Based on Spherical Polar Projection Mapping

The coverage probability calculation method designed in this paper primarily involves calculating the intersection point of the velocity vector with the Ahlswede ball in space, which corresponds to the center of an ellipse. By using the current normal acceleration, we compute the components along the Y-axis and Z-axis in the velocity coordinate system, thereby determining the major and minor axes of the ellipse. When the normal acceleration is zero, a small positive number is used as a substitute to ensure the validity of the coverage probability calculation. We assume that the Ahlswede ball between the pursuer and the evader is as follows:

\begin{matrix} {(x - \frac{{\bar{x}}_{P} - λ_{i}^{2} x_{E}}{1 - λ_{i}^{2}})}^{2} + {(y - \frac{{\bar{y}}_{P} - λ_{i}^{2} y_{E}}{1 - λ_{i}^{2}})}^{2} + {(z - \frac{{\bar{z}}_{P} - λ_{i}^{2} z_{E}}{1 - λ_{i}^{2}})}^{2} \\ = \frac{λ_{i}^{2}}{{(1 - λ_{i}^{2})}^{2}} [{(x_{E} + {\bar{x}}_{P})}^{2} + {(y_{E} + {\bar{y}}_{P})}^{2} + {(z_{E} + {\bar{z}}_{P})}^{2}] \end{matrix}

(8)

where

{\bar{x}}_{P} = \sum_{i = 1}^{N} γ_{P x i} x_{P i}, {\bar{y}}_{P} = \sum_{i = 1}^{N} γ_{P y i} y_{P i}, {\bar{z}}_{P} = \sum_{i = 1}^{N} γ_{P z i} z_{P i},

λ_{i} = \frac{∥v_{P i}∥}{∥v_{E}∥},

Let

∥v_{P i}∥

represent the velocity of the pursuer in three-dimensional space and

∥v_{E}∥

represent the velocity of the evader in three-dimensional space.

In this formula, we define the Apollonian sphere as the set of points in three-dimensional space that are bounded by a radius dependent on the velocities of both the pursuer and the evader. The variables

x, y, z

represent the coordinates in space, while

{\bar{x}}_{P}, {\bar{y}}_{P}, {\bar{z}}_{P}

are the weighted averages of the pursuer’s positions in three-dimensional space, given by the following summations:

We assume that the pursuer’s velocity vector in the inertial reference frame follows the line equation:

\begin{matrix} x_{P i}^{I} & = x_{P i} + v_{x} Δ t \\ y_{P i}^{I} & = y_{P i} + v_{y} Δ t \\ z_{P i}^{I} & = z_{P i} + v_{z} Δ t \end{matrix}

where

x_{P i}, y_{P i}, z_{P i}

are the position coordinates of the pursuer,

v_{x}, v_{y}, v_{z}

are the velocity components of the pursuer in three-dimensional space, and

Δ t

is the time step.

Spherical Polar Projection Mapping

When the pursuer’s velocity vector extends to the surface of the Ahlswede ball, it will cover the evader. To achieve this, we first need to allocate local coordinate systems on the sphere to describe the entire sphere through these local coordinate systems. This approach ensures that every point on the sphere has a unique corresponding coordinate. By combining these local coordinate systems, the entire sphere can be fully described. Next, we use spherical polar projection mapping to project these local regions from the three-dimensional real space onto a two-dimensional complex plane, as shown in Figure 3. Overlapping regions: The local coordinate systems of the coverage regions projected by different pursuers often overlap, leading to intersecting small areas. In these overlapping regions, we need to ensure that the local coordinate systems are correctly transformed to maintain consistency between different areas. Coverage of the sphere: Since the sphere is a closed surface, multiple local coordinate systems are required to cover the entire sphere. For example, on the unit sphere

S^{2}

in three-dimensional space, multiple small regions can be formed through the pursuer’s region coverage, and a local coordinate system is defined for each small region. Ultimately, these local coordinate systems will work together to ensure maximum coverage of the area where the evader is located.

Select the coordinate system

O X Y Z

such that

O X

and

O Y

coincide with the

O x

and

O y

axes on the complex plane and

O N

is perpendicular to the complex plane, with

O Z

aligned along the diameter

O N

. The complex number

I^{*} = x_{I} + y_{I} i

corresponds to the point

(x_{I}, y_{I})

, and the center of the sphere has coordinates

(x_{E}, y_{E}, z_{E})

. We can calculate the coordinates of point N using the center of the sphere’s coordinates as follows:

(\frac{x_{P} - λ_{i}^{2} x_{E}}{1 - λ_{i}^{2}}, \frac{y_{P} - λ_{i}^{2} y_{E}}{1 - λ_{i}^{2}}, \frac{z_{P} - λ_{i}^{2} z_{E}}{1 - λ_{i}^{2}})

(9)

This formula represents the calculation of the coordinates of point N in three-dimensional space, based on the center coordinates of the sphere

x_{E}, y_{E}, z_{E}

. The coordinates

x_{P}, y_{P}, z_{P}

represent the position of point P, and

λ_{i}

is an adjustment factor that scales the difference between point P and the sphere’s center. Specifically,

λ_{i}

influences the variation in the coordinates

x_{P}, y_{P}, z_{P}

in relation to the center of the sphere. The denominator

1 - λ_{i}^{2}

ensures the proper scaling and normalization of the coordinates, which is essential for maintaining the geometric relationships between the points in three-dimensional space. As

λ_{i}

varies, the calculated coordinates of point N adjust accordingly, reflecting the influence of the center’s position and the relative distance between the points.

This formula represents the calculation of the coordinates of point N in three-dimensional space based on the sphere’s center coordinates

x_{E}, y_{E}, z_{E}

. The parameter

λ_{i}

is an adjustment factor that influences the variation in

x_{P}, y_{P}, z_{P}

.

(\frac{x_{P} - λ_{i}^{2} x_{E}}{1 - λ_{i}^{2}}, \frac{y_{P} - λ_{i}^{2} y_{E}}{1 - λ_{i}^{2}}, \frac{z_{P} - λ_{i}^{2} z_{E}}{1 - λ_{i}^{2}} + \frac{λ_{i}}{(1 - λ_{i}^{2})} \sqrt{{(x_{E} + x_{P})}^{2} + {(y_{E} + y_{P})}^{2} + {(z_{E} + z_{P})}^{2}})

This equation calculates the new coordinates of point N by adding a correction term to the previous formula. The correction term

\frac{λ_{i}}{(1 - λ_{i}^{2})}

is multiplied by the distance between the coordinates of the point P and the sphere’s center. The distance is represented by the Euclidean norm, which is the square root of the sum of the squared differences between the coordinates of the sphere center and point P. This term adjusts the calculated coordinates of point N, making it more accurate by accounting for the relative motion between the pursuer and evader.

Thus, the expression for the line

\vec{N I^{*}}

is as follows:

\vec{N I^{*}} = (\begin{matrix} \frac{x_{P} - λ_{i}^{2} x_{E}}{1 - λ_{i}^{2}}, \\ \frac{y_{P} - λ_{i}^{2} y_{E}}{1 - λ_{i}^{2}}, \\ \frac{z_{P} - λ_{i}^{2} z_{E}}{1 - λ_{i}^{2}} - \frac{λ_{i}}{(1 - λ_{i}^{2})} \sqrt{{(x_{E} + x_{P})}^{2} + {(y_{E} + y_{P})}^{2} + {(z_{E} + z_{P})}^{2}} \end{matrix})

This formula describes the line equation

\vec{N I^{*}}

, which represents the relationship between point N and the complex number

I^{*}

. The coordinates

z_{P}, y_{P}, x_{P}

are calculated based on the sphere’s center coordinates and the adjustment factor

λ_{i}

, which influences the geometry of the problem. The correction term is applied to the z-coordinate to account for the distance and improve the accuracy of the model.

Therefore, the line equation can be expressed as follows:

\begin{matrix} \frac{X - \frac{x_{P} - λ_{i}^{2} x_{E}}{1 - λ_{i}^{2}}}{\frac{x_{P} - λ_{i}^{2} x_{E}}{1 - λ_{i}^{2}}} \\ = \frac{Y - \frac{y_{P} - λ_{i}^{2} y_{E}}{1 - λ_{i}^{2}}}{\frac{y_{P} - λ_{i}^{2} y_{E}}{1 - λ_{i}^{2}}} \\ = \frac{Z - \frac{z_{P} - λ_{i}^{2} z_{E}}{1 - λ_{i}^{2}}}{\frac{z_{P} - λ_{i}^{2} z_{E}}{1 - λ_{i}^{2}} - \frac{λ_{i}}{(1 - λ_{i}^{2})} \sqrt{{(x_{E} + x_{P})}^{2} + {(y_{E} + y_{P})}^{2} + {(z_{E} + z_{P})}^{2}}} \end{matrix}

(10)

This represents the ratio form of the line equation, which is used to express the relationship between the points on the sphere and the complex plane. Each of the three equations in the ratio form corresponds to the normalized difference between the coordinates

X, Y, Z

and the respective coordinates of the point N on the sphere. This form is essential for modeling the geometric relationship between the positions of the pursuer and the evader in three-dimensional space, ensuring that the calculations remain consistent and accurate.

To simplify subsequent calculations, we define the following:

\begin{matrix} X_{c} & = \frac{x_{P} - λ_{i}^{2} x_{E}}{1 - λ_{i}^{2}}, \\ Y_{c} & = \frac{y_{P} - λ_{i}^{2} y_{E}}{1 - λ_{i}^{2}}, \\ Z_{c} & = \frac{z_{P} - λ_{i}^{2} z_{E}}{1 - λ_{i}^{2}}, \\ B_{r} & = \frac{λ_{i}}{(1 - λ_{i}^{2})} \sqrt{{(x_{E} + x_{P})}^{2} + {(y_{E} + y_{P})}^{2} + {(z_{E} + z_{P})}^{2}} \end{matrix}

Here, we define new variables

X_{c}, Y_{c}, Z_{c}

, and

B_{r}

to simplify the subsequent calculations. These variables are derived from the previous expressions for the coordinates of point N and the correction term. The variables

X_{c}, Y_{c}, Z_{c}

represent the adjusted coordinates of point P, while

B_{r}

represents a correction factor that accounts for the distance between the points on the sphere. By introducing these simplified variables, we reduce the complexity of the formulas and make the upcoming steps in the calculation more manageable.

At this point, the line equation can be rewritten as follows:

\frac{X - X_{c}}{x - X_{c}} = \frac{Y - Y_{c}}{y - Y_{c}} = \frac{Z - Z_{c} - B_{r}}{- Z_{c} - B_{r}}

(11)

This represents the simplified form of the line equation, using

X_{c}, Y_{c}, Z_{c}, B_{r}

to represent the parameters of the line. By expressing the equation in terms of these new variables, we streamline the process of calculating the corresponding points on the sphere and the complex plane. The simplified equation reduces the complexity and allows for easier manipulation in subsequent calculations.

We can express the correspondence between points on the sphere and points on the complex plane as follows:

x = \frac{(X - X_{c}) (- Z_{c} - B_{r})}{Z - Z_{r} - B_{r}} + X_{c}

y = \frac{(Y - Y_{c}) (- Z_{c} - B_{r})}{Z - Z_{c} - B_{r}} + Y_{c}

These formulas illustrate the correspondence between points on the sphere and points on the complex plane. They provide the relationships between the coordinates

x, y

on the complex plane and the corresponding coordinates

X, Y

on the sphere, adjusted by the values of

X_{c}, Y_{c}, Z_{c}, B_{r}

. This correspondence is crucial for converting between the two coordinate systems and analyzing their geometric relationship.

Furthermore, the correspondence between points on the complex plane and points on the sphere is given by the following:

\begin{matrix} Z & = \frac{({(x - X_{c})}^{2} + {(y - Y_{c})}^{2}) ((Z_{c} + B_{r})) - ({(Z_{r} + B_{r})}^{2}) (B_{r} - Z_{c})}{{(x - X_{c})}^{2} + {(y - Y_{c})}^{2} + {(Z_{c} + B_{r})}^{2}}, \\ X & = \frac{({(x - X_{c})}^{2} + {(y - Y_{c})}^{2}) ((Z_{c} + B_{r})) - ({(Z_{r} + B_{r})}^{2}) (B_{r} - Z_{c})}{({(x - X_{c})}^{2} + {(y - Y_{c})}^{2} + {(Z_{c} + B_{r})}^{2}) (- Z_{c} - B_{r})} + (x - X_{c}) + X_{c}, \\ Y & = \frac{({(x - X_{c})}^{2} + {(y - Y_{c})}^{2}) ((Z_{c} + B_{r})) - ({(Z_{r} + B_{r})}^{2}) (B_{r} - Z_{c})}{({(x - X_{c})}^{2} + {(y - Y_{c})}^{2} + {(Z_{c} + B_{r})}^{2}) (- Z_{c} - B_{r})} + (y - Y_{c}) + Y_{c} \end{matrix}

(12)

These formulas provide expressions for solving the

X, Y, Z

coordinates on the sphere based on the known

x, y

coordinates on the complex plane. They demonstrate the precise relationship between the sphere and the complex plane, allowing for the conversion between the two coordinate systems. These equations are essential for mapping the points on the sphere to the complex plane and vice versa, which is key to understanding the geometric relationship between the two systems.

Through spherical polar projection, a circle not passing through point N is mapped to a circle on the complex plane. Suppose the equation of the spherical circle is

A (X^{2} + Y^{2} + Z^{2}) + C X + D Y + E Z + G = 0, (A \neq 0)

(13)

The equation of the circle on the complex plane can be written as

a {(x - x_{I})}^{2} + b {(y - y_{I})}^{2} + c = 0, (a, b \neq 0)

(14)

This equation is used to represent the relationship between the circle on the sphere and the circle on the complex plane. In the simulation process, by solving the system of equations, we obtain the expressions for each parameter, which leads to the equation of the circle on the complex plane.

By solving the Ahlswede ball equation and the line equation, we obtain the intersection points of the pursuer

[x_{P i}^{I}, y_{P i}^{I}, z_{P i}^{I}]

and the evader

[x_{E}^{I}, y_{E}^{I}, z_{E}^{I}]

. Based on this, the normal accelerations of both parties serve as the major and minor axes of two ellipses. We then project these two ellipses onto the complex plane and calculate their intersection points. For the intersection points of these two ellipses, there are two possible cases. The first case is shown in Figure 4, and the second case is shown in Figure 5.

The projection plane represents the projection of the current player’s information onto the

X - Z

plane in the inertial frame, which facilitates the calculation of the coverage area. The choice of the projection plane is related to the distance between the pursuer and the evader. The projection plane in the inertial system is parallel to the

Y_{I} - Z_{I}

plane, with the

X_{I}

-axis given by

X_{I} [\frac{1}{α} (\frac{1}{n} \sum_{i = 1}^{n} x_{P_{i}} + x_{E})]

Circle

C_{E}

represents the maximum normal cutoff coverage area of the evader E, where the center of the circle is the optimal game point of the evader

I_{E}^{*} = (x_{E}^{*}, y_{E}^{*}, z_{E}^{*})

, and the radius is the maximum normal cutoff

a_{E}^{\max}

of the evader.

Ellipse

E_{E}

represents the current normal cutoff coverage area of the evader, with the two axes of the ellipse defined as follows:

a_{E}^{Y}

is the normal acceleration along the Y-axis in the evader’s velocity system, and

a_{E}^{Z}

is the normal acceleration along the Z-axis in the evader’s velocity system. The center of the ellipse

E_{E}

is the same as the center of circle

C_{E}

.

Ellipse

E_{P_{i}}

represents the normal cutoff coverage area of the pursuer

P_{i}

. The center of the ellipse

E_{P_{i}}

is the optimal game point of the pursuer

I_{P_{i}}^{*} = (x_{P_{i}}^{*}, y_{P_{i}}^{*}, z_{P_{i}}^{*})

, and the two axes of the ellipse

E_{P_{i}}

are set as follows:

a_{P_{i}}^{Y}

is the normal acceleration along the Y-axis in the pursuer’s velocity system, and

a_{P_{i}}^{Z}

is the normal acceleration along the Z-axis in the pursuer’s velocity system.

The coverage areas of the pursuer and evader intersect at four focal points. This is simply for illustrative purposes, and different forms of the focal points will be given later. The intersection points

I_{1}

and

I_{3}

represent the intersections of the current acceleration coverage areas of the pursuer and evader. The coverage area

S_{1}

is formed by the trajectory from

I_{1}

to the lower trajectory

I_{3}

. The intersection points

I_{2}

and

I_{4}

represent the intersections of the current velocity coverage area of the pursuer and the maximum velocity coverage area of the evader. The coverage area

S_{2}

is formed by the trajectory from

I_{3}

to the lower trajectory

I_{4}

.

4. Calculation of Interception Time

In this section, we first introduce a method for calculating the interception time in two-dimensional space. Based on this, we extend the method to three-dimensional space and propose a method for calculating interception time under three-dimensional conditions. Additionally, we design a guidance method based on time deviation, which uses a variable proportional coefficient to adjust the proportional guidance rate.

The schematic diagram of the synchronous interception of two-to-one missiles is shown in Figure 6.

d_{1}, d_{2}

represent the set flight ranges of the missiles at their termination time. We use two missiles intercepting two virtual motions of the target simultaneously, where these two virtual points are defined at the initial moment. Apart from the acceleration, all other initial parameters are the same. In this way, two missiles can intercept the same batch of targets, thus achieving batch interception. If the two missiles can pass the acceleration determination values to the virtual point commands of the same batch of targets, it can be assumed that when one missile reaches its maximum acceleration, both missiles can successfully intercept the target. This way, a complex interception of the target can be achieved.

As shown in Figure 6, during the interception process, the missiles are modeled using virtual points. This method can effectively simplify the complex interaction between the missile and the target. Next, we will derive more detailed expressions based on these models and use mathematical formulas to optimize the interception time and path planning.

Lemma 1.

The launch time of a missile intercepting a moving target in two-dimensional space is given by [23]

t_{g o S} = \frac{1}{K_{b}} (\frac{α |σ_{M}^{α + 3} (0)|}{6 (α + 3)} - \frac{|σ_{M}^{α + 1} (0)|}{α + 1})

(15)

Δ t_{g o} = \frac{V_{M F} (t_{g o S}) \cdot (t_{g o F} - t_{g o S})}{V_{M R} (t_{g o S}) - V_{T} cos σ_{T F} (0)}

(16)

t_{g o} = t_{g o S} + Δ t_{g o}

(17)

where

σ_{M}

is the angle between the missile’s velocity system and the line-of-sight system;

t_{g o F}

is the forward time of missile attack on the virtual stationary target

T_{F f}

calculated by the PN guidance law;

K_{b}

is a constant related to the viewing angle; α is a constant related to the proportional coefficient, and

α = - \frac{N - 2}{N - 1}

; N is the proportional coefficient.

When calculating interception time in three-dimensional space, we improve Lemma 1. First, we calculate the attack time

t_{g o L i}

of the missile along a straight path. Then, compared to the two-dimensional case, the interception time calculation in three-dimensional space must consider the influence of the z-axis. For the straight path, the calculation process is as follows:

t_{g o L i} = \frac{1}{v_{M i}} \sqrt{\begin{matrix} {[x_{T i} (t_{g o L i}) - x_{M i} (0)]}^{2} \\ + {[y_{T i} (t_{g o L i}) - y_{M i} (0)]}^{2} \\ + {[z_{T i} (t_{g o L i}) - z_{M i} (0)]}^{2} \end{matrix}}

(18)

The equation for

t_{g o L i}

can be derived as

\begin{matrix} [v_{T i}^{2} - v_{M i}^{2} + a_{T i z} (z_{T i} - z_{M i}) + a_{T i y} (y_{T i} - y_{M i})] t_{g o L i}^{2} \\ + 2 v_{T i} (y_{T i} - y_{M i}) t_{g o L i} + R_{0}^{2} = 0 \end{matrix}

(19)

Next, the time required for the missile to attack along the straight path is calculated as follows:

t_{g o L i} = \frac{(- n - \sqrt{n^{2} - 4 {\dot{R}}_{i}^{2 m_{i}}})}{2 m_{i}}

(20)

where

{\dot{R}}_{i}

is the derivative of the distance between the target and the missile. The expression for

m_{i}

is

m_{i} = v_{M i}^{2} - v_{T i}^{2} + a_{T i y} Δ y + a_{T i z} Δ z,

where

a_{T i y}

is the component of the target’s velocity along the Y-axis in the target velocity coordinate system. Additionally, the value of n is

n = 2 v_{T i} Δ x

where

Δ x_{i}

represents the distance between the target and the missile. Next, we calculate the time required for the missile to reach the virtual point. After determining the attack time

t_{g o L i}

along the straight line, we assume the target moves to a virtual point

T F_{V}

in the velocity coordinate system. Through the transformation matrix, we can obtain the position of the virtual point

T F

in the ground reference system, thus calculating the terminal velocity of the target. The missile’s speed is then converted to the virtual line-of-sight system. At the virtual target point, the angles

θ_{M i F}

and

φ_{M i F}

represent the angle between the missile’s line-of-sight system and velocity system, and

θ_{T i F}

and

φ_{T i F}

represent the angle between the target’s velocity system and the line-of-sight system. The angles

σ_{M i F}

and

σ_{T i F}

are calculated as follows:

σ_{M i F} = arccos (cos (θ_{M i F}) cos (φ_{M i F}))

(21)

σ_{T i F} = arccos (cos (θ_{T i F}) cos (φ_{T i F}))

(22)

The angles

σ_{M i F}

and

σ_{T i F}

are intermediate variables and do not have actual physical meaning. The time for the missile to reach the virtual position is given by

\begin{matrix} t_{g o F i} = \frac{(- {|σ_{M i F}|}^{α + 1})}{α + 1} + \frac{α {|σ_{M i F}|}^{α + 3}}{6 (α + 3)} \\ \begin{matrix}  \end{matrix} + \frac{K_{a} {|σ_{M i F}|}^{α + 2}}{α + 2} - K_{a} \frac{α {|σ_{M i F}|}^{α + 4}}{6 (α + 4)} \end{matrix}

(23)

where

R_{F i}

is the distance between the missile and the virtual point.

The constant

K_{b} = \frac{- (N - 1) v_{M i} {(|sin (σ_{M i F})|)}^{\frac{1}{N - 1}}}{R_{F i} e^{K_{a} |σ_{M i F}|}}

represents a constant related to the missile’s velocity and the angle

σ_{M i F}

, where

v_{M i}

is the missile’s velocity,

R_{F i}

is the distance from the missile to the virtual point, and

K_{a}

is the missile’s lift-to-drag ratio constant.

The constant

K_{a} = \frac{N}{N - 1} y t a

, where

y t a

is the missile’s lift-to-drag ratio.

The deviation time can be calculated as

Δ t_{g o i} = \frac{v_{M i T} t_{d i}}{v_{M i T} (v_{M i T} - cos (σ_{T i F}) v_{T i})}

(24)

where

Ξ = \frac{y t a N sin (σ_{M i F}) (0 - σ_{M i F})}{N - 1}

represents a term involving the missile’s lift-to-drag ratio, the proportional coefficient N, and the angle

σ_{M i F}

.

The missile’s velocity after correction is given by

v_{M i T} = v_{M i} e^{Ξ}

, where

v_{M i}

is the missile’s initial velocity and

Ξ

is a correction factor based on the missile’s lift-to-drag ratio.

Additionally,

t_{d i} = t_{g o F i} - t_{g o L i}

, where

t_{g o F i}

is the forward time for missile attack, and

t_{g o L i}

is the time to reach a stationary target.

In Lemma 1, during each iteration, we consider a stationary target. However, in this paper, we need to consider the target’s movement with random acceleration. Therefore, we need to subtract the target’s movement impact from the time calculated in each iteration.

Δ t_{g o T i} = - \frac{\sqrt{\begin{matrix} {[x_{T i} (t + t_{g o L i}) - x_{T i} (t)]}^{2} \\ + {[y_{T i} (t + t_{g o L i}) - y_{T i} (t)]}^{2} \\ + {[z_{T i} (t + t_{g o L i}) - z_{T i} (t)]}^{2} \end{matrix}}}{(\begin{matrix} v_{T i} cos θ_{T i} (t + t_{g o L i}) cos φ_{T i} (t + t_{g o L i}) \\ - v_{T i} cos θ_{T i} (t) cos φ_{T i} (t) \end{matrix})}

(25)

where t is the current time and

t_{g o S i}

is the interception time.

t_{g o S i} = t_{g o L i} + Δ t_{g o i} + Δ t_{g o T i}

(26)

The difference between the current estimated interception time and the actual interception time can be obtained.

t_{e r r o r i}

is the error time,

i = 1, 2

.

t_{e r r o r i} = t_{g o i} - t_{g o S i}

(27)

In the aforementioned methods for calculating coverage probability and interception time, we primarily focus on the spatial relationship and coverage probability between the pursuer and the evader. However, this is a static analysis of the pursuit–evasion game, while in practice, the behavior of the pursuer and the evader is dynamic and there are strategic interactions between them. To address this complex dynamic game, we introduce the Q-learning-cover algorithm.

The Q-learning-cover algorithm uses reinforcement learning to train the strategies of the pursuer and evader, combining area cover and time coordination problems to optimize the solution to the pursuit–evasion game. The key to this algorithm is to gradually optimize the strategy through Q-value updates, so that the pursuer maximizes the probability of capturing the evader, while the evader tries to escape the pursuit by adjusting its strategy. Through this method, the Q-learning-cover algorithm not only considers the actions of both parties in the game but also effectively adapts to the changing game environment.

Next, we will provide a detailed introduction to the design and implementation of the Q-learning-cover algorithm and analyze its application effectiveness in the pursuit–evasion game.

5. Q-Learning-Cover Algorithm

In this section, we present the Q-learning-cover algorithm, which is based on the previously described coverage probability calculation method. The algorithm trains both the pursuer and the evader separately, thereby solving the pursuit–evasion game. In the pursuit–evasion game, the actions of the pursuer and evader are interdependent. Therefore, both must be trained individually to allow the model to effectively learn how to make optimal decisions based on real-time states and environments.

The Q-learning-cover algorithm combines the Q-learning algorithm from reinforcement learning with the concept of area cover. By iteratively training the behavior of each participant (the pursuer and the evader), the strategy is gradually optimized. Specifically, the goal of the pursuer is to maximize the probability of capturing the evader by continuously adjusting its behavior, while the evader tries to adjust its strategy to maximize its chances of escaping the pursuit.

In this algorithm, the training of the reinforcement learning model considers not only the actions of the participants but also integrates the area coverage probabilities and the synergistic effects of both parties’ behaviors at different points in time. The specific training process gradually improves the expected return of each participant in each state by updating the Q-values. To achieve this, the Q-learning-cover algorithm incorporates area coverage and time factors into the Q-value updating process, forming a bilateral game strategy.

Figure 7 shows the game strategy of the Q-learning-cover algorithm. On the left side is the strategy diagram for the pursuer, and on the right side is the strategy diagram for the evader. The two interact in different game states and adjust according to the feedback from reinforcement learning. Through this method, both the pursuer and evader can find the optimal pursuit–evasion strategy in complex environments.

By incorporating area coverage and time coordination, the Q-learning-cover algorithm is able to provide a more accurate solution to the pursuit–evasion game in dynamic environments. The behavior patterns of both the pursuer and the evader are continuously optimized during the training process, eventually reaching the optimal strategy. This enables the algorithm to effectively address complex pursuit–evasion problems encountered in real-world applications. The update formula for Q-learning-cover is shown below:

The Q-learning-cover algorithm is designed to efficiently plan the trajectories of multiple pursuers to intercept a single evader. The following pseudo-code outlines the main steps of the algorithm, incorporating both the interception time and regional coverage calculation before the algorithm iterations begin. Algorithm 1 shows the steps of the Q-learning-cover algorithm.

Algorithm 1 Q-learning-cover algorithm

1:: Initialize the state space S, action space A, and Q-table $Q (s, a)$
2:: Set learning parameters: $α$ (learning rate), $γ$ (discount factor), and $ϵ$ (exploration rate)
3:: (Interception Time Calculation)
4:: Calculate the interception time $t_{i n t e r c e p t}$ based on the positions and velocities of pursuers and the evader.
5:: (Regional Coverage Calculation)
6:: Calculate the regional coverage based on the positions of the pursuers and the evader’s trajectory, using geometric covering techniques such as the Ahlswede sphere.
7:: for each episode do
8:: Initialize the environment (pursuer and evader positions)
9:: while not converged do
10:: Select action a using an epsilon-greedy policy: $a = arg {max}_{a} Q (s, a)$
11:: Execute action a and observe new state $s^{'}$ and reward r
12:: Update Q-value using the formula:
$Q (s, a) \leftarrow Q (s, a) + α (r + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a))$
13:: end while
14:: end for
15:: Return the optimized policy

Theorem 1.

When the Q-learning-cover algorithm satisfies the selection upper bound condition in Equation (28), then convergence is achieved.

\begin{matrix} \frac{1}{N} \sum_{t = 0}^{N - 1} E [∥ Q_{t} - Q^{*} ∥_{\infty}] \leq (\frac{1 - ρ}{1 + ρ}) \\ \times [(\frac{1}{N} \sum_{t = 0}^{N - 1} E [∥ Q_{t} - Q^{*} ∥^{2}]) + γ^{2} {(\frac{R_{\max}}{1 - γ})}^{2} + \frac{1}{N} \sum_{t = 0}^{N - 1} E [∥ π_{t} - π^{*} ∥^{2}]] \end{matrix}

(28)

-: $\frac{1}{N} \sum_{t = 0}^{N - 1} E [∥ Q_{t} - Q^{*} ∥_{\infty}]$ : This term represents the average error in the Q-values over N iterations. $Q_{t}$ is the Q-value at time t, and $Q^{*}$ is the optimal Q-value. The error is measured using the infinity norm ( ${∥ \cdot ∥}_{\infty}$ ), which refers to the maximum difference between the current Q-values and the optimal Q-values. This term gives the expected error across all iterations.
-: $(\frac{1 - ρ}{1 + ρ})$ : This factor is the convergence rate. It ensures that as the algorithm progresses, the error decreases over time. The parameter $ρ$ controls how quickly the algorithm converges—lower values of $ρ$ result in faster convergence.
-: $(\frac{1}{N} \sum_{t = 0}^{N - 1} E [∥ Q_{t} - Q^{*} ∥^{2}])$ : This term is the second moment of the error. It measures the average squared error of the Q-values. The second moment provides a more detailed view of how the error evolves over time, which is crucial for understanding the stability of the algorithm.
-: $γ^{2} {(\frac{R_{\max}}{1 - γ})}^{2}$ : This term accounts for the discount factor $γ$ , which determines how much weight is given to future rewards. $R_{\max}$ represents the maximum reward, and this term adjusts the impact of future rewards on the convergence.
-: $\frac{1}{N} \sum_{t = 0}^{N - 1} E [∥ π_{t} - π^{*} ∥^{2}]$ : This is the policy error, which measures how far the current policy $π_{t}$ deviates from the optimal policy $π^{*}$ . This term represents the gap between the learned policy and the optimal policy.

This theorem shows that the Q-learning-cover algorithm will converge to the optimal Q-values and optimal policy as long as the algorithm satisfies the upper bound condition specified in Equation (28). The convergence is guaranteed by controlling the Q-value error, policy error, and the impact of future rewards through the discount factor

γ

. Each term on the right-hand side of the equation ensures that as the algorithm progresses, the error will decrease, ultimately leading to convergence.

Proof.

Our proof starts from the following elementary analysis and inverse recursion formula:

\begin{matrix} Δ_{t} = Q_{t} - Q^{*} \\ = (I - Λ_{t}) Q_{t - 1} + Λ_{t} (r + γ P_{t} π_{t} + β {\tilde{O}}_{t} + μ {\tilde{T}}_{t}) - Q^{*} \\ = (I - Λ_{t}) (Q_{t - 1} - Q^{*}) + Λ_{t} (r + γ P_{t} π_{t} + β {\tilde{O}}_{t} + μ {\tilde{T}}_{t} - Q^{*}) \\ = (I - Λ_{t}) (Q_{t - 1} - Q^{*}) + γ Λ_{t} (P_{t} π_{t} - {P π}^{*}) + Λ_{t} (β {\tilde{O}}_{t} + μ {\tilde{T}}_{t}) \\ = (I - Λ_{t}) Δ_{t - 1} + γ Λ_{t} (P_{t} - P) π^{*} + γ Λ_{t} P_{t} (π_{t} - π^{*}) + Λ_{t} (β {\tilde{O}}_{t} + μ {\tilde{T}}_{t}) \\ = A Δ_{t - 1} + γ Λ_{t} (P_{t} - P) π^{*} + γ Λ_{t} P_{t} (π_{t} - π^{*}) + Λ_{t} (β {\tilde{O}}_{t} + μ {\tilde{T}}_{t}) \end{matrix}

(29)

where for any

t > 0

, the first line comes from the update rule of the Q-learning algorithm, which computes the Q-value update based on the previous Q-value, the reward, and the transition probabilities. The term

Q_{t}

is the current Q-value, and

Q^{*}

is the optimal Q-value. The second-to-last line comes from the Bellman equation, which represents the optimal Q-value function as

Q^{*} = r + γ P π^{*}

[17], where r is the reward,

γ

is the discount factor, P is the transition probability matrix, and

π^{*}

is the optimal policy. The Bellman equation is used to express the recursive relationship between the Q-values, the reward, and the future state values.

We assume that

\begin{matrix} {∥A∥}_{\infty} : = {max}_{1 \leq i \leq m} \sum_{j = 1}^{n} |A_{i j}| \\ = {max}_{1 \leq i \leq m} \sum_{j} |{[A]}_{i j}| \\ = {max}_{1 \leq i \leq m} \sum_{j} |I - Λ_{t}| \\ = max_{1 \leq i \leq m} (1 - η) \\ \leq ρ \end{matrix}

In this formula, we are defining the infinity norm of a matrix A, which is calculated by taking the maximum of the row sums of the absolute values of the elements. The first line expresses the definition of the infinity norm for matrix A. The second line represents the same norm, expressed in terms of the elements of matrix A, where

A_{i j}

is the element in the i-th row and j-th column of the matrix. In the third line, we use the identity

I - Λ_{t}

, which is a difference between the identity matrix and a matrix

Λ_{t}

, and the absolute values of its elements are summed to calculate the norm. The fourth line assumes that the maximum of these sums is bounded by

1 - η

, where

η

is a parameter. Finally, the last line indicates that the norm is bounded by a constant

ρ

, which is a convergence parameter. This bound ensures that the norm does not grow unbounded as the algorithm proceeds.

Next, we design the Lyapunov function

V (x) = x^{T} M x

, where for any

ε > 0

,

ρ + ε \in (0, 1)

. We assume the existence of a positive definite matrix

M ≻ 0

, such that

A^{T} M A = {(ρ + ε)}^{2} (M - I)

and

λ_{min} (M) \geq 1,

λ_{max} (M) \leq \frac{|S| |A|}{1 - {(\frac{ρ}{ρ + ε})}^{2}} .

In this part of the proof, we define a Lyapunov function

V (x) = x^{T} M x

, where M is a positive definite matrix. The Lyapunov function is a key tool in proving stability and convergence of dynamic systems. The term

x^{T} M x

ensures that the energy-like quantity is non-negative and decreases over time. We introduce a condition on the matrix M such that it is positive definite (

M ≻ 0

), ensuring that M has strictly positive eigenvalues. The first equation

A^{T} M A = {(ρ + ε)}^{2} (M - I)

defines a relationship between M, the matrix A, and the identity matrix I, where

ρ + ε

is a factor controlling the contraction or expansion behavior of the system. The second inequality

λ_{min} (M) \geq 1

states that the minimum eigenvalue of M is greater than or equal to 1, ensuring that the matrix is not degenerate and that the Lyapunov function behaves correctly. The third inequality

λ_{max} (M) \leq \frac{| S | | A |}{1 - {(\frac{ρ}{ρ + ε})}^{2}}

provides an upper bound on the maximum eigenvalue of M. This bound is critical in determining the stability of the system, as it ensures that M does not grow too large, maintaining the desired convergence properties.

Then, by substituting these formulas into the Lyapunov function, we can obtain

\begin{matrix} E [V (Δ_{t + 1})] \\ = E [{(A Δ_{t} + γ Λ_{t + 1} (P_{t + 1} π_{t} - P π^{*}) + Λ_{t} (β {\tilde{O}}_{t} + μ {\tilde{T}}_{t}))}^{T} M \\ (A Δ_{t} + γ Λ_{t + 1} (P_{t + 1} π_{t + 1} - P π^{*}) + Λ_{t} (β {\tilde{O}}_{t} + μ {\tilde{T}}_{t}))] \\ = Δ_{t}^{T} A^{T} M A Δ_{t} + γ^{2} {(P_{t + 1} π_{t} - P π^{*})}^{T} Λ_{t + 1}^{T} M Λ_{t + 1} \\ (P_{t + 1} π_{t + 1} - P π^{*}) + β^{2} {\tilde{O}}_{t}^{T} Λ_{t + 1}^{T} M Λ_{t + 1} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} Λ_{t + 1}^{T} M Λ_{t + 1} {\tilde{T}}_{t} \\ < {(ρ + ε)}^{2} Δ_{t}^{T} (M - I) Δ_{t} + γ^{2} {(P_{t + 1} π_{t} - P π^{*})}^{T} Λ_{t + 1}^{T} M Λ_{t + 1} \\ (P_{t + 1} π_{t + 1} - P π^{*}) + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \\ < {(ρ + ε)}^{2} V (Δ_{t}) - {(ρ + ε)}^{2} {∥ Δ_{t} ∥}^{2} + γ^{2} η^{2} {(P_{t + 1} π_{t} - P π^{*})}^{T} M \\ (P_{t + 1} π_{t + 1} - P π^{*}) + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \\ < {(ρ + ε)}^{2} V (Δ_{t}) - {(ρ + ε)}^{2} {∥ Δ_{t} ∥}^{2} + γ^{2} η^{2} E ({(π^{*})}^{T} M π^{*} + {(π_{t + 1} - π^{*})}^{T} M (π_{t + 1} - π^{*})) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \\ < {(ρ + ε)}^{2} V (Δ_{t}) - {(ρ + ε)}^{2} {∥ Δ_{t} ∥}^{2} + γ^{2} η^{2} λ_{max} (M) E ({(π^{*})}^{T} π^{*} + {(π_{t + 1} - π^{*})}^{T} (π_{t + 1} - π^{*})) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \\ < {(ρ + ε)}^{2} V (Δ_{t}) - {(ρ + ε)}^{2} {∥ Δ_{t} ∥}^{2} + γ^{2} η^{2} λ_{max} (M) E (υ^{*}) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \end{matrix}

(30)

In this part of the proof, we are calculating the expected value of the Lyapunov function

V (Δ_{t + 1})

, which captures the change in the error

Δ

at the next time step

t + 1

. We expand the equation by substituting the previous terms into the Lyapunov function. The first line is the expression for

V (Δ_{t + 1})

, which involves the error at time

t + 1

, and we compute its expectation. The subsequent lines expand this function, considering the evolution of

Δ_{t}

, the effect of the transition probabilities

P_{t}

, and the impact of the perturbation terms

{\tilde{O}}_{t}

and

{\tilde{T}}_{t}

. The inequality in the third line introduces the term

{(ρ + ε)}^{2}

, which governs the contraction of the Lyapunov function over time. It shows how the system’s error reduces based on the parameters

ρ

and

ε

, and the impact of the rewards and perturbations. The term

λ_{max} (M)

controls the scaling of the error, with its maximum eigenvalue affecting the convergence. In the later lines, we apply the assumption that the system is stable, meaning that the error diminishes as the iterations progress. The terms involving

E [υ^{*}]

and

E [π^{*}]

represent the expected values of the optimal policy and Q-values, which further guide the convergence analysis of the system.

We assume that

E (υ *) = E ({(ß *)}^{T} ß * + {(ß_{t} - ß *)}^{T} (ß_{t} - ß *))

, and we can obtain

\begin{matrix} E [V (Δ_{t + 1})] - V (Δ_{t}) & < {(ρ + ε)}^{2} V (Δ_{t}) - V (Δ_{t}) - {(ρ + ε)}^{2} {∥Δ_{t}∥}^{2} + γ^{2} η^{2} λ_{max} (M) E (υ^{*}) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \\ < ({(ρ + ε)}^{2} - 1) V (Δ_{t}) - {(ρ + ε)}^{2} {∥Δ_{t}∥}^{2} + γ^{2} η^{2} λ_{max} (M) E (υ^{*}) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \\ < ({(ρ + ε)}^{2} - 1) {∥Δ_{t}∥}^{2} - {(ρ + ε)}^{2} {∥Δ_{t}∥}^{2} + γ^{2} η^{2} λ_{max} (M) E (υ^{*}) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \\ < - {∥Δ_{t}∥}^{2} + γ^{2} η^{2} λ_{max} (M) E (υ^{*}) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \end{matrix}

(31)

In this part of the proof, we are computing the expected change in the Lyapunov function over time. The left-hand side of the inequality represents the difference between the Lyapunov function at time

t + 1

and at time t. We are showing how this difference evolves based on the current error

Δ_{t}

and other terms such as perturbations

{\tilde{O}}_{t}

and

{\tilde{T}}_{t}

, as well as the optimal policy

π^{*}

. The first line of the inequality expresses the change in the Lyapunov function as a combination of the current error term

V (Δ_{t})

and additional terms involving the optimal Q-values and policies. The second line represents the bounding term

{(ρ + ε)}^{2}

that ensures the error is controlled. The third line simplifies the expression by combining terms and adjusting for the maximum eigenvalue of the matrix M, which is used to control the growth of the error. In subsequent lines, we apply the assumption that

ρ + ε

is a small contraction factor, leading to the error reduction. The inequality shows that the error term

{∥Δ_{t}∥}^{2}

will eventually become negative, implying that the error decreases over time, ensuring convergence. Finally, the right-hand side of the inequality bounds the error by terms involving the perturbations and the maximum eigenvalue of M, ensuring that the system’s stability holds as the iteration progresses.

According to the initial assumption,

{∥ß *∥}_{\infty} \leq \frac{R_{max}}{1 - γ}

E ({(ß t - ß *)}^{T} (ß t - ß *)) < E ({∥ß_{t} - ß *∥}_{\infty}^{2})

ε = \frac{1 - ρ}{2}, ρ + ε = \frac{1 + ρ}{2}

Taking the expectation on both sides, we can obtain

\begin{matrix} E [{∥Δ_{t}∥}^{2}] & \leq E [V (Δ_{t})] - E [V (Δ_{t + 1})] \\ + \frac{γ^{2} η^{2} | S | | A |}{1 - {(\frac{ρ}{ρ + ε})}^{2}} ({(\frac{R_{max}}{1 - γ})}^{2} + E ({∥π_{t} - π^{*}∥}^{2})) \\ + η^{2} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) λ_{max} (M) \end{matrix}

(32)

In this part, we start by assuming that the infinity norm of the optimal policy

π^{*}

is bounded by

\frac{R_{max}}{1 - γ}

, where

R_{max}

is the maximum reward, and

γ

is the discount factor. This bound is important for controlling the range of policy values. The second assumption involves the squared error between the current policy

π_{t}

and the optimal policy

π^{*}

. The expectation on the left-hand side represents the squared difference of the policy, while the right-hand side expresses this difference in terms of the infinity norm squared. This inequality helps in bounding the policy error. Next, we define

ε

, which controls the convergence rate. We use

ρ + ε

as a contraction factor that governs the convergence speed of the algorithm. The term

ε = \frac{1 - ρ}{2}

ensures that

ρ + ε

lies between 0 and 1, helping us control the system’s error reduction over time. The inequality in the main equation expresses the change in the error

Δ_{t}

in terms of the Lyapunov function

V (Δ_{t})

, the perturbation terms

{\tilde{O}}_{t}

and

{\tilde{T}}_{t}

, and the optimal policy. The expectation is taken over these terms, and the error bound is influenced by the maximum eigenvalue of the matrix M, which ensures that the error decreases over time. The right-hand side of the equation includes several terms that account for the following: The contribution from the policy error between

π_{t}

and

π^{*}

. The contribution from the perturbations

{\tilde{O}}_{t}

and

{\tilde{T}}_{t}

, which are factors that introduce noise or uncertainty into the system. The convergence rate controlled by the parameter

ρ

, which ensures that the system converges to the optimal policy.

Then, we use

λ_{min} (M) {∥x∥}_{2}^{2} \leq V (x) \leq λ_{max} (M) {∥x∥}_{2}^{2}

\begin{matrix} \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥Q_{t} - Q^{*}∥}^{2}] & \leq \frac{λ_{max} (M)}{N} E [{∥Q_{0} - Q^{*}∥}^{2}] + γ^{2} η^{2} λ_{max} (M) ({(\frac{R_{max}}{1 - γ})}^{2} \\ + \frac{1}{N} E [{∥π_{t} - π^{*}∥}^{2}]) \\ + η^{2} λ_{max} (M) (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) \end{matrix}

(33)

In this part, we start by using the relationship between the Lyapunov function

V (x) = x^{T} M x

and the squared norm of x. Specifically, we know that the Lyapunov function

V (x)

is bounded between the minimum and maximum eigenvalues of the matrix M, which are

λ_{min} (M)

and

λ_{max} (M)

, respectively. This relationship helps us control the magnitude of the Lyapunov function in terms of the vector norm. The inequality

λ_{min} {(M) ∥ x ∥}_{2}^{2} \leq V (x) \leq λ_{max} (M) {∥ x ∥}_{2}^{2}

is important for bounding the error. The next step is to bound the expected value of the squared error

∥ Q_{t} - Q^{*} ∥^{2}

, which is the difference between the Q-values and the optimal Q-values. We break down the bound into several terms: The first term

\frac{λ_{max} (M)}{N} E [{∥Q_{0} - Q^{*}∥}^{2}]

represents the contribution of the initial error at

t = 0

to the total error, scaled by the maximum eigenvalue of M. The second term involves the policy error, which is the squared difference between the learned policy

π_{t}

and the optimal policy

π^{*}

. The expectation

E [{∥π_{t} - π^{*}∥}^{2}]

is averaged over all iterations and gives a measure of the convergence of the policy. The third term accounts for the influence of perturbations

{\tilde{O}}_{t}

and

{\tilde{T}}_{t}

, which represent noise or uncertainty in the system. These perturbations contribute to the overall error, and their impact is scaled by the maximum eigenvalue of M. The final term includes the reward terms and their impact on the total error. The term

{(\frac{R_{max}}{1 - γ})}^{2}

captures the effect of the maximum reward

R_{max}

and the discount factor

γ

, while

λ_{max} (M)

ensures that the error remains bounded over time. This inequality provides a clear upper bound for the error in terms of the initial error, the policy error, the perturbations, and the system’s parameters.

Taking the square root on both sides and applying the subadditivity of square roots, we can obtain

\sqrt{\sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥Q_{t} - Q^{*}∥}_{2}^{2}]} \geq \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥Q_{t} - Q^{*}∥}_{2}] \geq \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥Q_{t} - Q^{*}∥}_{\infty}]

and

{∥Q_{0} - Q^{*}∥}_{2}^{2} \leq |S| |A| {∥Q_{0} - Q^{*}∥}_{\infty}^{2}

We can further obtain

\begin{matrix} \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥Q_{t} - Q^{*}∥}_{\infty}] \leq \sqrt{\begin{matrix} \frac{λ_{max} (M)}{N} E [{∥Q_{0} - Q^{*}∥}^{2}] + γ^{2} η^{2} λ_{max} (M) ({(\frac{R_{max}}{1 - γ})}^{2} + \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥π_{t} - π^{*}∥}^{2}]) \\ + η^{2} λ_{max} (M) \sum_{t = 0}^{N - 1} \frac{1}{N} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) \end{matrix}} \\ \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥Q_{t} - Q^{*}∥}_{\infty}] \leq \sqrt{λ_{max} (M) (\begin{matrix} \frac{1}{N} E [{∥Q_{0} - Q^{*}∥}^{2}] + γ^{2} η^{2} ({(\frac{R_{max}}{1 - γ})}^{2} + \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥π_{t} - π^{*}∥}^{2}]) \\ + η^{2} \sum_{t = 0}^{N - 1} \frac{1}{N} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) \end{matrix})} \\ \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥Q_{t} - Q^{*}∥}_{\infty}] \leq \sqrt{\frac{(1 - ρ)}{(1 + ρ)} (\begin{matrix} \frac{1}{N} E [{∥Q_{0} - Q^{*}∥}^{2}] + γ^{2} η^{2} ({(\frac{R_{max}}{1 - γ})}^{2} + \sum_{t = 0}^{N - 1} \frac{1}{N} E [{∥π_{t} - π^{*}∥}^{2}]) \\ + η^{2} \sum_{t = 0}^{N - 1} \frac{1}{N} (β^{2} {\tilde{O}}_{t}^{T} {\tilde{O}}_{t} + μ^{2} {\tilde{T}}_{t}^{T} {\tilde{T}}_{t}) \end{matrix})} \end{matrix}

(34)

In this section, we apply the relationship between the Lyapunov function

V (x) = x^{T} M x

and the squared norm of x, and we introduce the bounds for

V (x)

in terms of the minimum and maximum eigenvalues of the matrix M. The inequality

λ_{min} {(M) ∥ x ∥}_{2}^{2} \leq V (x) \leq λ_{max} (M) {∥ x ∥}_{2}^{2}

helps control the growth of the Lyapunov function based on the norms of x. We then consider the expected value of the squared error

∥ Q_{t} - Q^{*} ∥_{2}^{2}

, which is the difference between the current Q-values and the optimal Q-values. The expectation is averaged over all iterations. The first term on the right-hand side represents the contribution from the initial error

E [∥ Q_{0} - Q^{*} ∥_{2}^{2}]

, scaled by the maximum eigenvalue of M, which controls how the error propagates over time. The second term involves the policy error

E [∥ π_{t} - π^{*} ∥^{2}]

, which is the squared difference between the learned policy

π_{t}

and the optimal policy

π^{*}

. The policy error is crucial for understanding how well the system converges to the optimal policy. The third term accounts for perturbations

{\tilde{O}}_{t}

and

{\tilde{T}}_{t}

, which introduce noise or uncertainty into the system. These perturbations contribute to the overall error, and their impact is scaled by the maximum eigenvalue of M. Finally, we apply the convergence factor

ρ

, which ensures that the error decreases over time. The inequality bounds the total error by accounting for the initial error, policy error, perturbations, and the convergence parameters.

Finally, we prove the relationship between the Q-value vector at the iteration limit and the initial value vector when the system is stable. □

The purpose of this paper is to demonstrate the convergence of the Q-learning algorithm using a Lyapunov function and error bound analysis. By defining an appropriate Lyapunov function

V (x) = x^{T} M x

, we can control the system’s state and error, ensuring that the error decreases over time as the iterations progress, and the system ultimately converges to the optimal solution. Throughout the analysis, we define error bounds using a series of inequalities and employ the properties of the Lyapunov function. The error is decomposed into initial error, policy error, and perturbation error, each of which is scaled by the maximum eigenvalue

λ_{max} (M)

of matrix M. Each error term is related to other system parameters such as the maximum reward

R_{max}

, the discount factor

γ

, and perturbation terms

{\tilde{O}}_{t}

and

{\tilde{T}}_{t}

. Further derivation leads to an upper bound on the change in error, which takes into account all possible sources of error, including policy deviation and system perturbations. Finally, by introducing a convergence factor

ρ

, we ensure that the error gradually decreases over time, guaranteeing the system’s convergence. This analysis not only demonstrates the stability of the Q-learning algorithm in multi-step learning but also provides a theoretical foundation for error control in practical applications. With this approach, we can effectively learn the optimal policy in complex environments and ensure that the system converges within a finite amount of time.

6. Simulation Platform

This paper constructs a simulation platform suitable for the interception strategy, where the system model applies the proposed algorithm to an autonomously developed platform. By distributing the pursuers and evaders onto different physical hardware, the platform achieves interoperability between pursuers, while data communication between the pursuers and the evader can only occur through the server. This design aims to simulate a real interception scenario as closely as possible and validate the feasibility of the proposed algorithm through this platform. A missile interception model is used in the simulation.

The simulation platform consists of three parts: the embedded hardware section, the backend built using a server, and the frontend presented by the browser. The overall framework design is shown in Figure 8.

In the embedded hardware, we used the Nvidia-developed Jetson Nano development board as the pursuer in the differential game, and the Raspberry Pi as the evader in the differential game. These two embedded development boards are connected to a switch via Ethernet ports, and the switch is connected to the local server. The backend server uses the Java programming language with the Spring Boot framework, which internally supports the Tomcat server. The asynchronous communication framework is Netty. The data from the underlying hardware is sent to the backend server via UDP communication. The embedded hardware in the second part makes the already small memory even more limited. In terms of database management, we use Redis as a caching database, which caches and updates data via pipeline stream technology. Additionally, we use PostgreSQL as the relational storage database to store the data of both parties in the game. This pipeline stream technology and asynchronous framework help prevent data packetization issues.

For the frontend, we use React, with Netflix DGS providing methods for querying and modifying data. The DGS framework’s main feature is its annotation-based Spring Boot programming model, which has lower server overhead compared to traditional Restful APIs. GraphQL is mainly used for temporary queries and operations. Queries are used to fetch data and return the found data, while mutations are used to modify data. The entire technology stack is based on a B/S structure, with communication conducted through HTTP protocols. The application includes methods for processing data and mapping the data of the pursuers and evaders. Three-dimensional modeling uses Three.js and Cesium.js.

Firstly, the pursuers and the evader upload their positions and attitude angles to the server through UDP communication. Once the server receives the data, it calls the asynchronous program ‘Channelread0’ to write the data into the Redis cache database and uses pipeline stream technology to update the data in real-time to the PostgreSQL database. Netty sends messages between them. After receiving each other’s information, the pursuers and the evader calculate the optimal strategy through differential game theory, give guidance commands, and calculate the next moment’s position and attitude through the motion model, then update the data again. Since UDP communication can sometimes lose data, no multithreading solution is used to ensure real-time data reception.

On the frontend, the page rendering is completed when the application is launched. When the server writes data for the PostgreSQL database and updates it, a long connection between the frontend and backend is triggered, and the updated data are sent through Mutation. After receiving the updated information, the frontend uses a data mapping scheme to write the updated data into the state. At the same time, the position and attitude of the pursuers and evaders are updated in real-time. Both the Jetson Nano and Raspberry Pi development boards support the Python 3.9 compiler. The paper attempts to use hardware development boards with identical configurations.

The Q-learning algorithm is trained offline using a simulated environment where the pursuers learn to intercept the evaders based on various escape patterns. The offline training phase allows for computationally intensive processes, such as model optimization and fine-tuning, to be performed without the constraints of real-time processing. After training, the model is deployed on an embedded platform, such as Jetson Nano or Raspberry Pi, for online use, where it guides the pursuers in real-time interception. By separating the training and execution phases, our approach reduces the computational load during real-time operations. This allows the system to run on less powerful embedded hardware, such as the Jetson Nano or Raspberry Pi, while still benefiting from the rich training data and complex computations performed offline.

This paper validates the feasibility of the proposed algorithm through the simulation platform. To simulate this process more accurately, the observation coordinate system is projected onto the geocentric coordinate system for analysis. Hardware parameters are shown in Table 2. The data transmission protocol is shown in Figure 9.

Figure 10 shows the 3D visualization effect of the system, clearly presenting the interaction and interception process between the pursuer and the evader. This page greatly enhances the user experience through vivid dynamic effects and interactive feedback, making the system’s state and behavior more intuitive and understandable.

Figure 11 presents the structure and functionality of the backend server, which is responsible for processing core data calculations and interactive operations, ensuring the stable operation of the system and supporting real-time communication with the frontend and the database.

Figure 12 and Figure 13 show real-time data from the database, reflecting the dynamic changes of the evader and the pursuer. These figures accurately reflect key data and real-time states during the entire pursuit process, providing detailed monitoring information.

7. Experimental Results

Next, this paper presents three different examples to demonstrate the interception process using the proposed Q-learning-cover algorithm.

Example 1: The initial position of pursuer 1 is [0, 14,005, 4546], the initial position of pursuer 2 is [0, 15,892, 5025], and the initial position of pursuer 3 is [0, 14,543, 5072]. The pursuers’ speed is 4000 m/s. The initial position of the target is [100,000, 14,000, 0], and the target tries to evade the pursuers with a speed of 3000 m/s. The y- and z-coordinates of the pursuers’ initial positions are random. The initial distance between the pursuers and the evader is approximately 100,000 m, and the initial angle is random. The motion trajectories are shown in Figure 14.

Figure 14 shows the motion trajectories of the target and the pursuers in three-dimensional space. The red line represents the trajectory of the evader, while the blue, green, and cyan dashed lines represent the trajectories of the three pursuers. The figure demonstrates the relative position between the pursuers and the evader, as well as the distance variation between the target and the pursuers. Using the Q-learning-cover algorithm, the pursuers gradually reduce the distance to the target and ultimately capture it.

The X-Y axis projection is shown in Figure 15. This figure presents the projection of the target and the three pursuers on the X-Y plane. The trajectories of the evader and the pursuers are distinct in the X-Y plane due to their different initial positions. The evader’s trajectory shows a rapid offset, while the pursuers’ trajectories gradually converge towards the target based on their initial positions and motion directions.

The X-Z axis projection is shown in Figure 16. This figure presents the projection of the target and the three pursuers on the X-Z plane. The evader’s trajectory shows a rapid displacement along the X-axis, while the pursuers’ trajectories gradually approach the target, demonstrating the motion characteristics of different pursuers and the effectiveness of the algorithm.

The distance variation is shown in Figure 17. This figure illustrates the distance changes between the target and the three pursuers in three-dimensional space. Over time, the target is gradually approached by the pursuers. Ultimately, the miss distances are

1.2 m

,

0.9 m

, and

1.5 m

, with the target being successfully captured by pursuer 2.

The Y-axis normal acceleration variation is shown in Figure 18. This figure presents the variation in normal accelerations along the Y-axis for the evader and the three pursuers. Over time, the accelerations of the pursuers fluctuate significantly, especially during the capture process. The evader’s acceleration remains relatively stable, close to zero. The curves in the figure show the acceleration changes required by each pursuer based on their current motion trajectory.

The Z-axis normal acceleration variation is shown in Figure 19. This figure shows the normal acceleration along the Z-axis for both the target and the pursuers. Similarly to the Y-axis acceleration, the pursuers’ accelerations exhibit significant fluctuations during the capture process, while the evader’s acceleration remains relatively small and close to zero. The figure shows that as the distance between the pursuers and the target changes, the fluctuations in acceleration become more pronounced.

Example 2: The maximum normal acceleration of the pursuers and the evader is set to 40 g, while other conditions remain the same as in Example 1. The initial position of pursuer 1 is [0, 14,399, 5250], the initial position of pursuer 2 is [0, 15,481, 4543], and the initial position of pursuer 3 is [0, 14,324, 5211]. The target’s initial position is [100,000, 14,000, 0], and the maximum normal acceleration is set to 40 g, which means that the pursuers’ acceleration is restricted to a higher value, thus affecting their efficiency and strategy in capturing the target.

The motion trajectory is shown in Figure 20. The figure displays the motion trajectories of the evader and the three pursuers in three-dimensional space. The trajectories of the pursuers and the evader differ from those in Example 1, especially after the change in the acceleration limit. The pursuers’ trajectories show more significant changes. Through the Q-learning-cover algorithm, the pursuers gradually approach the target and ultimately capture it.

The X-Y plane projection is shown in Figure 21. This figure presents the projection of the target and the three pursuers on the X-Y plane. Compared to Example 1, the motion trajectories of the pursuers and the evader exhibit different changes in this plane. Particularly, with the normal acceleration limit set to 40 g, the pursuers’ movement becomes more intense, and the trajectory changes are more pronounced. The evader maintains a relatively stable path in the X-Y plane, while the pursuers’ trajectories show their continuous approach toward the target. By comparing the trajectories of the pursuers, the dynamic changes during the pursuit process are clearly observed.

The X-Z plane projection is shown in Figure 22. This figure presents the projection of the target and the three pursuers on the X-Z plane. With the normal acceleration limit set to 40 g, the pursuers’ trajectories show more complex dynamic changes on this plane, especially during the capture process. The pursuers’ paths exhibit larger fluctuations, while the evader’s trajectory remains more stable, despite the pursuers closing in. Comparing the projections of the pursuers in the X-Z plane helps in understanding the impact of acceleration on the pursuers’ movement paths.

The distance variation is shown in Figure 23. This figure displays the distance variations between the evader and the three pursuers. As the pursuers gradually approach the target, the final miss distances are

6.3 m

,

0.7 m

, and

2.2 m

. The curves in the figure show the dynamic changes during the pursuit. In this example, due to the normal acceleration being set to 40 g, the pursuers’ acceleration is significantly increased, leading to more dramatic trajectory changes during the capture process. By comparing the distance variation curves of the pursuers, one can clearly see the acceleration and deceleration processes at different stages of the pursuit.

The Y-axis acceleration variation is shown in Figure 24. This figure presents the variation in acceleration along the Y-axis for the evader and the three pursuers. In this example, the acceleration of the pursuers fluctuates significantly, especially as they approach the target. As their acceleration increases, the pursuers’ trajectories gradually close in on the target. In contrast, the evader’s acceleration remains relatively stable, staying at a low level. By comparing the acceleration variations of the pursuers and the evader, the effect of acceleration on the pursuit strategy is clearly observable.

Z-axis acceleration variation is shown in Figure 25. This figure presents the variation in acceleration along the Z-axis for the target and the pursuers. Similar to the Y-axis acceleration, the pursuers’ acceleration fluctuates more significantly as they approach the target, showing obvious variations. The evader’s acceleration remains relatively stable, with consistent values. By comparing the acceleration differences between the pursuers and the evader, the impact of acceleration on the capture process is intuitively observed.

Experiment 3: The pursuers use the optimal strategy, and the evader uses a composite sine maneuver. Other conditions are the same as in Example 1. The initial position of pursuer 1 is [0, 14,399, 5250], the initial position of pursuer 2 is [0, 15,481, 4543], and the initial position of pursuer 3 is [0, 14,324, 5211]. The initial position of the target is [100,000, 14,000, 0]. Unlike the previous two examples, the evader in this example adopts a composite sine maneuver strategy, making its trajectory more complex and unpredictable, thereby increasing the difficulty for the pursuers to capture the target. The 3D trajectory is shown in Figure 26.

The initial distance between the pursuers and the evader is approximately 100,000 m. The initial angle is random. The figure displays the motion trajectories of the evader and the three pursuers in three-dimensional space. Due to the composite sine maneuver of the evader, the trajectory shows more complex and variable characteristics. With the optimal strategy, the pursuers gradually close the distance to the target and eventually capture it.

X-Y plane projection is shown in Figure 27. This figure presents the motion trajectories of the evader and the three pursuers in the X-Y plane. Compared to Example 1, the motion trajectories of the pursuers and the evader exhibit different changes in this plane. Particularly, with the normal acceleration limit set to 40 g, the pursuers’ movement becomes more intense, and the trajectory changes are more pronounced. The evader maintains a relatively stable path in the X-Y plane, while the pursuers’ trajectories show their continuous approach toward the target. By comparing the trajectories of the pursuers, the dynamic changes during the pursuit process are clearly observed.

The X-Z plane projection is shown in Figure 28. This figure presents the motion trajectories of the evader and the three pursuers in the X-Z plane. With the normal acceleration limit set to 40 g, the pursuers’ trajectories show more complex dynamic changes on this plane, especially during the capture process. The pursuers’ paths exhibit larger fluctuations, while the evader’s trajectory remains more stable, despite the pursuers closing in. Comparing the projections of the pursuers in the X-Z plane helps in understanding the impact of acceleration on the pursuers’ movement paths.

The distance variation is shown in Figure 29. This figure displays the distance variations between the evader and the three pursuers over time. As the pursuers gradually approach the target, the final miss distances are 3.6 m (pursuer 1), 0.92 m (pursuer 2), and 2.3 m (pursuer 3). The curves in the figure display the dynamic changes during the pursuit. In this example, with the normal acceleration being set to 40 g, the pursuers’ acceleration is significantly increased, leading to more dramatic trajectory changes during the capture process. By comparing the distance variation curves of the pursuers, one can clearly see the acceleration and deceleration processes at different stages of the pursuit.

The Y-axis acceleration variation is shown in Figure 30. This figure presents the variation in acceleration along the Y-axis for the evader and the three pursuers during the pursuit. The evader’s Y-axis acceleration curve (red solid line) shows an overall increasing trend, which then stabilizes, reflecting the sustained control signals applied during its evasion or turning maneuvers. The pursuers’ acceleration curves exhibit significant jumps, especially for pursuer 3 (green dashed line), where acceleration spikes are observed during multiple periods, likely due to its rapid path correction behavior. Pursuers 1 and 2 show relatively smoother acceleration curves, indicating stable control strategies, contributing to higher precision tracking.

The Z-axis acceleration variation is shown in Figure 31. The figure shows the acceleration variations along the Z-axis for both the target and the pursuers. Similar to the Y-axis acceleration, the pursuers’ acceleration fluctuates more significantly as they approach the target, with obvious spikes in their acceleration. The evader’s acceleration remains relatively steady, with consistent values. Comparing the acceleration differences between the pursuers and the evader provides a more intuitive understanding of how acceleration differences impact the capture process.

Comparison of three algorithms: This paper compares the Q-learning-cover algorithm with the acceleration coverage algorithm in terms of miss distance. The experiment was repeated 100 times, and the average miss distance for the three pursuers was calculated, as shown in Figure 32. The experimental results show that the Q-learning-cover algorithm’s average miss distance is 2.4 m, the acceleration coverage algorithm’s average miss distance is 5.8 m, and the Q-learning algorithm’s miss distance is 10.4 m. This demonstrates that the Q-learning-cover algorithm has a clear advantage in improving interception accuracy.

In Figure 33, we present an analysis of the experimental results. The left side of the figure shows the process of three pursuers successfully intercepting one evader, while the right side presents the real-time feedback of the browser monitoring data. Unlike the experimental setup in Section 2, this experiment was conducted in the Earth coordinate system and used advanced Cesium technology for visualization, achieving more precise and dynamic monitoring displays.

Tabular comparison of existing approaches: As per the recommendation, we have included a tabular comparison (Table 3) between existing methods (PN, ZCMD, and DDPG) and the Q-learning-cover algorithm. This comparison presents the results based on accuracy, convergence time, and computational complexity, providing a clearer perspective on how our method compares to others in the field. Table 3 presents a comparison of the Q-learning-cover algorithm with other existing methods, namely PN, ZCMD, and DDPG, based on three key metrics: interception success rate, convergence time, and memory requirement.

Interception success rate: The Q-learning-cover algorithm achieved the highest interception success rate at 97%, significantly outperforming the other methods, with PN at 87%, ZCMD at 92%, and DDPG at 89%. This indicates that the Q-learning-cover algorithm excels in accuracy, making it a highly effective approach for strategic interception.
Convergence time: The Q-learning-cover algorithm also demonstrated the shortest convergence time (0.002 s), suggesting that it is computationally efficient compared to the other methods. In contrast, DDPG took the longest (0.021 s), followed by ZCMD (0.004 s) and PN (0.003 s).
Memory requirement: Regarding memory usage, the Q-learning-cover algorithm required 54 MB, which is the least among all methods. This efficiency is critical for real-time applications, especially when hardware resources are constrained. DDPG required 96 MB, ZCMD 120 MB, and PN 82 MB.

Overall, these results show that the Q-learning-cover algorithm offers a promising balance between high interception accuracy, fast convergence, and low memory usage, making it an attractive option for real-world applications in pursuit–evasion scenarios.

The parameter

α

plays a crucial role in adjusting the proportional coefficients of the algorithm. As

α

increases, the accuracy of the interception improves, but this comes at the cost of increased computation time. We observe that larger values of

α

result in slower convergence times, suggesting a trade-off between accuracy and efficiency.

The parameter

γ

, which is related to the discount factor in the reinforcement learning setup, influences the weight given to future rewards. Increasing

γ

leads to better long-term planning but may cause slower adaptation in dynamic environments. In our experiments, we found that varying

γ

significantly affects the algorithm’s convergence speed.

Finally, the parameter

λ

is crucial for scaling the velocity difference between the pursuer and evader. As

λ

increases, the algorithm’s memory usage decreases, but at the cost of reduced accuracy in interception. The relationship between

λ

and efficiency highlights a balance between resource consumption and interception precision.

In addition to the performance metrics presented in Table 3, we further analyze the algorithm’s efficiency by evaluating the computation time and the number of iterations required for convergence.

Computation time: The computation time is measured for each iteration of the algorithm, and we observe that the Q-learning-cover algorithm outperforms other methods in terms of speed, as shown in Table 4.

Number of iterations: The number of iterations required for the algorithm to reach convergence is also reported. We found that the Q-learning-cover algorithm requires fewer iterations compared to the other methods, which contributes to its faster convergence time.

The results show that the Q-learning-cover algorithm is both computationally efficient and fast in converging to the optimal solution with fewer iterations compared to other algorithms.

8. Conclusions

This paper presents a Q-learning-cover-based three-on-one pursuit interception algorithm that integrates Ahlswede sphere geometry and reinforcement learning. By projecting the 3D trajectories of pursuers and the evader onto the complex plane, the method significantly reduces computational complexity while maintaining high efficiency in planning the pursuers’ trajectories.

The algorithm’s boundedness is proven using a Lyapunov function, ensuring fast convergence and stability. The simulation results demonstrate that the Q-learning-cover algorithm improves pursuit efficiency and interception accuracy. The method shows promise for multi-agent systems such as drone swarms and robotic clusters, where it enhances the success rate and accuracy of pursuit tasks.

Future research will further investigate the extension of the Q-learning-cover algorithm to more complex pursuit scenarios, such as incorporating dynamic obstacles and multi-evader situations. For instance, the algorithm could be adapted to handle real-time avoidance and interception strategies for both moving obstacles and multiple evaders, which would be essential for applications like autonomous vehicles and multi-robot coordination. The development of more advanced models, such as model predictive control (MPC)-based methods and deep reinforcement learning algorithms, could also enhance the algorithm’s ability to manage non-linear dynamics and long-term planning.

Author Contributions

Conceptualization, Y.B. and D.Z.; methodology, Y.B.; software, Y.B.; validation, Y.B., D.Z. and Z.H.; formal analysis, Y.B.; investigation, D.Z.; resources, Z.H.; data curation, Y.B.; writing—original draft preparation, Y.B.; writing—review and editing, D.Z. and Z.H.; visualization, Y.B.; supervision, D.Z.; project administration, D.Z.; funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61773142.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, B.; Zhou, D.; Li, J.; Yao, Y. Coverage-based cooperative guidance strategy by controlling flight path angle. J. Guid. Control Dyn. 2022, 45, 972–981. [Google Scholar] [CrossRef]
Xiao, W.; Yu, J.; Dong, X.; Li, Q. Cooperative Interception of Highly Maneuverable Targets under Overload Constraints. Acta Aeronaut. Astronaut. Sin. 2020, 41, 184–194. [Google Scholar]
Chen, Z.; Yu, J.; Dong, X.; Ren, Z. Three-dimensional cooperative guidance strategy and guidance law for intercepting highly maneuvering target. Chin. J. Aeronaut. 2021, 34, 485–495. [Google Scholar] [CrossRef]
Su, W.; Li, K.; Chen, L. Coverage-Based Cooperative Guidance Strategy Against Highly Maneuvering Target. Aerosp. Sci. Technol. 2017, 71, 147–155. [Google Scholar] [CrossRef]
Su, W.S.; Shin, H.-S.; Chen, L.; Tsourdos, A. Cooperative Interception Strategy for Multiple Inferior Missiles Against One Highly Maneuvering Target. Aerosp. Sci. Technol. 2018, 80, 91–100. [Google Scholar] [CrossRef]
Su, W.; Li, K.; Chen, L. Coverage-Based Three-Dimensional Cooperative Guidance Strategy Against Highly Maneuvering Target. Aerosp. Sci. Technol. 2019, 85, 556–566. [Google Scholar] [CrossRef]
Moll, A.V.; Garcia, E.; Casbeer, D.; Suresh, M.; Swar, S.C. Multiple-Pursuer, Single-Evader Border Defense Differential Game. J. Aerosp. Inf. Syst. 2020, 17, 407–416. [Google Scholar]
Pachter, M.; Moll, A.V.; Garcia, E.; Casbeer, D.; Milutinović, D. Cooperative Pursuit by Multiple Pursuers of a Single Evader. J. Aerosp. Inf. Syst. 2020, 17, 371–389. [Google Scholar] [CrossRef]
Salmon, J.L.; Willey, L.C.; Casbeer, D.; Garcia, E.; Moll, A.V. Single Pursuer and Two Cooperative Evaders in the Border Defense Differential Game. J. Aerosp. Inf. Syst. 2020, 17, 229–239. [Google Scholar] [CrossRef]
Li, Q.; Li, F.; Dong, R.; Fan, R.; Xie, W. Navigation Ratio Design of Proportional Guidance Law Using Reinforcement Learning. J. Ordnance Eng. 2022, 43, 3040–3047. [Google Scholar]
Richard, S.S.; Andrew, G.B. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Christopher, J.C.H.W.; Peter, D. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar]
Qu, G.; Wierman, G. Finite-time analysis of asynchronous stochastic approximation and Q-learning. In Proceedings of the 33rd Conference on Learning Theory, COLT2020, Graz, Austria, 9–12 July 2020. [Google Scholar]
Mohammad, G.A.; Remi, M.; Mohammad, G.; Kappen, H.J. Speedy Q-learning. In Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 2411–2419. [Google Scholar]
Vive, S.B.; Sean, P.M. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 2000, 38, 447–469. [Google Scholar]
Diddigi, R.B.; Kamanchi, C.; Bhatnagar, S. A Generalized Minimax Q-Learning Algorithm for Two-Player Zero-Sum Stochastic Games. IEEE Trans. Autom. Control 2022, 67, 4816–4823. [Google Scholar] [CrossRef]
Lee, D.; Hu, J.H.; He, N. A Discrete-Time Switching System Analysis of Q-Learning. SIAM J. Control Optim. 2023, 61, 1861–1880. [Google Scholar] [CrossRef]
Lee, D.; He, N. A Unified Switching System Perspective and Convergence Analysis of Q-Learning Algorithms. In Proceedings of the Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Lim, H.D.; Kim, D.W.; Lee, D. Regularized Q-learning. arXiv 2022, arXiv:2202.05404. [Google Scholar] [CrossRef]
Li, G.; Wei, Y.; Chi, Y.; Gu, Y.; Chen, Y. Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. IEEE Trans. Inf. Theory 2022, 68, 448–473. [Google Scholar] [CrossRef]
Kumar, S.R.; Mukherjee, D. Terminal time-constrained nonlinear interception strategies against maneuvering targets. J. Guid. Control Dyn. 2021, 44, 200–209. [Google Scholar] [CrossRef]
Tahk, M.J.; Shim, S.W.; Hong, S.M.; Choi, H.L.; Lee, C.H. Impact Time Control Based on Time-to-Go Prediction for Sea-Skimming Anti-ship Missiles. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 2043–2052. [Google Scholar] [CrossRef]
Zhang, B.; Zhou, D.; Shao, C. Closed-Form Time-to-Go Estimation for Proportional Navigation Guidance Considering Drag. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 4705–4717. [Google Scholar] [CrossRef]
Rusnak, I. Optimal guidance laws with uncertain time-of-flight. IEEE Trans. Aerosp. Electron. Syst. 2000, 36, 721–725. [Google Scholar] [CrossRef]
Tahk, M.J.; Ryoo, C.K.; Cho, H. Recursive time-to-go estimation for homing guidance missiles. IEEE Trans. Aerosp. Electron. Syst. 2002, 38, 13–24. [Google Scholar] [CrossRef]
Wang, C.Y.; Ding, X.J.; Wang, J.N.; Shan, J.Y. A robust three-dimensional cooperative guidance law against maneuvering target. J. Frankl. Inst. 2020, 357, 5735–5752. [Google Scholar] [CrossRef]
Zhao, J.; Zhou, R.; Dong, Z.N. Three-dimensional cooperative guidance laws against stationary and maneuvering targets. Chin. J. Aeronaut. 2015, 28, 1104–1120. [Google Scholar] [CrossRef]
Zadka, B.; Tripathy, T.; Tsalik, R.; Shima, T. Consensus-based cooperative geometrical rules for simultaneous target interception. J. Guid. Control Dyn. 2020, 43, 2425–2432. [Google Scholar] [CrossRef]
Zhao, S.Y.; Zhou, R.; Wei, C.; Ding, Q.X. Design of time-constrained guidance laws via virtual leader approach. Chin. J. Aeronaut. 2010, 23, 103–108. [Google Scholar]
Zhou, J.L.; Yang, J.Y. Guidance law design for impact time attack against moving targets. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 2580–2589. [Google Scholar] [CrossRef]
Kumar, S.R.; Ghose, D. Impact time guidance for large heading errors using sliding mode control. IEEE Trans. Aerosp. Electron. Syst. 2015, 51, 3123–3138. [Google Scholar] [CrossRef]
Zhang, S.; Guo, Y.; Liu, Z.G.; Wang, S.C.; Hu, X.X. Finite-time Cooperative guidance strategy for impact angle and time control. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 806–819. [Google Scholar] [CrossRef]
Chen, Y.D.; Wang, J.N.; Wang, C.Y.; Shan, J.Y.; Xin, M. A modified cooperative proportional navigation guidance law. J. Frankl. Inst. 2019, 356, 5692–5705. [Google Scholar] [CrossRef]
Jeon, I.S.; Lee, J.I. Homing guidance law for cooperative attack of multiple missiles. J. Guid. Control Dyn. 2010, 33, 275–280. [Google Scholar] [CrossRef]
Kumar, S.R.; Mukherjee, D. Cooperative salvo guidance using Finite-time consensus over directed cycles. IEEE Trans. Aerosp. Electron. Syst. 2020, 56, 1504–1514. [Google Scholar] [CrossRef]
He, S.M. Three-dimensional optimal impact time guidance for anti-ship missiles. J. Guid. Control Dyn. 2019, 42, 941–948. [Google Scholar] [CrossRef]
Sinha, A.; Kumar, S.R.; Mukherjee, D. Three-dimensional nonlinear cooperative salvo using event-triggered strategy. J. Guid. Control Dyn. 2021, 44, 328–342. [Google Scholar] [CrossRef]
Garcia, E.; Casbeer, D.W.; Pachter, M. Optimal Strategies for a Class of Multi-Player Reach-Avoid Differential Games in 3D Space. IEEE Robot. Autom. Lett. 2020, 5, 4257–4264. [Google Scholar] [CrossRef]
Garcia, E.; Fuchs, Z.E.; Milutinovic, D.; Casbeer, D.W.; Pachter, M. A Geometric Approach for the Cooperative Two-Pursuer One-Evader Differential Game. IFAC Pap. Online 2017, 50, 15209–15214. [Google Scholar] [CrossRef]
Pachter, M.; Von Moll, A.; García, E.; Casbeer, D.; Milutinovic, D. Two-on-one pursuit. J. Guid. Control Dyn. 2019, 42, 1–7. [Google Scholar] [CrossRef]
Venkatesan, R.H.; Sinha, N.K. A New Guidance Law for the Defense Missile of Nonmaneuverable Aircraft. IEEE Trans. Control Syst. Technol. 2015, 23, 2424–2431. [Google Scholar] [CrossRef]
Wang, K.; Shen, W.; Yang, Y.; Quan, X.; Wang, R. Deep Reinforcement Learning for Autonomous Guidance and Control. In Proceedings of the IEEE Conference on Robotics and Automation, Paris, France, 1 May–31 August 2020. [Google Scholar]
Atwood, J.; Towsley, D. Missile Guidance Using Artificial Neural Networks. J. Guid. Control Dyn. 2016. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Xiao, Y.; Qu, Y.; Qiu, L.; Zhou, H.; Li, L.; Zhang, W.; Yu, Y. Dynamically Fused Graph Network for Multi-hop Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Malitesta, D.; Pomo, C.; Di Noia, T. Leveraging Graph Neural Networks for User Profiling: Recent Advances and Open Challenges. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM), Birmingham, UK, 21–25 October 2023. [Google Scholar]
Zhang, X.; Li, Y.; Zhang, Z.; Wang, L. Adaptive Missile Guidance Using Neural Networks. J. Guid. Control Dyn. 2022. [Google Scholar]
Ren, X.; Lewis, F.L.; Zhang, J. Neural network compensation control for mechanical systems with disturbances. Automatica 2009, 45, 1221–1226. [Google Scholar] [CrossRef]
Song, S.H.; Ha, I.J. A Lyapunov-like approach to performance analysis of 3-dimensional pure PNG laws. IEEE Trans. Aerosp. Electron. Syst. 1994, 30, 238–248. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of regional coverage interception.

Figure 2. Model of movement in 3D space.

Figure 3. Stereographic projection.

Figure 4. Schematic diagram of area coverage with two intersection points.

Figure 5. Schematic diagram of area coverage with four intersection points.

Figure 6. Schematic diagram of synchronous interception of two-to-one missiles.

Figure 7. Player game model.

Figure 8. Technology stack design.

Figure 9. Data transmission diagram. “***” is a placeholder.

Figure 10. Frontend page.

Figure 11. Server.

Figure 12. Evader data.

Figure 13. Pursuer data.

Figure 14. 3D trajectory diagram.

Figure 15. X-Y axis projection.

Figure 16. X-Z axis projection.

Figure 17. Distance variation.

Figure 18. Y-axis normal acceleration.

Figure 19. Z-axis normal acceleration.

Figure 20. 3D trajectory diagram.

Figure 21. X-Y plane projection.

Figure 22. X-Z plane projection.

Figure 23. Distance variation.

Figure 24. Y-axis acceleration.

Figure 25. Z-axis acceleration.

Figure 26. 3D trajectory diagram.

Figure 27. X-Y plane projection.

Figure 28. X-Z plane projection.

Figure 29. Distance variation.

Figure 30. Y-axis acceleration.

Figure 31. Z-axis acceleration.

Figure 32. Comparison of three algorithms.

Figure 33. Final result.

Table 1. Model parameters.

Symbol	Description	Coordinate System
$(X_{I}, Y_{I}, Z_{I})$	Inertial reference coordinate system	Inertial system
$(X_{L}, Y_{L}, Z_{L})$	Line-of-sight (LOS) coordinate system	Line-of-sight system
$(X_{E}, Y_{E}, Z_{E})$	Velocity coordinate system of the i-th pursuer	Pursuer
$v_{E i}$	Velocity of the i-th evader	Evader
$v_{P i}$	Velocity of the i-th pursuer	Pursuer
$A_{P i}$	Acceleration of the i-th pursuer	Pursuer
$A_{E i}$	Acceleration of the i-th evader	Evader
$γ_{P i}$	Angle between the acceleration of the i-th pursuer and the $Y_{P i}$ axis	Pursuer
$γ_{E i}$	Angle between the acceleration of the i-th evader and the $Y_{E}$ axis	Evader
$R_{P i}$	Distance between the i-th pursuer and the evader	Inertial system
$θ_{L i}, φ_{L i}$	Line-of-sight angles between the evader and the i-th pursuer relative to the inertial coordinate system	Line-of-sight system
$θ_{P i}, φ_{P i}$	Elevation and inclination angles of $v_{P i}$ relative to the line-of-sight system, starting from pointing at E	Pursuer
$θ_{E i}, φ_{E i}$	Elevation and inclination angles of $v_{E i}$ relative to the line-of-sight system, starting from pointing from $E_{i}$ to $P_{i}$	Evader
$A_{z P i}, A_{y P i}$	Projections of the normal accelerations along the $Z_{P i}$ and $Y_{P i}$ axes in the velocity coordinate system for the pursuer	Pursuer
$A_{z E i}, A_{y E i}$	Projections of the normal accelerations along the $Z_{E i}$ and $Y_{E i}$ axes in the velocity coordinate system for the evader	Evader

Table 2. Hardware parameters.

Item	JetNano	Raspberry Pi
CPU	ARM CORTEX-A57	ARM CORTEX-A72
Storage	16 GB EMMC 5.1	16 GB EMMC 5.1
Memory	4 GB	4 GB
Network	Gigabit Ethernet	Gigabit Ethernet

Table 3. Average comparison results of 100 experiments.

Method	Interception Success Rate (%)	Convergence Time (s)	Memory Requirement (MB)
Q-learning-cover	97	0.002	54
PN	87	0.003	82
ZCMD	92	0.004	120
DDPG	89	0.021	96

Table 4. Computation time and number of iterations for 100 experiments.

Method	Computation Time (s)	Number of Iterations
Q-learning-cover	225	120
PN	346	150
ZCMD	443	180
DDPG	276	250

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bai, Y.; Zhou, D.; He, Z. Pursuit-Interception Strategy in Differential Games Based on Q-Learning-Cover Algorithm. Aerospace 2025, 12, 428. https://doi.org/10.3390/aerospace12050428

AMA Style

Bai Y, Zhou D, He Z. Pursuit-Interception Strategy in Differential Games Based on Q-Learning-Cover Algorithm. Aerospace. 2025; 12(5):428. https://doi.org/10.3390/aerospace12050428

Chicago/Turabian Style

Bai, Yu, Di Zhou, and Zhen He. 2025. "Pursuit-Interception Strategy in Differential Games Based on Q-Learning-Cover Algorithm" Aerospace 12, no. 5: 428. https://doi.org/10.3390/aerospace12050428

APA Style

Bai, Y., Zhou, D., & He, Z. (2025). Pursuit-Interception Strategy in Differential Games Based on Q-Learning-Cover Algorithm. Aerospace, 12(5), 428. https://doi.org/10.3390/aerospace12050428

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pursuit-Interception Strategy in Differential Games Based on Q-Learning-Cover Algorithm

Abstract

1. Introduction

2. Dynamics Model

3. Coverage Interception Based on Spherical Polar Projection Mapping

Spherical Polar Projection Mapping

4. Calculation of Interception Time

5. Q-Learning-Cover Algorithm

6. Simulation Platform

7. Experimental Results

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI