1. Introduction
With the innovation of technologies and hardware related to unmanned systems, Unmanned aircraft systems (UASs) play an important role in different task scenarios. Swarms are one example of UAS applications that have become a major focus of the UAS community. Additionally, the advantages of swarm UAS over individual UAS in terms of task efficiency are obvious [
1], but there are other aspects of swarm UAS that need to be taken into account; for example, the logistical footprint and planning complexity of swarm UAS may be larger than that of an individual UAS. For example, in the monitoring of natural disasters, UAS swarm collaborative monitoring has a larger monitoring area and faster timing [
2] in geographic mapping, and the detection of a terrain swarm can be more quickly completed using photography [
3]. With the rapid development of artificial intelligence in recent years, the field of autonomous maneuvering decision making for UASs has become the target of expert research aimed at enabling UASs to accomplish the task of trajectory planning more quickly under multiple optimized conditions.
Swarm path planning in areas with radar, electromagnetic interference, or with other sources of interference is a challenging task because the swarm is at risk of detection, tracking, and failure, potentially resulting in not completing the task. Many countries are developing stealth UASs with minimized radar cross sections for radar-contested environments, including the methods for accessing it [
4]. This approach can decrease the chances of radar detection and improve UAS survivability [
5]. For successful swarm rapid traverse tasks in real-world scenarios, swarms must overcome various obstacles and uncertainties such as terrain obstacles, radar detection, and swarm failure. Therefore, efficient path planning strategies for swarm UASs in complex 3D environments have become an important research topic.
Some traditional path planning methods are widely used for rapid traverse tasks or path planning in the complex environment, but there are still some shortcomings. Based on the A-star algorithm, an improved algorithm that can meet the efficiency and accuracy requirements of the algorithm in a complex three-dimensional environment is proposed, which makes the route have higher safety and lower cost [
6]. To consider both terrain and UAS constraints, Ref. [
7] introduces a particle swarm optimization algorithm, which improves the performance through adaptive speed tuning, chaotic initialization, and an improved logistic chaotic map that considers terrain and UAS performance constraints to find paths with higher safety and a smaller cost function. An improved complete particle swarm optimization-based 3D path planning algorithm [
8] is proposed, which considers terrain threats, radar detection, and infiltration time. In [
9], a framework for optimizing the trajectory of multiple UASs in dynamic rapid traverse task planning is preset, considering diverse obstacles and perception constraints. In the field of UAS swarm path planning, several studies have proposed effective methods to address specific challenges, such as radar detection and terrain constraints. For example, Ref. [
10] proposed genetic algorithms for threat avoidance path planning with good feasible solutions, while a method that can successfully plan paths in complex obstacle environments [
11] is proposed. However, many of these methods rely on optimizing a specific objective function, which can lead to suboptimal results and slow optimization times.
Reinforcement learning (RL) based methods for trajectory planning are gaining popularity [
12,
13,
14]. RL can be used to solve a specific problem for optimal control as well as traditional methods. In RL, an agent achieves an optimal solution to a problem by continuously interacting with the environment and receiving rewards or penalties based on its behavior to obtain the maximum accumulated reward. RL is more successful in finding the optimal control of complex problems because the agent can improve itself over time by interacting more with the environment. One of the reasons why RL is effective in finding optimal solutions is its ability to handle problems in complex environments by maximizing the reward value and iteratively learning from experience. This robustness is a key factor in the success of RL-based methods. In [
15], a deep reinforcement learning approach is proposed for the radar route planning that can handle the problem of sparse rewards and improve the performance of the learning agent. To improve the convergence speed of the algorithm, a Relevant Experience Learning-DDPG approach is proposed in [
16], which finds the most similar experience to the current state for learning using expert knowledge. Considering the threat of radar, a situational assessment model is constructed, and the path planning problem of UASs in a multi-threat environment is solved using the dual deep Q-networks algorithm [
17]. A two-aircraft cooperative penetration strategy based on DRL is proposed [
18] for the continuous action space, and an approximate strategy optimization algorithm is used to achieve a two-aircraft cooperative reconnaissance mission. Existing reinforcement learning methods can solve the problem of UAS penetration path planning in some complex environments, but most of them only plan paths in two-dimensional planes and do not consider the impact of terrain changes on paths. And the studies that have considered terrain threats in 3D environments did not explore the threat and failure assessment of UASs by other factors such as radar, but only did avoidance training.
Despite numerous works on UAS rapid traverse path planning, several key issues remain unresolved. In many cases, radar is considered an impassable obstacle, but due to the difficulty in finding a path that completely avoids radar, the planned path of the UAS must pass through it. Few works consider path planning based on UAS swarm rapid traverse tasks, usually because too many UASs would require a large amount of computing resources and increase the overall system complexity, and it is difficult to estimate the radar detection probability of the UAS swarm. Generally, when a UAS reaches its target position, the task is considered complete. Previous works have considered the normal operation rate of the swarm to a lesser extent, focusing more on the path cost at individual waypoints and swarm collision avoidance. In [
19], although radar detection and tracking is taken into account during path planning, the possibility of failure to the UAS while traveling is not addressed. The swarm planning process is only concerned with energy consumption and collision avoidance [
20,
21]. However, when the swarm rapidly traverses diverse threats area, the UASs in the swarm can failure, so using the normal operation rate (normal rate) as a new indicator can better optimize the paths. We correlate the probability of UAV failure due to electromagnetic interference or other sources of interference in the threat area with the more complex probability of radar detection in the subsequent discussion.
In this paper, a framework for UAS swarm path planning based on reinforcement learning and formation control algorithms is proposed to address the aforementioned issues while ensuring the safety, efficiency, and robustness of the planned path. Our method outperforms existing approaches by planning high-quality paths in complex environments in a shorter time, and by increasing the success and normal rates of the swarm for completing the task. We summarize our contributions as follows:
- 1.
An effective equivalent model method for establishing the radar detection probability of a UAS swarm in a network radar system is proposed. By considering the number of UASs in a radar resolution cell and the radar cross section of each UAS, the detection probability of a UAS swarm can be approximated as that of a single UAS. This approximation allows for the simple and rapid calculation and evaluation of the swarm’s detection probability.
- 2.
A novel path planning method based on reinforcement learning is proposed, which balances the instantaneous rewards and terminal rewards. This method considers normal rate as a key indicator and takes into account the threat of failure to the swarm along the path, thereby forming an optimized path.
- 3.
A formation optimization strategy is presented that can reduce the probability of detection and mitigate the threat of failure. By dynamically adjusting the formation geometry of the swarm, we optimize the number of UASs in each radar resolution cell to ensure that the radar detection probability of the UAS swarm does not exceed a predetermined threshold.
- 4.
We present extensive simulations and experiments of the proposed method, and these results show that our method outperforms existing methods in terms of combined path quality, task success rate, and swarm normal rate.
The remainder of this paper is organized as follows.
Section 2 introduces the problem formulation. In
Section 3, we present the swarm rapid traverse algorithm based on reinforcement learning.
Section 4 discusses the control of the movement of all UASs in the swarm, based on the results obtained from the reinforcement learning component. The simulation results of the algorithm are analyzed in
Section 5. Finally, we summarize the paper in
Section 6.
3. Rapid Traverse Task Planning Based on Deep Reinforcement Learning
In the path planning of UAS swarm, the normal rate is an important indicator, but it is difficult to accurately evaluate the final normal rate during the planning process. In addition, the objective function we propose has complex high-order nonlinearity, making it difficult to obtain the optimal solution. To better consider the factors of threat and normal rate in path planning, we adopt a deep reinforcement learning approach. Specifically, we incorporate threat into instantaneous rewards and normal rate into final rewards, to ensure that the swarm can successfully complete the task with high normal rate in the face of threats. By this approach, we can calculate an optimal path that considers the balance between threat and normal rate and ensures the UAS swarm can efficiently execute rapid traverse task in complex environments.
To be precise, we utilized proximal policy optimization (PPO). PPO is a cutting-edge, benchmark, model-free, on-policy, policy gradient reinforcement learning algorithm designed by OpenAI [
25]. It improves upon the classical policy gradient (PG) algorithm and actor-critic (AC) algorithms [
26]. PPO not only exhibits excellent performance and lower complexity, but also possesses superior tunability, finding the best balance between implementation, batch sampling efficiency, and ease of tuning. To attempt the implementation part, we first formulate our problem as MDP, and the key assumption of an MDP is that the probability of transitioning from one state to another depends only on the current state and the action taken, and not on any previous states or actions. The MDP is defined in terms of a tuple
, where
denotes the state space,
denotes the action space,
denotes the state transition probabilities,
denotes the reward function, and
denotes the discount factor, where
, respectively, and several of these are discussed later.
3.1. State Space
The state of the virtual leader
l of the UAS swarm can reflect information about the state of the swarm in the environment. We consider the design of the state space of the virtual leader
l from three aspects: the first is the state of
l, the second is the relative relationship between
l and the target position
, and the third is the relative distance of
l and the radars. In this paper, the variables contained in the state space are listed in
Table 1, where
and
denote the distance between
l, the target position, and the angle between the line connecting them and the direction of the X-axis, respectively.
denotes the distance between
l and radar
i. Overall, the state
contains three parameters, as follows:
3.2. Action Space
In our problem setting, the action
a is determined by the control input
at each moment, and since
is continuous, the action space is also continuous. The action space contains all feasible control inputs
. The range of the action space is as follows:
where
denote the tangential overload, normal overload, and roll angle of the UAS, respectively. Ultimately, the action space of VL can be regulated within a range, as shown in
Figure 4. At each moment, the VL selects an action
according to the current state of
, i.e., it selects a control input
and uses this control input to drive itself in the environment.
3.3. Reward Function
A well-designed reward function can effectively improve the convergence speed of the algorithm and optimize the final result; thus, how to design the reward function is a crucial part of reinforcement learning. The design of the reward function is based on the task requirements mentioned in
Section 2.5. For the swarm corresponding to the VL, we design its reward function, as follows:
The element that moves closer to the target position. In order to prevent the sparse reward problem [
27] and improve the sampling efficiency, this paper uses a distance-based bootstrap reward. As shown in Equation (
18), the virtual leader
l will be rewarded for each approach to the target during the exploration of the environment.
where
are constants representing the weights of VL to the target distance, and the proximity to target,
denotes the distance difference between the current moment and the previous moment to the target position.
The element of altitude control. The altitude of the virtual leader during flight should tend to a desired altitude
. When the virtual leader deviates from this altitude, the system will obtain a negative reward.
where
denotes the influence factor of altitude deviation on the system reward, and
is the altitude of VL.
The element of radar detection. The VL needs to reduce the radar detection as it approaches the target position. The closer the leader is to the center of the radar, the greater the penalty.
where
and
denote the penalty factor of the swarm approaching the radar;
is a preset parameter that denotes the danger range, i.e., the probability of being detected increases when the VL-to-radar distance is less than this value;
denotes the effective detection distance of the radar.
The element of swarm failure. The penalty term
in the reward function is given for the degree of failure as follows:
where
is the number of normal UASs in the swarm,
is the number of UASs failure and
is a constant greater than zero that regulates the level of impact of the failure on the swarm.
The element of reaching the target. The rewards on arrival at the target position are as follows:
where
denotes the normal rate of the UAS swarm after completing the task. After arriving at the target position and satisfying the normal rate, the system is rewarded heavily, prompting the system to converge more easily in subsequent training.
In summary, the evaluation reward function for maneuvers in the system can be defined as
where
and
indirectly affect the tendency of the system in path planning by influencing
. When the weight of
is greater, the swarm will demonstrate preference to the path that is less detected by radar, i.e., the safer path.
3.4. Proximal Policy Optimization
The PPO algorithm can be better applied to continuous control tasks, and its objective function is optimized as follows
with
where
and
represent the network weights before and after the update, respectively. The system effect is observed by the ratio of the action probability of the current strategy
to the action probability of the previous strategy
[
25]. It means that if the current policy is more appropriate, the value of
will be greater than 1, and vice versa, the value of
will be between 0 and 1. The PPO algorithm improves the stability of the training agent behavior by restricting the policy updates to a small range. The clip function in Equation (
25) is a truncation function that restricts the values of the old and new policy parameters
to the interval
. In brief, the purpose of this trick is to prevent the distribution of
and
from varying too much, while avoiding the problem of difficult network convergence. When the estimated advantage
[
28] is positive, it indicates that the current action has a positive impact on the optimization goal, and it helps the system to measure whether the action improves the default behavior of the policy. As illustrated in Algorithm 1, set the maximum number of iterations per training, and if the algorithm can output a feasible solution within this number limit, then this solution must satisfy
and
at any time
t.
Algorithm 1: RL-Based Path Planning for Swarm Virtual Leader l |
|
4. Formation Optimization Strategy
Based on the previous section, we can obtain a reference path for the entire UAS swarm. Although the swarm radar detection probability of each waypoint is considered when planning this reference path, a strong constraint is not imposed on this parameter, so there may be some waypoints with higher radar detection probability on the path. These waypoints will increase the threat of swarm failure and undermine the effectiveness of the final rapid traverse task. Therefore, this section presents the strategy to optimize the rapid traverse task completion effect by changing the formation geometry of the UAS swarm.
4.1. Formation Geometry Change
According to Equation (
9), the geometry of the formation affects the number of UASs in the radar resolution cell, which further affects
. The swarm changes the formation geometry by splitting and merging groups, thus reducing the failure to the swarm. Splitting and merging of swarms can be achieved by changing the deviation vector
. According to Equation (
5), we can obtain the desired position
for each group, and next discuss the solution of
.
If the separation distance between two groups
j and
k is expanded to equal
, i.e.,
, we consider them to be in different radar resolution cells. For an easy description, we define a distance set as
where
denotes the distance interval between two adjacent groups.
Define the instruction
for the formation geometry change, as shown in
Figure 5.
At time t, if a split operation is performed, ; if a merge operation is performed, . It should be noted that the UAS swarm consisting of n groups that can be split up at most e times satisfies relation .
When
, the groups keep the initial position deviation
. When
, we define the set of elements in the part of the set
that needs to be updated as
so that the elements in it are equal to
and the rest of the elements in
remain unchanged, and
is defined as follows
Based on Equation (
28) and the reference waypoint
in
, we can obtain the deviation vector
that satisfies Equation (
27). Finally, according to Equation (
5), the desired position
of the VL
can be computed.
The flowchart is shown in
Figure 6, where
denotes the set detection probability threshold.
4.2. Control Law
As for the control law, we only use the potential field. To begin with, it should be noted that if the swarm needs to split, then the desired position of the front group will move forward along the reference path, so that the attractive force will drive the group to split forward; similarly, if the swarm needs to merge, then the back group will also be driven to merge forward.
An artificial potential field is used here to drive the group to chase its desired position. The attractive force of each group
is designed as follows:
where
is the positive parameter of the attractive force.
To avoid collisions between groups, there exists a repulsive potential field and a repulsive force. The repulsive potential field generated by group
j and its neighboring groups is defined as
where
denotes the distance between the VL
and VL
,
indicates the safe distance; if the distance between any two groups is less than it, they will collide, and
is the proportional coefficient. From Equation (
30), the total potential field generated by the virtual leader of all groups is given by
Then, the repulsive force is the partial derivative of the total repulsive potential field, as shown below
According to Equation (
32), it can be observed that the repulsive force on
ranges from positive infinity to zero along with
d ranging from
to
. Obviously, the closer the virtual leaders of two adjacent groups are, the greater the repulsive force is, and it does not make their distance less than
. For
, it is subjected to the following combined force
Then, the combined force can be resolved into two components along the longitudinal axis and the vertical axis, denoted as and , respectively. and can be calculated from and . Based on this, we can obtain the control input of the virtual leader for each group and use it to control the motion of the whole group.
5. Experiments
In this section, we first introduce the experimental setup, and then give the simulation results of the UAS swarm performing the rapid traverse task under different conditions, analyze the different simulation results, and give the number of normal UASs when the final swarm reaches the target position.
5.1. Environment and Training Setup
The experiments are conducted in a simulated mission area, and the simulation sets the planning scenario in a rectangular area of a size 230 km × 160 km × 1.6 km, corresponding with the x, y and z coordinate axes, respectively. As shown in
Figure 7, We set two initial positions and one ending position in the environment, and deploy five radars around the line connecting the starting points and ending point, and the coordinate information is shown in
Table 2, where z
denotes the terrain altitude corresponding to x, y.
The parameters mentioned in the previous section and the parameters of the algorithmic model are set in
Table 3.
In the simulation of the swarm’s rapid traverse, the neural network structures are all fully connected neural networks, the output layer uses a activation function, and all others are activation functions. The learning rate is 0.000125, the discount factor is 0.9, and the number of samples taken in batch during training is 64. The decision cycle of the system is 1 s, and a maximum of 550 decisions are made for an episode. The training round ends if any of the following conditions are met: the training exceeds the max_episode_length; the swarm collides with the terrain or reaches the target.
5.2. Simulation Setup and Evaluation Indicators
The number of failure UASs is related to a random judgment of the probability of failure for each UAS. To further validate the effectiveness of the proposed method, so we perform a numerical simulation of the swarm rapid traverse system using the Monte Carlo method. There are four indicators established to evaluate the simulation results:
Average path length (APL, ): The average length of the planned path should be as short as possible to reduce energy consumption, time costs, and the risk of being detected.
Average rapid traverse success rate (APSR, ): Given the total time of a simulation, the number of simulations in which the UAS swarm completed the rapid traverse task is counted as a percentage of the total number of simulations. This index can evaluate the learning efficiency of the evaluation environment and reward settings in the algorithm.
Average swarm normal rate (ASSR, ): It implies the average UAS normal rate among all simulations that completed the task. Higher normal rate indicates better resistance and reliability, which is essential for successful deployment of UAS swarm. This indicator is crucial for assessing the operational effectiveness of the swarms in carrying out their tasks.
Average algorithm computation time (AACT, ): The computation time of the algorithm is measured and should be as short as possible to ensure a real-time and efficient performance.
At the same time, evaluation indicators need to be defined:
where
M is the total number of Monte Carlo simulations;
is the index of the simulation.
denotes whether the single planning is successful or not, and takes the value of 1 if successful.
,
,
and
denote ASSR, APL, AACT, and APSR, respectively.
To examine the ability of the swarm to perform tasks in the constructed simulation environment, tests are performed by varying the number of radars and the parameters and . We combined with different ratios of , and conducted corresponding experiments and analyses. Then, the algorithm proposed in this paper is compared and analyzed with the classical A* and RRT* algorithms to prove the scalability, adaptability, and robustness of the algorithm by comparing the abovementioned indicators.
5.3. Results and Discussions
Multiple radars are present in the rapid traverse path to enable the tracking and constraint of the swarm. The overall goal of swarm rapid traverse is to plan a path that avoids collisions while balancing range and failure. Based on this, five radars are deployed in the environment, and the parameter is set. The UAS swarm starts the rapid traverse task from two initial positions, where the swarm passes through the radar detection areas and approaches the target position. The formation geometry is changed in the radar area to reduce the probability of detection and to guarantee the normal rate of the UASs in the final arrival at the target position.
Using the built swarm rapid traverse model to solve the problem, the path of the swarm is obtained as shown in
Figure 8, and the 3D path is shown in
Figure 9. In
Figure 8a, the swarm passes through one radar area, while the swarm passes through two radar areas in
Figure 8b. The formation geometry change instruction
for the swarm in the radar areas along the two paths mentioned above is shown in
Figure 10a, where the X-axis represents the time series of the swarm after entering the radar area. The green and red lines correspond to the paths in
Figure 8a,b. That is, the virtual leader crosses two radar areas in
Figure 8b, the first two subplots in
Figure 10a represent the instruction changes for this case.
According to
Section 2.4, we can obtain the change in the number of UASs in the radar areas as shown in
Figure 10b. The results showed that the number of UASs in the UAS swarm that finally completed the rapid traverse task is 27 and 21, and the normal rate of the swarm is 67.5% and 52.5%, respectively. In order to show more clearly the formation geometry changes of the UAS swarm during the task, we explain the path in
Figure 8b as an example. And according to Equation (
12), we can evaluate and decide whether each UAS is failure or not, and get the final normal state of the swarm as shown in
Figure 11. The UAS swarm first maintains the initial formation geometry. After entering the radar area, it gradually splits, increasing the distance between groups, and the formation geometry changes the most when it is closest to the radar. During the flight away from the radar region, the swarm merges and converges to the initial formation geometry.
To demonstrate the capability of the proposed algorithm to handle rapid traverse tasks in unknown environments and to validate its generalization and robustness, we randomly generate positions for swarms and vary the number of radars in the environment. The proposed algorithm is then evaluated in these randomly generated environments, and the results are shown in
Figure 12.
When
, we combine different radar numbers with different ratios of
, and the results of the evaluation indicators are shown in
Figure 13. Based on the results, it is obvious that the ASSN decreases as the number of radars increases. This is because an increase in the number of radars increases the probability of the swarm being in the radar area, making it more prone to failure. We can clearly observe the impact of the parameters on the indicators. A smaller value of
means that the swarm pays more attention to the negative impact of the radar; thus, it tends to avoid the radar. On the contrary, the larger the value, the more the swarm tends to move towards the target position. Therefore, the total path length is inversely proportional to this ratio.
The simulation environment for the comparison test of the proposed algorithm with the improved A* algorithm and RRT* algorithm is shown in
Figure 7, and results of the comparison test are shown in
Table 4, where the cost function settings in the comparison algorithm are consistent with the proposed algorithm.
From the experimental results in
Table 4, it can be observed that:
- 1.
The proposed algorithm increases , and in the normal rate compared with the other two algorithms, respectively. It shows that the proposed algorithm can obtain flight paths with a lower detection probability and improve the normal rate of UASs.
- 2.
The proposed algorithm is slightly longer than the A* algorithm in terms of path length , but it outperforms the A* algorithm in terms of the speed of solution. Compared with the RRT* algorithm, the path length is reduced by . Compared with the traditional algorithm, the proposed algorithm can obtain a shorter flight path, which enables the UAS cluster to cross the detection threat area quickly.
- 3.
The proposed algorithm is shorter than the other two algorithms in terms of running time , with a significant reduction of and in computing time, respectively. In complex environments, the proposed algorithm is more adaptable, i.e., it is more capable of handling unknown environments and abnormal situations, because it can adjust its strategy according to the changes in the environment, thus finding the optimal path quickly.
- 4.
The proposed algorithm is higher than the other two algorithms in terms of a rapid traverse success rate , which indicates that the algorithm has a better robustness and stability.
6. Conclusions
In this paper, a novel path planning method for a UAS swarm rapid traverse task is proposed, which enables the swarm to achieve the goal of improving normal rate under the condition of no collision. First, the dynamics model of UAS and the radar detection model are established, and the task requirements of the flight are clarified, which lay a good foundation for the subsequent effective training. Second, the principle of deep reinforcement learning is introduced. A reasonable state space, action space, and reward function are designed to effectively avoid the disadvantage of reward sparsity in long-running time systems and make the network converge effectively. Then, the reinforcement learning algorithm is combined with the formation control algorithm, and the output of the network is applied to the algorithm of the formation keeping and reconfiguration to control the movement of UASs in the swarm, avoiding the direct processing of the high-dimensional UAS cluster information. Final, the simulation results demonstrate that the proposed algorithm can significantly enhance the normal rate of the swarm after the rapid traverse task.
In future work, we will study end-to-end UAS decision making and planning methods to accomplish tasks such as the real-time dynamic obstacle avoidance of UAS swarms by adding sensor information.