UAV-Cooperative Penetration Dynamic-Tracking Interceptor Method Based on DDPG

: The multi-UAV system has stronger robustness and better stability in combat. Therefore, the collaborative penetration of UAVs has been extensively studied in recent years. Compared with general static combat scenes, the dynamic tracking and interception of equipment penetration are more difﬁcult to achieve. To realize the coordinated penetration of the dynamic-tracking interceptor by the multi-UAV system, the intelligent UAV model is established by using the deep deterministic policy-gradient algorithm, and the reward function is constructed using the cooperative parameters of multiple UAVs to guide the UAV to proceed with collaborative penetration. The simulation experiment proved that the UAV ﬁnally evaded the dynamic-tracking interceptor, and multiple UAVs reached the target at the same time, realizing the time coordination of the multi-UAV system.


Introduction
Compared with traditional manned aerial vehicles, unmanned aerial vehicles (UAVs) can be autonomously controlled or remotely controlled, which have the advantages of low requirements on the combat environment and strong battlefield survivability, and they can be used to perform a variety of complex tasks [1,2].Therefore, UAVs have been widely studied and applied [3].With the continuous application of UAVs in the military field, the system composed of a single UAV has gradually revealed the problems of poor flexibility and low stability [4,5].The cooperative combat method of using a multi-UAV system composed of multiple UAVs has become a new main research direction [6,7].Under the conditions of the modernized and networked battlefield, the air cluster composed of multiple UAVs has the air power to continuously launch the required strikes, forcing the enemy to spend more resources and deal with more fighters, thereby enhancing the overall capability and overall performance of military combat confrontation.
Multi-UAV-cooperative penetration is one of the key issues to achieve multi-UAVcooperative combat.Multiple UAVs start from the same location or different locations, and finally arrive at the same place.At present, UAVs' penetration-trajectory-planning methods mainly include the A* algorithm [8], the artificial potential field method [9,10], and the RRT algorithm [11].Most of the application scenarios of these methods are environments with static no-fly zones and rarely consider dynamic threats.The A* algorithm is a typical grid method.This type of method rasterizes the map for planning, but the size of the grid will have a greater impact on the result and is difficult to determine.Based on the artificial potential field law, it is easy to fall into the local optimum, leading to the unreachable target.When there is a dynamic-tracking interceptor in the environment, the environment information becomes complicated, and real-time planning requirements are put forward for UAVs.Therefore, traditional algorithms cannot meet the requirements.
Multi-UAV-cooperative penetration generally requires that multiple UAVs achieve time coordination to penetrate defenses according to different trajectories and finally reach the target area at the same time or according to a certain time sequence.When the UAVs depart from different locations, it will greatly increase the difficulty of collaboration.The multi-UAV-coordination algorithm can be improved based on the traditional single-UAV-penetration algorithm adapted to the multi-UAV environment.Chen uses the optimal-control method to improve the artificial potential-field method to achieve multi-UAV coordination [11,12], and Kothari improves the RRT algorithm to achieve multi-UAV coordination [13].At the same time, there are a large number of methods for cooperative control of UAVs based on the graph theory [3].Li proposed a multi-UAV-collaboration method based on the graph theory [14].Ruan proposed a multi-UAV-coordination method based on multi objective optimization [15].The above methods all realize the coordination of multiple UAVs, but their algorithms lack the research on dynamic environments and cannot adapt to the complex and dynamic battlefield environment.Aimed at the environment with dynamic threats, this paper proposes a method based on deep reinforcement learning to achieve multi-UAV-cooperative penetration.
Reinforcement learning is an important branch of machine learning.Its main feature is to evaluate the action policy of the agent based on the final rewards and through the interaction and trial and error between the agent and the environment.Reinforcement learning is a far-sighted machine-learning method that considers long-term rewards [16,17].It is often used to solve sequential decision-making problems.Reinforcement learning can not only be applied in a static environment but also when the parameters of the environment are constantly changing, and the agent can also be applied in a dynamic environment [16][17][18].The research of reinforcement learning is mostly concentrated in the field of a single agent, but there is also a large body of research on reinforcement-learning algorithms for multiagent systems.There is also much research on applying reinforcement learning to UAVs.Pham successfully applied deep reinforcement learning to UAVs to realize autonomous navigation of UAVs [19].Wang also studied the autonomous navigation of UAVs in large, unknown, and complex environments based on deep reinforcement learning [20].Wang applied reinforcement learning to the target search and tracking of UAVs [21].Based on deep reinforcement learning, Yang studied the task scheduling problem of UAV clusters and solved the application problem of reinforcement learning in multi-UAV systems [22].Through the investigation of relevant literature, it can be found that the application of deep reinforcement learning to multi-UAV systems is a feasible method, which can be used to achieve complex multi-UAV-system tasks and has great research potential.
The main work of this paper uses the multi-UAV-cooperative penetration dynamictracking interceptor as the scenario.Based on the deep reinforcement-learning DDPG algorithm, we establish the intelligent UAV model and realize the multi-UAV-cooperative penetration dynamic-tracking interceptor by designing the reward function related to coordination and penetration.The simulation experiment results show that the trained multi-UAV system can achieve cooperative attack tasks from different initial locations, which proves the application potential of artificial intelligence methods, such as reinforcement learning in the implementation of coordinated tasks in UAV clusters.

Motion Scene
This paper solves the problem of multi-UAV-cooperative penetration of dynamic interceptors.We assume multiple UAVs are respectively numbered as L = {1, 2, . . . ,n}.The scene is set as a two-dimensional engagement plane.Figure 1 is the schematic diagram of the motion scene of the multi-UAVs.Based on reinforcement learning algorithms, the following are the requirements: The UAV is not intercepted by interceptors when it is moving; • Multi-UAVs finally reach the target area at the same time.

UAV Movement Model
First, a two-dimensional movement model of the UAV is established.Figure 2 is the schematic diagram of the movement of the UAV.It is assumed that the linear velocity of the UAV is constant, and the angular velocity, , is a continuously variable value.It is assumed that the angle between the movement direction of the UAV and the x-axis is the azimuth angle, .The movement of the UAV is divided into x and y directions.First, the current azimuth angle is obtained by integrating the angular velocity of the UAV, and then the velocity is decomposed on the coordinate axis by the azimuth angle, , and finally, the position information of the UAV is obtained through integration.The mathematical model is shown in (1): Based on reinforcement learning algorithms, the following are the requirements:

•
The UAV is not intercepted by interceptors when it is moving; • Multi-UAVs finally reach the target area at the same time.

UAV Movement Model
First, a two-dimensional movement model of the UAV is established.Figure 2 is the schematic diagram of the movement of the UAV.It is assumed that the linear velocity of the UAV is constant, and the angular velocity, ω, is a continuously variable value.It is assumed that the angle between the movement direction of the UAV and the x-axis is the azimuth angle, θ.Based on reinforcement learning algorithms, the following are the The UAV is not intercepted by interceptors when it is moving; • Multi-UAVs finally reach the target area at the same time.

UAV Movement Model
First, a two-dimensional movement model of the UAV is establish schematic diagram of the movement of the UAV.It is assumed that the the UAV is constant, and the angular velocity, , is a continuously va assumed that the angle between the movement direction of the UAV an azimuth angle, .The movement of the UAV is divided into x and y directions.Firs muth angle is obtained by integrating the angular velocity of the UAV locity is decomposed on the coordinate axis by the azimuth angle, , an sition information of the UAV is obtained through integration.The ma is shown in (1): The movement of the UAV is divided into x and y directions.First, the current azimuth angle is obtained by integrating the angular velocity of the UAV, and then the velocity is decomposed on the coordinate axis by the azimuth angle, θ, and finally, the position information of the UAV is obtained through integration.The mathematical model is shown in (1):

Dynamic-Interceptor Design
Compared with the static environment, the position of the dynamic-interceptor changes in real time in the environment, so the UAV is required to be able to perform real-time planning.In this paper, the dynamic interceptor is defined as a tracking interceptor according to the proportional-guided pursuit law.Compared with most common dynamic interceptors with simple motion rules, such as interceptors that cycle in a uniform linear motion or circular motion, the tracking interceptor has stronger uncertainty, and it is difficult to evade by predicting the motion.The movement requirements of the multi-UAV system are much higher.In this paper, it is assumed that the linear velocity of the tracking interceptor is constant, and the angular velocity is calculated by proportional guidance.The schematic diagram of its movement is shown in Figure 3.

Dynamic-Interceptor Design
Compared with the static environment, the position of the dyn changes in real time in the environment, so the UAV is required to be able time planning.In this paper, the dynamic interceptor is defined as a trac according to the proportional-guided pursuit law.Compared with m namic interceptors with simple motion rules, such as interceptors that cy linear motion or circular motion, the tracking interceptor has stronger un is difficult to evade by predicting the motion.The movement requireme UAV system are much higher.In this paper, it is assumed that the linea tracking interceptor is constant, and the angular velocity is calculated guidance.The schematic diagram of its movement is shown in Figure 3.The basic principle of the proportional-guidance method is to make the interceptor's rotational angular velocity proportional to the line-of-sight angular velocity.Next, the proportional-guidance mathematical model of the interceptor is introduced.
Assuming that the position of the interceptor is x b , y b , the speed is v x b , v y b , the position of the UAV is x, y, and the speed is v x , v y , and the relative position and speed of the interceptor to the UAV are shown in (2): The interceptor's line of sight angle to the UAV can be obtained as: To obtain the line-of-sight angular velocity, (3) is derived, as shown in (4): .
The rotational angular velocity of the dynamic-tracking interceptor can be obtained by (5): K is the proportional guidance coefficient, taking K = 2.
Based on the angular velocity of rotation calculated by the proportional guidance, the interceptor performs a two-dimensional movement according to the angular velocity of the rotation obtained by the proportional guidance.The movement model is similar to that of a UAV:

Deep Deterministic Policy-Gradient Algorithm
The DDPG algorithm is a branch of reinforcement learning [5].The basic process of reinforcement-learning training is that the agent performs an action based on the current observation state.This action acts on the agent's training environment and returns a reward and a new state observation.The goal of training is to maximize the final reward.Reinforcement learning does not need to give any artificial strategies and guidance during training but only needs to give the reward function when the environment is in various states.This is also the only part of training that can be adjusted artificially.Figure 4 shows the basic process of reinforcement learning.
of the rotation obtained by the proportional guidance.The mo that of a UAV:

Deep Deterministic Policy-Gradient Algorithm
The DDPG algorithm is a branch of reinforcement learni reinforcement-learning training is that the agent performs an observation state.This action acts on the agent's training env ward and a new state observation.The goal of training is to Reinforcement learning does not need to give any artificial stra training but only needs to give the reward function when the states.This is also the only part of training that can be adjusted the basic process of reinforcement learning.The DDPG algorithm is an actor-critic framework algorithm that solves the problem of applying reinforcement learning in continuous space.There are two networks in the DDPG algorithm, namely, the state-action value function network Q(s, a θ Q ) using θ Q parameters and the policy network µ(s|θ µ ) using θ µ parameters.At the same time, two concepts are introduced, target network and experience replay.When the value function network is updated, the current value function is used to fit the future state-action value function.If both state-action value functions use the same network, it is difficult to fit during training.Therefore, the concept of the target network is introduced.The target network is used as the future state-action value function, which is the same as the state-action value function network to be updated, except that it is not updated in real time but is updated according to the state-action value function network when the state-action value function network is updated to a certain extent.The policy network also adopts the same training method in DDPG.The experience replay is a function of storing state transfer (s t , a t , r t , s t+1 ), and it will be stored in the experience replay pool every time the agent performs an action that causes the state to transfer.When the value function is updated, it will not be updated directly according to the action of the current policy, but the state transition value will be extracted from the experience playback pool for updating.The advantage of such an update is that the training and learning of the network is more efficient.
Before training, the value function network, Q(s, a θ Q ) , and the policy network, µ(s|θ µ ), are first randomly initialized, and then the target network, Q and µ , are initialized.It is also necessary to initialize an action random noise, N , which is conducive to the agent's exploration.During training, the agent selects and executes actions, a t = µ(s t | θ µ ) + N t , based on the current policy network and action noise and receives rewards, r t , and new state observations, s t+1 , based on environment feedback.The state transition (s t , a t , r t , s t+1 ) is stored in the experience replay pool.After that, N state transitions (s i , a i , r i , s i+1 ) are randomly selected to update the value function network.The principle of updating the value function network is to minimize the loss function.The mathematical expression of the loss function is as follows: where Assuming the objective function of training is the following: where γ is the discount factor.The policy network is updated according to the gradient of the objective function, and its mathematical expression is as follows: After training and finally updating the target network, the mathematical expression is as follows: To extend DDPG to a multi-UAV system, multiple actors and a critic must exist in the system.During each training, the value function network evaluates the policies of all UAVs in the environment, and the UAVs update their respective policy networks based on the evaluation and independently choose to execute actions.Figure 5 shows the structure of the multi-UAV DDPG algorithm.
the system.During each training, the value function network UAVs in the environment, and the UAVs update their respec on the evaluation and independently choose to execute action ture of the multi-UAV DDPG algorithm.The algorithm design also needs to construct the UAV's action space, state space, reward function, and termination conditions.In this paper, all UAVs are tested and simulated using small UAV models for the purpose of preliminary verification of the algorithm.

Action-Space Design
It can be seen from the motion model of the UAV that the action the UAV can perform is to change the angular velocity so the action space of multiple UAVs is designed as A = {ω 1 , ω 2 , . . . ,ω n }, which is the collection of the angular velocity of multiple UAVs.

State-Space Design
To realize the coordinated penetration of UAVs, the design of the state space should include the UAVs' positions, x i , y i , speed, v x i , v y i , and the central position of the target area, x t , y t .At the same time, it is necessary to introduce the state observation of the interceptor position, x b , y b .Therefore, the state space is set to S = x i , y i , v x i , v y i , x t , y t , x b , y b , i ∈ L.

Termination Conditions Design
The termination conditions are divided into four items, namely, out of bounds, collision, timeout, and successful arrival.a) Out of bounds: When the movement of the UAV exceeds the environmental boundary, it will be regarded as a mission failure; an end signal and a failure signal will be given.b) Collision: When the UAV is captured by the interceptor, that is, the distance between the two, is regarded as a mission failure; an end signal and a failure signal are given.c) Timeout: When the training time exceeds the maximum exercise time, the task will be regarded as a failure; an end signal and a failure signal will be given.d) Successful arrival: When the UAV successfully reaches the target area, the mission is successful; an end signal and a success signal are given.
When any UAV in the environment finishes training, all UAVs finish training and give a failure or success signal according to the distance from the target point.

Reward-Function Design
The sparse reward problem is a common problem when designing the reward function.This problem will affect the training process of the UAV, prolong the training time of the UAV and even fail to achieve the training goal.To better achieve collaborative tasks and solve the problem of sparse reward, the reward-function design is divided into four parts, namely, the distance-reward function, R d , which is related to the distance between the UAV and the target, the cooperative reward, R co , which is used to constrain the position of the UAV, the mission success reward, R s , and the mission failure reward, R f ail .The reward function is linearized to improve the efficiency of UAV cluster training.One of the UAVs is an example to introduce the design of the reward function.
Assume d i = (x i − x t ) 2 + (y i − y t ) 2 , i ∈ L represents the distance between one UAV and the target area, d target , representing the distance between the UAV's initial location and the target area.
The distance reward, R d , is related to the distance from the UAV to the target area.The closer the distance, the greater the distance-reward value.This type of reward is the key reward on whether the UAV can reach the target area.The specific form is shown in (11): The cooperative reward is related to the cooperative parameters in the UAV cluster.Here, the difference between the farthest and closest distance between the UAV and the target area is selected as the coordination parameter.Its specific expression is shown in (12).When there are two UAVs in the cluster, the distribution diagram is shown in Figure 6.
where d max = {d 1 , d 2 , . . . ,d n } max , d min = {d 1 , d 2 , . . . ,d n } min , respectively, represent the maximum and minimum distances between the UAV and the target area in the environment.
It can be seen from the mathematical expression and distribution diagram that there are two major distribution trends for synergistic rewards.When the maximum-distance difference of the UAV cluster is smaller, its value is larger, which will lead the UAVs to move towards time coordination.When the UAV is closer to the target area, its value is larger, and the UAV will be guided to reach the target area.

022, 12, x FOR PEER REVIEW
The cooperative reward is related to the cooperative p Here, the difference between the farthest and closest dista target area is selected as the coordination parameter.Its spec When there are two UAVs in the cluster, the distribution dia where  = { ,  , … ,  } ,  = { ,  , … ,  } maximum and minimum distances between the UAV and ment.It can be seen from the mathematical expression and are two major distribution trends for synergistic rewards.difference of the UAV cluster is smaller, its value is larger move towards time coordination.When the UAV is closer larger, and the UAV will be guided to reach the target area When the UAV receives the success signal and final mission is successful, and it will give a success reward.The to the farthest distance between the UAV cluster and the t When the UAV receives the success signal and finally reaches the target area, the mission is successful, and it will give a success reward.The success reward is also related to the farthest distance between the UAV cluster and the target point.The closer the distance, the greater the success reward, as shown in (13).
When the final mission of the UAV fails, that is, when it is intercepted by an interceptor and fails to reach the target area, it will give a negative reward of failure, as shown in (14).
By linearly superimposing the above multiple reward functions, the final reward function can be obtained.The reward function, R, of the UAV is shown in (15). where

Simulation-Scene Settings
Based on environment modeling, a simulation experiment of multi-UAV-cooperative penetration was carried out.To simplify the training, the scene and the speed of the UAV are all set to smaller values.The number of UAVs in the cluster is set to two.In the environment, there are as many dynamic interceptors as there are UAVs.Each dynamic interceptor is responsible for the tracking of an unmanned aerial vehicle, that is, each interceptor calculates the angular velocity of the proportional guidance rotation for an unmanned aerial vehicle.The initial positions of the two interceptors are at the center of the target point, and the radius of the target area is set to 60 m.Table 1 shows the simulation parameters during training.During the training process, the learning rate of the actor network and the critic network are set to α 1 = 0.0001,α 2 = 0.001, the discount factor is set to γ = 0.9, and the action noise is set to 0.3.

Simulation Results and Analysis
After 10,000 trainings, the simulation results are shown in Figure 7. Two UAVs can pass enough to bypass the dynamic interceptor and finally reach the target area at the same time, and the time to reach the target area will not exceed the simulation step (1 s).During the training process, the learning rate of the actor network and the cri work are set to  =0.0001, = 0.001, the discount factor is set to =0.9, and the noise is set to 0.3.

Simulation Results and Analysis
After 10,000 trainings, the simulation results are shown in Figure 7. Two UAVs can pass enough to bypass the dynamic interceptor and finally rea target area at the same time, and the time to reach the target area will not exceed th ulation step (1 s).
Figures 8 and 9 shows the angular velocity change curves of UAVs and interc and the line-of-sight change curves of the interceptors.It can be seen from the figure that the UAVs carried out a wide-angle exercise, the interceptor's line of sight to the UAVs continued to increase, and the interception failed.
Figure 10 shows the distance curve between the two UAVs and the interceptor.The black line at the bottom of the figure represents the distance captured by the interceptor.It can be seen from the figure that the UAVs carried out a wide-angle interceptor's line of sight to the UAVs continued to increase, and the interce Figure 10 shows the distance curve between the two UAVs and the int black line at the bottom of the figure represents the distance captured by the  It can be seen from the figure that the minimum distance between the two UAVs and the interceptors is above the black line, that is, they are not captured by the interceptors.
Figure 11 is the distance between the two UAVs and the target point.In the figure, there is a certain gap between the initial distance between the two UAVs and the target area, which is about 10 m.
It can be seen from the figure that the minimum distance between the two UA the interceptors is above the black line, that is, they are not captured by the interce Figure 11 is the distance between the two UAVs and the target point.In the there is a certain gap between the initial distance between the two UAVs and the area, which is about 10 m.It can be seen from the figure that the final distance difference between the two and the target point is almost zero.At the beginning of the movement, the distan tween the two UAVs and the target point is about 3 m.UAV A will deliberately go a to ensure that it is moving.Finally, the target point can be reached at the same tim The simulation experiment shows that the DDPG algorithm can complete the erative penetration of the dynamic-tracking interceptor through the training of th cluster, which meets the high-performance requirements of the multi-UAV system a method with huge application potential.

Conclusions
Based on the deep reinforcement-learning DDPG algorithm, this paper stud UAV-cooperative penetration.Based on the mission scenario model, the UAV's space, state space, termination conditions, and reward functions are designed.Th the training of the UAV, the coordinated penetration of multiple aircraft was realize main conclusions of this paper are as follows: The collaborative method based on the DDPG algorithm designed in this pap achieve coordinated penetration between UAVs.After UAVs are trained, they can dinate to evade interceptors without being intercepted by them.Compared with tional algorithms, the UAV's penetration performance is stronger, the applicable en ment is more complex, and it has great application prospects.
The reinforcement learning collaboration method in this paper realizes the tim laboration between UAVs on the premise of exchanging state information betwee tiple aircrafts, and the movement of UAVs finally reaches the target area at the sam from different initial locations, achieving time coordination.However, this paper d consider factors such as communication delays or failures between UAVs, which the UAV to receive the wrong information.This problem may be solved by pred UAVs' states and information fusion.Further research can be carried out in the foll work.
This paper only considers the movement of the engagement plane.In the foll work, the movement can be expanded to three dimensions, and multi-UAV-coop penetration in a 3D environment will put forward higher requirements on the algo It can be seen from the figure that the final distance difference between the two UAVs and the target point is almost zero.At the beginning of the movement, the distance between the two UAVs and the target point is about 3 m.UAV A will deliberately go around to ensure that it is moving.Finally, the target point can be reached at the same time.
The simulation experiment shows that the DDPG algorithm can complete the cooperative penetration of the dynamic-tracking interceptor through the training of the UAV cluster, which meets the high-performance requirements of the multi-UAV system and is a method with huge application potential.

Conclusions
Based on the deep reinforcement-learning DDPG algorithm, this paper studies the UAV-cooperative penetration.Based on the mission scenario model, the UAV's action space, state space, termination conditions, and reward functions are designed.Through the training of the UAV, the coordinated penetration of multiple aircraft was realized.The main conclusions of this paper are as follows: The collaborative method based on the DDPG algorithm designed in this paper can achieve coordinated penetration between UAVs.After UAVs are trained, they can coordinate to evade interceptors without being intercepted by them.Compared with traditional algorithms, the UAV's penetration performance is stronger, the applicable environment is more complex, and it has great application prospects.
The reinforcement learning collaboration method in this paper realizes the time collaboration between UAVs on the premise of exchanging state information between multiple aircrafts, and the movement of UAVs finally reaches the target area at the same time from different initial locations, achieving time coordination.However, this paper doesn't consider factors such as communication delays or failures between UAVs, which causes the UAV to receive the wrong information.This problem may be solved by predicting UAVs' states and information fusion.Further research can be carried out in the follow-up work.
This paper only considers the movement of the engagement plane.In the follow-up work, the movement can be expanded to three dimensions, and multi-UAV-cooperative penetration in a 3D environment will put forward higher requirements on the algorithm.

Figure 1 .
Figure 1.The motion scene of the multi-UAVs.

Figure 1 .
Figure 1.The motion scene of the multi-UAVs.

Figure 1 .
Figure 1.The motion scene of the multi-UAVs.

Figure 4 .
Figure 4.The basic process of reinforcement learning.

Figure 4 .
Figure 4.The basic process of reinforcement learning.

Figures 8
Figures 8 and 9 shows the angular velocity change curves of UAVs and interceptors and the line-of-sight change curves of the interceptors.

Figure 8 .
Figure 8.The parameter change curve of UAV A and Interceptor A during training.

Figure 8 .Figure 9 .
Figure 8.The parameter change curve of UAV A and Interceptor A during tra

Figure 10 .
Figure 10.The distance between the UAV and the interceptor.

Figure 9 .
Figure 9.The parameter change curve of UAV B and Interceptor B during training.

Figure 8 .
Figure 8.The parameter change curve of UAV A and Interceptor A during training.

Figure 9 .
Figure 9.The parameter change curve of UAV B and Interceptor B during training.

Figure 10 .
Figure 10.The distance between the UAV and the interceptor.

Figure 10 .
Figure 10.The distance between the UAV and the interceptor.

Figure 11 .
Figure 11.The distance curve between the UAV and the target.

Figure 11 .
Figure 11.The distance curve between the UAV and the target.