Adaptive Collision Avoidance for Multiple UAVs in Urban Environments

: The increasing number of unmanned aerial vehicles (UAVs) in low-altitude airspace is seriously threatening the safety of the urban environment. This paper proposes an adaptive collision avoidance method for multiple UAVs (mUAVs), aiming to provide a safe guidance for UAVs at risk of collision. The proposed method is formulated as a two − layer resolution framework with the considerations of speed adjustment and rerouting strategies. The ﬁrst layer is established as a deep reinforcement learning (DRL) model with a continuous state space and action space that adaptively selects the most suitable resolution strategy for UAV pairs. The second layer is developed as a collaborative mUAV collision avoidance model, which combines a three-dimensional conﬂict detection and conﬂict resolution pool to perform resolution. To train the DRL model, in this paper, a deep deterministic policy gradient (DDPG) algorithm is introduced and improved upon. The results demonstrate that the average time required to calculate a strategy is 0.096 s, the success rate reaches 95.03%, and the extra ﬂight distance is 26.8 m, which meets the real-time requirements and provides a reliable reference for human intervention. The proposed method can adapt to various scenarios, e.g., different numbers and positions of UAVs, with interference from random factors. The improved DDPG algorithm can also signiﬁcantly improve convergence speed and save training time.


Introduction
Urban low-altitude airspace is an important natural resource that possesses great socioeconomic value, and rational management of urban airspace is of great significance in alleviating traffic congestion and reducing the rate of ground traffic accidents [1,2].As the main subjects of urban air traffic, unmanned aerial vehicles (UAVs) have attracted widespread attention due to their flexibility, convenience, and low cost.By the end of 2022, China had 700,000 registered UAV owners, 15,130 companies operating UAVs, 958,000 registered UAVs, and about 57,000 h of average daily flight [3].It can be foreseen that with the development of urban air traffic, the types of tasks performed by UAVs will inevitably tend to diversify, showing great promise in fields such as tourism, rescue, and logistics, and with this comes an increase in urban air traffic flow and the complication of flight situations.
As the number and size of UAVs increases, their operation in urban airspace will present additional security threats.The dense distribution of buildings, the complex structure of the airspace, and the high density of aircraft make it extremely easy for accidents such as dangerous approaches or even collisions to occur.Therefore, in the face of limited airspace resources, means of effectively avoiding risk of collision have become a Drones 2023, 7, 491 2 of 22 primary issue that needs to be addressed in order to build urban air traffic demonstration areas and develop the low-altitude economy.However, the traditional collision avoidance algorithms lack sufficient success rates, and do not satisfy the safety interval criteria and real-time requirements in multi-target and high-density urban scenarios.In addition, these methods generate collision avoidance trajectories based on discrete state space, and the selectable actions are also discrete, and are not able to adequately reflect the flexibility of UAVs.
To address these problems, we propose an innovative two−layer resolution framework for mUAVs based on DRL, which can adaptively provide avoidance strategies for UAVs based on continue action space and ensure that each UAV has decision-making capability, thus significantly improving the success rate and computational efficiency.

Related Prior Work
Many studies have been performed proposing methods for UAV collision avoidance.In general, existing methods can be grouped into the following three categories: heuristic optimization methods, optimal control theory methods, and artificial intelligence methods.
(1) Heuristic optimization methods Heuristic optimization methods divide the conflict process into a series of discrete state spaces and then perform an optimal search for approximate solutions in a certain cooperative manner [4], and primarily include the swarm intelligence optimization method [5] A*, D*.Zeng et al. combined the ant colony algorithms and the A* algorithm to solve the unmanned ground vehicle (UGV) scheduling planning problem, avoiding conflicts during simultaneous path planning for UGVs at a lower cost [6].Zhao et al. considered collision probability and the intention information of intruders, using the A* algorithm to optimize trajectory planning to avoid collision risks [7].Yun et al. applied the enhanced D* lite algorithm in the field of robot path navigation in unknown dynamic environments [8].Furthermore, these methods are usually combined with other algorithms to solve the conflict problem, such as clustering methods [9] and Legendre Pseudo spectral Method [10].
(2) Optimal control theory methods The optimal control theory methods select the permissible control rate according to the kinematic model or the time domain mathematical model so that the UAV operates according to the constraints and thus achieves collision avoidance.These methods mainly include three aspects of mixed-integer linear programming, nonlinear optimization, and dynamic programming.Radmanesh et al. proposed fast-dynamic Mixed Integer Linear Programming (MILP) for the path planning of UAVs in various flight formations, focusing on avoiding typical UAVs from colliding with any intruder aircraft [11].De Waen et al. targeted complex scenarios with multiple obstacles, and divided the MILP problem into many smaller MILP subproblems for trajectory modeling, which ensures the scalability of MILP to solve conflict problems [12].Alonso-Ayuso et al. developed an exact mixed-integer nonlinear optimization model based on geometric construction for tackling the aircraft conflict detection and resolution problem [13].
Heuristic-based search methods are reliable and effective for achieving collision avoidance and are able to resolve conflicts among small numbers of UAVs, which is the most common case in practice today.However, this method is not very suitable for mUAV conflicts, especially when the airspace is crowded, in which case the collision avoidance paths generated may suffer from secondary conflict problems.The optimal control methods take the minimum interval between UAVs as the optimization condition, and its relatively complex theory will lead to a decrease in anti-interference capability and an increase in computation, so it is not able to meet the real-time requirements.
(3) Artificial intelligence methods The widespread use of artificial intelligence (AI) in recent years has provided new ideas and implementation paths.Reinforcement learning (RL) is the study of how an agent can interact with the environment to learn a policy that maximizes the expected Drones 2023, 7, 491 3 of 22 cumulative reward for a task.When RL is combined with the powerful understanding ability of deep learning, it is clear that it possesses better decision-making efficiency than humans in a nearly infinite state space [14,15].The application of DPL to the field of UAV collision avoidance can solve the problems presented by the methods described above, while achieving better avoidance in urban airspace with variable environmental states and meeting strict real-time requirements.
The authors of [16,17] developed a Q-learning algorithm to design the dynamic movement of UAVs, but there is no assurance that it can handle high-dimensional input data.Singla et al. proposed a deep recurrent Q-network with temporal attention to realize the indoor autonomous flight of a UAV [18].In [19], the DDQN algorithm was applied to ship navigation to achieve multi-ship collision avoidance in crowded waters.Li et al. designed a tactical conflict resolution method for air logistics transportation based on the D3QN algorithm, enabling UAVs to successfully avoid non-cooperative targets [20].The value-based algorithms (e.g., D3QN and DDQN) can adapt to complex state spaces, but cannot provide satisfactory solutions for continuous control problems.The emergence of police-based RL has solved such problems; one of the most widely used and mature approaches in practice is the DDPG algorithm proposed by DeepMind [21].For example, references [22][23][24][25] addressed the trajectory optimization in a two UAV scenario based on the DDPG algorithm.Ribeiro et al. [22] utilized a geometric approach to model conflict detection between two UAVs and trained the agent in conjunction with the DDPG algorithm to generate a resolution strategy.The authors of [23,24] utilized the DDPG algorithm to solve the UAV path following problem, taking into account the conflict risks during movement.In [25], a proper heading angle was obtained using the DDPG algorithm before the aircraft reached the boundary of the sector to avoid collisions.Alternatively, Proximal Policy Optimization (PPO) methods can be used in aircraft collision avoidance, and have shown a certain level of performance [26].
In summary, although a variety of UAV collision avoidance methods have been developed based on DRL, there are still several gaps in terms of actual application: (1) Scholars have typically rasterized the entire airspace when designing the state space [16,17,22], which has some specific limitations: UAVs can only move to an adjacent raster, which limits the action dimensions of UAVs, and is not able to adequately reflect their flexibility.The use of a discrete state space would cause a waste of airspace resources, and the whole raster area may become a no-fly zone due to some small buildings, thus reducing the space available for UAV flights.(2) The dimensional advantage is also an important factor in measuring the performance of the method, with most existing UAV collision avoidance methods borrowing from ground traffic, and thus the dimensional range is limited to 2D, which does not match the actual operation situation in the airspace, while also limiting the UAV avoidance actions that can be selected [18,20,26].(3) When the DRL theory is applied to mUAV collision avoidance, in the existing literature, only one UAV is regarded as the agent, and the other UAVs are regarded as dynamic obstacles without resolution ability [20]; their tracks are previously planned.In actual operation, conflicts may arise from any UAV, so each UAV should have the ability to make their own decisions.

Our Contributions
To solve the above problems, in this paper, a more practical collision avoidance method for mUAVs is developed.The primary contributions of this study are the following: (1) In this paper, an adaptive decision-making framework for mUAV collision avoidance is proposed.The adaptive framework enables UAVs to autonomously determine the avoidance action to be taken in 3D space, providing the UAVs with more options for extrication strategies when faced with static or dynamic obstacles.This framework combines the conflict resolution pool in order to transform mUAV conflicts into UAV pairs for avoidance, controlling the computational complexity at the polynomial level, thereby providing a new idea for mUAV collision avoidance.
(2) A DRL model for UAV pairs is designed based on a continuous action space and state space, reflecting the maneuverability of UAVs and avoiding wastage of urban airspace resources.The model endows each UAV with decision-making ability, and utilizes a fixed sector format to explore conflicting obstacles, thus simplifying the state of the agent and making it more adaptable to dense urban building environments.
(3) The DDPG algorithm is introduced to train the agent, and its convergence speed is enhanced by proposing the mechanism of destination area dynamic adjustment.
A summary of surveys related to DRL methods in the field of UAV collision avoidance is provided in Table 1.The table shows that this research represents the first study to resolve mUAV conflicts in a 3D environment based on continuous state and action space.The rest of the paper is organized as follows.In Section 2, the two−layer resolution framework is presented, and the methods for collision avoidance of UAV pairs and mUAVs are proposed.In Section 3, the improved DDPG algorithm is proposed for training the agent.In Section 4, the validity of the method is verified using a designed city scenario.In Section 5, some conclusions and summaries are presented.The table in the abbreviations section shows the main symbols used in this paper.

Problem Formulation
In this section, we propose a two−layer resolution framework (Figure 1) and divide the problem formulation of mUAV collision avoidance into two parts: collision avoidance between UAV pairs, and collaborative mUAV collision avoidance.Firstly, the collision avoidance agent training model is designed based on DPL, and the agent can assign avoidance actions for UAV pairs in real time and complete collision avoidance for both dynamic obstacles (between UAVs) and static obstacles (buildings).Secondly, the collaborative mUAV collision avoidance model is proposed, in which mUAV conflicts are transformed into the form of UAV pairs and then the agent is used to deconflict them one by one.
The collision zero model of the UAV and building are generated as described in an existing study [27], and are elliptical and cylindrical shapes, as shown in Equations ( 1) and (2).The first layer The second layer The collision zero model of the UAV and building are generated as described in a existing study [27], and are elliptical and cylindrical shapes, as shown in Equations ( and (2).
where au, bu, hu are the semi-axes of the ellipsoid, respectively, and Ro, ho are the radius an height of the cylinder.

First Layer: DRL-Based Method for Collision Avoidance between UAVs in a UAV Pair
In this section, the first layer of the framework is designed, and the agent trainin model is constructed with respect to three aspects: state space, action space, and rewar function.

Continuous State Space
According to the explanation in the previous section, we replace the rasterized spa with a continuous state space.The agent state vector includes the following three parts:

First Layer: DRL-Based Method for Collision Avoidance between UAVs in a UAV Pair
In this section, the first layer of the framework is designed, and the agent training model is constructed with respect to three aspects: state space, action space, and reward function.

Continuous State Space
According to the explanation in the previous section, we replace the rasterized space with a continuous state space.The agent state vector includes the following three parts: (1) The flight state vector of the UAV, including six attributes (ϕ: heading angle; V: horizontal speed; Z: the altitude of UAV; ϕ g : relative heading angle of the destination to UAV; d g : the horizontal distance of the destination to UAV), as shown in Figure 2.These attributes are able to accurately reflect the current flight status of the UAV, and the agent guides the UAV to its destination based on the flight status vector.(2) Interaction vectors between UAV pairs (ϕ us : the difference in heading angle; Z us : the difference in altitude; d us : the horizontal distance between two UAVs).These attributes reflect the position and heading relationships between the UAVs in a UAV pair, and the agent avoids collision between UAVs in a UAV pair based on the interaction vectors.(3) Building vectors.There are lots of buildings in the urban airspace, and if all building information is put into the agent, this will result in high state vector dimensionality and affect the speed of convergence.In this paper, considering the detection range of the UAV in the horizontal direction, a flight sector is used to map obstacles affecting flight into a fixed-length vector, and these are regarded as the obstacle vectors, as shown in Figure 2.
UAV; dg: the horizontal distance of the destination to UAV), as shown in Figure 2.These attributes are able to accurately reflect the current flight status of the UAV, and the agent guides the UAV to its destination based on the flight status vector.
(2) Interaction vectors between UAV pairs (φus: the difference in heading angle; Zus: the difference in altitude; dus: the horizontal distance between two UAVs).These attributes reflect the position and heading relationships between the UAVs in a UAV pair, and the agent avoids collision between UAVs in a UAV pair based on the interaction vectors.(3) Building vectors.There are lots of buildings in the urban airspace, and if all building information is put into the agent, this will result in high state vector dimensionality and affect the speed of convergence.In this paper, considering the detection range of the UAV in the horizontal direction, a flight sector is used to map obstacles affecting flight into a fixed-length vector, and these are regarded as the obstacle vectors, as shown in Figure 2. The detection area is supposed to be a circle with the location as the center and the detection distance as the radius, that is, the agent can obtain the position information of obstacles in this area.Starting from the heading angle and rotating counterclockwise, every 45° is divided into one sector, and a total of eight sectors can be obtained.The distance between the UAV and all obstacles in each sector is calculated, and the closest distance is one of the attributes; if there are no obstacles in the sector, the value of this attribute corresponds to the detection distance.Supposing that there are j obstacles in sector m, then: where Ru denotes the collision zone radius of the UAV,  denotes the collision zone radius of building j, ddet denotes the UAV detection distance,  denotes the distance attribute between UAV and building j, diom denotes the distance attribute of sector m, so the obstacle vector of UAV is: In summary, the state vector received by the agent from the environment at moment t is:  The detection area is supposed to be a circle with the location as the center and the detection distance as the radius, that is, the agent can obtain the position information of obstacles in this area.Starting from the heading angle and rotating counterclockwise, every 45 • is divided into one sector, and a total of eight sectors can be obtained.The distance between the UAV and all obstacles in each sector is calculated, and the closest distance is one of the attributes; if there are no obstacles in the sector, the value of this attribute corresponds to the detection distance.Supposing that there are j obstacles in sector m, then: where R u denotes the collision zone radius of the UAV, R j o denotes the collision zone radius of building j, d det denotes the UAV detection distance, d j iom denotes the distance attribute between UAV and building j, d iom denotes the distance attribute of sector m, so the obstacle vector of UAV is: In summary, the state vector received by the agent from the environment at moment t is:

Continuous Action Space
When modeling the UAV action space based on deep reinforcement learning, the control of the UAV is usually achieved based on adjusting spatial position, velocity, or acceleration.For the continuous state space and the flexibility of UAVs, in this paper, the continuous action space is designed based on adjusting the velocity and the maneuvering methods of UAVs are simplified, and are summarized into three processes: heading s, 2 m/s].Thus, the action space of the agent at moment t is:

Reward Function Design
The reward is a scalar feedback signal given by the environment that shows how well an agent performs at performing a certain strategy at a certain step.The reward function is a key component of the DRL framework.The purpose of the agent interacting with the environment is to maximize its reward value, so designing a suitable reward function for the agent can improve training performance and lead to faster convergence.
In this paper, we consider the shortest path to the destination, avoid collision between UAV pair, avoid collision between UAV and building, avoid the UAV flying out of the specific area, set four reward functions, and use artificially designed "dense" rewards to achieve a dynamic balance between near-term and long-term rewards, which can mitigate the sparse reward problem in DRL.
(1) Destination intent reward: when there are no obstacles in the sector of the UAV pair, the destination intent reward is used to ensure that the UAV takes the shortest path to the destination.The entire movement of the UAV is divided into multiple "intensive" actions, and the reward function is set for them to ensure that each action of the UAV affects the final reward value, thus contributing to the improvement of the overall strategy, as shown in Equation ( 6): (2) Building collision avoidance reward: when there are obstacles in the sector of the UAV pair, it is necessary to ensure that the UAVs avoid colliding with buildings while flying to their destinations; therefore, we need to balance the two tasks, and the reward function at this time is shown in Equation (7).
(3) UAV collision avoidance reward: the UAV collision avoidance reward is used to avoid collisions between the UAV and other UAVs.A UAV alert area is set up, and UAV collision avoidance is made the main task when there are other UAVs within the alert area, as shown in Equation (8).
where → V t i is the direction vector of the velocity of the UAV at moment t. (4) Additional reward: There are four final states of the UAV: reaching the destination, flying out of the control area, colliding with obstacles, and colliding with other drones.This additional reward is used to provide a relatively large reward or penalty value when the UAV reaches its final state, which can be used to guide the UAV to its destination and avoid bad events such as collisions or loss of control, as shown in Equation ( 9).
In summary, the reward function received by the agent per unit of time is as follows:

The Interaction between the Agent and the Environment
In reinforcement learning, the interaction between the agent and its environment is often modeled by a Markovian decision process.This process can be represented by a four-tuple (S, A, R, γ), where S is the current state of the agent Equation (4), A is the action taken by the agent (Equation ( 5)), R is the reward value obtained by the agent after taking the current action, γ ∈ [0, 1] is a discount factor, which is a constant, as shown in Figure 3.
is the direction vector of the velocity of the UAV at moment t. (4) Additional reward: There are four final states of the UAV: reaching the de flying out of the control area, colliding with obstacles, and colliding w drones.This additional reward is used to provide a relatively large reward value when the UAV reaches its final state, which can be used to guide th its destination and avoid bad events such as collisions or loss of control, as Equation ( 9).
12 : 6 : Drive out of control area 6 : 6 : In summary, the reward function received by the agent per unit of time is a

The Interaction between the Agent and the Environment
In reinforcement learning, the interaction between the agent and its envir often modeled by a Markovian decision process.This process can be represe four-tuple , , ,  , where S is the current state of the agent (Equation (4)), A tion taken by the agent (Equation ( 5)), R is the reward value obtained by the a taking the current action,  ∈ 0,1 is a discount factor, which is a constant, as Figure 3.

Data1 Data2 Data3
Figure 3.The interaction mode between the agent and the environment.
Assuming a discrete time domain  ∈ 0, 1, 2, ⋯ , the agent starts from state S0 and observes the environment state  ∈  at a certain time node t, tak Assuming a discrete time domain t ∈ {0, 1, 2, • • •}, the agent starts from the initial state S 0 and observes the environment state S t ∈ S at a certain time node t, taking action A t ∈ A based on the state and specific decision-making policy, and the next state is transferred to S t+1 , while the agent receives an immediate reward R t = R(S t , A t ).All the variables obtained from the Markov decision process can be recorded as a trajectory Drones 2023, 7, 491 To ensure mUAV safety, the conflict risk of each UAV needs to be detected, and then targeted for resolution.The conflict risk is detected based on the velocity vector and distance; the risk includes two types: UAV vs. UAV and UAV vs. building.
For conflict detection between the UAVs in a UAV pair, subscripts S and R designate the stochastic UAV and the reference UAV, respectively, in any UAV pair, P R denotes the position of the reference UAV, P S denotes the position of the stochastic UAV, and the velocities of the reference UAV and the stochastic UAV at the current moment are V R , V S .
In the process of modeling, defining a combined collision zone, which is assigned to the reference UAV so that the stochastic UAV can be regarded as a particle.The parameters of combined collision region D are: A 3D collision coordinate system is established with the origin fixed at the position of the reference UAV; the relative position and velocity of the UAV pair are: ∆P uu = P S − P R , ∆V uu = V S − V R .It is assumed that ∆l is the extension of the relative velocity ∆V uu , and if the intersection of ∆l and the combined collision region D is non-empty, i.e., ∆l ∩ D = ∅, then the UAV pair meets the condition of conflict in the relative velocity dimension; then, the spatial distance of the UAV pair is calculated, and if the spatial distance is less than the threshold while satisfying the relative velocity conflict condition, it can be determined that there is a collision risk for this UAV pair.
For conflict detection between UAVs and buildings, the combined collision zone is also defined.Since buildings are fixed, it is only necessary to consider whether the velocity extension line of the UAV intersects the combined collision region, and then the spatial distance between the UAV and building is considered to determine the conflict, as shown in Figure 4.
∈ A based on the state and specific decision-making policy, and the next state is tra ferred to  , while the agent receives an immediate reward    ,  .All the va ables obtained from the Markov decision process can be recorded as a trajectory   ,  ,  ,  ,  ,  , ⋯ , and each segment of the trajectory can be intercepted to form set of training data for the subsequent training of the Target-network and Evaluate-n work in the algorithm.

Three-Dimensional Conflict Detection
To ensure mUAV safety, the conflict risk of each UAV needs to be detected, and th targeted for resolution.The conflict risk is detected based on the velocity vector and d tance; the risk includes two types: UAV vs. UAV and UAV vs. building.
For conflict detection between the UAVs in a UAV pair, subscripts S and R design the stochastic UAV and the reference UAV, respectively, in any UAV pair, PR denotes position of the reference UAV, PS denotes the position of the stochastic UAV, and the locities of the reference UAV and the stochastic UAV at the current moment are VR, VS.
In the process of modeling, defining a combined collision zone, which is assigned the reference UAV so that the stochastic UAV can be regarded as a particle.The param ters of combined collision region D are: A 3D collision coordinate system is established with the origin fixed at the posit of the reference UAV; the relative position and velocity of the UAV pair are: ∆   , ∆   .It is assumed that ∆ is the extension of the relative velocity ∆ , a if the intersection of ∆ and the combined collision region D is non-empty, i.e., ∆ ∩  ∅, then the UAV pair meets the condition of conflict in the relative velocity dimensi then, the spatial distance of the UAV pair is calculated, and if the spatial distance is l than the threshold while satisfying the relative velocity conflict condition, it can be det mined that there is a collision risk for this UAV pair.
For conflict detection between UAVs and buildings, the combined collision zone also defined.Since buildings are fixed, it is only necessary to consider whether the veloc extension line of the UAV intersects the combined collision region, and then the spa distance between the UAV and building is considered to determine the conflict, as sho in Figure 4.

Conflict Resolution Pool
Based on the UAV conflict detection method in the previous section, we propose concept of a conflict resolution pool, where the UAVs detected as being at risk are stor as the objects of the collision avoidance agent.For UAV-to-UAV conflicts, the elements

Conflict Resolution Pool
Based on the UAV conflict detection method in the previous section, we propose the concept of a conflict resolution pool, where the UAVs detected as being at risk are stored as the objects of the collision avoidance agent.For UAV-to-UAV conflicts, the elements in the pool are UAV pairs, and for UAV-to-building conflicts, the elements are single UAVs, while the distances of UAVs at risk are deposited in the pool as the priorities of conflict resolution, with smaller distances indicating higher levels of risk and a more urgent need for collision avoidance, as shown in Equation (12).
where (u i , u j ) is the conflict UAV pair, u l is the UAV in conflict with the building, σ is the priority, and d is the spatial distance.
The conflict resolution pool transforms mUAV conflicts into UAV pairs for avoidance, which simplifies the cooperative collision avoidance problem to a great extent.If a trajectory search-based approach is used to calculate the safe path for each UAV individually, the search space will grow exponentially with the number of UAVs, and when the number exceeds a certain value, the complexity of the algorithm will be too high to be able to solve the problem within an acceptable time.The conflict resolution pool prevents the resolution energy being wasted on temporarily safe UAVs, so that the computational complexity of the method can be controlled at the polynomial level, and the ability to perform collision avoidance will be further improved.

Collaborative Resolution Process for mUAVs
In this section, we propose a working model for this method by combining the concepts of the collision avoidance agent, three-dimensional conflict detection, and the conflict resolution pool.
Assuming that there are n UAVs in the urban airspace, the specific steps are as follows: Step 1: A pool K is built, consisting of all UAVs in the airspace, which is initialized for each timestamp: Step 2: The three-dimensional conflict detection method is used to detect UAVs at risk, which are stored in the conflict resolution pool S, and then the priorities of the pool elements are calculated.
Step 3: The UAV pair with the highest priority is selected, as follows: If the UAV pairs are all in the pool K, the reinforcement learning agent is used to assign avoidance actions to them.For UAVs that are not in K, the actions assigned to it by the agent are ignored, and the original actions are kepts unchanged.Meanwhile, this UAV pair is removed from pool S.
Step 4: When there are no UAV pairs in the conflict resolution pool S, the two UAVs with the highest priorities that are in conflict with the building are selected and a UAV pair is formed, as follows: The agent is used to assign avoidance actions to them, and the corresponding UAVs are removed from conflict resolution pool S.
Step 5: The UAVs that have been assigned avoidance actions are removed from pool K until S = ∅, and for the UAVs that are still in pool K, their original actions are kept unchanged.
Figure 5 shows the process of collaborative collision avoidance.

Deep Deterministic Policy Gradient
Faced with a continuous state space and action space, in this paper, the DDPG algorithm is taken to train the collision avoidance agent for the UAV pair.The DDPG algorithm utilizes four neural networks in the actor-critic framework: policy network ( ( ; ) s   ), Q network ( ( , ; ) Q s a  ), target policy network ( ( ; ) s    ), and target Q network ( ( , ; ) The actor calculates the optimal action for the current state based on the learned policy function ( ( ) ).The critic estimates the value function ( ( , ) Q s a  ) given the state and the action, which provides an expected accumulated future reward for this state-action pair.In addition, the critic is responsible for calculating the loss function (i.e., TD error) that is used in the learning process for both the policy network and the Q-network.To update the critic network, similar to Q-learning, the Bellman equation [28] is used: ( , ( ; ); ) Then, the loss function is defined, and the argument is updated to minimize the loss between the original Q and the target:  ( , ; ) ) The actor utilizes the policy network ( ( ; ) s   ) to select the best action, which maximizes the value function.The objective function in updating the actor is to maximize the expected return: According to the chain rule, the gradient of the objective function Q concerning the actor parameters can be obtained: Then, for mini-batch data, the mean of the sum of gradients is taken:

Deep Deterministic Policy Gradient
Faced with a continuous state space and action space, in this paper, the DDPG algorithm is taken to train the collision avoidance agent for the UAV pair.The DDPG algorithm utilizes four neural networks in the actor-critic framework: policy network (π(s; θ)), Q network (Q(s, a; ω)), target policy network (π(s; θ − )), and target Q network (Q(s, a; ω − )).
The actor calculates the optimal action for the current state based on the learned policy function (π θ (s i )).The critic estimates the value function (Q ω (s, a)) given the state and the action, which provides an expected accumulated future reward for this state-action pair.In addition, the critic is responsible for calculating the loss function (i.e., TD error) that is used in the learning process for both the policy network and the Q-network.To update the critic network, similar to Q-learning, the Bellman equation [28] is used: Then, the loss function is defined, and the argument is updated to minimize the loss between the original Q and the target: The actor utilizes the policy network (π(s; θ)) to select the best action, which maximizes the value function.The objective function in updating the actor is to maximize the expected return: The target network is a network used in the training phase.This network is equivalent to the original network being trained, and it provides the target values used to compute the loss function.In the DDPG algorithm, the target network is modified using a soft update: This means that the target weights are constrained to change slowly.The use of target networks with soft updates allows them to give consistent targets during the temporaldifference backups and causes the learning process to remain stable.

An Improved Measure for DDPG
The DDPG algorithm has high execution efficiency, enabling continuous motion control of the agent.However, presented with the specific environment described in this paper, the DDPG consumes too much time in agent training, making it difficult to respond quickly when the urban environment undergoes significant changes and the agent needs to be retrained.To address such problems, this section improves the algorithm mainly in terms of the dynamic adjustment of the destination area.
Due to the large spatial area of the city, the UAV destination is relatively small, and is replaced by a prime point in Equation ( 9), where U i des ∈ D i u indicates that the UAV has reached its destination, and the agent receives the corresponding reward.However, in the actual training process, it is difficult to achieve the above conditions when the UAV performs the search, so there is little chance for the agent to obtain a relatively large reward value, thus causing the convergence speed to decrease.
In this section, the way of obtaining destination rewards in the algorithm is improved based on the Wright learning curve model, and a dynamic adjustment mechanism for the destination area is proposed.During the early stage of training, the destination area is expanded so that the agent can complete the task relatively easily and learn the primary strategy.According to the learning curve, the destination area is gradually reduced, and the agent gradually learns more difficult strategies, which is conducive to improving the stability of the learning and accelerating the convergence speed of the algorithm.The destination range is defined as being spherical: In Equation ( 22), D i des represents the destination area, if D i u ∩ D i des = ∅ means the UAV has reached its destination, the area radius is adjusted with the training episode according to the Wright learning curve model: In Equation (23), α is the learning rate, C is the attenuation coefficient, R is the initial destination area radius, and x indicates the training episode.The dynamic adjustment mechanism of the destination area further optimizes the "dense" reward and allows the algorithm to learn useful experiences in the early stages.
The original DDPG algorithm needs more training epochs to detect the accurate location due to the small and fixed destination area, and sometimes even fails to obtain the destination reward.The improved DDPG algorithm adds a dynamic adjustment mechanism for the destination area, which enlarges the size of the destination area in the initial stage of training, so that the agent can easily obtain the approximate destination location, ensuring that it will move in the right direction in the subsequent training.As the training progresses, the algorithm gradually reduces the destination area based on the Wright learning curve, guiding the agent to the precise destination location.Compared with the original algorithm, the improved DDPG algorithm is more goal oriented and avoids ineffective exploration on the part of the agent, so it can accelerate the convergence speed and save training resources.

Environment Setting and Hyperparameters
To analyze the performance of the mUAV collision avoidance method, in this paper, a DJI Matrice 600 is selected as a case study, whose form factor (L × W × H) is set to 1668 mm × 1668 mm × 759 mm, and the calculated elliptical collision zero parameters are 1445 mm × 1445 mm × 657 mm.The scenario range is set to 1000 m × 1000 m × 50 m, considering that the collision avoidance area of small UAVs in the city will not be too large.
In the experimental scenario, we construct the spatial layout of buildings in the city and set up fixed-volume obstacles at fixed locations; the building collision zero is cylindrical, and the shape parameters are shown in Table 2.The experiment was based on eight UAVs, each with an initial speed and initial heading.At the beginning of training, a random origin is generated for each UAV and the terminal, and the agent assigns avoidance actions to UAVs according to the action space in Equation ( 5), and returns to the origin to restart the training if a UAV has a collision accident.The experimental parameters are as shown in Table 3, and the environment is as shown in Figure 6.

Collision Avoidance Agent Training
For various conflicts arising from UAV pairs, the trained agent can provide appropriate solutions.During training, the average reward obtained per episode is an important indicator of convergence and collision avoidance performance.Using the improved DDPG algorithm to train agents, the rewards for each episode are shown in Figure 7.    From Figure 7, the reward obtained by the agent is not stable at the beginning of the training, as the agent touches the events with a higher degree of punishment during the exploration process, resulting in a large degree of reward drop.With continuous training, the agent gradually learns the high-reward behavior, and the reward value increases.In the second half of training, the reward did not significantly fall again, which indicates that the improved algorithm learned a better and more stable strategy, and therefore the reward oscillated less.
The comparison effect of the improved DDPG algorithm with the original algorithm is shown in Figure 8.It can be seen that the improved algorithm demonstrates a better improvement in convergence speed, achieving a higher reward value and showing a convergence trend of around 380 training epochs, while the original algorithm was only able to show such an effect after around 1000 epochs.After the 2000th training epoch, the rewards obtained by the two algorithms did not differ much.However, as for actual training, the improved DDPG algorithm obtained stable reward values and determined the convergence trend at earlier epochs, and thus training can be ended earlier than in the original algorithm, which saves training time.The comparison effect of the improved DDPG algorithm with the original algorit is shown in Figure 8.It can be seen that the improved algorithm demonstrates a be improvement in convergence speed, achieving a higher reward value and showing a c vergence trend of around 380 training epochs, while the original algorithm was only a to show such an effect after around 1000 epochs.After the 2000th training epoch, the wards obtained by the two algorithms did not differ much.However, as for actual tra ing, the improved DDPG algorithm obtained stable reward values and determined convergence trend at earlier epochs, and thus training can be ended earlier than in original algorithm, which saves training time.

Numerical Results Analysis 4.3.1. Collision Avoidance Results
Using the two−layer resolution framework, we obtained the collision avoidance results, as shown in Figure 9, while recording the distance between each UAV and the nearest obstacle, as well as the distance between the two nearest UAVs.

Collision Avoidance Results
Using the two−layer resolution framework, we obtained the collision avoidance results, as shown in Figure 9, while recording the distance between each UAV and the nearest obstacle, as well as the distance between the two nearest UAVs.From a two-dimensional perspective, it is intuitively apparent that every UAV is in conflict with at least one obstacle and avoids obstacles with as little extra flight distance as possible.In addition, there is a risk of conflict between UAV5 and UAV6, UAV7 and UAV8, and UAV1 and UAV7, so the agent randomly selects one of the UAVs to perform the primary avoidance maneuver, while the other maintains almost its original direction (or makes minor adjustments), in order to minimize the impact of the avoidance behavior on normal navigation.In the three-dimensional view, all the UAVs have reached the intended altitude.
The distance between each UAV and the obstacle has a process from small to large, which indicates that UAVs are making avoidance actions.The closest UAV pairs may be different at different times, but the overall trend of distance variation is consistent, proving that the UAVs can also avoid each other.The minimum distance from buildings is about 7 m, and the minimum distance from other UAVs is about 4 m throughout the whole process, thus meeting the standard safety interval, proving that the model in this paper can ensure the safe operation of UAVs in cities with many buildings.

Avoidance Strategy Analysis
In the two−layer resolution framework, three strategies are used for collision avoidance and destination guidance, to analyze the avoidance action selection pattern by the agent, recording the actions (heading angle, altitude, speed change) selected by all UAVs at each step, as shown in Figures 10-12.
As shown in Figure 10, each approach of UAVs and obstacles will lead to a significant change in the heading angle, and when the distance is kept at a relatively safe level, the change in heading angle will fluctuate around 0°, indicating that the UAV is flying along a straight line in the horizontal direction.It can be determined that the agent avoids collision with obstacles mainly by changing the heading angle of the UAV.From a two-dimensional perspective, it is intuitively apparent that every UAV is in conflict with at least one obstacle and avoids obstacles with as little extra flight distance as possible.In addition, there is a risk of conflict between UAV5 and UAV6, UAV7 and UAV8, and UAV1 and UAV7, so the agent randomly selects one of the UAVs to perform the primary avoidance maneuver, while the other maintains almost its original direction (or makes minor adjustments), in order to minimize the impact of the avoidance behavior on normal navigation.In the three-dimensional view, all the UAVs have reached the intended altitude.
The distance between each UAV and the obstacle has a process from small to large, which indicates that UAVs are making avoidance actions.The closest UAV pairs may be different at different times, but the overall trend of distance variation is consistent, proving that the UAVs can also avoid each other.The minimum distance from buildings is about 7 m, and the minimum distance from other UAVs is about 4 m throughout the whole process, thus meeting the standard safety interval, proving that the model in this paper can ensure the safe operation of UAVs in cities with many buildings.

Avoidance Strategy Analysis
In the two−layer resolution framework, three strategies are used for collision avoidance and destination guidance, to analyze the avoidance action selection pattern by the agent, recording the actions (heading angle, altitude, speed change) selected by all UAVs at each step, as shown in Figures 10-12.
As shown in Figure 10, each approach of UAVs and obstacles will lead to a significant change in the heading angle, and when the distance is kept at a relatively safe level, the change in heading angle will fluctuate around 0 • , indicating that the UAV is flying along a straight line in the horizontal direction.It can be determined that the agent avoids collision with obstacles mainly by changing the heading angle of the UAV.
As shown in Figures 11 and 12, climbing and descending actions ensure that the height of the UAV is finally consistent with the destination height, demonstrating that the agent has the guiding ability in the three-dimensional space.The speed change is generally stable within a fixed range, and there is no excessive speed, as the speed adjustment is coupled with the heading angle to avoid collision with obstacles.In addition, due to there being fewer obstacles near the destination, the UAV has a higher speed and a stable heading angle in the later stage, ultimately reaching the destination.As shown in Figures 11 and 12, climbing and descending actions ensure that the height of the UAV is finally consistent with the destination height, demonstrating that the agent has the guiding ability in the three-dimensional space.The speed change is generally stable within a fixed range, and there is no excessive speed, as the speed adjustment is coupled with the heading angle to avoid collision with obstacles.In addition, due to there being fewer obstacles near the destination, the UAV has a higher speed and a stable heading angle in the later stage, ultimately reaching the destination.As shown in Figures 11 and 12, climbing and descending actions ensure that the height of the UAV is finally consistent with the destination height, demonstrating that the agent has the guiding ability in the three-dimensional space.The speed change is generally stable within a fixed range, and there is no excessive speed, as the speed adjustment is coupled with the heading angle to avoid collision with obstacles.In addition, due to there being fewer obstacles near the destination, the UAV has a higher speed and a stable heading angle in the later stage, ultimately reaching the destination.As shown in Figures 11 and 12, climbing and descending actions ensure that the height of the UAV is finally consistent with the destination height, demonstrating that the agent has the guiding ability in the three-dimensional space.The speed change is generally stable within a fixed range, and there is no excessive speed, as the speed adjustment is coupled with the heading angle to avoid collision with obstacles.In addition, due to there being fewer obstacles near the destination, the UAV has a higher speed and a stable heading angle in the later stage, ultimately reaching the destination.To fully illustrate the collision avoidance effect of the method, 300 different scenarios were designed in the simulation area by randomly generating the starting and ending points of eight UAVs.The considered performance metrics were the following: (1) collision avoidance success rate (SR), which is the percentage of successful collision avoidance; (2) computational efficiency (CE), which is the time required for an agent to calculate an action; (3) extra flight distance (ED), which is the record of extra distance the UAV flighted due to collision avoidance.
(1) Collision avoidance success rate In the 300 random scenarios, 4523 conflicts were recorded in the conflict resolution pool, including 867 conflicts with other UAVs and 3656 conflicts with buildings.If the agent cannot assign the correct avoidance action to a UAV, a collision will occur, according to the current speed trend.The collision avoidance method described in this paper can guide UAVs out of collision risk, with success rates as shown in Table 4.The data in Table 4 show that the method has a higher success rate when resolving conflicts with fixed obstacles than with dynamic obstacles, as the invading UAV has positional uncertainty, the flight state may not be fully perceived by the current UAV, and no avoidance action can be taken in time.However, the overall success rate reached 95.03%, indicating that the method is able to guide UAVs to avoid most collision risks and can provide an adequate and reliable reference for urban air traffic management.
(2) Computational efficiency Recording the total computation time and the number of avoidance actions performed by UAVs in each scenario, to calculate the average time that the method to plan an action for a UAV, and this is used as the evaluation index for computation efficiency, as shown in Equation (24).
where T j total is the total time required to calculate an avoidance action in scenario j, and num j i is the number of avoidance actions performed by UAV i.The average time required for each scenario is shown in Figure 13.From the figure, the avoidance action calculation time of 300 scenarios is at the 0.01 s level, with an average time of 0.0963 s, which can meet the real-time requirements of collision avoidance.
(3) Extra flight distance When facing obstacles, the agent guides the UAV to change its heading or speed, which adds extra flight distance (ED) compared with the original trajectory.The metrics of ED are used to measure the impact of collision avoidance on UAVs, as shown in Equation (25): In Equation (25), d e means the extra flight distance (ED), d oi means the distance of the i-th trajectory directly to the destination regardless of any conflicts, d ni means the distance of the i-th trajectory which has considered the collision avoidance.n is the total number of trajectories in all scenarios.
In Equation (25), de means the extra flight distance (ED), doi means the distance of the i-th trajectory directly to the destination regardless of any conflicts, dni means the distance of the i-th trajectory which has considered the collision avoidance.n is the total number of trajectories in all scenarios.
From the perspective of flight efficiency and green transportation, the shorter the ED, the less flight energy is lost, and the less impact there is on the original flight [29].The average extra flight distance of eight aircraft in 300 scenarios is 26.8 m, which is relatively good and is acceptable in terms of a collaborative resolution process for mUAVs.

Impact of Noisy States
UAVs may have positional uncertainty due to the interference of random factors such as crosswinds, which can affect collision avoidance behavior.To investigate the robustness of the proposed method, we performed simulations with noisy states, adding noise to the state information of UAV.It was assumed that each noise component was uniformly distributed, i.e.,   ,  ,  ,  , ~ ,  .The noise was added to the state information of UAV, i.e., , , V V m   , which will change the state of the agent in Equation ( 4).Table 5 shows the SR and ED performances with noisy states.The table shows that noise influences the SR and ED performance, meaning that the UAV does not have accurate position and velocity information, leading to biased observations by the agent, which may output incorrect avoidance actions.However, even if noise  3 is added to both position and velocity information, SR is, at most, 91.61%, and ED is 36.2m, and thus a good level is still maintained.Therefore, our method has high tolerance to noisy observations.

Impact of Noisy States
UAVs may have positional uncertainty due to the interference of random factors such as crosswinds, which can affect collision avoidance behavior.To investigate the robustness of the proposed method, we performed simulations with noisy states, adding noise to the state information of UAV.It was assumed that each noise component was uniformly distributed, i.e., m = m x , m y , m z , m v , m ∼ U(−ε, ε).The noise was added to the state information of UAV, i.e.,P = x t i + m x , y t i + m y , z t i + m z and V t i = V t i + m v , which will change the state of the agent in Equation (4).
Table 5 shows the SR and ED performances with noisy states.The table shows that noise influences the SR and ED performance, meaning that the UAV does not have accurate position and velocity information, leading to biased observations by the agent, which may output incorrect avoidance actions.However, even if noise ε = 3 is added to both position and velocity information, SR is, at most, 91.61%, and ED is 36.2m, and thus a good level is still maintained.Therefore, our method has high tolerance to noisy observations.

Different Numbers of UAVs
Table 6 presents the CR, SR, and DR performances in scenarios with different numbers of UAVs J ∈ {2, 4, 8, 10, 12, 20}.The rates are averaged over 300 random realizations (all UAVs having random starting points and destinations).From the table, it can be noted that with increasing numbers of UAVs, the SR decreases due to the higher risk of collision, the ED has an overall upward trend, while the CE is not affected by the number of UAVs.DDQN is a typical value-based DRL algorithm [18,19], while APF utilizes the repulsive force of obstacles and the gravitational force of target to guide the UAV motion, and is widely used in research on collision avoidance [30,31].DDQN requires that the agent action space be discrete, which is set to ∆ ∈ 3°, 0, 3° , ∆ ∈ 1 m, 0 m, 1 m , ∆ ∈ 2 m/s, 0 m/s, 2 m/s .The APF action space is consistent with our method.Experiments using DDQN were conducted based on a two−layer resolution framework, and the APF was divided into two categories: APF with a two−layer framework and APF without a two−layer framework.Table 7 presents the performances of the different algorithms.The improved DDPG algorithm has an absolute advantage in terms of the SR, which is much greater than the other algorithms, at 95.03%.There was not much difference in CE, and only APF was higher than the other algorithms.Our method achieved the minimum ED, which was the largest value in DDQN; because of the constraints of the discrete action space, the DDQN-trained agent can only perform a limited number of action values and the flexibility of the UAV cannot be fully utilized.In addition, the performance of the

Comparison with Other Algorithms
In the same scenario, two other algorithms are selected for comparison: the double deep Q network (DDQN) and the artificial potential field (APF).
DDQN is a typical value-based DRL algorithm [18,19], while APF utilizes the repulsive force of obstacles and the gravitational force of target to guide the UAV motion, and is widely used in research on collision avoidance [30,31].DDQN requires that the agent action space be discrete, which is set to ∆ϕ ∈ {−3 • , 0, 3 • }, ∆Z ∈ {−1 m, 0 m, 1 m}, ∆V ∈ {−2 m/s, 0 m/s, 2 m/s}.The APF action space is consistent with our method.Exper- iments using DDQN were conducted based on a two−layer resolution framework, and the APF was divided into two categories: APF with a two−layer framework and APF without a two−layer framework.Table 7 presents the performances of the different algorithms.The improved DDPG algorithm has an absolute advantage in terms of the SR, which is much greater than the other algorithms, at 95.03%.There was not much difference in CE, and only APF was higher than the other algorithms.Our method achieved the minimum ED, which was the largest value in DDQN; because of the constraints of the discrete action space, the DDQN-trained agent can only perform a limited number of action values and the flexibility of the UAV cannot be fully utilized.In addition, the performance of the APF Drones 2023, 7, 491 20 of 22 with two−layer framework was superior to that of the original APF algorithm, indicating that the two−layer resolution framework can better avoid conflict risks when presented with mUAVs.
It is worth noting that we observed in training sessions that the improved DDPG completed convergence faster than the DDQN, which is consistent with our results in Section 4.2, compared to the original DDPG algorithm, and further indicates that the dynamic adjustment mechanism can reduce the number of training episodes.

Conclusions
In this paper, an adaptive method for mUAV collision avoidance in urban air traffic was studied.The main conclusions are as follows: Firstly, the proposed two−layer resolution framework provides a new concept for realizing mUAV collision avoidance, in which each UAV is endowed with decision-making ability, and the computational complexity is controlled at the polynomial level.Using the improved DDPG algorithm to train the agent allows convergence to be completed faster, which saves training costs to a great extent.
Secondly, the numerical results indicate that the proposed method is able to adapt to various scenarios, e.g., different numbers and positions of UAVs, and interference from random factors.More specifically, the average decision time of the method is 0.0963 s with eight UAVs, the overall resolution success rate is 95.03%, and the extra flight distance is 26.8 m.Our method has better performance when compared to APF, APF with a two−layer framework, and DDQN.
Thirdly, from the perspective of the avoidance process, changing the heading angle is the main way of avoiding collision, the minimum distance from buildings is about 7 m, and the minimum distance from other UAVs is about 4 m, which further proves that the method has a relatively high sensitivity for static obstacles.Our future research focus will be on how to determine the appropriate safety interval and how to reflect this in the resolution process.
Despite the strengths of our proposed approach, there are some drawbacks that require further study.In this paper, distance and velocity vector were calculated for conflict detection, which lacks objective quantification of conflict risk and may have an impact on the subsequent collision avoidance.The quantitative assessment of UAV conflict risk based on multiple factors could guarantee a more accurate determination of conflict targets and resolution strategies, which would be a valuable research direction in the future.Another valuable research direction would be to combine kinematics theory and control theory.Assigning appropriate resolution strategies for UAV collision avoidance at the level of urban air traffic management, while designing UAV controllers from the perspective of control performance, thus ensuring that UAVs can successfully complete avoidance actions by formulating suitable control parameters and suppressing the influence of external disturbances [32,33].This would promote the engineering application of the method described in this paper.

Figure 1 .
Figure 1.Overall workflow of the two−layer resolution framework.

Figure 1 .
Figure 1.Overall workflow of the two−layer resolution framework.

Figure 2 .
Figure 2. The diagram of the agent state vector.(a) Flight state vector.(b) Obstacle vectors.

Figure 2 .
Figure 2. The diagram of the agent state vector.(a) Flight state vector.(b) Obstacle vectors.
adjustment, and speed adjustment (∆ϕ: alteration in heading angle; ∆Z: alteration in altitude; ∆V: alteration in horizontal speed), where at each time step, ∆ϕ

Figure 3 .
Figure 3.The interaction mode between the agent and the environment.

Figure 4 .
Figure 4.The diagram of three-dimensional conflict detection.

Figure 4 .
Figure 4.The diagram of three-dimensional conflict detection.

Figure 7 .
Figure 7.The rewards for each episode during training.

Figure 8 .
Figure 8.Comparison of the convergence process.

Figure 7 .
Figure 7.The rewards for each episode during training.

Figure 7 .
Figure 7.The rewards for each episode during training.

Figure 8 .
Figure 8.Comparison of the convergence process.

Figure 8 .
Figure 8.Comparison of the convergence process.

Drones 2023, 7 , 22 Figure 10 .
Figure 10.The heading angle change trend and distance to obstacles of UAV.

Figure 11 .
Figure 11.The altitude change trend of UAV.

Figure 12 .
Figure 12.The speed change trend of UAVs.

Figure 10 . 22 Figure 10 .
Figure 10.The heading angle change trend and distance to obstacles of UAV.

Figure 11 .
Figure 11.The altitude change trend of UAV.

Figure 12 .
Figure 12.The speed change trend of UAVs.

Figure 11 .
Figure 11.The altitude change trend of UAV.

Figure 12 .
Figure 12.The speed change trend of UAVs.

Figure 12 .
Figure 12.The speed change trend of UAVs.

Figure 13 .
Figure 13.The average time required for allocating actions in each scenario.

Figure 13 .
Figure 13.The average time required for allocating actions in each scenario.

Figure 14 .
Figure 14.Illustrations of collision avoidance in different scenarios that J ∈ {2, 8, 20}.4.4.4.Comparison with Other Algorithms In the same scenario, two other algorithms are selected for comparison: the double deep Q network (DDQN) and the artificial potential field (APF).DDQN is a typical value-based DRL algorithm[18,19], while APF utilizes the repulsive force of obstacles and the gravitational force of target to guide the UAV motion, and is widely used in research on collision avoidance[30,31].DDQN requires that the agent action space be discrete, which is set to ∆ ∈ 3°, 0, 3° , ∆ ∈ 1 m, 0 m, 1 m , ∆ ∈ 2 m/s, 0 m/s, 2 m/s .The APF action space is consistent with our method.Experiments

Table 1 .
Related work on conflict resolution based on DRL.

...
, b u , h u are the semi-axes of the ellipsoid, respectively, and R o , h o are the radius and height of the cylinder.model is proposed, in which mUAV conflicts are transforme into the form of UAV pairs and then the agent is used to deconflict them one by one.
and each segment of the trajectory can be intercepted to form a set of training data for the subsequent training of the Target-network and Evaluatenetwork in the algorithm.

Table 2 .
The shape parameters of the building.

Table 4 .
Success rates of collision avoidance.

Table 5 .
SR, and ED performance when states are noisy.

Table 5 .
SR, and ED performance when states are noisy.

Table 7 .
SR, CE, and ED performance of different algorithms and J = 8.