Multi-UAV Conflict Resolution with Graph Convolutional Reinforcement Learning

Safety is the primary concern when it comes to air traffic. In-flight safety between Unmanned Aircraft Vehicles (UAVs) is ensured through pairwise separation minima, utilizing conflict detection and resolution methods. Existing methods mainly deal with pairwise conflicts, however due to an expected increase in traffic density, encounters with more than two UAVs are likely to happen. In this paper, we model multi-UAV conflict resolution as a multi-agent reinforcement learning problem. We implement an algorithm based on graph neural networks where cooperative agents can communicate to jointly generate resolution maneuvers. The model is evaluated in scenarios with 3 and 4 present agents. Results show that agents are able to successfully solve the multi-UAV conflicts through a cooperative strategy.


Introduction
Commercial and civil unmanned aircraft systems (UAS) applications are projected to have significant growth in the global market.According to SESAR, the European drone market will exceed C10 billion annually by 2035, and over C15 billion annually by 2050 [SESAR JU, 2016].Furthermore, considering the characteristics of the missions and application fields, it is expected that the most market value will be in operations of small UAS (sUAS) and at the very-low-level airspace (VLL).Such a growing trend will be accompanied by an increase in traffic density and new challenges related to safety, reliability, efficiency.Therefore, the development and implementation of conflict management systems are considered a pre-condition to integrate UAS in the civil airspace.Most notably, the National Aeronautics and Space Administration (NASA) in the USA aims to create a UAS Traffic Management (UTM) system that will make it possible for many UAS to fly at low altitudes along with other airspace users [Barrado et al., 2020].Europe is leading efforts to develop an equivalent UTM concept, referred to as U-space.It will provide a set of services (and micro-services) that would accommodate current and future traffic (mainly but not limited to) at VLL airspace [Prevot et al., 2016].Similar approaches are followed also in China and Japan [Zhang, 2018].Considering airspace under UTM services, UAS must be capable of avoiding static conflicts such as buildings, terrain, and no-fly zones and dynamic conflicts such as manned or unmanned aircraft.Here a pairwise conflict is defined as a violation of the en-route separation minima between two UAVs [ica].To ensure operations free of conflict, UTM provides Conflict Detection and Resolution services, which comprise three layers of safety depending on the time-horizon (i.e.look-ahead time) [nas]: Strategic and Tactical Conflict Mitigation and Collision Avoidance (CA) [ica][nas].In this work, we will focus on tactical CR applicable for small UAS missions.This function is typically treated in two ways: Self-separation and Collision Avoidance[nas] [Radanovic et al., 2019].The former is a maneuver executed seconds before the loss of separation minima, characterized by a slight deviation from the initial flight plan, and aims to prevent CA activation.The latter provides a last-resort safety layer characterized by imminent and sharp escape maneuvers.Both functions above are encompassed within what is widely recognized as Detect and Avoid capability [Consiglio et al., 2016, Johnson et al., 2017a].Aligning with the up-to-date state-of-the-art, a loss of separation minima is referred to as loss of Well Clear (LoWC).While there is no standard definition of well clear (WC), two related functions are associated with this state: Remain Well Clear (RWC), and Collision Avoidance (CA) [Manfredi et al., 2017].In terms of tactical CD&R, RWC is equivalent to the self-separation function.Defining and computation of RWC thresholds is an open research work but is mainly viewed as protection volume around UAS [Cook et al., 2017, Consiglio et al., 2019, Muñoz et al., 2016].This volume can be specified by spatial thresholds, temporal thresholds, or both at the same time.We follow the hockey-puck model [Weinert et al., 2018, McLain andDuffield, 2017] characterized by distance-based thresholds.In addition, the near-mid-air-collision (NMAC) represents the last safety volume.As the name suggests, a distance smaller than NMAC represents a very severe loss of well clear that could result in a collision in the worst case.This distance is usually defined based on the dimensions of the UAS and its navigation performance [Modi et al., 2016].
There are many existing works that propose conflict resolution algorithms (see Section II for a more detailed overview).However, the majority of these works focus mainly on pairwise conflicts.Nevertheless, with the expected increase in traffic density [SESAR JU, 2016] multi-UAV conflicts (i.e.involving more than 2 UAVs) are expected to occur.In this paper, multi-UAV conflict resolution is modeled as a multi-agent reinforcement learning problem (MARL).More specifically, we utilize graph convolutional reinforcement learning [Jiang et al., 2018], where air traffic is modeled as a graph.The present UAV are the set of nodes, and single pairwise conflicts form the set of edges in the graph.The model used in this paper provides a communication mechanism between connected nodes in the graph.Such a mechanism facilitates learning and allows for the agents2 to develop cooperative strategies.Multi-UAV conflicts are formally defined as compound conflicts, where multiple pairwise conflicts have tight spatial and temporal boundaries.In this work, we first train a model in scenarios with three UAVs.After that, the same model is retrained to solve compound conflicts with four UAVs.This technique allows us to re-use the previously learned policies and refine them to a new set of scenarios while efficiently training the new agent from scratch.Results show that agents are successfully able to solve compound conflicts in both cases.
The rest of the paper is organized as follows: some existing works are discussed in Section 2. Section 3 describes the theoretical background necessary for this paper.In Section 4, the experimental setup is presented.Results are presented and discussed in section 5, while in Section 6, we draw conclusions and propose steps for further research.

Related Work
There are many essential contributions in the area of conflict resolution methods in aviation.These methods are widely classified into the geometric, force field methods, optimized trajectory, and Markov Decision Process (MDP) approaches (probabilistic) [Skowron et al., 2019].For detailed and comprehensive information on CD&R practices, we suggest Kuchar and Yang's review study [Kuchar and Yang, 2000] and this review paper [Ribeiro et al., 2020a] for more up-to-date content.We will focus only on the MDP method and provide a summary discussion below, as our work aligns with this group of methods.Aircraft and especially UAS operations are characterized by uncertain environments and stochastic events such as weather, multiple intruders, Communication, Navigation, and Surveillance (CNS) failures; therefore, decision making methods that adapt under such conditions are necessary.MDP and more recent Partial Observable MDP (POMDP) are methods that can have significant performance in such domains.Different techniques are used to solve MDP and/or POMDP problems, and most noticed are reinforcement learning (RL) and deep reinforcement learning (DRL) methods.In [Bertram and Wei, 2020], the authors present an efficient MDP-based algorithm that provides self-separation functions for UAS in free airspace.A similar approach is followed here [Yang and Wei, 2020], where the authors give a scalable multi-agent computational guidance for separation assurance in Urban Air Mobility.In addition, they use RL techniques to solve the MDP problems.In a previous work [Hu et al., 2020], a conflict resolution system is applied to mitigate conflicts between UAS.Ribeiro et al. [Ribeiro et al., 2020b] consider a single-agent approach to conflict resolution through RL for unmanned aerial vehicles (UAVs).Furthermore, recent works have seen the engagement of DRL methods, which behave better in multi-agent environments and consider uncertainties.In this paper [Isufaj et al., 2021], the authors model pairwise conflict resolutions as a multi-agent reinforcement learning (MARL) problem.They use Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [Lowe et al., 2017] to train two agents, representing each aircraft in a conflict pair, capable of efficiently solving conflicts in the presence of surrounding traffic by considering heading and speed changes.In [Pham et al., 2019], the authors use the Deep Deterministic Policy Gradient (DDPG) technique to mitigate conflicts in high density scenarios and uncertainties.Brittain et al. [Brittain et al., 2020] used a deep multi-agent reinforcement learning framework to ensure autonomous separation between aircraft.Dalmau et al. [Dalmau and Allard] used Message Passing Neural Networks (MPNN) to model air traffic control as a multi agent reinforcement learning system where agents must ensure conflict free flight through a sector.
While these papers consider a multi-aircraft (manned or unmanned) setting, they do not particularly consider small UAS performance capabilities (i.e., high yaw rate).Also, a common assumption is that the flight trajectories should be within a predefined airspace sector.In a UTM environment airspace is not necessarily segregated into sectors.Additionally, small UAS characteristics can directly effect how the action space is modelled.Moreover, approaches with a multi-UAV setting do not consider the effects of cooperation on the resolution manoeuvres.In this work, we propose a multi-UAV conflict resolution method suitable for sUAS operations and attempt to achieve cooperation between the agents.
It is worth noting that these methods (RL and DRL) are considered very important for the development of the Aircraft Collision Avoidance System (ACAS-X), which will be extended into ACAS-Xu, ACAS-sXu, and so on, to accommodate all airspace users [Manfredi and Jestin, 2016].
3 Theoretical Background

Reinforcement Learning
Reinforcement Learning (RL) is a paradigm of machine learning which deals with sequential decision making [Sutton et al., 1998].A given RL problem is formalized by a Markov Decision Process (MDP), which is a discrete time stochastic control process [Bellman, 1958] that consists of a 4-tuple (S, A, T, R), where: • S is the state space, • A is the action space, • T : S × A × S → [0, 1] is the transition function which is a set of conditional probabilities between states, • R : S × A × S → R is the reward function In RL, an agent makes decisions in an environment to maximize a certain notion of cumulative reward G, defined as follows: where γ is a discount factor between 0 and 1.Its task is to inform the agent how relevant immediate rewards are in relation to rewards further in the future.The higher γ is the more the agent will care about future consequences.
The agent improves incrementally by modifying its behaviour according to previous experience.The agent does not strictly require complete information or knowledge of the environment, it only needs to interact with it and gather information [François-Lavet et al., 2018].
The RL agents starts at an initial state s 0 ∈ S and at each time step t must take an action a t ∈ T .Then, the agent gets a reward r t ∈ R from the environment.The states then transitions to s t+1 ∈ S, which is dictated by the taken action and the dynamics of the environment.Finally, the agent stops interacting with the environment when it reaches a defined goal state.
The agent's behavior is encoded into a policy π, which can be deterministic π : S → A, or stochastic π : There are two ways that are used to predict the total future discounted reward: the value function V π and the action-value function Q π , defined as follows: (3) The value function represents the future expected reward in the current state if the policy π is followed, while actionvalue function represents expected rewards for state-action pairs following policy π.Ultimately, the goal of all RL algorithms is to solve either of these functions.
Q-learning is one most prominent algorithms for solving RL problems.There, an agent must learn to estimate the optimal action-value function in the form of a table with as many state-action pair entries as possible [Dalmau and Allard].However, in cases where the state space or action space (or both) are continuous, there are infinitely many state-action pairs, which makes it unfeasible to store the values in table.In those cases, a function is used to approximate the Q function.Such a function with parameters µ, is optimized through an objective function based on the Bellman equation [Bellman, 1958].
In the case of Deep Q-Networks (DQN) [Mnih et al., 2013], the Q function approximators are neural networks.However, several issue arise when applying deep learning directly on a RL problem.First, in RL rewards can be sparse or delayed, which hinders neural networks, as they rely on directly gained feedback.Additionally, the data that is obtained from a RL problem is highly correlated and lastly, the data distribution changes as the policy does, making it non-stationary which further impairs the learning capabilities of neural networks.In order to overcome these issues, several modifications must be.Experience replay is used to mitigate the issue of sample autocorrelation [Mnih et al., 2013].In this technique, the agent's experience is stored at each time step in a replay buffer.the memory is sampled randomly and is used to update the networks.When the replay buffer becomes full, the simplest solution is to discard the oldest samples.The non-stationarity of the data makes the training unstable, which can lead to undesired phenomena such as catastrophic forgetting, where the agent suddenly "forgets" how to solve the task after apparently having learned a suitable policy.Such an issue can be mitigated using target networks, which is an identical network to the one used to learn the Q function, that is held constant to serve as a stable target for learning for a fixed number of time steps.

Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning (MARL) is an extension of classical RL where there are more than one agents in the environment.This is formalized through partially observable Markov games [Littman, 1994], which are decision processes for N agents.
Similarly to MDPs, Markov games have a set of actions.However, in this case, the environment is not fully observable by the agents.Therefore, the Markov game has a set of observations O 1 , ..., O N for each agent.Similarly to single agent RL, in the MARL setting agents take actions according to their policy and obtain rewards.The goal of the agents is to maximize personal and total expected reward.

Graph Convolutional Reinforcement Learning
While deep learning has proven effective in capturing patterns of Euclidean data, there are a number of applications where data are represented as graphs [Wu et al., 2020].The complexity of graph data has imposed significant challenges on existing deep learning algorithms.A graph can be irregular and dynamic, as it can have a variable number of nodes and the connections between nodes can change over time.Furthermore, existing deep learning algorithms largely assume the data to be independent, which does not hold for graph data.
Recently, there has been an increasing number of works that extend deep learning approaches to graph data, called Graph Neural Networks (GNNs).Variants include: Graph Attention Networks (GATs) [Veličković et al., 2017], Graph Convolutional Networks (GCNs) [Kipf and Welling, 2016] and Message Passing Neural Networks (MPNNs) [Gilmer et al., 2017].We refer the reader to [Wu et al., 2020], for a comprehensive review of GNNs.
In the case of MARL, communication is often cited as a key ability for cooperative agents [Jiang et al., 2018, Dalmau and Allard].In such a setting, agents exchange information before taking an action.
In this work, we will use Graph Convolutional Reinforcement Learning [Jiang et al., 2018] (dubbed DGN by its authors), which is a GNN algorithm for cooperative agents.
In DGN, the multi-agent environment is modeled as a graph G = (V, E), where V is the set of nodes and E is the set of edges.Each agent is a node and the local observation of the agent are the features of the node.Each node i has a set of neighbors B i , where (i, j) ∈ E, ∀j ∈ B i .The set of neighbors is defined according to some criteria, depending on the environment and changes over time.In DGN, neighbor nodes can communicate with each other.Such a choice leads to the agents only considering local information when making their decisions.Another option would be to consider all agents in the environment, however, this comes with higher computational complexity.
DGN has three modules: an observation encoder, convolutional layer and Q network.The observation of an agent i at time step t, o t i is encoded into a feature vector h t i by a Multi Layer Perceptron (MLP).The convolutional layer combines the feature vectors in the local region and generates a latent feature vector h i t .The receptive field of the agents increase by stacking more convolutional layers on top of each other.An important property of the convolutional layer is that it should be invariant from the order of the input feature vectors.Furthermore, such a layer must be effective in learning how to abstract the relation between agents as to combine the input features.
DGN uses multi-head dot-product attention [Zambaldi et al., 2018], which is an implementation of attention which runs the attention mechanism several times in parallel, to compute interactions between agents (we refer the reader to [Jiang et al., 2018, Zambaldi et al., 2018] for a detailed overview of the attention mechanism).Let us denote with B +i the set of neighbors B i and agent i.The input features of the agent i are projected into query Q, key K and value V representation by every attention head.For an attention head m the relation for i, j ∈ B +i is as follows: where τ is a scaling factor and W m Q and W m K are the weight matrices of the query and key for attention head m.The representations of the input features are weighted by the relation and summed together, which is done for each head m.The outputs of all attention heads for an agent i are concatenated and then fed into a MLP σ as follows: The graph representing the agents and the interactions between them is formalized through and adjacency matrix C, where the ith row contains a 1 for each agent in B i and 0 for any agents not in the neighborhood of i.The feature vectors are merged into a feature matrix F with size N × L where N is the number of agents and L is the length of the feature vector.The feature vectors in the local region of agent i are obtained by C i × F .
The Q network in DGN is a common network as described in II.B.However, in DGN, the outputs of the graph convolution layer are concatenated and fed into the network.At each time step, the tuple (O, A, O , R, C) is stored in the replay buffer, where O and O are the current and next observations, A is the set of actions, R isthe set of rewards and C is the adjacency matrix.During training, a random minibatch of size S is sampled from the buffer and the loss is minimized as follows: where y i indicates the return.Another factor that can impact the training of the Q network is the dynamic nature of the graph, which can change from one time step to the other.To mitigate this, the adjacency matrix (C) is kept unchanged in two successive time steps when computing the Q values in training.Finally, the target network with parameters θ is updated from the Q network with parameters θ as follows: where β indicates the importance of the new parameters in the target network.
4 Experimental Setup

Compound Conflicts
In this work, we consider multi-UAV conflicts.However, multiple pairwise conflicts can have varying spatial and temporal boundaries, i.e., their overlap in space and time.Koca et al.[Koca et al.], introduce the concept of a compound ecosystem, with an ecosystem being the set of aircraft affected by the occurrence of a conflict.They propose that multiple ecosystems can be considered together if they have at least one common member and the conflicts overlap in time more than 10% of their duration.For this work, we relax the requirements by not considering surrounding traffic, therefore proposing the concept of a compound conflict.As such, multiple pairwise conflicts can be considered collectively if and only if they share a common aircraft.We keep the temporal requirement the same as in [Koca et al.].

Traffic as a Graph
In this work, the multi-agent environment is represented as a graph.Therefore, we must define how the graph is created for a given traffic scenario.In order to have a correct definition of a graph G = (V, E), the set of nodes and edges is required.
In DGN, the nodes are the agents present in the environment.We keep the same approach by considering the UAVs as nodes in a given traffic scenario.An edge is created between two UAVs if and only if a conflict between them has been detected.This choice is motivated by the fact that in DGN, agents communicate with their neighbors first and foremost.Therefore, we make this choice to facilitate cooperation between UAVs that are in conflict.

Training Environment
In order to model compound conflict resolution as a MARL problem, the underlying Markov decision process must be formalized.Thus, we have to determine the state space, action space, and reward function.As we are considering cooperative agents, the ultimate goal is to maximize the joined reward.Therefore, all agents have the same reward structure.

State space
The representation of the states of the environment is one of the most critical factors that can impact the learning capability and performance of the agents.Typically, the state is formalized through a vector of a certain dimensionality which should provide enough information to facilitate learning.Nevertheless, representations with higher dimensionality will suffer from a higher computational effort to train an effective model.
Therefore, in this work we take the state representation proposed by Isufaj et al.[Isufaj et al., 2021], where the state is formalized through the agents' position and speed information.More specifically the state s i of an agent is the vector s i = [lat, lon, hdg, spd], i.e. latitude, longitude, heading and speed.These values are normalized into the range [0, 1] to make it easier for the model to be trained.

Action space
In this work, we only consider solutions through heading changes, thus speed and altitude changes are ignored.As such, agents can choose to take on of three actions at each decision time step: turn left, turn right, do nothing, where each track change corresponds to a heading change of 15 o in either direction.Agents must make a decision every 2 seconds.

Reward function
Once the agents take an action according to their policy, they will receive a reward from the environment r i,t for the current time step, which indicates the quality of the action.Thus, a carefully constructed reward function is crucial in achieving desirable performance [Isufaj et al., 2021].
In our case, the reward consists of three terms.First of all, the number of conflicts term punishes agents according to the number of conflicts.The more conflicts the agent is in, the more it will have the incentive to solve the conflicts.Furthermore, through the severity term, the agents are encouraged to solve the most severe conflicts first.This term considers more severe conflicts, i.e., smaller CPA distance, as more important to solve first.Lastly, the deviation term penalizes the agents for solutions that drift the agent from its original track.In this work, if an agent has deviated more than 90 o from the original route, it is penalized heavily.In cases where it hasn't, it is penalized as a fraction of the current deviation to the maximal deviation.Such a term indirectly also incentivizes the agents to solve the conflicts as soon as possible, as the quicker the conflicts are solved, the less of a negative reward the agent will get.Formally, the reward function is as follows: where w 1 , w 2 , w 3 are positive weights that indicate the importance of each term, E(i) indicates all the agents that have an edge with i, µ and µ are the original and current heading and d cpa and d thresh are the CPA and self-separation distances.In this paper, w 1 , w 2 , w 3 are kept equal, however in future work these can be extended to be learnable parameters.The total reward for a given time step t is: where N is the total number of agents.

Simulation Environment
Simulations were run on the Air Traffic Simulator BlueSky [Hoekstra and Ellerbroek, 2016].The simulator was chosen primarily because it is an open-source tool, allowing for more transparency in developing and evaluating the proposed model.Furthermore, BlueSky has an Airborne Separation Assurance System (ASAS), supporting different CD&R methods.This allows for different resolution algorithms to be evaluated under the same conditions and scenarios.

Data Generation
Algorithm Algorithm 1 describes the procedure to generate the training scenarios.In this work, we consider compound conflicts with 3 and 4 UAVs.To create the multi-UAV conflict, first, a reference aircraft is initialized, with a heading sampled from a uniform distribution from 0 o to 360 o .Then, this aircraft is added to the set of created aircraft.To generate the rest of the conflicting UAVs, we sample from the set of the created ones.Then, a conflict angle is chosen from the list Next, to add some variance to the intrusion headings, a variance in the range [−10 o , 10 o ] is added to each case.After that, the severity of the conflict is decided by sampling from a uniform distribution between 0.1 and 1.Finally, we set the time the new aircraft enters in conflict with the randomly chosen aircraft to 15 seconds.The CRECONF function is taken from the BlueSky simulator, and it provides the location and speed of a new conflicting aircraft.However, as compound conflicts have temporal boundaries, no accidental conflicts are added in one look-ahead time, which is set to 8 seconds.This is checked by the CONFLICT function, also taken from BlueSky.To define the metrics for self-separation, we follow a similar approach as in [Shi et al., 2020, Mullins et al., 2013].This threshold depends on the UAV maneuverability and its maximum airspeed.Whereas the innermost layer will be modeled according to the Near Mid Air Collision concept, as a circle with radius: R N M AC = 2 × Maximum Wing Span + Total System Error (TSE).The self-separation can be calculated by the equation below: where V m and ω m are maximum airspeed and maximum yaw rate, respectively.Whereas t m is the time needed for the UAV to make an avoidance maneuver.The self-separation threshold was set to 240m, taking into account R N M AC = 4m, maximum airspeed 15m/s, and a maximum yaw rate of 90 o /s As we are attempting to solve conflicts at the tactical level, a duration of 1 minute per scenario was deemed suitable.Note that the time metrics (i.e tactical CD&R maneuver and look-ahead time) mentioned above are synthesized from the state of the art of CD&R in small UAS [Ho et al., 2018, Consiglio et al., 2019, Johnson et al., 2017b].
5 Simulation Results

Conflict Resolution Performance
The model was trained for 10000 episodes with scenarios of compound conflicts with 3 UAVs.Then, it was trained for a further 10000 episodes with scenarios of compound conflicts with 4 UAVs.In this way, we utilize the learned policies of the previous agents to fine-tune them in the four-agent case and train the new agent from scratch.The models were trained on the Google Cloud Platform3 using an NVIDIA Tesla K80 GPU.The training lasted around 10 hours.Figure 1 shows the evolution of the cumulative reward for both cases.As the agents are cooperative, we are interested in the overall reward that is gained per episode and do not concern ourselves with the individual rewards.In this work, we utilize negative rewards, so the maximum the agents can get is 0. In the case of the 4 agents the reward seems a bit lower, however this comes a result of there being one more agent present, which takes actions to solve the conflicts thus inflicting itself some negative rewards for going away from track.
From the figure, it can be noted that the model converges on both occasions.This means that the agents are successfully able to improve their policies with gained experience.However, in the case with 3 present agents, the convergence happens around 2000 episodes, while around 4000 episodes are required for the 4 agent case.In the latter case, there are more possible scenarios that can be generated, therefore increasing the variance of situations that the agents are presented with.Furthermore, in the beginning of training the already present agents employ their learned policies, while the new agent is exploring the possible actions, which reduces the overall reward the agents get.
In Figure 2 the number of losses of separation (LOSS) is shown.The number of LOSS of the average unmitigated case (for both 3 and 4 agents) is shown with the dashed line.As we can see, the reward performance translates directly to successfully avoiding LOSS.In the case with 3 present agents, after convergence the average LOSS per episode is less than 1.This indicates that the agents are able to successfully solve conflicts before violating the self-separation distance.
In the case of the 4 agents compound conflict, the average is around 1 LOSS per episode.However, through our results, we note that the model manages to always avoid near misses in both cases, as the NMAC distance is never breached.
It is interesting to note that we can observe a similar evolution as in Figure 1, with the retrained model performing slightly worse.In general, the case with 4 agents had more pairwise conflicts present, which makes the problem more difficult.
Table 1 shows in how many episodes the compound conflict was solved, meaning no LOSS has occurred.We note that the difference is similar to the number of extra epochs the 4 agents model needed to converge.As such, once the model converges it can generally manage to solve the compound conflict, thus fulfilling its task successfully.This result shows that the agents are able to solve conflicts through communicating with their neighbors.In addition to solving conflicts, it is desirable for agents not to spend too much in a LOSS, as this can increase the risk of collisions.Such information is shown in Figure 3.The results shown there further confirm that the agents are able to improve their performance.In a similar trend, the case of the 3 agents compound conflict seems simpler to solve successfully, as the agents spend less than 5 seconds in a LOSS, with 5650 episodes not experiencing a LOSS (therefore no time steps in LOSS).

Agent behavior
The results shown so far indicate that the agents manage to successfully learn how to solve the task.We not that the model converges fast and maintains its knowledge of the system, thus avoiding the common forgetting issues.
However, it is important to understand what strategies they have learned.This information is shown in Figures 4 and 5.
For the sake of simplicity, we only show the frequency of actions for the last 200 episodes.
We note that in both settings, the agents take the go left action in the majority of cases.While the direction of the action might not be as important, the learned strategy suggests that agents take the same action.This results in agents increasing the distance between them, as taking the same action head-on or crossing scenarios results in them going in different directions.However, in overtaking scenarios such a strategy does not immediately solve the conflict.Nevertheless, through the reward agents must learn that the conflict with the smallest CPA distance is the most urgent.
As such, it can happen that agents prefer to delay the solution in an overtaking scenario, by taking several small changes in the same direction.While this is not immediately desirable, attempting to make a heading change to the opposite direction could create a more severe conflict with the head-on or crossing agents.
In this work, we do not put any restrictions to the agents and do not inject expert knowledge in them, thus they start learning from a blank state.The results show that the agents are able to learn a strategy that successfully solves the compound conflicts in scenarios with 3 and 4 agents.

Conclusions and Future Work
In this paper, we tackle multi-UAV conflict resolution by modelling it as a MARL problem with cooperative agents.Air traffic is represented as a graph with aircraft as nodes.An edge is created between every two aircraft in a pairwise conflict.We use graph convolutional reinforcement learning, which provides a communication mechanism between connected agents.This means that conflicting aircraft are allowed to communicate with each other and develop cooperative strategies.In order to formally define a multi-UAV conflict, we propose the concept of compound conflicts, which are conflicts that have tight spatial and temporal boundaries.
We first train a model that learns how to solve compound conflicts with 3 agents.After that, the same model is retrained to to solve compound conflicts with 4 agents.As a result, we are able to refine the policies learned in the previous setting, while added agent learns a desirable policy.
Results show that the agents are able to improve their policies and thus solve the task.For both settings, we observe an improvement both in number of LOSS present and duration of LOSS with the majority of scenarios after convergence having no LOSS (i.e. the compound conflict is solved).Furthermore, the agents are able to discover a strategy that increases the overall distance between them.As such, they effectively learn to solve the most severe conflicts first and then solve the remaining conflicts while making sure that no new conflicts are created.
However, there are several aspects that must be further researched.For instance, in this work we use a maximum of 4 agents in the scenario.In reality, the number of agents in a compound conflict can not be always decided beforehand, thus a solution that adapts to N agents must be sought.Furthermore, the reward function could be further elaborated to include terms that deal with the quality of solutions, such as optimizing for battery usage or number of actions taken.Finally, the action space can be extended to include solutions by speed or altitude changes.

Figure 1 :
Figure 1: Evolution of the cumulative reward per episode.

Figure 2 :
Figure 2: Number of losses of separation in comparison with the average unmitigated case.

Figure 3 :
Figure 3: Number of time steps spent in LOSS.

Figure 4 :
Figure 4: Frequency of actions for the last 200 episodes for the compound conflict with 3 agents.

Figure 5 :
Figure 5: Frequency of actions for the last 200 episodes for the compound conflict with 4 agents.

Table 1 :
Number of episodes compound conflicts solved