## 1. Introduction

This work is part of a line of research aimed at improving air traffic management (ATM) procedures. More specifically, the line focuses on improving operations in the airport environment, and it is mainly motivated by the relentless increase in global air traffic [

1].

All proposals in this field must thoroughly respect the vast rules issued by the many civil aviation authorities in this sector. These authorities include the ICAO (International Civil Aviation Organization), a United Nations agency, which promotes aviation safety and the orderly development of international civil aviation worldwide. The ICAO establishes standards and regulations necessary for aviation safety, efficiency, regulation, and for environmental protection. One of the many safety requirements that air operations must meet has to do with maintaining a minimum lateral and vertical separation between aircraft that are in flight. For example, the minimum lateral separation is 5 NM (we use nautical miles (NM), feet (ft), and knots (kt), as they are usual units of measurement in air navigation. In the International System of Units (SI), 1 NM = 1852 m, 1 ft = 0.3048 m, and 1 kt = 0.5144 m/s) for en-route airspace, and 3 NM inside the terminal radar approach control area. On the other hand, the minimum vertical separation is 2000 ft above 29,000 ft, and 1000 ft below this altitude [

2].

Air traffic conflict detection and resolution (CD and R) mechanisms [

3,

4] aim to maintain the separation between in-flight aircraft established by the aviation regulation. In this context, a “conflict” (or “collision”) is an event in which two or more aircraft experience a loss of minimum separation. Traditional CD and R mechanisms use some geometric [

5,

6,

7] or probabilistic [

8,

9,

10] techniques for predicting future aircraft trajectories starting from, for example, known flight plans or current radar information. Trajectory prediction allows the detection of conflict in advance and the triggering of actions to prevent it from happening.

According to the prediction horizon, CD and R mechanisms can be classified into three categories: Long-term CD and R (horizons over 30 min), medium-term CD and R (horizons up of to 30 min), and short-term CD and R (horizons up of to 10 min). In this work, we focus on short-term CD and R. At this level, there are ground-based systems that assist air controllers. They use information from surveillance radars or other, more advanced systems, such as the Automatic Dependent Surveillance—Broadcast (ADS-B) system [

11]. One example of a ground-based CD and R system is the Short-Term Conflict Alert (STCA) [

12]. On the other hand, there is a family of airborne devices that function independently from the ground-based Air Traffic Control (ATC) system. This is the case of the Traffic Collision Avoidance System (TCAS) [

13]. In this case, conflict is detected by establishing a direct communication between nearby aircraft. In [

14], conflict resolution for autonomous aircraft in the medium and short terms is investigated.

On the other hand, neural networks [

15] are quickly growing in popularity, mainly due to the great advances that are being made in terms of computing power. Thanks to this and the proliferation of the development of graphic accelerators, which have proven quite capable for the training process of these networks, more and more uses are being raised for this technology, which requires long and costly training in computing terms, but exhibits a high efficiency after the networks are trained.

In this context, the present work explores the possibility of using a neural network to detect in advance the breach of the requirement of minimum separation between aircraft. In particular, we focus on the prediction of conflicts during the approach flight phase [

16]. Approach is one of the most critical flight phases, in which any hazard (for example, a conflict) could have fatal consequences. In fact, recent statistics [

17] show that accidents in the final approach phases represent 27% of the total.

In the literature, there are numerous works in which the application of neural networks in the CD and R field has already been proposed. For example, in [

18], a neural network is used in the environment of an airport to estimate the position of aircraft up to 30 s in advance, in order to predict possible conflicts among them. The network is trained by using data obtained at the airport. Similarly, in [

19], the use of a neural network with multi-layer perceptron (MLP) architecture is proposed, which predicts the trajectory of two aircraft and prevents their collision. In [

20], a neural network is employed to predict aircraft trajectories in the vertical plane. The network is trained by using a set of real trajectories. We can find more recent examples of the use of neural networks for aircraft conflict prediction. In [

21], the use of neural networks and other machine learning techniques is proposed to determine the Closest Point of Approach (CPA) between two aircraft, in both medium-term and short-term CR and D. The backpropagation (BP) neural network proposed in [

22] is able to predict aircraft trajectories in 4D space (3D and time) with a high accuracy. Authors in [

23] state that their Deep Long Short-Term Memory (D-LSTM) neural network for trajectory prediction can be applied to detect potential conflicts between aircraft. Finally, LSTM is combined with convolutional neural networks (CNN) in [

24] for aircraft 4D trajectory prediction.

Our proposal differs from all these works in that the goal of the neural network is not to predict any aircraft trajectory or position, but rather to decide whether two aircraft that are initiating an approach maneuver will come into conflict at some point during the maneuver. For this reason, our method cannot be directly compared with the above-mentioned techniques. Instead of this, we will check the accuracy of the neural network when predicting whether there will be a conflict or not.

Although this is out of the scope of this paper, there are also several proposals that employ neural networks to avoid (resolve) the conflict, once it has been detected [

25,

26,

27,

28].

The rest of this work is structured as follows. First,

Section 2 presents some basic notions about neural networks, necessary to understand our final implementation. Next, in

Section 3, we formally describe the way in which aircraft execute approach procedures based on waypoints, and we provide a simple conflict detection algorithm. After that,

Section 4 details the way in which the training database for the neural network has been obtained and the design and training process carried out. Finally,

Section 5 provides some conclusions and outlines some future works.

## 2. Neural Networks Basics

With the current increasing of computing power, and the popularization of GPU computing, the neural network technology is now being applied to very diverse areas and covers a wide variety of topics, proving itself worthy by producing incredibly good results. From medicine to economics, this is currently one of the most used tools to solve problems that would traditionally require very complex mathematical models. We consider that it can also be applied to solve typical ATM problems, such as predicting conflicts between approaching aircraft.

Artificial neural networks (or, simply, neural networks) are vaguely inspired by the biological neural networks that form animal brains. As well as their biological counterparts, the key factor in neural networks is that they can learn to perform tasks by considering examples, instead of being specifically programmed for that end. A neural network is based on a collection of nodes, called artificial neurons, that model the neurons existing in the biological model. These nodes, also connected between them, reflect the synapses between the neurons by transferring or not the information processed in them depending on a certain activation layer.

The computational model for neural networks was developed by McCulloch and Pitts in 1943 [

29]. Former implementations (called Perceptron) were formed by a series of artificial neurons connected to the output layer. In 1975, Werbos developed his backpropagation algorithm, which effectively allowed the training of multi-layer networks (called Multi-Layer Perceptron, or MLP) to be feasible and efficient [

30]. As the computing power increased through GPUs and distributed systems, the number of layers in the neural networks models could grow. These systems became known as deep learning networks and proved particularly good at solving image and visual recognition problems.

A neural network is often described as a black box learning model. However, that does not mean that their mechanism is not well-known or is too intricated to know. Rather, they are called black boxes because of their complexity. Once the complexity of the network starts to grow, by having multiple layers and a big number of neurons per layer, the weights each neuron take become indescribable, so they do not mean anything to a human observer. However, we can look at the main elements of the neural network to have an idea of how those weights are finally computed and, later, used.

The first element to consider is the artificial neuron. There are several models of artificial neurons. The one used in the perceptron is still used in current neural networks and deep learning networks. It works by taking a set of binary inputs

${x}_{1},{x}_{2},\dots ,{x}_{n}$ and produces a single binary output. Inside the neuron, all these binary inputs are multiplied by a set of weights

${w}_{1},{w}_{2},\dots ,{w}_{n}$, which could be defined as the importance each input has for the final result. If the sum of all the inputs multiplied by the weights is greater than a threshold value, the neuron outputs a

$1$. Or, put in more precise algebraic terms,

That threshold, used to decide if the neuron will activate, is usually called bias,

$b$, and it can be moved to the right part of Equation (1). In addition, to simplify the notation, the sum of the products of the inputs and weights is usually written as the dot product of those two vectors. This way, we can define the neuron more easily as

With this neuron model, a complete neural network can be devised to work and approximate a function. However, a problem arises when the neural network must learn from a set of inputs. When trying to adjust the different weights, the binary nature of these neurons has a catastrophic chain reaction effect. If a single weight is changed, to make a neuron change from

$0$ to

$1$, it can activate or deactivate a lot of the following neurons that were working fine before. That is where logistic neurons appear. Instead of being completely binary, these neurons can receive any real number between

$0$ and

$1$. In addition, instead of the step function

$0$ or

$1$ for the output value, there will be an activation function. Historically, this activation function has been the sigmoid, and that is why these logistic neurons can also be referred to as sigmoid neurons in some books. However, the late increase in popularity of other activation functions, such as the hyperbolic tangent (tanh) or the Rectified Linear Unit (ReLU), has rendered the sigmoid neuron terminology old. For simplicity, the sigmoid function will be the one used to explain the functioning of these kind of neurons, the other activation functions being interchangeable. The sigmoid function is defined as

where, in this case,

$z$ is our dot product of weights and inputs, that is,

$w\cdot x+b$.

While it might seem a complex thing versus the simple model that we described previously, it is actually not that far away from the previous model. If we look at the sigmoid function, we can see that when $z$ is very large and positive, the sigmoid approaches $1$, and when it is very negative, it approaches $0$. In fact, if the $\sigma $ function were a step function, we would have exactly the same kind of neurons as before. However, back to the sigmoid, it is the smoothness of its shape that is crucial. By making slight changes in the weights and bias, we can obtain slight changes in the output of the neuron, rather than an extreme change like before. This will allow the learning algorithm to make small changes to each neuron without completely disrupting the model.

In order to learn, we define the following to be true:

$real\_output\text{}=\text{}output\text{}+\text{}\Delta output$. As we want to improve our network, we must compute that Δoutput in order to make the small changes to the network. Further, to compute this, we know that small changes in weights and bias produce small changes in the output, that is

Thus, it is here where the choice of activation function comes into play. To solve the derivatives, the exponential character of the sigmoid function plays an important role in facilitating the necessary computations to adjust the weights and bias of each neuron. That is why such a proper candidate cannot be just any function with its shape. The only factor is that, now, our output layer will not output a definite Boolean value. However, when classifying, we can just take a threshold, which is usually $0.5$, to define when the real number means true or false.

After defining both the neurons and the activation functions, we must define the learning function. We have previously stated that the backpropagation technique was a breakthrough in facilitating the computing of the different adjustments of the weights and bias, and have now properly explained what they mean. First, we need a reference to know if the neural network is working correctly. This function is called the loss or cost function and is represented as

$C$. As a simple example, the mean squared error (MSE) can be used as a cost function. In fact, Matlab [

31], which is the platform employed in this work, uses this particular cost function for most of its network implementations. This function is defined as

where

$w$ and

$b$ are our vectors of weights and bias,

$y\left(x\right)$ is the output of the network, and

$a$ is the desired output. We must take into account that the methods explained here work for supervised classification, which is the modality used for our final Matlab network.

Once the cost function is defined, the set of weights and bias must be optimized to minimize it. This is done with the learning function. A popular learning function is the Stochastic Gradient Descendent (SGD) [

32]. This function will also be one of those used later when designing and training our network. It is an algorithm that seeks the minimization of the cost function using derivatives. That is why the MSE is often used, identical to before the sigmoid function. It is a function in which it is easy to make small changes to improve its accuracy. In fact, other ad hoc functions are sometimes used that better suit whatever the network is approximating. However, for the task, the MSE works well enough both in the explaining and, later, for the implementation.

The way of minimizing the error is going to be a local search. This means that it is possible to find a local minimum while not being able to exit that area when training the networks. Even if finding the global minimum is a difficult task, preventing the search for a shallow local minimum is something that can be dealt with and that will be later explained.

Thus, to find the minimum for such kinds of functions, the solution is to find the derivative to the function. However, when the number of variables, weights and bias, grows as in a neural network (when having a big neural network of thousands of neurons, the weights and bias can be billions), the derivatives grow exponentially more complex, not having any efficient way to compute them. However, instead of looking for the absolute minimum, we can just “peek” at where the slope is going. If the minimum is to be found, the logic says that the minimum should eventually be reached by following the slope.

To move toward the minimum, the first thing is to define how the cost function evolves, that is

where each

$v$ is a variable of our network. In addition, we define the vector of changes in our variables as

$\mathsf{\Delta}v\equiv \left(\mathsf{\Delta}{v}_{1},\mathsf{\Delta}{v}_{2},\dots ,\mathsf{\Delta}{v}_{n}\right)$ and the gradient of

$C$,

$\nabla C$, as the vector of partial derivatives. With this, Equation (6) can be rewritten as

which proves interesting in showing a way in which we can make

$\Delta C$ negative. In particular, the changes in the variables can be chosen as

This would mean that

and, given that

${\parallel \Delta C\parallel}^{2}$ is always positive,

$\Delta C$ is always negative.

$\eta $ is a small, positive parameter that is chosen when defining the network, and is called the learning rate. The smaller it is, the lower the chance that the changes in the variables will jump out of the local minimum, but it will also make the computations slower. That is why choosing the learning rate properly can change the outcome of the network after its learning process. After this, the set of variables can be updated as

Finally, to apply this with the components we have, that is,

$w$ and

$b$, we have

A problem appears when we have a large number of inputs. Given that the gradient ∇C must be computed for every input, in the case of a very large database, the computing time can be excessive. That is when the SGD can be used. This approach takes a relatively small batch of training inputs, chosen randomly from the database, to adjust the weights. By applying this several times, it happens that the speed is greatly improved without losing much accuracy with the true gradient of the whole database.

Once we have defined all our elements and the gradient, the only thing missing is to know how to compute the gradient previously explained. It is here where the backpropagation algorithm enters as the solution. For the rest of this section, the following notation will be used:

${w}_{jk}^{l}$ will be the weight that connects the ${k}^{th}$ neuron in the ${\left(l-1\right)}^{th}$ layer with the ${j}^{th}$ neuron of the ${l}^{th}$ layer.

${b}_{j}^{l}$ will be the bias of the ${j}^{th}$ neuron of the ${l}^{th}$ layer.

${a}_{j}^{l}$ will be the activation of the ${j}^{th}$ neuron of the ${l}^{th}$ layer.

${z}_{j}^{l}$ will be what is called the weighted input. This will be used to sum up the following formula: ${z}_{j}^{l}={\displaystyle \sum}_{k}{w}_{jk}^{l}{a}_{j}^{l-1}+{b}_{j}^{l}$.

${\mathsf{\delta}}^{l}$ will be the error of the ${l}^{th}$ layer.

With these definitions, we can define the activation (or output) of the

${j}^{th}$ neuron of the

${l}^{th}$ layer as

We can see that this notation is a bit cumbersome. To make it simpler to write and follow, the neurons will be written using a matrix approach, meaning that

${w}^{l}$ is a matrix or array of all the weights of the

${l}^{th}$ layer. Then, Equation (13) can be rewritten as

After this, the backpropagation algorithm is based on four fundamental equations. Proving them falls out of the scope of this work but knowing them will allow us to explain how the backpropagation works. The four equations are the following ones:

${\mathsf{\delta}}^{L}={\nabla}_{a}C\cdot \text{}\mathsf{\sigma}\text{}\prime \left({z}^{L}\right)$. Simply put, this means that the error of any layer can be computed as the derivatives of the cost function in that activation layer multiplied by the derivative of the activation function of that layer.

${\mathsf{\delta}}^{l}=\left(\left({w}^{l+1}{\mathsf{\delta}}^{l+1}\right)\cdot \text{}\mathsf{\sigma}\text{}\prime \left({z}^{l}\right)\right)$. This one means that the error of a layer can be computed as the error of the following layer multiplied by the weights of the following layer, multiplied by the derivative of the activation function of the current layer. This will be a key concept, because it gives a way of computing the error of a layer having the error of the following one.

$\frac{\partial C}{\partial {b}_{j}^{l}}={\mathsf{\delta}}_{j}^{l}$. In this case, the rate of change of the cost with respect to any bias is exactly the same as the error of that neuron.

$\frac{\partial C}{\partial w}={a}_{in}{\mathsf{\delta}}_{out}$. Lastly, the rate of change of the cost with respect to any weight can be computed as the activation of its input layer multiplied by the error of its output layer.

With all these elements, the backpropagation algorithm can be finally defined as followed:

Feedforward: For each $l$, compute ${z}^{l}={w}^{l}a=l-1+{b}^{l}$ and ${a}^{l}=\mathsf{\sigma}\left({z}^{l}\right)$.

Output error: Compute the error of the last layer as ${\mathsf{\delta}}^{L}={\nabla}_{a}C\cdot \text{}\mathsf{\sigma}\text{}\prime \left({z}^{L}\right)$. Using the MSE as the cost error, and the sigmoid as the activation function, the derivatives are as easy as: ${\mathsf{\delta}}^{L}={\nabla}_{a}C\cdot \text{}\mathsf{\sigma}\text{}\prime \left({z}^{L}\right)$, where ${t}^{L}$ is the expected output for the network in array format.

Backpropagate the error: Compute the error of every layer using the error of the following one. As the error of the last layer can be easily known, the rest of them can be computed iteratively. This is where the power of the algorithm lays: Just with a forward and a backward pass, which can have roughly the same computational cost, the weights are adjusted closer to the final result. As stated before, the error of the hidden layers can be computed as ${\mathsf{\delta}}^{l}=\left(\left({w}^{l+1}{\mathsf{\delta}}^{l+1}\right)\cdot \text{}\mathsf{\sigma}\text{}\prime \left({z}^{l}\right)\right)$. For the last layer, ${z}^{l}={x}^{l}$, that is, the input values.

Compute the gradient and update the weights and biases. By putting everything together, the result is: