Air Trafﬁc Prediction as a Video Prediction Problem Using Convolutional LSTM and Autoencoder

: Accurate prediction of future air trafﬁc situations is an essential task in many applications in air trafﬁc management. This paper presents a new framework for predicting air trafﬁc situations as a sequence of images from a deep learning perspective. An autoencoder with convolutional long short-term memory (ConvLSTM) is used, and a mixed loss function technique is proposed to generate better air trafﬁc images than those obtained by using conventional L 1 or L 2 loss function. The feasibility of the proposed approach is demonstrated with real air trafﬁc data.


Introduction
Air traffic prediction over different time horizons is one of the important research topics in the field of air traffic management for optimizing the use of airspace and airport resources in addition to safety purposes, such as conflict probes [1,2]. Various methods for aircraft trajectory prediction have been proposed [3][4][5][6][7], but their accuracy is still limited by many factors [8]. One of the major factors that contribute to an inaccurate trajectory prediction is that the trajectories often deviate from the filed flight plans [9]. Various methods have been proposed to infer the true paths of flights, but it is still challenging to effectively incorporate the interacting dynamics of multiple aircraft [10][11][12][13]. A more holistic approach might be needed to consider complex interconnections between trajectories.
Recently, several attempts have been made to consider the overall air traffic situation as image data in a specific airspace. One study proposed a method to detect an anomaly in ADS-B messages [14]. An air traffic image generated from a given ADS-B message was reproduced by an autoencoder model, and the prediction error was then used to determine the authenticity of the messages. In another study, a convolutional neural network (CNN) was used to extract important features from a multichannel air traffic scenario image to evaluate its complexity [15].
Building upon these previous studies, this paper introduces a new paradigm for air traffic prediction. In conventional approaches, the aircraft trajectories are individually predicted and combined to construct a future traffic situation. However, in the proposed method, the overall air traffic situation in a specific airspace at an instantaneous time is considered as an image, and a data-driven model is trained to learn how this image changes over time as illustrated in Figure 1. In other words, the proposed method considers the task of air traffic prediction as a video prediction problem from a deep learning perspective. Once the model is trained with historical data, future air traffic situations can be predicted as a sequence of subsequent images based on the model. Note that a sequence of air traffic images contains a wealth of information about not only individual aircraft trajectories, but also their interactions with neighboring traffic and other environmental factors, such as the runway configuration. The remainder of this communication is organized as follows. Section 2 describes the analytical framework of the proposed approach for air traffic prediction, and Section 3 demonstrates the feasibility of the proposed approach with real air traffic data. Finally, concluding remarks and future works are presented in Section 4.

Methodology
Let ∈ ×ℎ× denote the air traffic situation at time as an image for which , ℎ, and are the width, height, and channel size of the image, respectively. The proposed problem of air traffic prediction involves generating the next air traffic images ( +1 , +2 ⋯ + ) with maximum likelihood given the previous air traffic images ( − +1 , − +2 , ⋯ ): I n In this paper, an autoencoder architecture is used to predict future air traffic images. The autoencoder architecture consists of two separate networks: (1) an encoder; and (2) a decoder. The encoder network is used to compress an input sequence ( − +1 , − +2, ⋯ ) into embedding vectors, ′ , while the decoder network acts as a predictor to generate a sequence of the next image frames ( +1, +2, ⋯ + ) for given embedding vectors, as shown in Figure 2. The remainder of this communication is organized as follows. Section 2 describes the analytical framework of the proposed approach for air traffic prediction, and Section 3 demonstrates the feasibility of the proposed approach with real air traffic data. Finally, concluding remarks and future works are presented in Section 4.

Methodology
Let X t ∈ R w×h×c denote the air traffic situation at time t as an image for which w, h, and c are the width, height, and channel size of the image, respectively. The proposed problem of air traffic prediction involves generating the next K air traffic images (X t+1 , X t+2 · · · X t+K ) with maximum likelihood L given the previous J air traffic images X t−J+1 , X t−J+2 , · · · X t : In this paper, an autoencoder architecture is used to predict future air traffic images. The autoencoder architecture consists of two separate networks: (1) an encoder; and (2) a decoder. The encoder network is used to compress an input sequence (X t−J+1 , X t−J+2, · · · X t ) into embedding vectors, v s , while the decoder network acts as a predictor to generate a sequence of the next K image frames (X t+1 , X t+2 · · · X t+K ) for given embedding vectors, as shown in Figure 2.
To better capture the spatiotemporal correlation in air traffic images, convolutional long short-term memory (ConvLSTM) is used in both the encoder and decoder networks [16]. LSTM has been proven to be stable for handling the temporal features of sequential data, but it easily overlooks the spatial natures of images [17,18]. ConvLSTM is a variant of LSTM for which the input-to-state and state-to-state transitions are exchanged with convolutional operations with an M × N grid called a kernel ( Figure 3a). Therefore, ConvLSTM is better at retaining both the temporal and spatial information of an image than conventional LSTM, in which a two-dimensional (2D) image is flattened into a 1D vector ( Figure 3b). To better capture the spatiotemporal correlation in air traffic images, convolutional long short-term memory (ConvLSTM) is used in both the encoder and decoder networks [16]. LSTM has been proven to be stable for handling the temporal features of sequential data, but it easily overlooks the spatial natures of images [17,18]. ConvLSTM is a variant of LSTM for which the input-to-state and state-to-state transitions are exchanged with convolutional operations with an × grid called a kernel ( Figure 3a). Therefore, ConvLSTM is better at retaining both the temporal and spatial information of an image than conventional LSTM, in which a two-dimensional (2D) image is flattened into a 1D vector (Figure 3b).

Air Traffic Image Data
A dataset of air traffic images was created based on historical surveillance data for three months of flights over 900 square miles of airspace around Incheon International Airport (ICN), South Korea. The runway configurations of ICN are shown in Figure 4. Due to the close proximity to the border with North Korea, the north wind configuration (33/34) is dominant. At the time corresponding to the selected data, three runways were in use, but a fourth runway (the westernmost runway in the figure) began to operate in 2021. Note that this study does not exclude any flight in these three months although the flight patterns could vary especially at night or on the weekend. Segregation of the dataset based on time or weekday would further improve the performances of the proposed method and needs to be explored in future research.
As shown in Figure 5, the time, latitude, longitude, heading, and identification of each flight were extracted every 20 s from the radar surveillance data, and this information was converted into a traffic image combined with the runway configuration data (i.e., location, length, direction) and terrain data [19]. The resolution of the image was chosen to be 164( ) × 180(ℎ) with 1 pixel corresponding to 0.03 square miles of air-  To better capture the spatiotemporal correlation in air traffic images, convolutional long short-term memory (ConvLSTM) is used in both the encoder and decoder networks [16]. LSTM has been proven to be stable for handling the temporal features of sequential data, but it easily overlooks the spatial natures of images [17,18]. ConvLSTM is a variant of LSTM for which the input-to-state and state-to-state transitions are exchanged with convolutional operations with an × grid called a kernel (Figure 3a). Therefore, ConvLSTM is better at retaining both the temporal and spatial information of an image than conventional LSTM, in which a two-dimensional (2D) image is flattened into a 1D vector (Figure 3b).

Air Traffic Image Data
A dataset of air traffic images was created based on historical surveillance data for three months of flights over 900 square miles of airspace around Incheon International Airport (ICN), South Korea. The runway configurations of ICN are shown in Figure 4. Due to the close proximity to the border with North Korea, the north wind configuration (33/34) is dominant. At the time corresponding to the selected data, three runways were in use, but a fourth runway (the westernmost runway in the figure) began to operate in 2021. Note that this study does not exclude any flight in these three months although the flight patterns could vary especially at night or on the weekend. Segregation of the dataset based on time or weekday would further improve the performances of the proposed method and needs to be explored in future research.
As shown in Figure 5, the time, latitude, longitude, heading, and identification of each flight were extracted every 20 s from the radar surveillance data, and this information was converted into a traffic image combined with the runway configuration data (i.e., location, length, direction) and terrain data [19]. The resolution of the image was chosen to be 164( ) × 180(ℎ) with 1 pixel corresponding to 0.03 square miles of air-

Air Traffic Image Data
A dataset of air traffic images was created based on historical surveillance data for three months of flights over 900 square miles of airspace around Incheon International Airport (ICN), South Korea. The runway configurations of ICN are shown in Figure 4. Due to the close proximity to the border with North Korea, the north wind configuration (33/34) is dominant. At the time corresponding to the selected data, three runways were in use, but a fourth runway (the westernmost runway in the figure) began to operate in 2021. Note that this study does not exclude any flight in these three months although the flight patterns could vary especially at night or on the weekend. Segregation of the dataset based on time or weekday would further improve the performances of the proposed method and needs to be explored in future research.
As shown in Figure 5, the time, latitude, longitude, heading, and identification of each flight were extracted every 20 s from the radar surveillance data, and this information was converted into a traffic image combined with the runway configuration data (i.e., location, length, direction) and terrain data [19]. The resolution of the image was chosen to be 164(w) × 180(h) with 1 pixel corresponding to 0.03 square miles of airspace at an instantaneous time. This resolution was the maximum amount that we could apply in this study due to the limited computational power available, and it was ensured that the corresponding size of real airspace for each pixel was at least small enough not to contain more than one flight. space at an instantaneous time. This resolution was the maximum amount that we could apply in this study due to the limited computational power available, and it was ensured that the corresponding size of real airspace for each pixel was at least small enough not to contain more than one flight.   space at an instantaneous time. This resolution was the maximum amount that we could apply in this study due to the limited computational power available, and it was ensured that the corresponding size of real airspace for each pixel was at least small enough not to contain more than one flight.

Constructing the Autoencoder Model
As shown in Figure 2, both the encoder and the decoder contain three layers of ConvLSTM, and each layer contains 64 hidden states with 3 × 3 kernels. The model accepts five image frames as input and predicts the next five image frames as output. An additional output layer (3D CNN) is used to reconstruct an output image sequence after the last layer of ConvLSTM. The specifics of the model are shown in Table 1. For video prediction problems, two types of loss functions are often used: (1) L2 and (2) L1 [21]. As shown in Equations (2) and (3), the L2 loss function involves minimizing the sum of the squares of differences between the RGB values of the true and the predicted images. In contrast, the L1 loss function involves minimizing the absolute differences. In the equations, y(i, j) represents the RGB value of the pixel (i, j).

Mean Squared Error
The L2 loss function could suffer from blurriness of the output images, while one with L1 could suffer from the vanishing of dynamic objects [22,23]. To mitigate these problems, we propose a mixed loss function technique called "L2 and then L1". The model was trained with the L2 loss function until one-third of the total training steps (i.e., 150 epochs) was reached, and the L1 loss function was then used for the last quarter of the training steps (i.e., 50 epochs). Note that the total number of epochs in this study was fixed to be 200. The idea behind the mixed loss function is that pretraining the model with L2 could be more effective for capturing the variability in traffic patterns before being trained with L1. Similar techniques, such as using a pretrained model or switching to a different loss function during the training process, are often used to train a deep learning model better [24,25]. Although more theoretical justification of the proposed mixed loss function should be explored in future research, it turned out to be quite effective and allowed the model to suffer less from flights vanishing in the predicted images when compared with the case with only L1 loss function, as discussed in the next section.
The model was trained with 5000 image sequences using the Adam optimizer [26] and a learning rate starting at 10 −3 . With a batch size of 16, the training took approximately 60 h using an NVIDIA RTX 3070 (8 GB memory) with a Pytorch/Pytorch Lightning backend [27,28]. An extra 2000 images were used as a validation set to evaluate different choices of the number of layers of the autoencoder network, the kernel size of ConvLSTM, and the type of activation function of the output CNN in terms of the total computation time, the required memory size, and the quality of predicted images.

Results and Discussion
Since the output of the proposed air traffic prediction model is given in the form of images, a proper definition of how to measure its prediction accuracy is needed for quantitative validation. A per-pixel loss used in the training process is not a good indication of the quality of trajectory prediction, and a different evaluation metric, which probably requires converting the predicted images back to the trajectory data (latitude, longitude, heading, and identification), would be necessary. To that end, extensive experimental work and trade-off analysis are usually required [29], and this type of work is beyond the scope of this paper. At the current stage of this study, a prompt introduction of the proposed approach is the focus. Therefore, only visual evaluation of the predicted images is presented in this paper. Future work should be followed for a more quantitative and exhaustive validation of the proposed approach.
The experimental results from one selected test sample are shown in Figure 6, which shows a series of air traffic images from every 20 s. Figure 6a shows the model's input images, and Figure 6b-e show the future air traffic images predicted by the proposed model with different loss functions (L2, L1, and L2 and then L1). Figure 6b shows how the aircraft actually flew (i.e., ground truth). Among other test samples, this example was chosen because it contained a mix of landing and departure traffics in different directions. In fact, the model's prediction performances varied depending on the geographical locations and flight directions of traffics, and the air traffic scenario in Figure 6 was considered a good example to illustrate both strengths and weaknesses of the proposed model.
As indicated by the red box in the figure, the images generated with L2 suffer from blurriness. In contrast, the images with L1 show more vivid trajectories, but some flights gradually vanish as shown in the blue dotted boxes in Figure 6d. However, the model with the mixed loss function generates more vivid and long-lasting trajectories than the models with L1 or L2 loss functions.
The model is capable of predicting the trajectories of landing/departing traffic to follow the standard arrival routes (STARs) and the standard instrument departures (SIDs) even though such information was not included in the training process. For instance, the landing traffics on the runway 33 (AC1-AC4) in the predicted image ( Figure 6e) could stay on their STAR and maintain separations between successive aircraft. Note that the training data contained the flight vectored off from SIDs or STARs; therefore, it might not be too easy for the model to keep the traffic images on SIDs and STARs as illustrated by many blips in the predicted images with the L2 loss function in Figure 6c.
It is interesting to note that the model could show that AC1 disappeared from the predicted images because it has landed in scene #5. The proposed model could also make successful predictions for departing flights (such as AC5-AC7) from the same runway where they make a U-turn to the right along the SID when they arrived 10.6 miles away from the runway threshold, as shown in Figure 6e. It should also be noted that the proposed model allows for a new aircraft to start to appear in the predicted image (AC10 in Figure 6e #9) although that aircraft did not exist in the input images. The proposed model still suffers from the loss of some flights in the predicted images. For instance, AC8 departed from runway 34 of ICN and AC9 departed from the Gimpo Airport (GMP) nearby and they vanish more quickly from the predicted images than other flights.
The performance of the proposed approach for air traffic prediction might not yet be sufficient for practical applications. However, its performance could be significantly improved by exploring other state-of-the-art techniques for video prediction problems and by collecting more air traffic data. Importantly, this new approach could play a role in overcoming one of the main hurdles in air traffic prediction, namely, incorporating simultaneous and multiple interactions between neighboring aircraft. Work remains to be done to further investigate the potential and limitations of the proposed method in addition to how it can be used in a variety of applied contexts. The proposed approach might not be accurate enough to be used as a standalone application, but it could be used as a hybrid form together with the conventional trajectory prediction methods.