1. Introduction
On-time flight performance is an important measure of the service quality of airports and airlines. During the period 2013–2019, while the number of flights in Europe increased by 16% [
1], the average departure delay of European flights increased by 41% [
2]. Such an increase has a negative impact on the airports’ and airlines’ quality of service. As Eurocontrol forecasts the number of flights to be restored to 2019 levels by 2024 [
3], large increases in delay can be expected again in the future. Accurate flight delay predictions will therefore remain central to support airports and airlines in offering a high-quality service.
In the past years, several machine learning algorithms have been proposed to predict flight delays. Most studies predict flight delays using (i) binary classifiers (delayed/not delayed flight), (ii) multi-class classifiers (multiple delay classes), or (iii) regression (estimating the delay value).
Binary classifiers are proposed in Kim et al. [
4] where recurrent neural networks are used to predict flight delays at airports in the US. The prediction horizon is several hours before the operation. Using this approach, delays are predicted with an accuracy of 0.87 (i.e., the rate of correctly predicted samples). In Lambelho et al. [
5], binary classification of flight delays and cancellations is performed for Heathrow airport using three different classification algorithms: LightGBM, Multilayer Perceptron, and Random Forests. The authors predict flight delays and cancellations with an average F1-score of 0.56 using the LightGBM classifier. In Choi et al. [
6], the authors propose binary classifiers for flight delays assuming a prediction horizon of five days and one day. The obtained flight delay predictions have an accuracy of 0.80 using the Random Forest classifier.
Multi-class departure delay predictions are obtained in Alonso and Loureiro [
7] for Porto airport for a prediction horizon of several hours before the operation. In Chen and Li [
8], flight delay is predicted using multi-label Random Forests classification. Flight delay values from routes flown by an aircraft earlier in a day are used to predict flight delay for the routes flown later in a day.
In Kalliguddi and Leboulluec [
9], flight delays are estimated hours ahead of the operation using machine learning algorithms that perform regression. The authors consider delay states of the aviation network as features, in addition to flight schedule-related features. The results obtained using Random Forests have a root mean square error (RMSE) of 12.5 min. It is also shown that the delay states have the largest effect on on-time performance. In Manna et al. [
10], the obtained flight delay predictions have an RMSE of 8.2 min and 10.7 min when considering departure delays and arrival delays, respectively. In Yu et al. [
11], a deep-belief network is used to predict flight delays several hours before the operation. A reduction of 21% in the RMSE is obtained compared to the best benchmark algorithm, the k-Nearest Neighbours. Thiagarajan et al. [
12] propose both classification and regression algorithms to predict flight delay. Here, the regression approach using Random Forests produced an RMSE of 8.7 min. Ayhan et al. [
13] and Shao et al. [
14] introduce features based on flight trajectory data. Ayhan et al. [
13] predict flight delays for domestic flights in Spain within an RMSE of 4 min. A range of prediction algorithms is employed, of which AdaBoost performs best. Shao et al. [
14] find that the features based on trajectory data contribute the greatest to the predictive accuracy, and the best result is found using LightGBM.
The classification and regression results obtained in these studies generate an estimate for individual flight delay in the form of a class or a point estimate, respectively. The estimates are often evaluated using metrics based on the confusion matrix and metrics such as RMSE/MAE (Mean Absolute Error), respectively. In order to plan flight operations such as gate allocation or runway allocation in a robust manner, however, it is necessary to also consider the uncertainty of the predicted delays of individual flights. Such measures are not included when obtaining delay classes or point estimates, nor can they be derived directly from the commonly used evaluation metrics. Therefore, in this paper, we propose to estimate the probability distribution of flight delays on an individual flight basis, using machine learning algorithms. Such probability distributions can support planners to robustly plan flight operations.
Very few studies estimate the probability distribution of flight delays. The common approach is to fit historical delays to
one probability distribution which is assumed to be representative for
all considered flights [
15,
16,
17,
18,
19]. In Mueller and Chatterji [
15] and Novianingsih and Hadianti [
16], airport and airline delay distributions are obtained by fitting historical delays to classes of probability distributions. Tu et al. [
19] introduce a more complex model, where the national airspace delay distribution is assumed to be the sum of seasonal trends, a daily propagation pattern and random residuals. To the best of our knowledge, however, no studies have been performed that estimate a probability distribution for flight delays on an individual flight basis, i.e., probabilistic flight delay prediction.
To illustrate how probabilistic flight delay prediction on an individual basis can be useful for operation optimisation, we integrate these predictions into a probabilistic flight-to-gate assignment problem (FGAP). Şeker and Noyan [
20] were among the first ones to incorporate probabilistic effects in their solution method for the FGAP. The authors evaluate the robustness of FGA’s by modelling the departure and arrival flight delays as random variables. A set of scenarios is created, each with random disruptions to flight arrival and departure times. The number of gate conflicts is then minimized for each scenario. The random disruptions utilized in this study model flight delay; however, they are not based on delay predictions. The results of this study provide a general overview of the robustness of the used optimization methods, but it is not possible to directly evaluate the robustness using the actual delay experienced at the airport. Van Schaijk and Visser [
21] and L’Ortye et al. [
22] determine the probability that a given arriving/departing aircraft is present at a gate, for a range of time values. This is called the aircraft presence probability, and it is obtained using a regression model based on historic data of aircraft gate presence, using the features ’airline identity’ and ’origin/destination region of flight.’ The aircraft presence probability of an arriving aircraft is in fact the cumulative distribution function (cdf) of the aircraft’s delay. The presence probability of a departing aircraft is the inverted cdf of the aircraft’s delay. Using these presence probabilities, robust flight-to-gate assignments are developed. The approach taken by Van Schaijk and Visser [
21] makes use of only two features, leading to a limited variation in the constructed presence probabilities. It is possible to use many more features of the flights that need to be assigned to gates, leading to a more accurate prediction of their gate presence.
In this paper, we obtain probabilistic delay predictions for flights arriving and departing at a regional reference airport. To the best of our knowledge, this is the first time probabilistic predictions for flight delays on an individual flight basis are obtained. We employ two machine learning algorithms: Mixture Density Networks and Random Forest regression. We consider features based on flight schedules available at the reference airport, as well as the weather conditions recorded at the origin/destination airport of the flights. Suitable metrics are proposed to evaluate the performance of the considered machine learning algorithms, which estimate delay probability density functions (pdf). Furthermore, the impact of the choice of hyperparameters for these algorithms is analyzed.
The use of the obtained probabilistic predictions is demonstrated in the context of a robust flight-to-gate assignment problem. First, probabilistic predictions for arrival flight delays and departure flight delays are obtained using machine learning algorithms. These predictions are then used to estimate the probability of an aircraft being present at the reference airport. Lastly, these presence probabilities are integrated into a probabilistic FGAP model that aims to robustly assign arriving/departing aircraft to the gates of the reference airport. Here, robustness refers to the assignment model’s ability to account for potential flight delays. The results show that, by considering flight delay predictions, flights are allocated to gates more robustly relative to the case when no information about flight delays is considered.
The remainder of this paper is structured as follows: in
Section 2, the datasets, machine learning algorithms for probabilistic flight delay predictions, and several performance metrics for these algorithms are introduced. The prediction results are then presented and discussed. In
Section 3, the obtained probabilistic flight delay predictions are integrated into a flight-to-gate assignment model. Both a deterministic and a probabilistic model for the optimization of the FGAP are formulated. The models are both applied on a short and long term, and the results regarding the robustness of the obtained solutions are presented and discussed. In
Section 4, conclusions and recommendations for future work are provided.
2. Data-Driven Probabilistic Flight Delay Predictions
In this section, we obtain probabilistic flight delay predictions using two machine learning algorithms, Mixture Density Networks and Random Forests Regression.
2.1. Data Description
2.1.1. Flight Schedule Dataset
For this analysis, flight schedules available at Rotterdam The Hague Airport (RTM) between 1 January 2017 and 29 February 2020 are considered. In total, 17,365 departing and 17,336 arriving flights are considered. These flights arrive from and depart to 42 airports across Europe and North Africa. The shortest route included is to London City Airport (LCY), and the longest to Tenerife South Airport (TFS), with an average of 1300 km.
Figure 1 shows a map indicating all airports to or from which flights depart or arrive. The delay distribution of these flights is shown in
Figure 2. The departing flights have an average absolute delay of 17.8 min with a standard deviation of 25.1 min, and the arriving flights have an average absolute delay of 15.4 min with a standard deviation of 26.4 min. Here, the delay is considered to be the positive or negative time difference from the scheduled time of arrival/departure.
2.1.2. Weather Dataset
Using [
23], we also consider the weather conditions, such as the temperature, pressure, and wind speed, measured at the origin/destination airport of all flights arriving/departing at RTM in the period 2017–2020. Measurements are available every 30 min.
2.2. Feature Selection
In this section, features are extracted and selected from the datasets described in
Section 2.1. Feature selection is performed using the Pearson Correlation Coefficient. The correlation between any two features and the correlation between the features and the target (the flight delay) are calculated for a given training set. The features are selected as follows: for any two features that are correlated by more than the threshold value of 0.7, the feature that has the smallest correlation with the target variable is removed.
Table 1 shows the features that have been selected for flight delay prediction. In
Table 2, a description is provided for each of the selected features.
The features Airport, Airline, Season, Time of day, Day of week, Day of month, Day of year, Airport latitude and longitude, Distance, Month, Year and Scheduled flights 2h and day are obtained or calculated from the flight schedule dataset. The feature Seats is derived from the aircraft type assigned to perform a flight. The features Temperature, Dewpoint, Visibility, Pressure, and Wind speed are obtained from the weather dataset.
The features are either categorical, time-related, or numerical. The categorical features are target encoded based on a binary delay threshold of 15 min. The encoded value of the sample feature is the delay rate of the category to which the sample belongs. For example: if 8 out of 20 samples flying on Tuesdays are more than 15 min delayed, all Tuesday flights are encoded with value 0.4 for the feature Day of the week. The time features are encoded using trigonometric functions to preserve the periodicity. Two features (sine and cosine) are extracted from every time feature. For example, the features Month sine and cosine are calculated using and for a given month m.
The remaining features are numerically encoded, i.e., the encoded value is the same as the original feature value. Note that the time features are both trigonometrically and numerically encoded. For example, the data field Day of the week yields the features Day of the week sine, Day of the week cosine, and Day of the week. The encoding method of every selected feature is denoted in
Table 1. After encoding, all feature values are scaled to the interval
to eliminate undesired feature domination in neural network classifiers.
Table 1 shows that most features are selected for at least one of the departure/arrival pair, and that the trigonometrically encoded time features are selected more often than the non-encoded time features.
2.3. Machine-Learning Algorithms to Estimate the Probability Distribution of Flight Delays
Following feature selection, two algorithms are proposed to estimate the distribution of flight delays: Mixture Density Networks (MDN) and Random Forests regression (RFR). These algorithms belong to different classes of machine learning algorithms, neural networks, and decision trees, respectively.
2.3.1. Mixture Density Networks (MDNs)
A Mixture Density Network [
24] is a combination of a neural network and a Gaussian mixture model. Given feature values
of flight
i, an MDN outputs the parameters for each Gaussian in the mixture: the weight
, the mean
, and the standard deviation
. With these parameters, the probability density function
of the target variable
, the flight delay, is determined. In general, the MDN is particularly suitable to estimate multimodal probability distributions [
25,
26,
27,
28,
29,
30,
31]. It is therefore able to predict a distribution with peaks at, for example, two separate likely delay values.
The flight delay probability distribution is constructed as the weighted sum of Gaussian distributions as follows:
where
is the probability distribution of delay value
given feature values
from flight sample
i, while
,
and
are the weight, mean, and standard deviation of the
Gaussian component,
with
m the total number of Gaussian components considered for the mixture.
For any given flight, the features obtained in
Section 2.2 are the input to the MDN, while the parameters
,
, and
are the output of the MDN. Thus, there are
outputs of the MDN. The weights use a softmax activation function, and the standard deviations use an exponential activation function, while the means are unrestricted.
The neural network is trained using backpropagation, i.e, the network parameters, the weights and biases of each node are updated using an error function
E, which is the negative logarithm of the likelihood that the model derived from the output of the current network gives rise to the training data [
24]. This likelihood is the product of the likelihood of every data point, given the current network parameters. Formally [
24],
where
is the total number of samples in the training set.
For every data point fed to the neural network, the derivatives of the error with respect to all network parameters are used to update the weights and biases of the network. Following training, the MDN is applied to a test set and multimodal probability distributions for the delay of each flight in the test set are estimated. The MDN method is illustrated schematically in
Figure 3.
2.3.2. Random Forests Regression and Kernel Density Estimation
Random Forests regression (RFR) is a class of decision tree-based machine learning algorithms [
32]. The regular RFR algorithm is an ensemble method that combines the results of a number of decision trees. When building each tree, a random subset of the feature values of each training data point is used to make branches. The algorithm outputs a point estimate for the target variable (flight delay) of every test sample by averaging the output values of all considered decision trees. However, for our analysis, we are interested in estimating the probability distribution for the delay of the given flight, rather than a point estimate.
In order to obtain the flight delay distribution of a flight in the test phase, the output values of the decision trees are not averaged, but collected, and a kernel density estimation (KDE) is performed [
33]. A KDE results in a normalized probability density function. Two settings of the KDE are the kernel type and the bandwidth. In our analysis, a bandwidth of
is used to render the estimated distribution smooth. Gaussian kernels have been selected for their generality.
Random Forests regression is a well-established technique that has been applied in many research areas. However, there are very few examples of studies utilizing the algorithm to obtain probability distributions. Förster et al. [
34] use quantile values, obtained from Quantile Random Forests, to construct a right-continuous cumulative distribution function of aircraft’s time-to-fly from the turn onto the final approach course to the runway threshold. Schlosser et al. [
35] and Rahman et al. [
36] use Random Forests algorithms to obtain probability distributions for precipitation forecasts and drug sensitivity, respectively. Both studies make use of feature probability distributions estimated via maximum likelihood to make splitting decisions when constructing the decision trees. Stochastic variables are introduced during or before the growing of the decision trees. In contrast, in this study, the feature values and splitting decisions are kept deterministic throughout the Random Forests algorithm. In this way, the probability density function is estimated from deterministic feature values without the need for stochastic variables. Furthermore, the working of the original Random Forests regression algorithm need not be changed.
In
Figure 4, an example of obtained probability distributions is shown for both methods. For both distributions, the actual delay value of the flight example is indicated.
2.4. Hyperparameter Tuning
The hyperparameters of the MDN and the RFR prediction algorithms have been optimized using a grid search. The hyperparameters leading to the lowest mean CRPS scores have been selected.
Table 3 shows the selected hyperparameters and their search range. For MDN, a network with three hidden layers of 50 nodes is selected. The output layer of the network consists of 24 nodes, with which an 8-modal Gaussian distribution function is constructed. For RFR, 200 decision trees with a maximum depth of 10 layers are constructed. For every branch split, three out of four features are considered of at least seven training samples.
2.5. Performance Metrics for Probabilistic Forecasting
As discussed before, many studies perform point estimate prediction on flight delays, such as [
9,
10,
11,
12]. The most pervasive metrics for point estimate prediction are the root mean square error (RMSE) and mean absolute error (MAE), measured between the actual point and the predicted point. In this study, probabilistic forecasting is performed. Thus, metrics such as the RMSE and MAE cannot be applied, since they cannot be used to compare an entire delay distribution with a point value for actual delay. In this paper, the following six metrics are proposed to evaluate the performance of the MDN and RFR algorithms.
2.5.1. Continuous Ranked Probability Score
Since our aim is to estimate probability distributions for flight delays, a metric is needed that evaluates these distributions. The algorithms aim to obtain a distribution centered on the actual flight delay value, with a small standard deviation. To measure the extent to which the probabilistic prediction algorithms are able to achieve this, the Continuous Ranked Probability Score (CRPS) [
37] is proposed. For an estimated flight delay probability distribution
and actual delay value
, we define:
where
is the cumulative distribution function of
and
is the Heaviside step function.
The CRPS is a generalization of the MAE for probabilistic predictions. It measures the deviation of the estimated delay cumulative distribution function from a step function at the actual delay value. This means that the CRPS attains the value 0 in the limit of a correct point prediction with absolute certainty. Since the CRPS is minimized if the model outputs the ideal distribution, the CRPS is a proper scoring rule. Therefore, it is an indication of both the sharpness and the calibration of the probabilistic forecast [
38].
Figure 5 shows a case where the actual delay is 10 min, and includes examples of cumulative distributions with varying sharpness and calibration. Both a reduced sharpness and a reduced calibration in the distribution will increase the CRPS value. Since the CRPS is calculated for every flight in the test set, we introduce the metrics ‘CRPS mean’ and ‘CRPS std’, the mean and standard deviation of all CRPS values, respectively.
2.5.2. and
Since the RMSE and MAE are not suitable to assess an estimated flight delay distribution, we propose the variants and , which are calculated by comparing the mean value of the estimated distribution against the actual delay value. Before introducing the formal notation of these metrics, it is necessary to define the mean value of the estimated distribution. For MDN, the mean is defined as the weighted average of the component means, i.e., is the distribution mean of flight sample i, with and the weight and mean of component j. When using RFR, the mean is defined as the mean of the point estimates obtained from each decision tree. The distribution means are referred to as with .
The
and
are then defined as:
The and are used to characterize the average deviation of the mean of the estimated distribution from the actual delay and thus measure only the calibration of the distribution and not the sharpness.
2.5.3. Metrics Based on the Standard Deviation
For MDN, the standard deviation of a multimodal probability density function for a flight sample
is calculated as follows [
24]:
with
the weight,
the mean and
the standard deviation of component
j.
For the RFR algorithm, the standard deviation of the delay distribution is calculated in a similar fashion: a Kernel Density Estimation can be considered a multimodal Gaussian as well. This Gaussian has equal weights
, the RF regression point estimates as means and
as the standard deviation. This leads to the following expression for
:
with
the number of estimators used in the algorithm, and
the
jth point estimate for the delay of flight sample
i. The distribution standard deviations are referred to as
with
. Having obtained the distribution standard deviations in Equations (
7) and (
8), we can introduce the two metrics based on these. The first metric is the sample average of the standard deviation:
where
is the number of flights in the test set. In order to define the second metric, we first introduce
: the fraction of samples for which the actual delay
lies within one standard deviation
from the distributional mean
of that sample. The second metric
is then defined as the average of this quantity over all
samples. It measures the ability of the probabilistic algorithm to predict a narrow delay distribution on or near the correct delay value. Together with the
, it characterizes the spread of the estimated distribution and thus measures only the sharpness of the distribution and not the calibration. Formally,
with
The six metrics defined in Equations (
4)–(
11) are used to assess the estimated flight delay distributions obtained using MDN and RFR. The metrics CRPS mean, CRPS std,
,
and
have the same unit as the target variable, i.e., minutes of delay, whereas
is expressed as a percentage.
2.6. Results—Probabilistic Flight Delay Predictions
We analyze both departing and arriving flights. For both, train and test sets are constructed using a 5-fold Cross Validation. The MDN and RFR algorithms have been used to estimate the distribution of the arrival and departure flight delays. The use of weather measurements implies that the prediction horizon associated with these flight delay predictions is at most several days long.
Table 4 shows the performance obtained using these algorithms.
Table 4 shows that both MDN and RFR are able to predict departure and arrival delays within an average CRPS of 11 min. The RFR algorithm results in a smaller prediction error than the MDN algorithm. In addition, the delays of the arriving flights are predicted with larger error than those of the departing flights. This is explained by the fact that the bulk of the arriving flights has a considerably smaller delay than the bulk of the departing flights, as seen in
Figure 2. Because the algorithms are trained mostly using arrival samples having a small delay, they have a decreased prediction performance for test samples with large delays. This decreased performance contributes greatly to the larger CRPS values.
Furthermore,
Table 4 shows that the MDN algorithm predicts flight delays with a larger standard deviation than the RFR algorithm, and in turn the actual delay falls within this standard deviation more often. This is explained by the fact that the RFR algorithm produces a more narrow prediction curve than the MDN algorithm, on average.
2.7. Impact of the Choice of the Hyperparameters
In this section, the influence of the values of important hyperparameters on the probabilistic flight delay prediction performance is assessed. The focus lies on the ability of the algorithms to construct a representative delay distribution; therefore, the mean CRPS is used to quantify the performance.
An important hyperparameter of the MDN algorithm is the number of modes. A distribution with more modes allows for more complex shapes, while a distribution with only one mode corresponds to a regular Gaussian distribution. In
Figure 6a, the performance of the MDN algorithm for a varying number of modes is shown. Using multiple modes leads to a better performance than using a regular Gaussian function. When adding more than three modes, this improvement stagnates.
An important hyperparameter of the RFR algorithm is the maximum tree depth. A greater tree depth leads to a better distinction between different flights in the training set, but a tree depth that is too large can lead to overfitting. In that case, the error on the test set is not further reduced, while the computational time still increases. In
Figure 6b, the performance of the RFR algorithm for varying values of the tree depth is shown. By analyzing a range of values between 10 and 30, it is found that a consistent performance is obtained from a max depth value of roughly 20.
4. Conclusions
In
Section 2, two probabilistic forecasting algorithms, Mixture Density Networks and Random Forest regression, have been applied to the problem of flight delay prediction. The algorithms were trained using features extracted from a flight schedule dataset and a weather dataset, which contained data from Rotterdam The Hague Airport. Six performance metrics were defined to evaluate the probabilistic predictions, and the influence of the hyperparameters on the probabilistic prediction performance was investigated.
The results show that it is possible to estimate probability distributions for future flight delays within a CRPS of 11 min, several days in advance. The probabilistic flight delay predictions can provide airport coordinators not only with an estimate for the flight delays of all incoming flights, but also with a measure of the certainty of these estimates. In this way, better informed decisions regarding strategic flight schedules can be made, and on-time performance prediction can be improved.
Subsequently, in
Section 3, the probabilistic predictions were used as input to a probabilistic linear programming model optimizing the flight-to-gate assignment problem, with the goal of increasing the robustness of this assignment. The results for the flight-to-gate assignment problem show a reduction of up to 74% in the average number of conflicted aircraft per day by incorporating the probabilistic flight delay predictions. The robustness can be adjusted by varying the maximum permissible overlap probability threshold in the probabilistic optimization model. The application of flight delay predictions to the flight-to-gate assignment problem provides a framework for increasing robustness for flight-to-gate operations at airports.
Future work includes the application of the introduced approach to increasing the robustness of flight-to-gate assignments to a larger airport, taking into account e.g., varying assignment costs and airline gate usage, and, secondly, the integration of probabilistic flight delay predictions into models for other airport operations. Examples are arrival/departure sequencing and scheduling, and electric taxiing operational planning.