Predictive Modeling of Flight Delays at an Airport Using Machine Learning Methods

: Flight delays represent a significant challenge in the global aviation industry, resulting in substantial costs and a decline in passenger satisfaction. This study addresses the critical issue of predicting flight delays exceeding 15 min using machine learning techniques. The arrival delays at a Turkish airport are analyzed utilizing a novel dataset derived from airport operations. This research examines a range of machine learning models, including Logistic Regression, Naïve Bayes, Neural Networks, Random Forest, XGBoost, CatBoost, and LightGBM. To address the issue of imbalanced data, additional experiments are conducted using the Synthetic Minority Over-Sampling Technique (SMOTE), in conjunction with the incorporation of meteorological data. This multi-faceted approach ensures robust forecast performance under varying conditions. The SHAP (SHapley Additive exPlanations) method is employed to interpret the relative importance of features within the models. The study is based on a three-year period of flight data obtained from a Turkish airport. The dataset is sufficiently extensive and robust to provide a reliable foundation for analysis. The results indicate that XGBoost is the most proficient model for the dataset, demonstrating its potential to deliver highly accurate predictions with an accuracy of 80%. The impact of weather factors on the predictions is found to be insignificant in comparison to scenarios without weather data in this dataset.


Introduction
Air travel, renowned for its remarkable speed compared to other modes of transportation, is also highly susceptible to unforeseen delays.The airline industry often formulates flight plans well in advance, with some airlines planning up to 300 days ahead [1].Nevertheless, this extensive planning period is beset by variables and circumstances that can disrupt flight schedules, resulting in delays.
The consequences of flight delays have a profound impact on a multitude of stakeholders.In terms of economic impact, flight delays cost the United States (U.S.) an estimated 33 billion dollars in 2019 [2].Furthermore, sectors that are indirectly affected by flight delays, including retailers, accommodation businesses, and tourism, collectively incur costs that exceed two billion U.S. dollars.In addition to financial losses, flight delays have a multitude of other consequences.The consequences of flight delays are felt by all parties involved, including passengers, airlines, airport operators, and the environment.Repetitive and unpredictable delays have been identified as one of the most significant challenges faced by both airlines and passengers [3,4].For passengers, flight delays can have a detrimental impact on their travel plans, leading to missed connections and delayed vacations and business appointments, which in turn exacerbates passenger dissatisfaction [5].Furthermore, persistent delays can prompt leisure travelers to explore alternative transportation options, which in turn affects the number of passengers carried by airlines [6,7].Airlines are also affected by delays in terms of their operating costs.The occurrence of persistent delays on specific routes has the potential to erode an airline's appeal to customers over time [8].Even airports are not immune to the negative effects of delays, as passenger satisfaction levels decline significantly when such occurrences occur [5,9].Another crucial aspect to be considered is the environmental impact of flight delays.It was estimated that U.S. airlines expend a considerable 2 billion liters of jet fuel annually as aircraft remain idling on runways due to flight delays [10].This heightened fuel consumption not only results in economic losses but also amplifies carbon emissions, thereby exacerbating environmental concerns [11].In light of these considerations, it is evident that flight delays have a multitude of adverse effects on stakeholders, encompassing both financial costs and other detrimental consequences.
To mitigate the adverse effects of flight delays, it is of the utmost importance to be able to accurately predict when such delays are likely to occur.The objective of this study is to predict whether flights will be delayed by more than 15 min, utilizing readily available weather and airport data.The primary objective of this study is to analyze the arrival performance of the selected airport.It is evident that arrival delays not only affect the onboard passengers but also have an impact on future flight plans utilizing the same aircraft.Given the uniqueness of the dataset in comparison to those utilized in the existing literature, a variety of machine learning techniques, including Logistic Regression, Naïve Bayes, Neural Networks, XGBoost, Random Forests, CatBoost, and LightGBM, are explored to determine the most suitable approach.
This paper proceeds with a detailed examination of the research stages.The research study commences with the explicit definition of the research question, which is followed by a comprehensive review of the existing literature.Subsequently, the methods employed are elucidated, with detailed information provided about the dataset.Subsequently, the findings are presented and discussed in depth, leading to recommendations for future research.

Problem Definition
Air travel plays a pivotal role in modern society, offering unparalleled convenience and efficiency in long-distance transportation.One of the most persistent challenges currently facing the aviation industry is the occurrence of flight delays.Such delays have the potential to disrupt travel plans, lead to the incurrence of additional costs, and ultimately result in the dissatisfaction of passengers.In order to address this issue, the objective of this study is to develop a predictive model for flight delays at a Turkish airport using machine learning techniques.
Flight delays are commonly defined as instances where flights depart later than their scheduled departure times.It is, however, important to note that the Federal Aviation Administration (FAA) does not categorize delays of 15 min or less as tardy.Accordingly, for the purposes of this study, only those instances of flight delay exceeding the 15 min threshold are considered as delayed, in alignment with the FAA standards [12].
In order to mitigate potential biases and improve the robustness of our models, we explore the use of a resampling technique called Synthetic Minority Over-Sampling Technique (SMOTE).SMOTE is a data augmentation technique that generates synthetic samples by interpolating existing minority class instances, thereby addressing the class imbalance issue.This process helps to prevent models from being biased towards the majority class [13].In our case, the objective of SMOTE is to achieve a more balanced representation of both on-time and delayed departures by generating synthetic instances of the minority class (delayed flights).
There is a paucity of consensus regarding the impact of meteorological conditions on aviation operations.According to the Bureau of Transportation Statistics (2022), while extreme weather conditions affect flight conditions 4% of the times, in 2020, 45.8% of flights were delayed due to weather conditions.The financial impact of flight delays due to weather conditions and other circumstances differs between airlines.In the event of a flight delay resulting from inclement weather, the airline is not required to make additional payments to passengers.It is possible that some airlines may attempt to exploit this situation.Consequently, a novel experiment was devised, extracting weather conditions from the available data, with the objective of enhancing the accuracy of flight delay predictions.
Feature selection is a crucial aspect of machine learning model development.This study investigates whether comparable predictive results can be achieved using fewer features.In particular, the use of SHAP, a method that automates the selection of relevant features while discarding less informative ones, is explored.This analysis aims to determine if model performance can be optimized by reducing the dimensionality of the dataset without sacrificing predictive accuracy.
In conclusion, this article addresses the complex issue of flight delay prediction at a Turkish airport.By considering the guidelines set forth by the FAA, meteorological conditions, data imbalances, and feature selection techniques, our objective is to gain a comprehensive understanding of the factors contributing to flight delays.Ultimately, we aim to develop accurate predictive models that can assist the aviation industry in optimizing operations and enhancing passenger experiences.
This study makes several significant contributions to the field of flight delay prediction.Firstly, we conducted an analysis of the performance of various machine learning models, including Logistic Regression, Naïve Bayes, Neural Networks, Random Forest, XGBoost, CatBoost, and LightGBM, with the objective of identifying the optimal model for our dataset.This diversity allowed for a comprehensive evaluation of model performance.Secondly, SMOTE was employed in order to address the issue of imbalanced datasets, with the effect of this specific technique being assessed.Additionally, the integration of weather data into the models enabled the examination of the impact of weather conditions on flight delays, thereby enhancing the overall accuracy of the models and the understanding of the contribution of weather variability to the predictions.
Furthermore, we employed a novel dataset that has not been previously explored in the literature for the purpose of predicting flight delays.This enhanced the originality of our study and its contribution to the field.Additionally, the performance of the models was evaluated on both the training and the test datasets, with recommendations provided for improving the models' generalization ability.This will facilitate the generation of more reliable and consistent predictions in real-world applications.Finally, we employed SHAP values to identify the key features affecting flight delays, thereby enhancing the interpretability of the models and providing more reliable and understandable predictions for decision-makers.

Literature Review
In the field of flight delay prediction, a variety of methods have been employed in the literature.These methods can be categorized into regression models, probabilitybased models, operational research models, network-based models, and machine learning methods.
Regression models have been used extensively, such as a hybrid regression and time series models applied to Frankfurt Airport, explaining 60-69% of variability [14], and a multivariate regression model achieving a 5.3 min precision [15].Probability-based approaches have been utilized, including a probabilistic model combined with genetic algorithms to estimate delay distributions at Denver International Airport [16], and survival analysis with the Cox model to analyze flight delays probabilistically [17].
Operations research techniques have also played a role, with dynamic optimization models for flight operations [18], optimization of aircraft routes and departure times using mixed-integer programming [19], and multi-objective optimization leading to a 12-31% improvement in flight delays [20].Network-based models, like the clustered airport modeling approach of optimizing flight operations [21], the minimum-cost flow problem for system design [22], and the investigation of delay propagation among airports with Granger causality [23,24] are also used.
Despite the historical utilization of these methods, their practical application has not been widely adopted.In recent years, there has been a notable shift towards the embrace of machine learning techniques.This transition has been driven by the abundance of big datasets in the aviation industry and the demonstrated success of machine learning in pattern recognition.
The earliest documented use of machine learning methods in flight delay prediction from our literature review dates to 2014.Since 2016, research employing machine learning techniques has been on the rise, with an evolving array of methods.Table 1 provides a concise summary of findings from these studies, encompassing information on data sources, datasets used, and selected techniques.Among the methods employed, Random Forest (RF) stands out as the most utilized technique in the papers reviewed.Following closely, Decision Trees (DT), Gradient Boosting (GB), and Artificial Neural Networks (ANN) have been preferred in successive order.In terms of dataset origins, the United States predominantly serves as the source, with select studies encompassing data from Egypt, Iran, Brazil, and various European countries or individual airline companies.While most publications cover information on multiple airports, some studies focus on just one or a few specific airports.
In conclusion, each method for flight delay prediction has its own set of strengths and weaknesses.Regression models and probability-based approaches offer simplicity and flexibility but may encounter difficulties when faced with complex relationships and data requirements.Operations research models offer powerful optimization capabilities but can be complex and difficult to scale.Network-based models offer a comprehensive view of delay propagation but are data-intensive and computationally demanding.Machine learning models are distinguished by their high accuracy and capacity to handle complex relationships.However, these models require substantial data and computational resources.
In the digital age, data are widely acknowledged as a form of currency.Consequently, it is of paramount importance for businesses to make predictions based on their own data sources.In this context, flight data belonging to a single airport operator were utilized in this study.As this dataset has not been previously used in the literature, seven different machine learning techniques were applied.The choice of using the ANN technique was driven by its frequent preference in the literature.Naïve Bayes (NB) and logistic regression were selected due to their suitability for binary classification tasks.Furthermore, the inclusion of XGBoost, LightGBM, and CatBoost was motivated by their frequent comparison in the context of gradient boosting-based techniques.
This study addresses several notable gaps in the existing literature on flight delay prediction.By utilizing a novel dataset from a Turkish airport, which has not been previously explored, the study shows originality and provides a new perspective on flight delay prediction.Furthermore, it conducts a comprehensive comparison of various machine learning models, including Logistic Regression, Naïve Bayes, Neural Networks, Random Forest, XGBoost, CatBoost, and LightGBM, allowing for a thorough evaluation of model performance across different scenarios.The integration of weather data into the predictive models enables an examination of the impact of weather conditions on flight delays, offering a more holistic view of the influencing factors.Furthermore, this study addresses data imbalance by employing SMOTE to balance the dataset.This assessment demonstrates an innovative approach to handling data imbalance.Additionally, feature importance analysis using SHAP values identifies key features affecting flight delays, enhancing the interpretability of the models and providing more reliable and understandable predictions for decision-makers.The analysis revealed that DT achieved the highest accuracy with a score of 0.70 [34] Egypt Air flight delay data REPTree, Forest, Stump, J48 REPTree achieved the highest accuracy of 80.3%.In the case of rule-based classifiers, PART demonstrated the best accuracy, reaching 83.1%.Additionally, when considering both accuracy and running time, REPTree emerged as the most efficient classifier [35] 70 Airports in the U.S.

Logistic Regression (LR)
The findings of the study revealed that the proposed approach achieved an accuracy rate of approximately 80% [36] Brazilian National Civil Aviation Agency

ANN, RF, SVM, NB, KNN
The findings demonstrated that the incorporation of balancing techniques into the models led to a noteworthy enhancement resulting in an accuracy rate of approximately 60% [37] Beijing International Airport Deep Belief Network The model incorporated support vector regression (SVR) for refinement.The results revealed that when utilizing the DBN-SVR model, approximately 99.3% of the predicted values fell within a 25cmin deviation from the observed values [38] U.S.

RF, Delay Propagation Model
The study employed an air traffic delay prediction model that integrated RF with an approximate delay propagation model.The model achieved a departure delay accuracy of 0.83 and an arrival delay accuracy of 0.87 [39] Air traffic data systems DT, NB, Classification Rule Learners, SVM, Bagging Ensemble, Boosting Ensemble, RF This study employed supervised Machine Learning algorithms to construct prediction models for both weather-related and volume-related Ground Delay Programs.The research identified the Boosting Ensemble algorithm as the most effective, achieving the highest Kappa statistic of 0.66 [40] Low-cost airlines in Europe LSTM, LR, BM, TL The study highlighted the potential for small airlines to initiate model training through transfer learning and subsequently fine-tune the models using their own data

Methods
One of the initial decisions to be made when utilizing machine learning techniques is how to conduct the learning process.In this study, all methods were employed with the supervised learning technique, given that the outputs of the dataset were available.Consequently, 70% of the available data in the training phase and 30% in the testing phase were utilized.Another issue that requires resolution is the selection of hyperparameter values.In this study, Bayesian optimization for hyperparameter tuning was used for this aim.The advantage of the Bayesian technique is that it performs a probabilistic scan, rather than calculating the effects of all hyperparameters.In other words, the Bayesian technique selects several hyperparameter settings, evaluates the selected parameters' quality, and decides which samples to select.Consequently, results can be obtained more rapidly.Finally, all the selected machine learning techniques were suitable for binary classification.In other words, all the techniques were suitable for the prediction of the status of aircrafts (late/on time).In the rest of this section, a brief introduction of each technique and their reason of selection will be presented.

 Naïve Bayes
The Naïve Bayes algorithm is a straightforward probabilistic classifier that calculates probabilities based on the frequency and combinations of values within a given dataset.The fundamental principle of this approach is the assumption of strong independence, which states that the presence or absence of any given class feature has no bearing on the presence or absence of any other feature.In essence, the Naïve Bayes algorithm operates using conditional probability [50,51].One of its advantages is its ability to generate classifications from large datasets without the need for complex iterative parameter estimation procedures [52].Because of this, it is chosen for its simplicity and effectiveness with large datasets, which makes it a good baseline model.

 Artificial Neural Networks
Artificial neural networks (ANNs) are designed to mimic the functionality of the human brain.Specifically, ANN models emulate the electrical activity of the brain and nervous system.Due to their structure, they can offer solutions to various problems, such as nonlinear and stochastic problems [53].Artificial neurons or nodes, which are information processing units grouped in layers and joined by synaptic weights, constitute perceptrontype neural networks (connections).The technique has the advantages of working with incomplete data, being fault-tolerant, and not corrupting one or more cells in the network [54].An ANN is chosen for its ability to capture complex, non-linear relationships in the data, as shown in [30,32,36].

 Logistic Regression
Logistic regression is a well-established technique that was first developed by D. R. Cox [55].It is very sensitive to the problem of overfitting, which is usually fixed with the L2 arrangement [44].This technique assumes that the relationship between the dependent variable and the independent variables is linear only between the logits of the explanatory variables and the output [56].The method can be used when the dependent variable is only categorical.Unlike artificial neural networks, practitioners can see the relationship between variables and output.Another advantage is that they can find a solution in a short time, as less processing time is required [57].Therefore, logistic regression is preferred for its interpretability and speed, which makes it suitable for quick insights and baseline comparisons, as seen in [35,44].

Random Forests
Random Forests (RFs) are a model type that provides a transparent and understandable estimation since they are based on the idea that decision trees can be used in combination.It is a model introduced by [58].The main idea of the Random Forests method is based on generating a large number of predictive trees and then randomly selecting subcommunities from the generated trees.The fundamental concept underlying the Random Forests method is the bagging method itself.In other words, the combination of decision trees and the bagging technique constitutes the Random Forests method.The RF method is chosen for its robustness and ability to handle high-dimensional data and interactions between features with high accuracy, as demonstrated in [25,31,38].

 XGBoost
It was developed to improve the GBDT technique.This technique has been successful in many machine learning competitions.XGBoost has many advantages over classical methods [59].The first is that it contains a revolutionary tree-learning technique for sparse datasets.Second, it employs a theoretically justified weighted quantile sketch approach that allows approximation tree learning to handle instance weights.Parallel and distributed computing reduces the training time, allowing researchers to conduct model explorations more quickly.Finally, using out-of-core computation enables data scientists to process massive amounts of information.As one of the widely preferred algorithms, XGBoost was included for its high performance and efficiency, especially in handling sparse datasets and complex patterns, as reported by [29,31,49].

 LightGBM
LightGBM is another widely used technique that has gained recognition, particularly in classification problems.The motivation for improvement was stated as the inadequacy of the available methods for datasets containing high amounts and many features [60].Focused on development based on XGBoost, it takes a different approach to classification problems by adding and combining the two techniques of gradient-based one-side sampling and exclusive feature bundling.Thus, LightGBM was included for its speed and efficiency, especially with large datasets with many features, as evidenced by [31,42].

 CatBoost
It is a technique especially developed for categorical features.The CatBoost method adds two new fundamental functions: a permutation-driven ordered boosting algorithm and a novel algorithmic approach specifically for processing category information [61].It is claimed that CatBoost can also handle cases when the number of categories is too large for conventional GBDT models.It achieves this by dividing the dataset into random subsets, converting the labels to integer values, and converting the remaining categorical features to numerical values.The reasons for its selection were its superior handling of categorical data and avoidance of overfitting through ordered boosting, as demonstrated in [43].
By employing a diverse set of models, we aimed to leverage the strengths of each technique to develop a comprehensive and accurate predictive model for flight delays.

Dataset
Although aviation data are published as open source in some countries, such a secondary data source is not available in most countries.For this reason, secondary data were obtained from an airport while conducting this study.The data records of 40,077 flights (8625 late, 31,452 on time) in the dataset covered three years (2016,2017,2018).The dataset we used is unique, which allowed for the generalization and evaluation of results when working on different data.
While creating the dataset, care was taken to ensure that the features recorded at the airport were similar to those used in the literature for estimating flight delays.Table 2 shows the features obtained from the airport, the data types of these features, and the studies in the literature that made flight delay estimations using relevant data.The study included meteorological data and data pertaining to the airport.The relevant weather data were obtained via the online website rp5.ru.The weather data from the time zone closest to the flight were extracted from the dataset, which contained weather information updated every 20 min.In addition to these variables, those presented in Table 2 were employed in the estimation of flight delays in conjunction with meteorological conditions.Table 3 presents the weather-related features utilized and the types of data associated with each feature.Furthermore, the flight type variable was included, which indicated whether the flight in question was scheduled or not.This encompassed special, ferry, and technical flights.Encoding of categorical data and unbalanced dataset management tools were required throughout the data preparation process.There were 7 unique observations for the day of the week feature.For the other categorical values, there were 31 for day of the month, 12 for month, 97 for origin of departure airport, 17 for mean wind direction, 419 for total cloud cover, and 2 for flight type.For a proper modelling, all categorical features were converted to numerical values.One-hot encoding, which is frequently used for categorical data conversion, was selected for this.Using this approach, new features were created form the unique observations provided for each categorical feature.After encoding was applied, the number of features used in the study was 595.For the imbalanced dataset, another data preparation step was required.If the classification categories are not relatively equally represented, the dataset is imbalanced.Due to the lower number of delayed flights, this issue was present in the dataset of this study.SMOTE [67], one of the recommended approaches, was preferred to address this problem.The minority class disadvantage is minimized with this technique, which generates synthetic data similar to the minority class.The data acquired without utilizing SMOTE are also presented to further analyze the outcomes obtained.

Results
We conducted four distinct trials using the seven selected methods on our prepared dataset to predict flight delays using Python.These trials were designed to examine the impact of different factors on prediction accuracy.In the first trial, we estimated flight delays solely by optimizing hyperparameters without incorporating weather data.The second experiment incorporated the results of the SMOTE technique in addition to the same hyperparameter optimization.In the third and fourth trials, we introduced weather data into the dataset.Notably, the third trial did not include the SMOTE technique, while the fourth trial employed it.

Performance Evaluation
The effectiveness of the learning models was assessed during both the training and the testing phases for all trials.The success of the techniques was determined through a comprehensive evaluation using multiple performance metrics.These metrics offer various insights into a model's performance: Accuracy: This metric measures the overall correctness of a model's predictions and is widely used in the literature.It is computed from the confusion matrix.However, for imbalanced datasets, relying solely on accuracy can be misleading [66].
Recall (sensitivity): Recall represents the percentage of relevant instances that were correctly identified by the model.It is a vital metric for cases where finding all relevant instances is crucial.
Precision (positive predictive value): Precision indicates the percentage of retrieved instances that are relevant to the problem at hand.High precision is important when minimizing false positives is essential.
F-1 score: The F-1 score is the harmonic mean of precision and recall.It provides a balanced evaluation that considers both false positives and false negatives.
Balanced error rate: This metric offers an assessment of error that considers class imbalances and can be particularly valuable for imbalanced datasets.
These metrics collectively offer a robust evaluation of a techniques' performance, considering different aspects of predictive accuracy and suitability for the task.
The result of the first trial, which was conducted with only the airport data, is shared in Table 5. XGBoost and CatBoost consistently demonstrated strong performance across multiple metrics in both the training and the testing phases, which makes them promising candidates for predicting flight delays.Logistic Regression, ANN, and LightGBM also delivered reasonable results, while Naïve Bayes performed less effectively in most metrics.In the second experiment, the same dataset was balanced with SMOTE.The results are presented in Table 6.Overall, the application of the SMOTE technique helped improve recall in several models during the training phase, at the expense of precision in some cases.However, it did not consistently improve the model performance in the testing phase, with recall often decreasing.XGBoost, CatBoost, and LightGBM continued to demonstrate strong performance even with SMOTE, suggesting their robustness in dealing with both imbalanced and balanced data.Table 7 reflects the varying performance of different machine learning models when combined with airport and weather data.XGBoost, CatBoost, and LightGBM continued to emerge as strong candidates, demonstrating excellent accuracy, recall, precision, and balanced error rates.On the other hand, Naïve Bayes excelled in recall but at the expense of precision and accuracy.Logistic Regression and ANN showed reasonable performance, while Random Forest exhibited suboptimal results with zero recall, which made it less suitable for this dataset and task.Table 8 reveals the performance of the machine learning models when using a combination of weather and airport data with the application of the SMOTE technique.XGBoost, CatBoost, and LightGBM continued to demonstrate strong performance, showcasing high accuracy, recall, precision, and balanced error rates.Naïve Bayes exhibited exceptionally high recall but at the cost of precision, which makes it a suitable choice if recall is the top priority.Logistic Regression and ANN showed reasonable and balanced performance, while Random Forest demonstrated suboptimal results, with low recall and precision.The techniques that showed signs of overfitting were Naïve Bayes, ANN, XGBoost, Random Forest, CatBoost, and LightGBM.These models showed high performance on the training data, but significantly lower performance on the test data, indicating that they learned specific patterns in the training data that did not generalize well to the new data.

Analyses of Feature Importance
In this section, we delve into the analysis of feature importance.The forecasting model employed in this study incorporated 595 features, resulting in increased decision complexity.The selection of features is of critical importance for determining the extent to which a simplified model can produce results that are comparable in quality to those of a more comprehensive model.
Various feature selection approaches exist in the literature, but in this study, we focused on SHAP (SHapley Additive exPlanations) values.SHAP values, originating from cooperative game theory, are widely used to enhance the transparency and interpretability of machine learning models.They aim to measure each input feature's contribution to individual predictions or the final model outcome [68].
A defining feature of SHAP values is that they always sum up to the difference between the model's outcome and the actual outcome when all features are present and to the actual outcome when none of the features is considered.In the context of machine learning, this implies that the sum of SHAP values for all input features is equal to the difference between the baseline (expected) model output and the current model output for the prediction under examination.
Figure 1 indicates the relative importance of each feature in influencing the model's output.According to the SHAP results shown in Figure 1, the variables in order of importance were scheduled time, number of passengers (PAX), atmospheric pressure at the arrival airport (P0), travel distance, arrival airport's temperature, arrival airport's relative humidity, arrival airport's dew point temperature, arrival airport's mean wind speed, carried cargo by the flight, and flights departed from Sabiha Gökçen Airport.Understanding these features' impacts is crucial for interpreting and improving the model, as it allows one to focus on the most influential factors when addressing flight delay predictions.
Figure 2 indicates that flights with late scheduled times had a positive impact and were more likely to be delayed.It can be observed that flights with a smaller number of passengers were more likely to be delayed at the destination airport.Longer distances resulted in a positive impact on delay, indicating that longer flights may be more likely to result in arrival delays.The significance of air pressure (P0), relative humidity (U), mean wind speed (FF), and dew point temperature (Td) indicated that the weather conditions exerted a considerable influence on flight delays.A reduction in wind speed or an increase in air pressure appeared to have a negative impact on flight delays, resulting in a greater number of on-time arrivals.Conversely, high wind speeds or low air pressure at the arrival airport were associated with increased delays.The three models (XGBoost, CatBoost, and LightGBM) were selected for analysis due to their high-accuracy results, which demonstrated their robustness and reliability in predicting flight delays.Although the three models exhibited similarly impressive performance, XGBoost was selected for in-depth analysis due to its consistently higher recall rates, which is essential for reducing the number of false negatives when forecasting flight delays.The subsequent results were obtained by utilizing the 10 most significant features in conjunction with XGBoost, as detailed in Table 9.The XGBoost model with selected features demonstrated excellent performance on the training data, with high accuracy, recall, precision, F-1 score, and AUC score and low balanced error rate (Table 9).However, the model's performance on the testing data was slightly lower, suggesting that it may have some difficulty in generalizing to new, unseen data.Further model refinement and evaluation may be necessary to improve its generalization capabilities.

Conclusions
In this section, we interpret and analyze the results of our study on flight delay prediction at a Turkish airport using machine learning techniques.
Looking at the results, based on the accuracy rates reported in the literature, XGBoost, CatBoost, and LightGBM performed consistently well across different scenarios.They consistently maintained robust performance when dealing with unbalanced data, integrating weather information, and when SMOTE was applied.Specifically, XGBoost achieved an accuracy of 80% and recall rates of 20% to 27% across different trials, indicating its robustness.CatBoost and LightGBM also demonstrated strong performance, with accuracies around 80% and recall rates ranging from 0.13 to 0.24.Logistic Regression, although generally less robust compared to more advanced models, showed improved performance when SMOTE was applied.With only airport data and SMOTE, Logistic Regression achieved an accuracy of 64% and 61% on the test set.When both weather and airport data were used with SMOTE, the accuracy improved to 65% on the test set, with recall rates of 60%.In the literature, most studies report results based primarily on accuracy.In this context, our findings suggest that XGBoost, CatBoost, and LightGBM, with their higher accuracy rates, are more effective for predicting flight delays compared to Logistic Regression.Therefore, they appear to be strong candidates for predicting flight delays at this Turkish airport.
In the context of predicting delayed flights on time, the most unfavorable scenario would be a model that performs poorly in terms of recall.In this context, the least desirable outcome would be that the model failed to correctly identify a significant proportion of delayed flights (i.e., high number of false negatives).XGBoost consistently demonstrated higher recall rates across a range of trials and scenarios.Nevertheless, the performance of the models on the test dataset, while satisfactory, indicates a need for further improvement in terms of generalization.This finding indicates that our models may benefit from additional fine-tuning and evaluation to enhance their performance in realworld settings.
The incorporation of weather data into our models proved to be a valuable addition, as it provided insights into the impact of weather on flight delay predictions.The experiments demonstrated that the weather conditions exert some influence on flight delays.Nevertheless, while the incorporation of weather data enhanced the model performance to a certain extent, it did not markedly alter the overall prediction accuracy in comparison to that of models that did not include weather data.This suggests that while the weather conditions are a significant factor, airport-specific operational factors may be more influential in determining in-flight delays at the studied location.Future research could investigate the interplay between weather conditions and operational factors in greater depth in order to gain a more comprehensive understanding of their combined effect on flight delays.Furthermore, the limited impact of the weather data on overall prediction accuracy indicates the potential necessity for more granular weather data or the investigation of other environmental factors that might affect flight operations.
Feature selection plays a critical role in model performance and interpretability.Our analysis of feature importance using SHAP values indicated that several key features significantly influence flight delays.Factors such as scheduled time, number of passengers, time to arrival, atmospheric pressure at the arrival airport, travel distance, and the number of flights departed from the airport showed the highest impacts on predicting flight delays.Understanding these influences is essential for aviation professionals to optimize operations and passenger experiences.

Discussion and Future Recommendations
Here, we will consider the implications of our findings and the potential avenues for further research in the field of aviation and machine learning.
While our machine learning models achieved noteworthy results on the training dataset, the slightly reduced performance on the testing dataset highlights the need for model generalization.To enhance their ability to handle unseen data effectively, further research could explore techniques such as additional feature engineering.
The transition from traditional methods to machine learning in the aviation industry is becoming increasingly apparent.The incorporation of large datasets and advanced algorithms has the potential to revolutionize flight delay prediction and aviation operations.As our models continue to improve, they can be valuable tools for airlines, airports, and regulators, allowing them to proactively manage delays and enhance passenger experiences.
At this juncture, it is academically prudent to engage in a discourse on ethical considerations.In addition to the positive results brought by accurate predictions, it is important to consider ethical issues when using machine learning techniques.This is important not only to ensure the accuracy of a study's results but also to uphold fairness, impartiality, and respect for the rights and well-being of all relevant stakeholders.In this context, one of the crucial aspects is data privacy and security, emphasizing the need for data anonymization and compliance with data protection regulations, such as GDPR.To prevent biases in the training data, characteristics such as airline names or aircraft types were not included.
When examining the impact on stakeholders, it becomes evident that passengers may experience increased discomfort if they are misdirected based on prediction outcomes, which potentially leads to missed connecting flights or important events.Airlines, on the other hand, could face disruptions to their operations, resulting in financial losses and damage to their reputation due to incorrect predictions.Airport authorities might struggle with efficient resource allocation and, contrary to expectations, may encounter traffic congestion, delays, and additional costs.
To maintain ethical standards, it is crucial not only to ensure the accuracy of predictions but also to strive for fairness and the enhancement of the well-being of all stakeholders.Thus, monitoring and, if necessary, improving models during their usage become imperative in preserving ethical standards.
Our study has limitations, including the reliance on a specific dataset from a single airport.Future research could expand to include data from multiple airports and airlines to generalize the findings.Additionally, exploring advanced machine learning techniques like deep learning and recurrent neural networks (RNNs) may offer further insights.Also, regression algorithms should be tested to predict the delay times of delayed flights.
The practical implications of our study extend to the operations of the aviation industry and passenger services.By harnessing the potential of machine learning, airlines can optimize the flight schedules, allocate resources more efficiently, and reduce the costs associated with delays.Passengers, too, can benefit from more reliable travel experiences.

Figure 1 .
Figure 1.Feature impacts on model output.

Figure 2 .
Figure 2. Feature impacts on model output with positive and negative effects.

Table 1 .
Flight delay prediction studies.
[33]model firstly performed a binary classification to predict whether a flight would be delayed and, in the second stage, conducted a regression analysis to estimate the duration of the delay in minutes The highest departure delay classification was gathered with Gradient Boosting, with accuracy of 86.48%[32]Kaggle.comANN,DeepBeliefNetworkDeep neural networks attained an accuracy rate of 77%, whereas ANN achieved an accuracy rate of 89%[33]Github RF, DT, Support Vector Machine (SVM), Naïve Bayes (NB)

Table 2 .
Flight characteristics, types, and studies in the literature.

Table 3 .
Weather condition features and data types.

Table 4 .
Experimental design for hyperparameter space.
Best values are shown in bold.

Table 6 .
Second trial, airport data with SMOTE.
Best values are shown in bold.

Table 7 .
Results from the combined use of airport and weather datasets.

Table 8 .
Results of weather and airport data using the SMOTE technique.

Table 9 .
Results of XGBoost with selected features.