1. Introduction
Flight delays are a significant issue affecting passengers, airlines, and the entire aviation infrastructure. In an era of increasing globalization and rapid growth of air transport, understanding the causes, consequences, and management mechanisms of delays has become a critical element in improving the efficiency and quality of services in this sector. This study focuses on addressing this challenge by exploring innovative methods to predict and mitigate flight delays, with the aim of improving the operational performance of the aviation industry.
The objective of this article is to formulate and assess a novel methodology for predicting flight delays through the use of scorecards. These scorecards function as a streamlined analytical instrument that allocates weights to diverse factors affecting delays, thereby establishing an accessible predictive model. In contrast to conventional, intricate techniques, the scorecard system is devised to offer an intuitive and user-friendly solution for airlines and airport operators, reducing the requirement for specialized expertise in data analysis.
Delays directly impact passenger satisfaction, a key determinant of airline competitiveness. Punctuality and reliability are among the most valued aspects of air travel, and disruptions can lead to frustration, stress, and missed connections [
1].
In addition to inconveniencing passengers, delays entail significant economic consequences. Airlines incur increased operational expenditures, including additional fuel, crew expenses, and penalties associated with service disruptions [
2]. Furthermore, delays contribute to environmental concerns by increasing fuel consumption and greenhouse gas emissions during prolonged waiting periods. Efficient delay management can not only improve operational efficiency, but also support the broader goals of sustainability in the aviation sector.
Flight delays are influenced by various factors, classified into operational, weather-related, air traffic control, external, passenger-induced, and supply chain disruptions [
3]. Operational delays often stem from internal airline processes, such as technical maintenance, the rescheduling of crews, or extended ground services. Weather conditions, such as storms, fog, or strong winds, represent a significant external factor that makes flights unsafe or impossible [
4]. Air traffic control constraints, such as congestion in the airspace or on runways, can also result in prolonged delays. Additionally, external disruptions, such as security threats, infrastructure failures, or changes in regulations, as well as passenger behavior, contribute to the complexity of the issue [
5].
Traditional delay forecasting methods, including advanced statistical techniques and machine learning models, have demonstrated efficacy but are often resource intensive, requiring specialized expertise and significant computational resources. These approaches, such as ARIMA models, effectively capture trends and seasonality but lack the accessibility needed for a broader implementation [
6]. Machine learning methods, though powerful in capturing nonlinear relationships, pose challenges in deployment and maintenance [
7,
8]. Consequently, there is a pressing need for simpler and cost-effective tools to address this pervasive issue.
To bridge this gap, this research explores the potential of scorecards as an innovative and simplified method for the prediction of delays. Scorecards, widely used in various industries for performance evaluation, offer a structured framework that integrates multiple performance indicators. By assigning weights to critical factors influencing delays, scorecards provide a straightforward predictive model that can be easily adopted by airlines and airport operators without requiring advanced analytical capabilities.
The proposed scorecard system aligns with the broader trend toward data-driven decision making in aviation. It enables stakeholders to monitor key performance indicators, such as on-time performance and delay durations, while incorporating external variables such as weather conditions and air traffic constraints. This approach facilitates proactive strategies for delay mitigation, supporting continuous improvement, and benchmarking against industry standards.
Summarizing, the objective of this article is to offer a practical solution for a pressing issue in air transportation, providing an effective, accessible, and actionable tool for predicting and managing flight delays. The study of scorecard results can deliver not only a forecasting tool, but also an intuitive and clear visualization of data to understand the process, which can point to the main factors and explain reasons for a change.
This research directly contributes to the ongoing development of data-driven solutions in transportation systems by integrating interpretable modeling techniques with real-world aviation data. In particular, the use of scorecards—a method well known in finance but rarely applied in the transport sector—offers a transparent and lightweight alternative to traditional machine learning models. This approach emphasizes practical applications of analytics to optimize operations and improve service performance in aviation and logistics.
The main contributions of this paper are as follows:
The introduction of a novel application of scorecards for flight delay prediction, demonstrating that a transparent and interpretable model can match or exceed the predictive power of more complex machine learning approaches.
The development and validation of a high-performing model using a large-scale dataset of over 30 million flights from the U.S. Bureau of Transportation Statistics.
Comparative analysis highlighting the advantages of the scorecard approach over traditional statistical and ML models in terms of interpretability, simplicity, and operational usability.
Operational insights derived from model outputs, supporting decision-making in airline resource allocation, delay mitigation, and service reliability improvement.
The discussion of sustainability implications, including how scorecard-based monitoring can contribute to lower fuel usage, reduced CO2 emissions, and improved passenger satisfaction.
This article is structured as follows: the rationale for addressing the issue of flight delays is outlined first, followed by a review of conventional forecasting methods and their limitations. Subsequently, the proposed scorecard approach is introduced, along with its practical implementation and advantages. The paper concludes with a discussion of its implications for operational efficiency and sustainability in the aviation industry. By addressing a multifaceted challenge with a scalable and user-friendly tool, this research contributes to advancing the efficiency, reliability, and environmental sustainability of air transport systems.
2. Literature Review
2.1. Significance of the Flight Delays Study
Flight delays represent a critical challenge for airlines, airports, and passengers, which significantly impacts operational costs and customer satisfaction. The intricacy inherent in conventional delay prediction methodologies frequently requires the application of sophisticated statistical analyses accompanied by mathematical modeling, rendering these approaches cumbersome and demanding in terms of resources. Recent studies have highlighted various machine learning approaches, such as supervised learning models and deep learning networks, which have shown promise in enhancing prediction accuracy and efficiency. For instance, supervised machine learning techniques, including random forests and support vector machines, were used to predict flight-time deviations, achieving notable accuracy through a grid search for optimal parameters [
9]. Similarly, it has been demonstrated that the integration of meteorological data with deep convolutional neural networks could significantly improve the classification performance of flight delays [
10].
However, despite their predictive power, such models are often opaque (‘black boxes’) and require extensive computational resources and technical expertise for both development and interpretation. This constrains their practical applicability in real-time, high-consequence settings, such as airline operations.
Despite the advancements in machine learning applications for flight delay prediction, the use of scorecards as a predictive tool remains largely unexplored. This study introduces an innovative application of scorecards, which may streamline the prediction process by offering stakeholders a more intuitive and accessible approach. The existing literature predominantly focuses on deterministic predictions, often yielding binary outcomes such as on-time or delayed statuses [
11]. These approaches rarely provide probability estimates or clear interpretability for operational staff, making it difficult to understand or act upon model outputs. In contrast, the proposed scorecard method aims to provide a probabilistic framework that could offer nuanced insights into potential delays, thereby addressing the limitations of traditional models. For example, in [
12], work on the predictive modeling of flight delays using a decision tree illustrates the potential of machine learning to enhance decision-making processes.
Moreover, the integration of various influencing factors, such as historical flight data, weather conditions, and airport congestion, is crucial for developing robust predictive models. Research indicates that effective feature engineering is essential to capture relevant indicators that influence flight delays [
13]. The proposed scorecard approach could facilitate the incorporation of these diverse factors in a more structured manner, potentially leading to improved accuracy in delay predictions. Unlike machine learning models that often require retraining and complex optimization procedures, scorecards offer a flexible structure that can be manually adjusted as operational conditions change, providing better adaptability with minimal effort. Furthermore, the ability to adapt to real-time data emphasizes the importance of dynamic systems in enhancing prediction capabilities [
14]. Therefore, the importance of this study lies in its potential to introduce a simplified but effective method for predicting flight delays, which could ultimately improve operational efficiency and passenger satisfaction.
In summary, the existing literature lacks interpretable solutions that combine predictive power with ease of implementation. This study aims to fill that gap by introducing scorecards as a middle ground—offering sufficient accuracy for operational use, while maintaining full transparency and usability for non-technical personnel.
2.2. Conventional Forecasting Methods
The forecasting of delays in air transportation is a critical area of research, particularly as it impacts operational efficiency, passenger satisfaction, and overall airline profitability. Traditional forecasting methods have evolved significantly, integrating various statistical methods.
One of the foundational methods in delay forecasting is the autoregressive integrated moving average (ARIMA) model, which has been widely used for time series analysis in various contexts, including air traffic forecasting [
15,
16,
17]. For instance, studies have shown that ARIMA models can effectively capture trends and seasonality in air passenger traffic data, making them suitable for predicting future volumes and potential delays [
18,
19]. The integration of external variables, such as GDP and oil prices, into ARIMAX models has also been demonstrated to improve the accuracy of forecasting, particularly in the face of economic shocks such as the COVID-19 pandemic [
20].
Although ARIMA and its variants are effective in modeling seasonality and long-term trends, they often assume linear relationships and stationary time series, which may not reflect the dynamic and irregular nature of flight delays. These models also struggle to incorporate categorical or non-numeric factors, such as airport congestion levels or flight-specific operational variables. This limits their practical usefulness in complex operational environments where nonlinear, multidimensional influences play a major role.
In addition to ARIMA, hybrid models that combine traditional statistical methods with machine learning techniques have gained traction. For example, the use of ensemble methods, which aggregate predictions from multiple models, has been shown to enhance the robustness of forecasts by capturing complex patterns in the data [
21]. These methods are particularly beneficial in the context of air traffic, where the interplay of various factors can lead to significant variability in delays. The application of machine learning algorithms, such as support vector machines and neural networks, has also been explored, providing a means to model non-linear relationships and improve prediction accuracy [
22,
23].
However, although ensemble and hybrid models may increase predictive power, they often lack transparency and interpretability, making them difficult to deploy in environments where quick and explainable decisions are required. In addition, their computational complexity may hinder their use in real-time systems, particularly for airlines with limited technological infrastructure or data science expertise.
Additionally, the impact of external factors, such as weather conditions and operational disruptions, on air traffic delays has been a focal point in recent research. Studies have indicated that incorporating weather-related variables into forecasting models can significantly enhance their predictive capabilities [
24]. For example, the use of weather impact indices in conjunction with traffic volume data has been shown to produce more accurate delay predictions, particularly for short-term forecasts [
25].
The COVID-19 pandemic has further complicated the forecast landscape, introducing unprecedented volatility in air travel patterns. Researchers have adapted traditional and machine learning methods to account for these disruptions, highlighting the importance of flexibility in forecasting approaches [
26]. The disorder introduced by the pandemic necessitated a reassessment of pre-existing models, prompting the investigation of novel methodologies that can more effectively accommodate abrupt modifications in air traffic dynamics [
27]. These shocks have highlighted the fragility of overly complex or rigid forecasting systems, reinforcing the need for adaptive yet interpretable solutions that can be updated or recalibrated quickly in response to sudden changes.
It should be noticed that the forecasting of delays in air transportation has evolved through the integration of traditional statistical methods and advanced machine learning techniques. The combination of ARIMA models with external variables, the use of ensemble methods, and the adaptation of forecasting approaches in response to external shocks such as COVID-19 are crucial in enhancing the accuracy and reliability of delay predictions. Future research should continue to explore these hybrid methodologies and their applicability in various operational contexts within the aviation industry.
2.3. An Innovative Approach–Scorecards
The innovative approach of utilizing scorecards in forecasting delays in air transportation represents a significant advancement in the field of aviation management [
28]. Scorecards, as a performance measurement tool, facilitate the systematic evaluation of various factors that influence flight delays, thereby improving decision-making processes and operational efficiency [
29].
In recent years, scorecards have emerged as a transformative tool in aviation delay forecasting, offering a practical, flexible, and accessible alternative to traditional methods. These tools provide a systematic framework for integrating multiple performance indicators, enabling airlines to assess and respond to the complex interplay of factors contributing to delays. Unlike traditional statistical or machine learning approaches, which can be resource-intensive and require specialized expertise, scorecards offer an intuitive solution that is easier to implement and maintain while still delivering actionable insights. By assigning weights to key variables such as weather conditions, air traffic inefficiencies, operational disruptions, and external factors, scorecards present a simplified yet robust means of predicting and managing flight delays [
30].
What distinguishes scorecards from traditional and AI-based approaches is their ability to bridge the gap between operational usability and analytical depth. They are interpretable by design and can be implemented without the need for specialized programming or data science expertise.
One of the primary advantages of scorecards lies in their ability to incorporate a wide range of both internal and external performance indicators into a cohesive framework. Internal metrics include on-time performance rates, average delay durations, and turnaround times, all of which directly affect operational efficiency and passenger satisfaction. External variables, such as regulatory changes, economic conditions, and weather disruptions, are also critical components of effective delay forecasting. By systematically integrating these factors, scorecards offer a holistic view of the operational environment, enabling airlines to anticipate and mitigate delays more effectively [
31].
The use of scorecards aligns with the broader shift of the aviation industry toward data-driven decision making, where analytics play an increasingly critical role in improving operational performance. Scorecards enable airlines to organize and visualize complex datasets in a structured manner, making it easier to identify patterns, trends, and areas of concern. For example, tracking KPIs such as the frequency and duration of delays allows airlines to pinpoint operational bottlenecks and inefficiencies, informing targeted interventions. Importantly, scorecards do not require retraining from scratch when operational parameters shift; instead, their rules and weights can be adjusted incrementally based on updated domain knowledge or performance monitoring, allowing for agile model maintenance. Research has demonstrated that the implementation of scorecards has been linked to improved operational efficiency in various sectors, including aviation, where timely and informed decisions are essential to maintaining competitiveness and profitability [
32,
33].
Additionally, scorecards provide an adaptable framework capable of accommodating the dynamic and often unpredictable nature of the aviation industry. The ability to incorporate external factors into delay forecasting is particularly valuable in light of recent disruptions such as the COVID-19 pandemic, which introduced unprecedented volatility in global air travel. By integrating variables such as fluctuating passenger demand, changing regulatory requirements, and evolving safety protocols, scorecards enable airlines to remain agile and responsive to shifting circumstances [
34,
35]. This adaptability makes scorecards an indispensable tool for managing uncertainty and ensuring operational resilience in the face of external challenges.
In addition to their practical applications in delay forecasting, scorecards also facilitate continuous improvement and benchmarking, two critical components of modern performance management. By regularly reviewing performance against established benchmarks, airlines can identify areas for improvement, implement best practices, and measure progress over time. Benchmarking not only enhances the accuracy and reliability of delay forecasts but also fosters a culture of accountability and excellence within organizations. This is particularly important in the highly competitive aviation industry, where operational efficiency and service quality are key differentiators. For example, comparing on-time performance rates with industry peers can help airlines identify gaps in their operations and develop strategies to close those gaps, ultimately improving both customer satisfaction and financial performance [
36,
37].
The balanced scorecard approach, a widely recognized methodology within performance management, provides a useful framework for organizing and prioritizing the various metrics included in delay forecasting. This approach emphasizes the importance of balancing operational metrics with broader strategic goals, ensuring that airlines not only address immediate challenges but also align their efforts with long-term objectives such as sustainability and growth. For example, incorporating environmental metrics into the scoring framework allows airlines to monitor and minimize their carbon footprint, supporting global efforts to reduce greenhouse gas emissions and achieve sustainability targets [
33]. This cross-functional transparency is difficult to achieve with many existing ML-based approaches, which often require specialized interpretation layers or expert involvement to translate model results into actionable insights.
One of the key strengths of scorecards is their potential to bridge the gap between complexity and usability. Although traditional forecasting methods, such as ARIMA models or advanced machine learning algorithms, are effective in capturing trends and patterns, they often require significant computational resources and technical expertise. This can limit their accessibility and practical applicability for smaller airlines or operators with limited resources. In contrast, scorecards offer a more user-friendly alternative that combines the analytical rigor of advanced models with the simplicity needed for widespread adoption. By providing clear and actionable insights, scorecards empower airline managers and operators to make informed decisions without relying on specialized analysts or expensive software tools [
30]. This accessibility is particularly valuable for small or medium-sized airlines that lack advanced IT infrastructure or data science teams, allowing them to benefit from predictive modeling without prohibitive costs or complexity.
Furthermore, the implementation of scorecards can contribute to enhanced collaboration and communication within organizations. By presenting performance data in a clear and visual format, scorecards make it easier for different teams and departments to understand and align around common goals. For example, operations teams can use scorecards to monitor turnaround times and identify delays, while customer service teams can track passenger satisfaction metrics to ensure a seamless travel experience. This cross-functional approach not only improves operational efficiency but also enhances the overall quality of service, ultimately strengthening the airline’s competitive position in the market [
32].
In conclusion, scorecards represent a significant innovation in delay forecasting, offering a flexible, accessible, and data-driven approach to managing one of the most pressing challenges in the aviation industry. By integrating various performance metrics, incorporating external variables and supporting continuous improvement, scorecards provide airlines with the tools necessary to navigate the complexities of delay management effectively. Their capacity to adjust to dynamic operational conditions, combined with their focus on simplicity and ease of use, renders them an indispensable asset for airlines aiming to augment operational performance, elevate passenger satisfaction, and secure long-term sustainability. As the aviation sector continues to evolve, the adoption of scorecards and similar innovative methodologies will be essential for maintaining competitiveness and ensuring high levels of service quality in an increasingly complex and demanding global market.
In recent years, the growing use of AI/ML models in regulated industries, such as banking and insurance, has brought increased attention to the need for transparency in predictive systems. According to the European Banking Authority (EBA), for an AI/ML model to be approved for operational use in such environments, it must meet strict interpretability and auditability standards. Initially, the emphasis was on developing “fully interpretable models”, but this has evolved into the broader concept of “explainable models”, which do not necessarily require full transparency of internal logic, but must still provide insight into key model components, such as the list and order of input factors and their relative influence on the predicted outcome [
38]. In parallel, the field of explainable artificial intelligence (XAI) has emerged, aiming to enhance complex AI/ML models with complementary tools—such as feature importance rankings, surrogate models, or counterfactual analysis—that make their decisions more understandable to human users [
39].
In the context of transportation, recent studies have explored XAI applications to support interpretable delay predictions, passenger behavior modeling, and risk assessments in multimodal logistics. By positioning scorecards within this broader movement, our study responds to the increasing demand for transparent, auditable, and operationally feasible predictive tools, particularly in critical and high-impact sectors such as aviation.
The long-term hypothesis is formulated as follows: every business area, in the first steps, will be covered by intuitive and fully interpretable scorecards methods; then, when the field is open to advanced analytics and many processes are covered by fully automatic tools with many predictive models inside, then AI/ML models will be a natural consequence, but scorecards should still be used to verify and test the correctness of the AI/ML.
2.4. Summary of Innovations and Contribution
This study introduces a novel and operationally practical approach to the prediction of flight delays through the use of scorecards. Unlike traditional forecasting models, such as ARIMA, decision trees, or neural networks, the proposed method balances predictive capability with full interpretability and low implementation cost. Below, we summarize the key innovations and contributions of this work in comparison to existing methodologies:
Interpretability and transparency: In contrast to many machine learning models often regarded as ‘black boxes’, the scorecard method provides a fully interpretable framework where each input variable has a visible and explainable impact on the final prediction.
Operational usability: The scorecard model is designed with practical deployment in mind. It can be integrated into existing airline and airport systems via dashboards or APIs and does not require advanced computational infrastructure or ongoing retraining.
Adaptability with minimal complexity: Although many ML-based models require complex tuning and optimization, scorecards allow manual recalibration, making them suitable for dynamic operational environments.
Domain transferability: Although demonstrated on U.S. flight data, the modeling approach is general and can be replicated in other national or regional contexts. Variables may differ by region, but the underlying method remains applicable.
Filling a methodological gap: The current literature emphasizes either advanced, opaque models or simplistic binary classifiers. Scorecards serve as a middle ground—offering a probabilistic output while remaining easy to interpret by non-technical users.
The integration of analytical precision with business reasoning intrinsic to the scorecard renders it an effective instrument for data-driven decision-making within the aviation sector. It responds directly to the industry’s growing need for explainable, efficient, and scalable forecasting solutions.
In summary, the objective of this article is to present a pragmatic solution for an urgent issue in the field of air transportation, by providing an effective, accessible, and actionable instrument for the prediction and management of flight delays. The study of scorecard results can provide not only a forecasting tool but also intuitive and clear visualizations of data to understand the process, identify the main contributing factors, and explain the reasons for change.
This research addresses a gap in the existing literature, where scorecards—commonly used in finance and risk management—have not yet been explored in the context of flight delay prediction. Although machine learning models dominate this space, they often suffer from opacity and high computational complexity. This study poses the following key research question: Can scorecard systems effectively predict flight delays with an accuracy comparable to ML models while maintaining full interpretability and operational simplicity?
The solution to this inquiry not only informs the modeling methodology in this paper but also ascertains whether a more straightforward, human-interpretable instrument can augment or even supplant complex, opaque predictive models in essential transportation applications.
3. Classification of Flight Delays
In the United States, there is no unified federal regulation that clearly defines the thresholds for various types of flight delays or mandates standardized passenger compensation. As a result, airlines and researchers often rely on alternative classification schemes when analyzing delay events. In this study, we adopt a structured delay categorization model inspired by the European Union’s Regulation (EC) 261/2004, which outlines consistent criteria for delay compensation, cancellations, and denied boarding across member states.
Based on this regulatory framework and previous academic sources, we distinguish the following delay statuses [
12]:
Early arrival–when the aircraft lands before its scheduled arrival time.
On time–when arrival occurs at or within 15 min of the scheduled time.
Delayed–when the delay ranges between 15 min and 2 h.
Significantly delayed–when the delay exceeds 2 h.
Exceptional delay–when the delay surpasses 6 h or is otherwise categorized as extreme.
These categories are translated into a five-level scale, with Level 1 representing early arrivals and Level 5 indicating extraordinary disruptions.
In addition to categorizing delay severity, the U.S. Department of Transportation (DOT) provides an operational classification of delay causes. These include the following:
Carrier-related delays, such as technical issues, ground operations, or overbooking.
Late-arriving aircraft, where previous flight delays cascade into subsequent segments.
National Airspace System (NAS) delays due to air traffic control restrictions, limited runway availability, or moderate weather.
Security-related delays, resulting from events such as passenger screening bottlenecks or terminal evacuations.
Weather delays, caused by severe meteorological conditions that impact flight safety.
Each cause is considered independently and contributes a measurable portion to the total delay time. By breaking down delays into component parts, it becomes possible to analyze the cumulative impact and explore causal relationships.
Analyzing the underlying causes of flight delays is essential for generating actionable data-driven insights and building reliable predictive models. The United States, as the largest aviation market—handling around 30% of global air traffic and possessing the highest number of operational airports—provides a uniquely rich dataset. Since 1987, detailed flight-level records have been systematically published by the Bureau of Transportation Statistics, offering researchers access to a long-term, high-volume data environment.
For the purposes of this study, we focused on a recent and representative time window: January 2018 to December 2022. This five-year period includes more than 30 million domestic flight records, which were obtained in CSV format directly from the Bureau’s online repository.
To handle the large data volume efficiently and ensure accurate processing, the raw files were imported into SAS 9.4 software, which is optimized for large-scale statistical computing and enables the processing of datasets that far exceed the row limitations of spreadsheet applications. In addition, custom-built analytical scripts and data transformation procedures developed by the authors were employed to clean, integrate, and restructure the data. This approach allowed for scalable computation, exploratory analysis, and the implementation of scorecard-based modelling on a robust technical foundation.
4. Research Method
4.1. Scorecard Methodology
Scorecards are a mathematical tool for classifying and evaluating objects, phenomena, or processes using multiple criteria. They are widely used in areas such as risk analysis, credit scoring, customer behavior forecasting, or quality assessment. Their operation is based on the transformation of various characteristics (attributes) into scores, which are then added together to produce an overall score. A detailed methodology for the construction of scorecards is outlined in the following, together with the mathematical underpinnings.
The method is based on classical risk scorecard models used in the banking environment [
40,
41], where scorecards are used in key processes like IFRS9–provision calculation, IRB–risk-weighted assets calculation, and in many credit risk acceptance processes where various credits are granted.
The methodological design of the model is consistent with the base of Basel II documents such as [
42,
43].
It is also based on famous methods described in known books and articles such as [
40,
44,
45,
46]. The methodology has been described in detail in the book [
47].
4.2. Database
The dataset involved the collection of historical flight data in the U.S., containing information such as departure and arrival times, route, aircraft type, and other factors that can affect delays. An exploratory analysis of the data was then conducted to identify key factors that affect delays. The next step was the construction of scorecards, which involved assigning weights to individual factors based on their impact on delays, normalizing and scaling these factors to facilitate the interpretation of the results. The efficacy of the scorecards was validated utilizing test data, and the outcomes were assessed by juxtaposing the forecasted delays with the actual data. Finally, a validation and analysis of the results was carried out to assess the accuracy and reliability of the model.
Once the data were collected, an exploratory analysis was conducted to identify the key factors influencing flight delays. This analysis included the following:
Examining the distributions of individual variables (
Table 1).
The analysis of correlations between variables and delays.
Identifying data gaps and possibly filling or removing them.
Grouping flights by categories such as aircraft type, route length, or departure time to understand which groups are most susceptible to delays.
The results of the exploratory analysis helped to select the key variables that were used in the construction of the scorecards. It also allowed us to understand the scale of the problem and the specific relationships between delays and various operational and external factors.
Table 2 categorizes the primary causes of flight delays measured in minutes. Each type of delay reflects a specific reason contributing to overall flight disruptions:
CarrierDelay: Delays caused by airline-related issues such as maintenance problems or crew unavailability.
WeatherDelay: Delays resulting from adverse weather conditions that affect flight operations.
NASDelay: Delays attributed to the National Air System including air traffic control and airport congestion.
SecurityDelay: Delays due to heightened security checks or airport security incidents.
LateAircraftDelay: Delays caused by the late arrival of the incoming aircraft intended for the next flight segment.
Before presenting any scorecard, it is imperative to examine the causes of flight delays based on the provided data. Causes were collected from 2003 to 2006. There were 7,030,910 selected flights with delays of more than 60 min. Using separate variables, the minutes of delay associated with the cause of delay are provided for departure delay, carrier delay, weather delay, security delay, late aircraft delay, and NAS delay. Sometimes there was a positive-only departure delay—722,205 cases. All minutes were summarized and the share of total minutes for arrival delay is calculated as a percentage.
Figure 1 presents the relative contribution of various delay causes to the total arrival delay time based on U.S. flight data from 2018 to 2022. Departure delays emerge as the dominant factor, accounting for approximately 94% of the total arrival delay minutes. This suggests that delays that occur before take-off are rarely made up during the flight.
Late aircraft delays contribute around 41% and carrier-related delays approximately 32%, indicating that previous flight disruptions and airline operational issues are also substantial contributors. In contrast, weather-related delays account for just 6% and security-related issues are negligible (0.1%).
The proportions remain relatively stable throughout the year, although a slight decrease in weather-related delays is observed during warmer months, which is consistent with seasonal meteorological patterns.
The same analysis was extended to selected groups of origin airports to examine how the causes of delays vary by airport size and traffic volume. Specifically, two subsets were analyzed: the top five origin airports, representing 1,687,276 flights, and the bottom 60 origin airports, comprising 3592 flights.
Within the top five airports, the overarching distribution of delay causations exhibits a pattern that closely aligns with the general trend, with a marginal increase in weather-related delays to 8% (
Figure 2). These findings imply that although high-traffic airports effectively manage the majority of delay types, they nevertheless exhibit a degree of susceptibility to the impacts of seasonal weather.
In contrast, smaller regional airports (bottom 60) demonstrate less stability in delay shares over months. As shown in
Figure 3, these airports experience very high levels of departure delays (up to 98%) with notable contributions from late aircraft delays (33%) and weather-related delays (9%). These fluctuations indicate a higher sensitivity to operational or environmental disruptions, potentially due to limited resources or infrastructure.
The delay analyses can be done by scorecard building. This gives us many useful conclusions and interpretations even if the data or the subject of the research is not very well known. Scorecards can be done in an almost automatic way by any analyst or data scientist, and then results can be presented to experts to analyze and interpret them, and finally, to make conclusions for further research. To optimize analyses without losing representativeness and many details, all analyses below were conducted on a sample of 2,055,823 flights randomly generated from all available flights—205,493,100 (flights that were not cancelled, from 1987 to 2023).
The modelling uses as a classical approach to verify causal relations between a delayed flight and its cause.
The following two statuses are defined in the modelling event:
The observed share of delayed flights is defined as the delay rate (DR)—this metric reflects the actual, historical proportion of delayed flights in the dataset. The corresponding model-estimated value is the predicted delay rate (PDR), which indicates the probability of delay as forecasted by the scorecard model. These definitions help to distinguish between observed performance and predictive estimation. All data are randomly split into two subsets—training and validation—used, respectively, for model construction and evaluation (see
Table 3).
4.3. Model-Building Process
The model-building process involves a structured sequence of steps designed to transform raw data into a predictive framework, ensuring a robust and reliable model tailored to the analysis objectives. The process consists of the following steps:
Random sample and data partition: Two datasets are created, a training set and a validation set, based on a simple random sample method (without duplications) in proportions of 60%/40%.
Binning, categorization, and grouping: Groups and categories of variables are created. Based on entropy statistics, every interval variable is split into categories. This is a typical method used in decision tree techniques. Nominal variables are also grouped, especially when the number of unique categories is too big. After binning, every variable is transformed into a logit variable.
Variable preselection: Based on univariate statistics for every variable, unimportant variables are indicated, such as those with too small of a predictive power or that are too unstable over time.
Multifactor analysis and multidimensional variable selection: The heuristic method, called branch and bound, available in SAS software, is used [
48] along with RFE (recursive feature elimination), available in Python 3.13.
Before diving into specific steps such as model assessment, implementation, and monitoring, it is essential to highlight that each stage plays a crucial role in ensuring the model’s reliability, relevance, and practical applicability. The process consists of the following steps:
Model assessment: There are no unique model criteria. The most popular are the following: predictive power (Gini); stability, such as R, Gini, or delta Gini–relative difference between Gini on training and validating datasets; collinearity measures, such as Max VIF—maximal variance inflation factor [
49], MAX Pearson—maximal Pearson correlation statistics on a pair of variables, and MAX Con Index—maximal condition index [
50]; and variable significance measures (
p-values).
Model implementation: It is recommended to calculate all innovation factors, especially the predicted innovation rate, on a yearly basis and use them as measures of innovation change.
Monitoring and testing: After every yearly calculation, some factors can be tested by additionally organized surveys, maybe on a lower scale, or only for extreme cases, like the highest or the lowest predicted innovation rates, to verify and monitor the equality of the observed and predicted innovation measures. Also, what should be emphasized, based on predicted innovation rates, can be made into a ranking of all variables and then can be investigated with a more advanced study for some chosen variables to understand the change and quality of innovation. Another interesting point relates to the possibility to create recommendations dedicated to a particular variable to increase its innovation; namely, it can be done by a partial score analysis. To increase the innovation rate, the variable should change its characteristic existing in the model to obtain a lower partial score. The mentioned ideas are described further in the following subsections.
5. Results
5.1. Model Statistics
Table 4 and
Table 5 present the key statistical measures of the model performance.
Table 4 provides percentage-based metrics, which include measures of model discrimination and gains across different deciles.
Table 5 outlines numerical values for metrics such as the KS score, PSI score, multicollinearity indicators, and lift values at various levels. These metrics collectively evaluate the effectiveness and stability of the model.
The model has a very high value of predictive power measured by Gini statistics at the level of 94.61%, see
Table 4 and
Table 5. This means that the information used in the model is enough to explain the delay of the flight. The Gini statistics can be interpreted as follows: taking any two flights, one delayed and one non-delayed, there is a 97.30% ((Gini + 1)/2) chance our model can predict which of them will be delayed. More information about model statistics is provided in the dictionary below.
5.2. Variable Descriptions and Reports
5.2.1. The Departure Delay
A variable category report is very useful from a business point of view. It helps to understand the root cause or causal relations between flight characteristics and being delayed. In the report, the main statistics are the observed delayed rate—DR and the share of the population—how many flights are in the category.
The information (
Table 6) presented by category can be useful for defining some benchmarks so the airport does not allow the airplane to stay longer in the airport when the probability of a final delay, more than 60 min, is too high. Namely, if an airplane is delayed more than 63.5 min, it has a 90.4% of not catching up the delayed minutes and arriving more than 60 min late. In the opposite case, a departure with a delay less than 21.5 min has a 0.3% chance of being delayed more than 60 min.
5.2.2. Variable Month: The Month of Departure Time
The analysis examines the impact of the departure month on various delay metrics, providing information on seasonal patterns and their influence on flight punctuality. The results are presented in
Table 7.
For that variable, it is useful to also present an evolution over time (
Figure 4). The following presents the dynamism of DR statistics over every ten years:
It can be seen that the delay rate is shifted in 2000 and warm months, 5.5 ≤ Month < 8.5—partial score 288 (red color), are suddenly more delayed than before. It can also be observed that the shares are not stable (
Figure 5), particularly for the 291 partial scores, Month < 5.5, and cold months, which show a higher number of flights.
5.2.3. The Size of Origin Airport
Before analysis, the number of flights per every origination and destination is calculated from all available flights (
Table 8). This information can be related to some issues at particular airports, such as heavy aircraft traffic.
In general, small delays can be observed at very big and small airports; however, additional information can be obtained from the evolution over time (
Figure 6).
It can be observed that for the biggest airports—partial score 294—the delayed rate has been decreasing over the last 20 years; this can be explained by the more advanced quality of services in the big airports.
5.2.4. The Size of Destination Airport
This analysis explores how the size of the destination airport (Dest_size) influences flight performance, offering insights into operational efficiency and potential congestion impacts. The detailed report on the variable category is presented in
Table 9.
In general, small delays can be observed at small airports, but additional information can be obtained from the evolution over time, presented in
Figure 7.
We can observe that for the biggest airports—partial score 297—the delayed rate has been decreasing over the last 20 years; this can be explained by the more advanced quality of services at big airports.
5.3. Model Scorecard
The final form of the scorecard is presented in
Table 10. The final score for a flight is calculated as a sum of all partial scores assigned to a particular flight. A lower score corresponds to a higher value for the delay rate (DR).
5.4. Variable Statistics
In
Table 11,
Table 12 and
Table 13, variable statistics are presented. Based on the importance statistics, we can conclude that the most important variable in the model is REAL_ESTATE_INDUSTRY_RATIO, the variable connected with the real estate market. The rest of the measures are described in the dictionary below.
The importance statistics are consistent with the prior analysis based on causal variables. The predominant cause of delay is attributed to department delay, accounting for 86% of cases. The subsequent cause is the size of the destination. Additionally, as previously observed, monthly variations may contribute to minor discrepancies.
5.5. Calibration and Forecasting
The developed model enables the prediction of the delay rate (PDR) using a calibrated formula, which is as follows:
Formula (1) is derived from historical data and serves as a reliable tool to estimate delay rates based on the given score. The calibration ensures that the predicted PDR aligns closely with the observed delay rates (DR) over time.
The robustness of this formula has been validated by testing it against historical data in increments of ten years. In each case, the predicted PDR has consistently demonstrated a strong correspondence with the actual observed DR values, confirming the formula’s accuracy and reliability. This historical consistency underscores the model’s effectiveness in predicting delay rates under varying conditions. The results are presented in
Table 14.
Very similar values for DR and PDR suggest that all necessary information to predict delays in flights are included in the model.
5.6. Ranking of Flights and Further Research
The existence of a formula for PDR gives the possibility of validating the evolution over time of predicted delay rates and of focusing on some extreme flights with the biggest PDR to test and analyze in detail. It can also create a business case for managing some actions associated with a high PDR to avoid losses and inform flight customers about possible delays. The current model gives the possibility of calculating the PDR directly, at the time of departure.
Currently, this can be formulated as a new question. Can an arrival delay > 60 be predicted before the airplane departs, based on information available before the time of departure?
To answer this question, the second model is built. The full documentation of that model is not presented, but some interesting topics are pointed out.
The predictive power is small—20.4%. It can be interpreted as follows: taking any two flights, one delayed and one non-delayed, there is a 60.20% ((Gini + 1)/2) chance our model can predict which of them will be delayed, so only slightly better than the 50% that can be achieved by a random model. This means that in this case, we do not have enough information to predict the event.
Before diving into the variables used in the second model, it is important to highlight that its objective is to determine the likelihood of an arrival delay exceeding 60 min based solely on pre-departure information. Despite its limited predictive power, the model sheds light on key factors that influence delays, providing a foundation for further investigation and improvement. Below is the list of variables included in the model:
Month—the month of the planned departure time.
Dest_size—the number of flights per destination for all available flights.
Origin_WDMH_size—the number of flights per weekday, month, and hour of planned departure from the origination for all available flights; this is a more advanced variable connected with heavy aircraft traffic issues.
Reporting_Airline—the airline of airplane.
CRSElapsedTime—the planned length of the flight.
It should be underlined that there are three new variables: Origin_WDMH_size, Reporting_Airline, and CRSElapsedTime.
5.6.1. The Number of Flights per Weekday, Month, and Hour
The variable “Origin_WDMH_size” captures the total number of flights scheduled from the origin airport, broken down by weekday, month, and hour. This metric provides valuable insights into peak traffic periods and potential congestion, which may contribute to delays. The detailed breakdown is presented in
Table 15.
It can be observed that big airports with a large number of flights in the same weekday, hour, and months, as planned time at origin, have a tendency towards decreasing delay rates (
Figure 8) over the last 20 years, but the smallest airports with the least traffic, 322 partial score, have an increasing trend.
5.6.2. The Reporting Airlines
The variable “Reporting_Airline” represents the airline responsible for operating the flight. This factor is crucial to understanding how different carriers perform in terms of punctuality and delay management.
Table 16 provides a detailed category report for this variable.
From
Figure 9, it can be concluded that airline MQ, partial score 310, had the biggest DR in the past, with a later tendency towards decreasing DR, but in the last 10 years has been slightly increasing.
5.6.3. Flight Duration Analysis
The variable “CRSElapsedTime” represents the scheduled duration of a flight. measured in minutes. This metric helps assess how planned flight lengths correlate with delay patterns, offering insights into whether longer or shorter flights are more prone to delays.
Table 17 provides a detailed category report for this variable.
The smallest DR (
Figure 10) is for the shortest duration flights, partial score 322. It is observed that almost all lines do not cross except for the partial scores of 310 and 312, which represent the longest duration flights and show an inversion of their DR over time.
It is notable that an increasing trend is observed for long-duration flights, while a decreasing trend is noted for short-duration flights (
Figure 11). An interesting conclusion can be drawn from the variable importance report, as shown in
Table 18.
No significant importance for certain variables is observed in the model. The most important variable is identified as the month, which intuitively aligns with cyclic weather patterns. The model’s predictive performance is not strong, indicating the need for additional data. However, based on variable importance, areas of interest or priorities can be defined. Weather data should be added first, followed by information on airport traffic and other relevant factors.
6. Discussion
The scorecard-based model developed in this study offers a transparent and accessible solution for predicting flight delays, making it highly suitable for operational use. Its structure facilitates straightforward integration into existing systems through dashboards or lightweight APIs, enabling real-time monitoring of predicted delay risk (PDR). Such tools empower airline and airport personnel to make proactive decisions, such as resource reallocation or passenger notifications, before delays occur.
One of the main advantages of the scorecard system is its interpretability. Unlike complex ‘black-box’ models, scorecards present results in the form of cumulative scores derived from weighted variable categories. This structure allows an intuitive understanding of why a particular flight may be at a higher risk of delay. For example, stakeholders can see how specific factors—such as departure time, airport congestion, or weather conditions—contribute to the final score. These partial scores not only signal risk, but also guide targeted operational responses. To further support this, we recommend including annotated visual summaries of key scorecard outputs in dashboards, emphasizing which variables most influence delay probabilities in a given context.
Nevertheless, the model’s simplicity comes with limitations. The use of binning, while critical to interpretability, can reduce model granularity and adaptability over time. Static categorization may mask evolving patterns or lead to suboptimal thresholds in dynamic environments. To mitigate this, regular recalibration and reassessment of variable groupings are essential. By periodically updating scorecard parameters with new data, the model can retain its accuracy while preserving the clarity of interpretation that makes it practical.
It is important to emphasize that although the model was developed using flight data from the United States, the methodology itself is universally applicable and operationally neutral. The scorecard-based framework was designed to be independent of specific regulatory or infrastructural conditions, making it adaptable to a wide range of regional contexts.
Applying this method in another country would involve retraining the scorecard using locally sourced flight data, which can naturally result in different variables, thresholds, or weightings. However, the step-by-step process of variable selection, binning, scoring, and model calibration remains unchanged. This ensures that the core structure of the scorecard model can be replicated across global aviation systems, regardless of differences in traffic volume, weather conditions, or operational practices.
Furthermore, the transparency and interpretability of the model render it particularly apt for deployment on a global scale, especially in regions where technical or regulatory conditions necessitate tools that are explainable and capable of being audited. The use of scorecards also reduces the need for advanced infrastructure or specialized machine learning expertise, enhancing its usability in both developed and developing aviation markets.
In this sense, while U.S. data provided a robust foundation for initial model development, the methodology is not limited to a specific geographic context and is fully transferable to global applications.
7. Featured Application
The scorecard-based approach to flight delay prediction developed in this study provides a practical and accessible analytical tool for stakeholders in the air transportation industry. Its simplicity, interpretability, and scalability make it particularly suitable for integration into real-time operational environments, where decision-making must be both fast and explainable.
Unlike complex statistical models or black-box machine learning algorithms, scorecards allow for a transparent presentation of the influence of key operational variables—such as departure time, airport traffic, or weather conditions—on the probability of a flight delay. This transparency is critical for adoption by airline managers, airport coordinators, and regulatory bodies, especially in environments where model interpretability is a regulatory or operational necessity.
The scorecard can be implemented as part of airport management systems or airline scheduling platforms to continuously assess delay risk during various stages of flight preparation and execution. For example, based on available pre-departure information, operators can use the scorecard to flag flights with a high predicted delay rate (PDR). Such alerts could trigger proactive actions—such as aircraft reassignment, crew schedule adjustments, or early passenger communication—that help mitigate the impact of potential delays.
Moreover, the lightweight nature of the scorecard model makes it ideal for deployment in environments with limited computational resources, such as smaller regional airports or mobile airline control platforms. The model requires only a basic input dataset and delivers near-instant predictions that are easily interpretable by non-technical staff. This aligns well with the increasing demand for democratized access to analytics tools within operational teams.
The application of scorecards also supports benchmarking and performance monitoring efforts. By aggregating predicted and actual delay rates over time, airlines can identify bottlenecks, seasonal patterns, or systemic inefficiencies. Additionally, the interpretability of the model allows it to serve as a validation or calibration layer for more advanced AI/ML models. As explainable artificial intelligence (XAI) becomes increasingly important in regulated industries, scorecards can offer a foundation for trustworthy model governance and auditing.
In the broader context of logistics and transport optimization, the adoption of scorecards exemplifies the shift toward hybrid analytical approaches—combining simplicity with data-driven precision. The method presented in this paper offers an innovative, cost-effective, and practical solution to the pervasive problem of flight delays, with potential for further extension into multimodal logistics, passenger flow management, and infrastructure planning.
8. Conclusions
Scorecards are an innovative tool that significantly improves operational efficiency and service quality in the aviation sector. Their simplicity, flexibility, and ability to integrate multiple factors make this solution highly beneficial for both airlines and airports. By monitoring key performance indicators (KPIs) such as punctuality, ground handling time, and crew availability, scorecards enable a quicker identification of potential issues and the implementation of preventive measures. For instance, in cases of delays caused by adverse weather conditions, scorecards can pinpoint the most at-risk flights, allowing for better resource planning.
The simplicity of interpreting scorecard results facilitates data-driven decision-making without requiring advanced analytical skills. This tool is particularly useful in dynamic operational environments, where quick decisions are crucial. Moreover, high punctuality and operational reliability supported by scorecards enhance competitiveness, boosting passenger loyalty and market position.
This study aimed to answer the following research question: Can a scorecard-based model provide a simple, interpretable, and effective method for predicting flight delays? The results clearly confirm this. This study demonstrated that scorecards can be used to accurately predict flight delays using historical U.S. flight data. The model achieved high predictive performance (Gini = 94.6%) while maintaining full interpretability, addressing the research goal of developing a simple yet effective forecasting method.
Compared to traditional statistical and machine learning approaches, the scorecard method offers a compelling balance between accuracy, transparency, and ease of implementation.
Implementing scorecards also leads to a significant reduction in operational costs. Improvements in punctuality and predictability reduce expenses related to additional fuel consumption, crew overtime costs, and penalties or compensation for delayed passengers. Better resource management also results in a more efficient use of airport infrastructure and fleet. Furthermore, proactive delay management ensures that passengers are informed earlier about potential disruptions, minimizing frustration and improving customer satisfaction. The insights generated by scorecards can also be used to create transparent reports, fostering trust and loyalty among passengers.
Nonetheless, some limitations remain. The current model relies on historical and operational data but does not yet incorporate real-time external factors such as weather nowcasting or real-time congestion data. Additionally, binning methods used in scorecard construction, while interpretable, may lead to the loss of granularity in some scenarios.
Future research should aim to address these limitations by integrating real-time data feeds, incorporating dynamic variables, and testing the model in non-U.S. operational contexts. The inclusion of explainable AI (XAI) extensions may also offer a hybrid solution that preserves interpretability while boosting adaptability.
The potential for further research into scorecards is vast. Incorporating artificial intelligence (AI) into scorecard development could enable real-time updates of weights and scores based on live data, supporting the prediction of rare but critical events such as large-scale operational disruptions. Simultaneously, personalized scoring models tailored to the specific needs of different airports and airlines could enhance their effectiveness. Integrating scorecards with operational management systems, such as fleet management or enterprise resource planning (ERP) systems, requires a further exploration of interoperability and data standardization. In the context of sustainability, scorecards could include environmental indicators such as CO2 emissions and fuel consumption, supporting airlines in achieving their environmental goals. Long-term analyses evaluating the impact of scorecards on operational efficiency, costs, and passenger satisfaction would also provide valuable insights for future optimization.
In conclusion, this study confirmed that scorecards represent a viable and effective method for flight delay prediction, especially where interpretability and fast decision-making are required. Their versatility, low implementation cost, and strong predictive power position them as a valuable tool for the aviation industry’s ongoing digital transformation