Abstract
The aviation industry operates as a complex, dynamic system generating vast volumes of data from aircraft sensors, flight schedules, and external sources. Managing this data is critical for mitigating disruptive and costly events such as mechanical failures and flight delays. This paper presents a comprehensive application of predictive analytics and machine learning to enhance aviation safety and operational efficiency. We address two core challenges: predictive maintenance of aircraft engines and forecasting flight delays. For maintenance, we utilise NASA’s C-MAPSS simulation dataset to develop and compare models, including one-dimensional convolutional neural networks (1D CNNs) and long short-term memory networks (LSTMs), for classifying engine health status and predicting the Remaining Useful Life (RUL), achieving classification accuracy up to 97%. For operational efficiency, we analyse historical flight data to build regression models for predicting departure delays, identifying key contributing factors such as airline, origin airport, and scheduled time. Our methodology highlights the critical role of Exploratory Data Analysis (EDA), feature selection, and data preprocessing in managing high-volume, heterogeneous data sources. The results demonstrate the significant potential of integrating these predictive models into aviation Business Intelligence (BI) systems to transition from reactive to proactive decision-making. The study concludes by discussing the integration challenges within existing data architectures and the future potential of these approaches for optimising complex, networked transportation systems.
1. Introduction
1.1. Big Data in the Aviation Industry
The aviation industry is characterised by vast amounts of complex, unstructured data that are subject to continuous change and can be classified as Big Data owing to their stochastic and dynamic nature. Consequently, analysing and visualising such data frequently necessitates the use of specialised software tools whose outputs serve as critical support or influential factors in decision-making processes aimed at optimising business objectives, enhancing customer satisfaction, and achieving a competitive advantage. This is particularly evident for several reasons:
The aviation industry currently faces heightened competition, largely due to deregulation and the proliferation of low-cost carriers, which attract potential customers through cost-effective offerings
Business processes within the aviation sector are intrinsically linked to decentralised data sources, including meteorological, economic, and diagnostic information
Operational responsibilities and task management must align with long-term historical data archives to identify patterns of application or best practices
Airlines require robust marketing strategies and a rigorous approach to service quality management, supported by predictive analytics to monitor market trends and customer feedback
Data such as market conditions or device/sensor status is frequently generated and distributed in real time at high intensity, necessitating improvements in the performance of integrated, industry-specific business intelligence tools []
The analysis of these characteristics presents both a challenge and an opportunity for optimising existing processes or adopting novel methodologies within aviation management. Aviation companies, airports, aircraft manufacturers, suppliers, governments, and other aviation-related organisations rely heavily on data for operational planning and process execution. However, the complexity and competitiveness of datasets pose significant technical and human challenges in collecting, sorting, and mining aviation databases—a task that exceeds the capabilities of conventional desktop computing systems. Furthermore, the so-called “3V” model of Big Data—comprising Volume, Variety, and Velocity—is particularly pertinent to this context. Volume necessitates specialised software for processing large-scale data with high performance and scalable storage solutions, while Variety introduces data from disparate sources in diverse formats, thereby complicating research analysis and ETL (Extract, Transform, Load) processes. Finally, Velocity refers to the continuous generation of data from industrial or economic processes such as aircraft sensors, air traffic, and weather monitoring [].
In addition to these three characteristics, a fourth dimension—referred to as “Veracity”—further complicates Big Data management. This pertains to data of suboptimal quality, which may be corrupted, contain unauthorised values or inaccurate measurements, or originate from unreliable sources. Such issues can lead to conflicting analytical results and adverse consequences for decision-making processes and integration with Big Data tools []. According to the IBM Big Data and Analytics Hub, one-third of business leaders express distrust in their data sources for critical decisions, resulting in annual losses exceeding $3 trillion due to misinformed choices based on imprecise information []. Thus, alongside the deployment of BI tools and predictive models, a rigorous approach to managing unstructured data is essential for effective enterprise analytics.
The scholarly article “Cross-Platform Aviation Analytics Using Big-Data Methods” identifies eight primary sources of Big Data within the aviation industry: flight tracking records, passenger details, airport operations, aircraft specifications, meteorological information, airline data, market intelligence, and aviation safety reports, as visualised in Figure 1 []. It is crucial to note that these data types are interdependent; no single source can independently provide a comprehensive overview of the industry’s current state. Consequently, prior to processing, integration of these diverse datasets into a unified repository is performed, after which they are fed into specialised analytical tools for further examination.
Figure 1.
Sources of data and data processing in the aviation industry.
1.2. Application of Business Intelligence Systems
Business intelligence systems can be defined in various ways; however, their most essential characteristic is that they consist of tools or a collection of technologies designed to extract actionable business insights from Big Data repositories. These systems aim to improve business operations and economic outcomes within specific contexts while enabling knowledge workers—such as managers, executives, and analysts—to make more informed and timely decisions. Specifically, they are highlighted for the following purposes:
Benchmarking of business process performance and goal achievement—enabling comparisons between entities, such as two airlines, to assess the impact of modified processes or new practices on market conditions
Quantitative analysis through predictive analytics, modelling, business process simulation, and statistical techniques—for instance, identifying patterns in passenger preferences for specific destinations, services, or travel habits, which can optimise flight schedules, service intervals, inventory management, and other operational aspects
Departmental or company-wide reporting—a feature consistently present across all domains of activity
Utilisation of Electronic Data Interchange (EDI) tools to facilitate information exchange between internal and external organisational entities, where standardisation remains a critical consideration
Deployment of knowledge management software to identify, organise, and disseminate organisational information to relevant stakeholders, primarily managers, who rely on these insights to make evidence-based decisions
Implementation of methodologies and procedures for interactive data collection techniques []
In addition, traditional analytical frameworks, models, and methods must be reconfigured to deliver decision-support services via cloud computing or Big Data platforms. Given the context in which their application is studied, this inherent characteristic aligns with the evolving demands of modern organisational environments [].
This paper investigates the feasibility of implementing a business intelligence system based on an ETL (Extract, Transform, Load) process combined with predictive models for analysing historical data, with the objective of addressing two critical challenges in modern aviation:
Monitoring aircraft equipment conditions to establish planned and proactive maintenance strategies that prevent system failures and minimise equipment unavailability
Developing an optimal flight schedule by leveraging historical data on previous delays to mitigate future disruptions []
The primary contributions of this work to the current state of knowledge are as follows:
A systematic comparison of 1D CNN and LSTM architectures for both binary/multiclass health classification and Remaining Useful Life (RUL) prediction on the NASA C-MAPSS dataset, achieving up to 97% classification accuracy and identifying the most influential sensor features through SHAP-based interpretability—an approach rarely applied in prior aviation prognostics studies.
A transparent, reproducible pipeline for flight delay prediction using real-world U.S. flight records, which quantifies the relative impact of airline identity, origin airport, and scheduled departure time while exposing the limitations of polynomial regression in the presence of extreme outliers and sparse airport-level data.
An integrated methodological framework that bridges Exploratory Data Analysis (EDA), feature importance analysis, and model validation to support the deployment of predictive analytics within aviation Business Intelligence (BI) systems, offering a practical blueprint for transitioning from reactive to proactive decision-making in dynamic transportation environments.
Beyond a theoretical and practical examination of these use cases, this study presents an objective comparison of results derived from alternative approaches or models. Advantages and limitations are evaluated from both human and technological perspectives for each method, while potential extensions, upgrades specific to the aviation domain, and adaptability to other fields are discussed. The methodology employed in this research draws on a range of sources, including academic books, peer-reviewed scientific papers, industry articles, and reputable online resources from experts or organisations specialising in economics and information technology, whereas the data utilised for analysis consists of publicly accessible archived datasets or simulation results.
The remainder of this paper is structured as follows. Section 2 reviews related work in predictive maintenance and flight delay forecasting, highlighting key gaps and opportunities in current research. Section 3 outlines the research methodology, including the formulation of the central problem, research objectives, and the design science–inspired approach used to develop predictive analytics pipelines. Section 4 provides context for the case studies by discussing the role of Business Intelligence systems in aviation, the operational challenges of predictive maintenance, and the impact of flight delays and cancellations. Section 5 details the materials and methods, covering dataset descriptions, preprocessing steps, feature selection strategies, and the design of machine learning models for both engine health assessment and delay prediction. Section 6 presents and discusses the experimental results, including classification performance, Remaining Useful Life estimation accuracy, and regression-based delay forecasting across multiple airlines and airports. Finally, Section 7 offers concluding remarks, reflects on limitations, and outlines directions for future work—including integration with real-time monitoring systems, domain adaptation, and ethical considerations in safety-critical AI deployment.
2. Related Work
A substantial body of scholarly research has explored the application of predictive analytics and machine learning in aviation, particularly in the domains of passenger experience, operational efficiency, and system reliability. Shiwakoti et al. [] examined passengers’ perceptions of digital technologies adopted by airlines during the COVID-19 pandemic, finding that AI-driven predictive tools can enhance user trust when integrated with transparent operational practices. However, their study remained largely qualitative and did not propose or evaluate concrete predictive models. Our work extends this insight by operationalising trust through model interpretability: we integrate SHAP-based explanations into deep learning pipelines for engine health monitoring, thereby transforming abstract “transparency” into actionable, engineer-verifiable feature attributions—a step not addressed in [].
Noviantoro and Huang [] presented a data mining framework for airline passenger satisfaction, emphasising preprocessing, attribute selection, and performance evaluation in Big Data contexts. While their methodological rigour in feature engineering is valuable, their focus was limited to survey-based satisfaction metrics rather than operational outcomes like delays or mechanical failures. In contrast, our study adapts and expands their data preparation principles to time-series sensor data and real-world flight records, shifting the objective from customer sentiment to system reliability and schedule adherence—thereby broadening the applicability of their framework to safety-critical decision-making.
Predictive maintenance has been explored in industrial settings beyond aviation. Kang et al. [] demonstrated the use of artificial neural networks for Remaining Useful Life (RUL) prediction in production-line equipment, establishing foundational architectures for degradation modelling. Yet their approach treated RUL as a purely regression-based problem without considering discrete health states or operational context. Our work improves upon this by introducing a dual-task learning paradigm—simultaneous classification (normal/degrading/failure) and RUL regression—on the NASA C-MAPSS dataset, which better aligns with real-world maintenance protocols that require both diagnostic and prognostic outputs.
Similarly, Truong et al. [] applied business analytics to predict on-time flight performance, identifying key factors such as air traffic congestion and weather. While their regression-based approach offers useful macro-level insights, it lacks granularity in modelling airline- and airport-specific delay dynamics and does not incorporate modern machine learning techniques. Our study addresses this gap by developing airline-aware delay prediction models that quantify heterogeneity across carriers and origins, and by benchmarking linear, polynomial, and regularised regression under cross-validation—revealing the fragility of polynomial fits in the presence of outliers, a limitation unexamined in [].
The transferability of predictive maintenance concepts across industries has also been noted. A study on vehicle maintenance using the “Keep the Machine Running” application [] illustrates how BI platforms can monitor mechanical assets in automotive contexts. However, this work assumes relatively stable operating environments and does not account for the high-dimensional, time-varying sensor streams typical of jet engines. Our research extends this cross-domain vision by demonstrating how aviation-specific challenges—such as multi-component degradation, operational setting shifts, and sparse failure data—necessitate tailored feature selection (e.g., sensor monotonicity analysis) and hybrid deep learning architectures, thus refining the notion of “domain transfer” for complex cyber-physical systems.
Finally, the NASA C-MAPSS dataset, introduced in the seminal work by Saxena et al. [], has become the de facto benchmark for engine prognostics. While [] focused on simulating realistic degradation physics, it did not propose or evaluate machine learning models for RUL prediction. Subsequent studies have used C-MAPSS for algorithm benchmarking, but often report only aggregate error metrics without interpretability or operational feasibility analysis. Our contribution lies in closing this loop: we not only achieve high classification accuracy (up to 97%) but also identify the most influential sensors (e.g., high-pressure compressor pressure, fuel flow ratio) through SHAP, enabling maintenance teams to validate predictions against physical failure modes—an advancement that bridges the gap between simulation-based research and field-deployable diagnostics.
Collectively, the reviewed literature confirms the growing role of data-driven methods in aviation but reveals three persistent gaps. First, predictive maintenance and flight operations are typically studied in isolation, despite their operational interdependence. Second, many studies prioritise predictive accuracy over interpretability, limiting trust and adoption in safety-critical settings. Third, model evaluation often neglects real-world constraints such as data sparsity, extreme outliers, and heterogeneous operational contexts. This paper directly addresses these limitations by (1) integrating engine health monitoring and delay forecasting within a unified BI framework, (2) embedding SHAP-based interpretability into deep learning models, and (3) rigorously evaluating model robustness under domain-specific challenges—thereby advancing both the methodological rigour and practical relevance of predictive analytics in dynamic aviation systems.
3. Research Methodology
This study is motivated by a pressing operational challenge in modern aviation: the absence of integrated, scalable, and interpretable predictive analytics frameworks capable of simultaneously addressing mechanical reliability and scheduling inefficiencies within a unified decision-support architecture. While the literature contains numerous isolated studies on either engine prognostics or flight delay prediction, real-world aviation systems demand a holistic perspective that acknowledges the interdependence of asset health, operational tempo, and external disruptions. The central research problem, therefore, is formulated as follows: How can data-driven machine learning models be systematically designed, validated, and contextualised to support proactive decision-making in dynamic aviation environments, with specific focus on aircraft engine health monitoring and flight departure delay forecasting? This problem is inherently interdisciplinary, spanning predictive maintenance engineering, time-series analytics, regression modelling, and business intelligence system design.
To address this problem, the research pursues three interrelated objectives. First, it seeks to develop and comparatively evaluate deep learning architectures—specifically one-dimensional Convolutional Neural Networks (1D CNNs) and Long Short-Term Memory (LSTM) networks—for both multi-state classification of engine degradation and regression-based estimation of Remaining Useful Life (RUL) using the NASA C-MAPSS dataset. Second, it aims to construct and benchmark interpretable regression models for predicting flight departure delays using real-world U.S. flight records, with explicit attention to the influence of airline identity, origin airport, and scheduled departure time as contextual covariates. Third, and most critically, the study endeavours to synthesise these technical efforts into a coherent methodological pipeline that integrates Exploratory Data Analysis (EDA), feature importance quantification, model validation, and operational feasibility assessment—thereby offering a transferable blueprint for embedding predictive analytics into aviation Business Intelligence (BI) ecosystems.
The methodological approach adopted in this research is grounded in the principles of design science research (DSR), which emphasises the construction, evaluation, and iteration of technological artefacts to solve complex real-world problems. Within this paradigm, the artefacts in question are end-to-end predictive modelling pipelines that transform raw, heterogeneous aviation data into actionable operational insights. The research design is fundamentally empirical and quantitative, combining simulation-based analysis for engine prognostics with observational data analysis for flight operations. This dual-track strategy ensures methodological robustness while accommodating the distinct data generation mechanisms of each use case: synthetic but physically informed degradation trajectories in C-MAPSS versus real-world but noisy and incomplete flight records from the U.S. Bureau of Transportation Statistics.
The overall research workflow proceeds through a sequence of interdependent phases. It begins with data acquisition and contextual framing, wherein domain requirements are mapped to data availability and modelling constraints. This is followed by an intensive Exploratory Data Analysis (EDA) phase, during which data completeness, distributional characteristics, temporal patterns, and feature redundancies are rigorously assessed. Critical preprocessing steps—including handling of missing values, removal of non-informative features (e.g., constant operational settings), label derivation for RUL, and temporal encoding of flight schedules—are then applied to produce clean, structured datasets suitable for machine learning. Feature engineering is performed with domain awareness: for engine data, temporal windows and sensor-derived health indicators are constructed; for flight data, categorical variables such as airline and airport are encoded using One-Hot Encoding to preserve semantic distinctions without imposing artificial ordinality.
Model development proceeds along two parallel tracks. For predictive maintenance, supervised learning models are trained to perform both classification (normal, degrading, critical) and regression (RUL estimation) tasks. The architectures evaluated include 1D CNNs for local temporal pattern extraction, LSTMs for long-range dependency modelling, and hybrid variants. For flight delay prediction, a suite of regression models—linear regression, polynomial regression, Ridge regression, and decision trees—is implemented to assess trade-offs between model complexity, interpretability, and robustness to outliers. All models are evaluated using discipline-appropriate metrics: classification performance is assessed via accuracy, F1-score, and Area Under the ROC Curve (AUC), while regression quality is measured through Mean Squared Error (MSE), Mean Absolute Error (MAE), and coefficient of determination (R2). To guard against overfitting and ensure generalisability, K-fold cross-validation is employed systematically across both modelling domains.
While our individual models employ established architectures, the methodological novelty of this work resides in three domain-informed design choices: (1) a dual-task engine health assessment framework that unifies diagnostic classification and prognostic RUL estimation to mirror real-world maintenance workflows; (2) a context-sensitive delay prediction strategy that treats airline–airport pairs as atomic units of analysis, capturing operational heterogeneity ignored by aggregate models; and (3) an open, modular analytics architecture that decouples predictive logic from visualisation layers, enabling transparent, auditable, and vendor-agnostic integration into aviation BI ecosystems. These choices reflect a shift from algorithm-centric benchmarking to operationally grounded system design.
A distinguishing feature of this methodology is its emphasis on interpretability and operational relevance. Rather than treating models as black boxes, the study integrates SHAP (SHapley Additive exPlanations) to quantify the contribution of individual sensors or operational factors to model predictions. This not only enhances trust among domain experts but also facilitates the identification of physically meaningful degradation signatures or delay drivers. Furthermore, model limitations—such as sensitivity to extreme delay outliers or RUL overestimation in early degradation phases—are explicitly documented and discussed in the context of real-world deployment constraints. The entire workflow is designed to be modular and reproducible, enabling future extension to other transport modes or integration with real-time monitoring systems.
This methodological framework bridges the gap between theoretical machine learning research and practical aviation operations. By embedding domain knowledge into every phase—from feature selection to model validation—it ensures that the resulting artefacts are not only statistically sound but also operationally meaningful. The subsequent sections of this paper detail the implementation of this methodology, beginning with a contextual overview of aviation BI systems and followed by a granular description of datasets, preprocessing routines, and model architectures.
4. Case Study Context
4.1. Business Intelligence Systems in Aviation
The integration of data-driven intelligence into aviation operations has evolved beyond static reporting toward dynamic, forward-looking decision support. While commercial Business Intelligence (BI) platforms—both general-purpose (e.g., Tableau, Power BI) and aviation-specialised (e.g., IATA AirSAT, Teradata Aviation Analytics)—offer pre-configured dashboards and KPI tracking, they are fundamentally descriptive or diagnostic in nature. These systems excel at retrospective analysis but lack native support for prognostic modelling, particularly when such modelling requires custom deep learning architectures, real-time feature engineering, or explainability mechanisms tailored to safety-critical contexts.
This limitation is especially pronounced in two high-impact operational domains: predictive maintenance of aircraft engines and flight delay forecasting. In predictive maintenance, the degradation process is inherently nonlinear, multivariate, and embedded in high-frequency sensor time series. Commercial BI tools typically treat sensor data as scalar metrics for threshold-based alerts, ignoring temporal patterns that precede failure. Similarly, in delay prediction, the interplay between airline-specific operational policies, airport congestion dynamics, and exogenous factors (e.g., weather) demands flexible, adaptive regression frameworks that can incorporate interaction effects and handle sparse, skewed distributions—capabilities absent in rigid, template-driven BI reporting modules.
Moreover, the closed-source nature of most industry BI platforms impedes model transparency, a non-negotiable requirement in aviation. Regulatory bodies such as EASA and FAA increasingly emphasise model interpretability and auditability for any system influencing maintenance or scheduling decisions. Black-box vendor models that cannot be inspected, validated, or modified by airline engineers or safety officers pose significant certification and liability risks. This is further compounded by data governance constraints: airlines are often unwilling to upload sensitive operational telemetry to third-party cloud BI services due to cybersecurity and intellectual property concerns.
In response to these challenges, a growing body of research advocates for open, modular analytics pipelines built on scientific computing ecosystems (e.g., Python, R) that prioritise reproducibility, version control, and integration with existing data lakes. Such approaches enable end-to-end control—from raw sensor ingestion to SHAP-based explanation—while remaining compatible with enterprise data architectures through APIs or containerization. This paradigm shift reflects a broader trend in critical infrastructure sectors: rather than retrofitting predictive capabilities into legacy BI suites, organisations are developing domain-specific analytical microservices that feed insights into existing dashboards via standardised interfaces.
Our study aligns with this emerging best practice. Rather than evaluating or comparing commercial BI platforms—a task better suited to enterprise IT procurement studies—we focus on designing, implementing, and validating a predictive core that could inform or enhance any BI system. The choice of Python-based open-source libraries (TensorFlow, Scikit-learn, SHAP) is thus not merely technical but strategic: it ensures full reproducibility, facilitates peer review, and lowers barriers to adoption by research institutions and smaller carriers lacking enterprise software licences.
Critically, this approach does not reject BI systems outright; instead, it repositions them as consumers of predictive outputs rather than generators of insight. For instance, RUL estimates or delay risk scores produced by our models can be exposed as REST endpoints or Kafka streams, enabling real-time integration into cockpit alerts, maintenance planning tools, or passenger notification systems—without requiring BI vendors to embed complex ML logic internally.
By anchoring our methodology in this open-analytics philosophy, we address a key gap in the literature: most prior work either (a) treats BI platforms as monolithic solutions without dissecting their analytical limitations, or (b) develops predictive models in isolation without considering how they might interface with operational decision ecosystems. Our contribution lies in demonstrating a viable, transparent, and extensible pathway from raw aviation data to actionable foresight—one that prioritises scientific rigour over vendor convenience.
4.2. Predictive Maintenance in Aviation
The production lines of aircraft manufacturers comprise a substantial quantity of equipment, sensors, and machines via which components are assembled to create finished products. While various aircraft parts are produced, situations may arise in which their quality deteriorates, leading to frequent failures of individual or critical components that hinder system operations. Such failures significantly impact performance and operational costs within the aviation industry, often resulting in marked reductions in availability due to compromised safety, reliability, and costly maintenance periods. To mitigate these risks, maintenance activities for such assets are planned proactively, with reducing associated costs considered a crucial advantage in the highly competitive manufacturing sector, where maintenance expenses can escalate to 70% of total operational costs. However, in the automotive industry, integrating predictive maintenance models represents an explicitly strategic initiative, enabling timely interventions and often incorporating these practices into production processes for preventive purposes due to the heightened likelihood of component failures. One approach to address such challenges is predictive maintenance, a strategy that involves proactive assessment of system conditions to schedule maintenance activities and prevent unforeseen failures in the near future [].
The foundation of predictive maintenance lies in monitoring the current mechanical state, operational efficiency, and stability of equipment, alongside other parameters influencing overall system performance. This practice is typically facilitated through the deployment of advanced sensors, diagnostic instruments, and simulation software for historical data analysis []. The primary objective of this process is to extend the interval between repairs while simultaneously minimising the frequency of repairs, unplanned outages, malfunctions, and related costs. Multiple activities—such as vibration monitoring, tribology analysis, thermographic measurements, and others—are employed in this effort, with operational parameters continuously monitored to optimise equipment availability and significantly reduce maintenance expenditures.
From a business intelligence perspective, integrating predictive maintenance systems into existing work environments presents several challenges. First, real-time collection of aircraft failure data is difficult without a robust Big Data infrastructure, comprehensive data warehouses, domain expertise, and customised software capable of managing such data flows. Second, establishing a high-quality ETL pipeline and conducting exploratory analysis becomes essential given the variability in data structure, decentralised sources, and disparate formats. Finally, privacy concerns often prevent companies from publicly sharing this sensitive data, thereby limiting the validation of developed models. As an alternative, datasets generated by simulation tools are available; however, input parameters must be calibrated according to the specifications of the target system during simulations to ensure relevance and accuracy.
4.3. Flight Delays and Cancellations
As air travel becomes increasingly prevalent, flight delays or cancellations have emerged as significant determinants of passenger experiences within the aviation industry. Any period of waiting for a service can adversely affect customers in multiple ways, potentially inducing frustration, impatience, insecurity, and dissatisfaction with the provided service. Data from historical records up to 2007, sourced from the US Bureau of Transportation Statistics, indicates an annual increase in flight delays of 1.1% []. However, flight delays pose serious challenges, including:
- Adverse user experiences, where time constitutes a critical resource for customers
- Operational difficulties at smaller airports due to heightened congestion and delays
- Reputational damage to airlines stemming from unfavourable reviews and negative experiences, often prompting prospective passengers to seek alternative service providers
- Financial losses encompassing compensation claims, corrective measures, reimbursements, or unexpected expenses arising from severe incidents such as equipment malfunctions []
According to the “Total Delay Impact Study,” air transport delays in the United States during 2007 were estimated to cost $32.9 billion for passengers and the aviation industry, contributing to a $4 billion reduction in GDP []. Consequently, predictive models for flight delays can enhance airline operational efficiency and passenger satisfaction while supporting economic growth within the sector through optimised flight scheduling, improved arrival/departure times, and identification of correlations with other aviation-related variables. A critical component of this task involves generating and presenting reports or parameter values to non-technical personnel or individuals lacking analytics expertise; in this context, data visualisation tools and exploratory data analysis (EDA) libraries are indispensable for effective communication of findings.
5. Materials and Methods
5.1. Datasets (C-MAPSS and Kaggle)
The focus of this study is on predictive maintenance of aircraft engines, with regard to the C-MAPSS dataset commonly used to simulate engine operation. The C-MAPSS is a simulation tool for larger commercial turbofan engines, implemented within the MATLAB and Simulink environment. It comprises a large number of user-editable input parameters that depend on operational profiles, closed-loop controller configurations, environmental conditions, and other factors. For this paper, a dataset was utilised featuring a closed-loop configuration with fourteen input parameters—including fuel flow and system health indicators—that enable simulation of degradation or failure in any of the five rotating components (fan, low-pressure compressor (LPC), high-pressure compressor (HPC), high-pressure turbine (HPT), and low-pressure turbine (LPT)), as presented in Table 1. The outputs include sensor response surfaces and operating margins, with a graphical user interface available to facilitate controller design and simulation, as shown in Figure 2.
Table 1.
The input parameters for simulating degradation scenarios of rotating components.
Figure 2.
A simplified engine model with system response parameters []. Major engine sections are shown: Fan (green), Low-Pressure Compressor (LPC, blue), High-Pressure Compressor (HPC, yellow), High-Pressure Turbine (HPT, red), and Low-Pressure Turbine (LPT, orange). Key parameters include mass flow rates (W21, W22, W25, W31, W32, W48, W50), pressures (P2, P15, P24, P30, P50), temperatures (T2, T24, T30, T48, T50), fan and core speeds (Nf, Nc), and fuel flow (Wf).
The dataset comprises multiple multivariate time series from various engines. Each engine begins with a distinct initial utilisation level and production variations that are unknown to the user and should not be interpreted as failures but rather as differing usage scenarios for the same aircraft type. Three operational settings are essential for running the simulation and are included within the dataset. Initially, each time series represents normal engine operation; subsequently, a fault occurs at a defined point in the sequence. In the training set, error values increase progressively following fault onset, whereas in the test data, value transmission ceases immediately prior to fault occurrence.
The data is provided as a ZIP file containing text files with 26 columns. Each row corresponds to a data reading recorded during one operating cycle, with the following column structure:
- Unique engine identifier (ID)
- Time elapsed in cycles
- Three operational settings
- Twenty-one sensor outputs from the engine []
- The dataset utilised for flight delay prediction was retrieved from a publicly available Kaggle source and comprises three tables exported as CSV files, namely:
- “airlines.csv”—which contains records of airline names and identifiers
- “airports.csv”—which contains records of airport names, locations, and codes
- “flights.csv”—which serves as the primary dataset for recording flight schedules, metadata, planned and actual times across various stages of flights, and flags indicating causes of delays or cancellations []
- The “flights.csv” file includes the following columns:
- Flight-related temporal data—YEAR, MONTH, DAY, DAY_OF_WEEK
- Identifiers—AIRLINE, FLIGHT_NUMBER, TAIL_NUMBER
- Origin and destination information—ORIGIN_AIRPORT, DESTINATION_AIRPORT, DISTANCE
- Temporal metrics—TAXI_IN, ARRIVAL_TIME, ARRIVAL_DELAY, DEPARTURE_TIME, DEPARTURE_DELAY, TAXI_OUT, SCHEDULED_TIME, ELAPSED_TIME, AIR_TIME
- Timestamps—SCHEDULED_ARRIVAL, WHEELS_ON, WHEELS_OFF, SCHEDULED_DEPARTURE
- Status indicators—DIVERTED, CANCELLED
- Cancellation-related attributes—CANCELLATION_REASON, AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, WEATHER_DELAY []
5.2. Justification of Model Selection
The choice of machine learning architectures for both classification and regression tasks was guided by a confluence of theoretical, empirical, and operational considerations specific to aviation predictive analytics. In the domain of engine health monitoring, the input data consist of multivariate time series generated by sensors under varying operational conditions. These sequences exhibit temporal dependencies, non-stationary dynamics, and gradual degradation patterns that evolve over hundreds of operational cycles. Traditional statistical models (e.g., linear regression, ARIMA) are ill-suited for such data due to their inability to capture complex, nonlinear interactions across sensor streams and operational settings.
1D Convolutional Neural Networks (1D CNNs) were selected because they excel at extracting local temporal patterns—such as transient spikes in pressure or temperature—that often precede mechanical failure. Unlike 2D CNNs designed for spatial data, 1D CNNs apply filters along the time axis, making them computationally efficient and highly effective for sensor-based time-series classification. Their hierarchical feature learning capability allows automatic detection of discriminative motifs without manual feature engineering, a critical advantage given the high dimensionality (21 sensors) and redundancy (e.g., correlated core speed sensors) observed in the C-MAPSS dataset.
Long Short-Term Memory (LSTM) networks, in contrast, were chosen to model long-range temporal dependencies inherent in degradation processes. While 1D CNNs capture short-term anomalies, LSTMs maintain a memory cell that can retain information over extended sequences—enabling them to track slow drifts in sensor baselines that signal progressive wear. This dual capability aligns with the dual objectives of our study: diagnostic classification (short-term fault detection) and prognostic RUL estimation (long-term trend forecasting). The inclusion of RNNs as a baseline further allows us to assess whether simpler recurrent structures suffice or whether gated mechanisms (as in LSTMs) are necessary for performance gains.
For the classification task, we adopted a multi-state health labelling scheme (normal, monitoring, failure) rather than binary failure detection. This reflects real-world maintenance protocols, where early warnings trigger inspection rather than immediate grounding. The selected deep learning models naturally support multiclass outputs through softmax layers, and their end-to-end training ensures joint optimisation of feature extraction and decision boundaries—unlike pipeline approaches that decouple representation and classification.
For RUL regression, we evaluated both data-driven deep learning (LSTM) and physics-informed statistical models (exponential degradation, similarity-based). The LSTM was retained despite its higher computational cost because it makes no assumptions about degradation shape, unlike exponential models that enforce monotonic decline. However, we also included classical approaches to benchmark against interpretable, low-parameter alternatives—a necessity in safety-critical domains where model transparency affects certification and trust.
In the flight delay prediction task, the data exhibit heterogeneous structure: a mix of categorical variables (airline, airport) and continuous temporal features (scheduled departure time), alongside heavy-tailed delay distributions with extreme outliers. Given these characteristics, we prioritised interpretable regression models over black-box alternatives (e.g., deep neural networks) for three reasons:
- Operational transparency: Dispatchers and schedulers require clear attribution of delay causes (e.g., “Spirit Airlines at JFK at 6 PM averages 22-min delays”), which linear and polynomial models provide through coefficient inspection.
- Data sparsity: Many airline–airport pairs have limited historical records. Complex models overfit sparse regimes, whereas regularised linear models (Ridge regression) stabilise estimates by penalising large coefficients.
- Baseline robustness: As shown in prior aviation analytics studies [,], simple regression models often match or exceed the performance of complex learners when data quality is inconsistent or outliers dominate.
Polynomial regression was included to test for nonlinear temporal effects (e.g., U-shaped delay patterns across the day), but its susceptibility to overfitting—especially in the presence of extreme delays—necessitated K-fold cross-validation and outlier sensitivity analysis. Decision trees were added as a nonparametric alternative to capture interaction effects (e.g., airline × airport) without assuming linearity. Collectively, this methodological portfolio balances expressiveness, interpretability, robustness, and domain alignment. It reflects a deliberate shift from “maximizing accuracy at all costs” to “optimizing for deployability in real-world aviation BI systems”—a perspective increasingly emphasised in applied AI research for critical infrastructure.
5.3. Implementation Environment
The predictive models and data processing pipelines described in this study were implemented following established software engineering principles to ensure modularity, reproducibility, and maintainability. The entire codebase was developed in Python 3.9, leveraging open-source scientific libraries within a structured project architecture. Key dependencies include NumPy 2.2.0 for numerical operations, Pandas 2.3.3 for data manipulation, Scikit-learn 1.7.1 for classical machine learning models and preprocessing utilities, Matplotlib 3.10.7 and Seaborn 0.13.2 for visualisation, and SHAP 0.48.0 for model interpretability. Deep learning models (1D CNN, LSTM, RNN) were implemented using TensorFlow 2.12 with Keras 3.11.2 as the high-level API, enabling GPU acceleration via CUDA 11.8 and cuDNN 8.6.
All experiments were executed on a workstation equipped with an AMD Ryzen 5600x CPU (6 cores, 3.7 GHz), 32 GB DDR4 RAM, and an MSI GeForce RTX 4070 VENTUS 2X 12G OC (Micro-Star International Co., Ltd., New Taipei City, Taiwan), running Ubuntu 22.04 LTS. For reproducibility, the computational environment was containerised using Docker (v24.0), with a custom image built from the official “python:3.9-slim” base. This ensured consistent dependency resolution across development and evaluation phases.
The software architecture is built on a modular design that organises the workflow into distinct components. It manages both raw and processed datasets, includes scripts for exploratory data analysis, missing value handling, feature encoding, and labelling of remaining useful life (RUL). The modelling component provides reusable implementations for various algorithms such as 1D CNNs, LSTMs, and linear, polynomial, or Ridge regression. Evaluation modules handle the computation of metrics like F1, MSE, MAE, and R2, along with cross-validation and model explainability using SHAP. Additionally, the framework includes Jupyter notebooks for exploration and visualisation, as well as configuration-driven experiment scripts to automate training and evaluation with hyperparameter sweeps.
Version control was managed via Git, with the repository hosted on a private GitLab instance. Each experiment was logged using MLflow, capturing code versions, hyperparameters, metrics, and artefacts (e.g., trained models, plots). This enabled traceability and facilitated comparison across model variants (e.g., LSTM with vs. without early stopping). While formal unit testing was limited due to the research-oriented nature of the project, critical data preprocessing functions (e.g., RUL calculation, One-Hot Encoding) were validated against hand-computed edge cases. Additionally, all figures and numerical results were regenerated from the final codebase prior to manuscript submission to ensure consistency. This engineering-aware implementation not only supports the scientific validity of our findings but also enhances the potential for future integration into production-grade aviation BI systems, where maintainability, auditability, and scalability are paramount.
5.4. Predictive Maintenance Models
This section describes the process of preparing input data, conducting exploratory analysis and preprocessing, and presenting the criteria used to select the most relevant attributes. It also outlines the development and comparative analysis of two predictive modelling approaches:
- Classification—for predicting the likelihood of aircraft failure within the next n days
- Regression—in the form of Remaining Useful Life (RUL) prediction, which estimates the aircraft’s remaining lifespan []
5.4.1. Data Analysis, Pre-Processing and Visualisation
Conducting exploratory data analysis is a critical prerequisite for effective data pre-processing. It involves describing the dataset’s characteristics, including variable types and completeness, and implementing a strategy for data cleansing to extract relevant information []. Numerous methods exist for this process; some of the most important include:
- Checking for missing values: The presence of missing values can significantly degrade prediction outcomes by skewing mean values across specific data ranges. To address this, three commonly employed strategies are applied: (1) excluding records containing illegal or missing values, (2) removing columns with predominantly empty entries, or (3) imputing missing fields using the mean value of the corresponding variable type. For example, analysis of the third operational setting column reveals no observable variation in its values. However, the underlying reason for this remains unspecified—whether the setting consistently returns a constant reading or the data is unrecorded. Given the lack of variability, this column may be excluded at this stage
- Checking for duplicate records: Data sources such as repositories, databases, or simulators may lack constraints to prevent duplication, resulting in repeated records. This can negatively impact predictive models by increasing the frequency of identical cases and values across columns. Typically, such duplicates are removed during ETL processes or via data integration tools. In this context, however, each record corresponds to a distinct change in sensor state, making removal unnecessary
- Label encoding: To manage complex variations within a single column effectively, it is recommended to transform categorical data into numerical values through label encoding. This technique is particularly useful in scenarios requiring predictive maintenance models for devices or machines, as it enhances clarity and facilitates intuitive result interpretation. For instance, encoding the remaining useful life (RUL) column—which may contain three states: normal/predicted operation, standby/maintenance/failure expectation, and inevitable failure—is beneficial for tracking outcomes systematically, and
- Checking for outliers: Outliers represent rare data points that can disrupt model performance by deviating from expected ranges or containing undefined values. This issue may be mitigated by adjusting dataset quantiles—reducing upper-bound values while increasing lower-bound values. In some contexts, such as sensor readings constrained to non-negative real numbers dependent on operational time, outliers must be evaluated for potential causes such as failures, overloads, or measurement errors. However, no such cases were observed in the simulation data employed here
- After examining and visualising the available training and testing data, several consistent characteristics were identified across all conducted simulations:
- The expected average lifespan of an aircraft engine ranges from 190 to 210 operating cycles. Only a small subset of engines remains failure-free beyond 300 cycles, whereas less reliable units exhibit operational durations of approximately 130 cycles, as illustrated in Figure 3
Figure 3.
Average engine lifespan.
- The third operational setting remains consistently unchanged across all engines during their operational lifespan
- Sensor measurements (1, 5, 10, 16, 18, and 19), which pertain to the fan module, Engine Pressure Ratio (EPR), and fuel-to-air ratio in the combustor, exhibit minimal variability for the observed engine population and are thus deemed insignificant for model creation. In contrast, Sensor 6 (Bypass-duct Pressure) records either static values or stochastic fluctuations between two states; however, no discernible causal patterns emerge in some instances
- Observing the correlation matrix reveals high interdependence between sensors 9 and 14, which provide data on physical and corrected core speed, as shown in Figure 4
Figure 4.
(a) The sensor correlation matrix along with (b) the scatter plot of the values from the ninth and fourteenth sensors.
- The most pronounced deviations in sensor readings occur during the final 50 operating cycles, characterised by abrupt fluctuations, as illustrated in Figure 5
Figure 5.
Sensor measurements from individual engines during the final fifty days prior to failure. Each coloured line represents one engine’s temporal profile.
Within the data pre-processing script, the following steps are implemented:
- The RUL is designated as the target variable, with engine number and cycle count used to index records for streamlined selection of specific engines and time intervals
- Columns corresponding to the third operational setting and less influential sensors are removed
- A new column for RUL is created, initialised with values representing the difference between failure time and current record time, and
- Test data and actual RUL outcomes are loaded and indexed using engine number as a unique identifier
Following pre-processing, data have been cleansed, restructured, and standardised to meet requirements for predictive model development. The elimination or imputation of invalid/undefined values enhances algorithmic precision, efficiency, and performance []. While additional transformations such as scaling, cohort analysis, or survival analysis may be necessary depending on the use case, partial implementation of these techniques is already documented in the scientific paper detailing simulation procedures.
5.4.2. Feature Selection and Importance
Feature importance serves as a quantitative measure to assess the relevance of input variables and identify those most influential in decision-making processes. It facilitates comprehension of predictive model operations, supports feature management strategies, and enables visualisation of results through various graphical representations, thereby enhancing insights into dataset attributes. The four primary methodologies for evaluating feature importance include:
- Perturbation: introducing noise to a specific attribute
- Missing values: substituting attribute values with zeros
- Permutation: reassigning attribute values using permissible combinations, and
- Application of deep learning algorithms that generate SHAP (SHapley Additive exPlanations) values []
Irrespective of the method employed, impact is quantified through mean squared error comparisons between modified and unmodified data, as a greater effect on predictive outcomes typically indicates higher attribute significance. Figure 6, which presents the SHAP values, is included as it provides the most reliable assessment of feature importance.
Figure 6.
Overview of the results of feature importance measurement using SHapley Additive exPlanations (SHAP) values.
Based on these findings, it can be concluded that sensor data from sources 4, 7, 11, 12, and 21—used to measure the following parameters:
- Output (standard and static) high-pressure compressor pressure
- Low-pressure turbine output temperature
- Fuel flow-to-power unit ratio, and
- Low-pressure turbine cold air flow
Constitute the most significant features for training the selected model, as illustrated in Figure 7. It is crucial to note that models with sparse connectivity (few connections) should be avoided for such measurements, as this may compromise accuracy. In practice, if a notable increase in the importance of an attribute previously classified as constant or highly stochastic is observed, attention must be directed toward model architecture, and a densely connected network should be implemented [].
Figure 7.
(a) Changes in the importance of operational settings and (b) measurements from the five most significant sensors.
5.5. Flight Delay Prediction Models
Based on the input data provided in CSV format from the aforementioned Kaggle repository, the task involves constructing a predictive model for forecasting flight delays using linear and polynomial regression. Although the data analysis phase has been partially outlined in the preceding section, this discussion will focus on the characteristics of the input dataset, implications of data modifications, and their visualisation.
5.5.1. Data Pre-Processing
Data preprocessing extends beyond merely removing or correcting records with missing or erroneous values; it also entails adjusting the input data to facilitate manipulation within the selected development environment. Several modifications may be implemented:
- Columns representing flight times (four for each segment) can be converted into “datetime” objects to streamline date and time handling in Python
- The scheduled departure timestamp, currently presented as a four-digit number (first two digits denoting hours, last two minutes), should also be transformed into a “datetime” object, assuming departures and arrivals occur on the same day
- Relevant data points include flight dates, airline identifiers, airport codes, actual and scheduled arrival/departure times with delay information, and travel distances. Additional columns may be incorporated; however, for simplicity, they are excluded in this model development phase
- Finally, it is essential to verify column completeness, remove records with missing values, or replace them with mean values as appropriate.
5.5.2. Exploratory Analysis and Visualisation
Upon examining the cleaned data, the following observations can be made:
- The number of cancelled flights equals the number of cancelled flights with specified reasons, which indicates that delay reasons are recorded only for cancellations. All flights delayed by fifteen minutes or more have a stated reason for the delay, such as air system issues, airline operations, or weather conditions. Airlines such as Southwest, Delta, American, and Skywest account for the majority of flight records and generally experience relatively low cancellation rates, while American Eagle stands out with the highest cancellation rate at approximately 5%
- Aircraft-related delays account for one-third of all recorded delays, with half of these attributed specifically to Southwest Airlines. Delays categorised as airline-related constitute between 25% and 33% of total delays, with 58% of such cases associated with Hawaiian Airlines. The remainder of all recorded delays are caused by the air traffic system, with Spirit Airlines exhibiting the highest frequency of these delays, while Hawaiian Airlines accounts for approximately 2%
- Weather-related delays are exclusively tied to the time and location of specific flights rather than to the airline itself, due to significant variability in their occurrence. The average time between landing at the destination and reaching the airport gate is typically less than ten minutes, whereas the time between departing the gate and take-off exceeds ten minutes for all airports. Southwest is among the airlines with the shortest such intervals
- Average flight speeds range between 400 and 450 miles per hour across airlines. United Airlines has the fastest average speed, while Hawaiian Airlines exhibits the slowest performance, characterised by high variability in flight speeds
- Total departure delays exceed arrival delays for all airlines except Hawaiian Airlines. Spirit Airlines and Frontier Airlines record the highest average departure delays, whereas Alaska Airlines is the only airline that arrives at its destinations earlier on average.
5.5.3. Comparison of Airline Data
The most critical information when analysing airlines involves the examination of flight volumes and delays during departures and arrivals. This data is presented graphically in Figure 8.
Figure 8.
Relationship between flight and delay data of airlines.
Based on the graph in the upper left corner of Figure 8, major airlines such as “Southwest Airlines” operate a significantly greater number of flights (combined) than the five smallest carriers. Conversely, when considering pre-take-off delays, a mean delay of approximately 11 ± 7 min—excluding “Hawaiian Airlines” and “Alaska Airlines”—is observed, which corresponds to the calculated average value. At the bottom of the figure, the distribution of all departure delays for January is displayed, revealing notable dispersion. While an estimated delay of 11 min is accurate, this statistic primarily reflects a large proportion of on-time departures, with occasional extreme delays of several hours skewing the mean. More precise data regarding shorter delays depends exclusively on the specific airline under consideration. Typically, the number of delays under 5 min is approximately 3–4 times lower than delays lasting up to 45 min.
A normalised distribution of flight delays is modelled using the exponential function
where parameter a is inversely proportional to the number of delays. Airlines with the highest values—namely “Hawaiian Airlines” and “Delta Airlines”—demonstrate the lowest incidence of delays. In contrast, “Skywest Airlines” exhibits the greatest frequency of delays, as shown in Figure 9.
Figure 9.
Comparison of flight delay data using the exponential distribution.
5.5.4. Analysis of Data Correlation
If both geographic location and temporal factors are significant determinants of flight delays, it is essential to incorporate data on the number of airports visited by an airline to capture these influences accurately. The relationship between departure location and flight delay constitutes another critical factor requiring consideration in flight analysis. To visualise this relationship effectively, the number of airports has been limited due to constraints on single-plot representation. Despite this limitation, key data characteristics remain clearly discernible. For instance, “American Eagle Airlines” consistently experiences prolonged delays across all departure locations, whereas “Delta Airlines” typically records shorter delays of up to 5 min. Additionally, specific airports with recurrent delayed departures—such as Denver and New York—are identifiable. Overall, flight delay variability is substantial, underscoring the necessity of specifying both airline and airport when seeking reliable results, as visualised in Figure 10.
Figure 10.
Representation of the influence of the origin airport on flight delays by airline companies.
The temporal distribution of flight delays also reveals discernible cycles in frequency and intensity. This pattern aligns with expectations, as US flight delays often occur with daytime take-offs and nocturnal landings, or vice versa. The effect is amplified by reduced airport traffic during nighttime hours, further strengthening the correlation between departure time and delay occurrence. Consequently, the temporal attribute of departure is a highly significant variable, with the number of delays exhibiting a monotonic increase due to these interrelated factors, as illustrated in Figure 11.
Figure 11.
Correlation between departure time and flight delays.
5.5.5. Implementation of a Predictive Model and Cross-Validation
Due to the necessity of integrating information about both the airline and airport, a predictive model for departure times can be developed across three flight tracking scenarios for the selected airline:
- Within the selected airport
- Independent of the airport, and
- For data grouped by departure and arrival information
The first model incorporates features such as the airline carrier, airport ID, departure time (converted to minutes), mean flight delay, polynomial terms with powers of the departure time up to the third degree, and scheduled departure times for predictions. However, this model is unsuitable for deployment in a production environment, though it may serve as a preliminary test to verify algorithmic performance and generate training and testing datasets from a smaller sample. Insufficient statistical data presents another challenge: if the selected airline exhibits low traffic volume, there may be inadequate data for regression analysis. Additionally, extreme delays within isolated samples could introduce biassed results, where unusually high delay values disproportionately influence the model’s output, as illustrated in Figure 12.
Figure 12.
The influence of extreme values on flight delays (the blue line connects the mean delay values over time, while the removal of delays longer than one hour results in new values represented by the green line).
The second model addresses limitations of the first by accounting for potential relationships and extrapolations across different airports through a single unified fit. This approach enables more accurate delay predictions even when individual airports have limited data. However, airport data must be encoded to ensure compatibility with regression algorithms; the most straightforward method is One Hot Encoding, which maps each unique airport name to a distinct numerical identifier representing its origin. While linear and polynomial regression can be applied, model validation through parameter fine-tuning is essential to prevent overfitting [].
Observation of the scatter plot reveals that extreme delay values—likely caused by exceptional circumstances such as equipment malfunction or adverse weather—are largely inconsequential yet disproportionately influence the polynomial curve, resulting in elevated prediction errors. Selecting an appropriate polynomial degree for regression requires careful consideration, and cross-validation is recommended to minimise error. The most widely used technique, K-fold cross-validation, partitions the dataset into k subsets and iteratively evaluates model performance by holding out one subset for testing while training on the remaining k−1 subsets, as summarised in Table 2 and illustrated in Figure 13. The regression lines for linear and polynomial regression are presented in Figure 14.
Table 2.
The results of K-fold validation.
Figure 13.
The estimated regression polynomial curve.
Figure 14.
Regression lines for (a) linear and (b) polynomial regression. The green line shows the fitted model, the red dashed lines indicate the 95% prediction intervals, and the blue shading represents the 95% confidence interval for the mean.
The third model follows a methodology identical to that of the preceding models, with the sole distinction being the inclusion of an additional feature: planned arrival time at the destination airport. Ridge regression is employed due to the high correlation within the expanded dataset, and optimal parameters are determined using K-fold cross-validation. The results of this process are summarised in Table 3, while the corresponding regression line is shown in Figure 15.
Table 3.
Results of K-fold cross-validation for the parameters of ridge regression. Mean squared errors (MSE) are reported for different values of the ridge regularization parameter . Bold red highlights the lowest MSE across all folds.
Figure 15.
Regression line for ridge regression. The green line shows the fitted model, the red dashed lines indicate the 95% prediction intervals, and the blue shading represents the 95% confidence interval for the mean.
6. Results
Before presenting the experimental results, it is essential to contextualise the representativeness of the datasets used and the real-world applicability of the proposed methods. The NASA C-MAPSS dataset, while widely adopted as a benchmark for engine prognostics, is a high-fidelity simulation that models gradual, monotonic degradation under controlled operational profiles. It does not capture sudden faults (e.g., bird strikes), maintenance-induced anomalies, or multi-component cascading failures common in real fleets. Consequently, models trained on C-MAPSS are best suited for algorithm prototyping and relative comparison rather than direct deployment on live aircraft. Similarly, the Kaggle flight dataset—comprising U.S. domestic flights from 2015—is real but temporally and geographically bounded. It lacks granular contextual variables such as real-time weather radar feeds, air traffic control directives, or crew duty logs, which are known to dominate delay causality. Moreover, 2015 predates the pandemic-induced operational shifts and recent staffing shortages that have reshaped delay dynamics. Despite these limitations, the proposed methodology remains highly applicable in specific scenarios: the dual-task engine health framework can be integrated into ground-based monitoring centres for condition-based maintenance, while the context-aware delay pipeline is well-suited for tactical scheduling adjustments at major hubs where historical patterns are stable. The open, modular architecture further facilitates incremental integration into existing BI systems without requiring full platform replacement.
6.1. Binary and Multiclass Classification
The core function of a predictive model is classification, which typically involves systematically organising data samples according to their features and the interplay of multiple criteria. Binary and multiclass classification represent two prevalent forms, distinguished by the number of potential output classes. Binary classification applies to scenarios with two mutually exclusive outcomes that describe the likelihood of an event occurring, whereas multiclass classification encompasses three or more distinct categories of output data. In this context, two possible outcomes may pertain to fault detection, while the remaining useful life (RUL) value can be categorised into three states:
- The engine operates normally (RUL > 50),
- The engine requires close monitoring (25 ≤ RUL ≤ 50), and
- Engine failure is anticipated (RUL < 25).
With data that has been pre-processed, cleaned, and normalised, a broader range of classifiers becomes applicable. However, when handling large-scale datasets characterised by numerous features and hyperparameters, alongside the potential integration of Big Data technologies, combining convolutional neural networks (CNNs) with high-performance supervised learning algorithms may achieve a balance between the computational efficiency of CNNs and the accuracy of the selected classification method. The subsequent results from a binary classification scenario are summarised in Table 4 and include an example confusion matrix and ROC curve, as illustrated in Figure 16.
Table 4.
Results for the used binary classification models.
Figure 16.
Confusion matrix and receiver operating characteristic (ROC) curve of recurrent neural network (RNN) classification. The blue dashed line represents the performance of a random classifier (AUC = 0.5); any model performing above this line has predictive power greater than chance.
For multiclass classification tasks, standard convolutional and recurrent neural networks demonstrate satisfactory performance, achieving anticipated precision rates of approximately 90%. The results derived from the developed models are summarised in Table 5, while Figure 17 presents the boxplot of LSTM binary classification results across ten runs.
Table 5.
Results for the used multiclass classification models.
Figure 17.
Boxplot of long short-term memory (LSTM) binary classification results in ten runs. The green line represents the median, the boxes indicate the interquartile range (Q1–Q3), and the whiskers show the data range excluding outliers.
6.2. Predicting the Remaining Useful Life of Aircraft
The prediction of an aircraft’s remaining useful life (RUL) can be accomplished through diverse methodologies. While some approaches rely on descriptive statistical techniques and the development of statistical models, others, particularly within the domain of business intelligence, depend on machine learning algorithms and artificial intelligence technologies. The primary objective of each method is to identify patterns or correlations between various factors in historical data, enabling the estimation of the duration for which a device or machine can operate without requiring additional maintenance []. Given that aircraft availability is a critical factor for task execution, as well as for the economic and technological advancement of airlines, omitting such analysis from business processes may result in substantial financial and social losses. Proactive implementation of RUL prediction facilitates increased productivity and reduces the likelihood of unexpected delays by enabling scheduled maintenance interventions.
Several models are employed for RUL prediction, including:
- Exponential degradation models—useful for examining the exponential relationship between component age and reliability
- Similarity-based predictive models—identify similarities in states preceding failure, thereby aiding in the detection of short-term operational patterns that contribute to deterioration []
- LSTM networks (implemented using the architecture illustrated in Figure 18)
Figure 18.
Example architecture of an LSTM model.
The exponential degradation model can be expressed as follows:
where denotes the health indicator as a function of time , represents the intercept term considered constant, and and are random parameters defining the model’s slope. Specifically, follows a lognormal distribution, while is Gaussian-distributed.
The selection of a specific model for RUL prediction dictates the attribute selection method employed. The exponential degradation model uses monotonicity to ensure a consistent degradation trend over time, whereas the similarity-based model employs trendability—the correlation between changes in two attributes—for attribute selection, as visualised in Figure 19 [].
Figure 19.
The (a) monotonicity and (b) trendability of available attributes.
Subsequently, it is necessary to construct a health indicator as a quantitative measure of system state, derived from the aggregation of the most significant factors (typically through linear degradation) and used within the formula for the selected model. Most machine learning libraries allow principal component analysis to identify the largest variance components in a dataset, with one such component then serving as the health indicator, as presented in Figure 20.
Figure 20.
(a) The distribution of engines based on their health indicator, along with (b) the exponential estimation of the one for the fifth engine.
The results of the employed predictive models are presented in Table 6.
Table 6.
Results of applied RUL prediction models.
Notably, RUL predictions based on similarity yield a mean absolute error of 16.81%. Given that these models were developed using data from fifty engines, this result may be considered satisfactory. In contrast, the LSTM model produces a mean absolute error exceeding 100%, indicating that although it may outperform the exponential degradation model in RUL estimation, its errors can significantly deviate from actual values. This discrepancy poses risks when a shorter remaining useful life is misclassified as safe. As illustrated in Figure 21a, the predictive model results follow a linear regression trend, yet the deviations shown in Figure 21b highlight substantial inconsistencies in accuracy. While extending these models could improve predictions, unpredictable failures or those arising from lower-quality components would negatively impact overall reliability. It is also essential to account for the diverse operational and maintenance conditions under which not only engines but also other aircraft components are managed, as well as their monitoring practices. Expert involvement will be required for this purpose.
Figure 21.
(a) Scatter plot of predicted versus true RUL with a fitted linear regression line (blue) and its 95% confidence interval and (b) a line graph for deviations.
In this section, an approach for developing a predictive model to estimate the remaining useful life (RUL) of aircraft engines and their key characteristics is presented. The input data undergoes preprocessing, significant features are extracted from the dataset, and various models are employed for binary and multiclass classification tasks. In binary classification, achieving an accuracy of 95% requires further validation using more extensive historical datasets and special cases. Meanwhile, one-dimensional convolutional neural networks (CNNs) yield satisfactory results, with general error levels of approximately 7%, or around 2% PAAF (predicted after actual failure), for multiclass classification []. Similar outcomes are observed in RUL prediction; however, numerous factors may further influence the final results and necessitate rigorous monitoring. Under such conditions, model application remains feasible, though its use for planning preventive or scheduled maintenance should remain independent of direct reliance on predicted values.
In practical settings, the implementation of these findings could be achieved through online remote monitoring systems integrated with continuous learning mechanisms, leveraging Big Data cloud services for data collection. However, this approach presents certain challenges, including potential disturbances or noise during data transmission and difficulties in maintaining stable connectivity. These risks may be mitigated by employing continuous model parameter tuning, implementing noise reduction techniques, or incorporating additional parameters that influence such occurrences. Furthermore, ensuring a high level of model accuracy requires validation against actual system performance, sensor reliability, and external maintenance considerations, including spare parts availability, delivery lead times, and aircraft unavailability affecting flight operations and passenger scheduling.
6.3. Comparison of Created Flight Delay Models’ Accuracy
Table 7 presents the results of applying regression algorithms in terms of mean squared error, R2 score, and average delay. The majority of algorithms satisfy the initial assumption of an average delay of 10 min, with overfitting being more prevalent in polynomial regression. For example, in the second testing case, the algorithm only performs accurately on the training data, whereas predictions for delays at the end of January produce greater discrepancies. It is also important to note that alternative algorithms may be better suited for such predictive modelling tasks and regularisation, including Lasso (or L1) and elastic net regression.
Table 7.
Accuracy of presented algorithms in the training and testing phase.
As a result, predictive modelling can serve as a valuable tool for forecasting flight delays and mitigating their impact on airlines and the broader aviation industry. By analysing historical data—including the temporal distribution of flight operations alongside factors contributing to delays or cancellations—the presented models demonstrate acceptable performance for the targeted dataset in most cases. Integrating such models with business intelligence or reporting tools enables consistent, proactive measures to enhance service availability and customer experience. However, the operation of these systems requires continuous monitoring and effective management of unforeseen events that may disrupt even the most meticulously developed models. This can be achieved through flexible responses to delays, improved protocols for emergency situations, and enhanced communication strategies with customers.
7. Concluding Remarks and Future Work
Predictive modelling has become ubiquitous and an essential component of modern business operations, enabling organisations to leverage historical or real-time data sources in order to forecast future events and behaviours with the aim of generating market value, enhancing business performance, strengthening market position, and achieving long-term success. This trend is particularly pronounced in the airline industry, where the inherent variability of operational processes, external factors, and environmental conditions significantly influences decision-making for optimising operational efficiency, passenger experience, and safety while minimising costs associated with large volumes of heterogeneous data. The dynamic nature of this sector aligns closely with evolving market conditions; however, a failure to adapt suitably and promptly can result in severe financial repercussions, social consequences, or even fatal outcomes.
This paper outlines the process of developing and evaluating predictive models designed for remaining useful life (RUL) prediction and flight delay forecasting through the application of diverse classification algorithms. The outcomes achieved from these predictive maintenance models demonstrate broader applicability beyond the aviation sector, with potential utility in industries such as automotive manufacturing or equipment monitoring systems. Through systematic research into optimal data handling strategies and the fine-tuning of input parameters, enhanced model performance and robustness can be attained. A foundational prerequisite for this is a thorough understanding of the collected dataset, necessitating rigorous pre-processing and exploratory analysis utilising visualisation techniques and attribute selection methodologies as described.
Also, while this study demonstrates the potential of machine learning–based predictive analytics for aviation maintenance and operations, several methodological and practical limitations must be acknowledged. First, the reliance on simulated and historical datasets constrains the external validity of our findings. The C-MAPSS dataset, though widely used as a benchmark, models degradation under idealised, physics-informed assumptions that may not fully capture real-world failure modes such as sudden sensor faults, cascading component failures, or maintenance-induced anomalies. Similarly, the Kaggle flight delay dataset, while extensive, lacks granular contextual variables—such as real-time weather conditions, air traffic control directives, or crew availability—that are known to significantly influence delay dynamics. Consequently, model performance reported here may not translate directly to live operational environments without recalibration or online learning mechanisms.
Second, the generalisability of the models across fleets and regions is limited. The C-MAPSS simulations represent a single turbofan engine type under fixed operational profiles, and the flight data are restricted to U.S. domestic carriers in a specific year. Predictive models trained on such narrow domains may fail when applied to different aircraft families, international routes, or post-pandemic scheduling regimes characterised by altered demand patterns and staffing shortages. This underscores the risk of overfitting to domain-specific artefacts rather than learning transferable degradation or delay signatures. Third, the computational and interpretability trade-offs inherent in our modelling choices present practical deployment challenges. While 1D CNNs and LSTMs achieve high classification accuracy, their black-box nature complicates certification in safety-critical aviation contexts, despite our use of SHAP for post hoc explanation. Moreover, the computational overhead of deep learning models may be prohibitive for edge deployment on aircraft or at small airports with limited IT infrastructure. Conversely, the interpretable regression models used for delay prediction lack the capacity to capture complex nonlinear interactions (e.g., weather × airport congestion × airline policy), limiting their predictive power in volatile scenarios.
Finally, the evaluation framework focuses predominantly on statistical metrics (e.g., MSE, F1-score) without incorporating operational cost functions or decision-theoretic criteria. For instance, a 10-min delay prediction error may be tolerable for scheduling but catastrophic for crew duty-time compliance. Similarly, RUL overestimation poses greater safety risks than underestimation, yet standard MAE treats both symmetrically. Future work should integrate asymmetric loss functions and domain-specific utility metrics to align model evaluation with real-world decision consequences. These limitations do not invalidate the study’s contributions but highlight the need for cautious, context-aware deployment and continuous model validation in production settings.
Simultaneously, it is imperative to address the ethical implications associated with AI technologies in critical environments throughout all stages—data collection, processing, training, and testing—to mitigate the risk of model bias. Continuous monitoring and maintenance of deployed algorithms require up-to-date input data, reliable communication channels with relevant sources, and a cautious approach incorporating validation protocols, transparent reporting practices, and deliberate consideration of future developments. The capabilities of these systems can be further enhanced through interdisciplinary integration, particularly by combining data from domains such as risk management, marketing, and customer support. For instance, synthesising information on meteorological variations, traffic patterns, and passenger behaviour may yield a holistic understanding of their interrelated impacts on airline operations. Concurrently, the refinement of promotional strategies, service offerings, and customer engagement must be guided by empirical insights derived from consumer feedback and competitive analysis.
Proactive mitigation of safety risks and accident prevention through predictive maintenance demands substantial investment in model testing and access to extensive historical datasets. While these requirements can be partially addressed through simulation-based approaches—demonstrated effectively in scenarios involving preventive engine maintenance—such implementations necessitate precise specifications and a gradual, methodical transition from simulated environments to real-world hardware or data infrastructure systems.
Author Contributions
Conceptualisation, E.M., E.K., and D.H.; methodology, E.M., E.K., D.H., and N.Ž.; software, E.M., E.K., and D.H.; validation, E.M., E.K., and D.H.; formal analysis, E.M., E.K., N.Ž., D.H., and N.H.; investigation, E.M., E.K., and N.H.; resources, E.M., E.K., N.Ž., D.H., and N.H.; data curation, E.M., E.K., N.Ž., D.H., and N.H.; writing—original draft preparation, E.M., E.K., and D.H.; writing—review and editing, E.M., E.K., N.Ž., D.H., and N.H.; visualisation, E.M., E.K., N.Ž., D.H., and N.H.; supervision, E.M., E.K., N.Ž., D.H., and N.H.; project administration, E.M., E.K., and N.H.; funding acquisition, E.K., and N.Ž. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
No new data were created or analysed in this study.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Pérez-Campuzano, D.; Morcillo Ortega, P.; Rubio Andrada, L.; López-Lázaro, A. Artificial Intelligence Potential within Airlines: A Review on How AI Can Enhance Strategic Decision-Making in Times of COVID-19. J. Airl. Airpt. Manag. 2021, 11, 2. [Google Scholar] [CrossRef]
- TechTarget. What Is 3Vs (Volume, Variety and Velocity). Available online: https://www.techtarget.com/whatis/definition/3Vs (accessed on 22 January 2023).
- Abellera, R. AI Meets BI; Auerbach Publications: Boca Raton, FL, USA, 2020. [Google Scholar]
- IBM Big Data & Analytics Hub. The Four V’s of Big Data. Available online: https://opensistemas.com/wp-content/uploads/2020/06/4-Vs-of-big-data-1.jpg (accessed on 22 January 2023).
- Larsen, T. Cross-Platform Aviation Analytics Using Big-Data Methods. In Proceedings of the 2013 Integrated Communications, Navigation and Surveillance Conference (ICNS), Herndon, VA, USA, 23–25 April 2013; pp. 1–9. [Google Scholar] [CrossRef]
- Noviantoro, T.; Huang, J.-P. Investigating Airline Passenger Satisfaction: Data Mining Method. Res. Transp. Bus. Manag. 2022, 43, 100726. [Google Scholar] [CrossRef]
- Shiwakoti, N.; Hu, Q.; Pang, M.K.; Cheung, T.M.; Xu, Z.; Jiang, H. Passengers’ Perceptions and Satisfaction with Digital Technology Adopted by Airlines during COVID-19 Pandemic. Future Transp. 2022, 2, 988–1009. [Google Scholar] [CrossRef]
- Kang, Z.; Catal, C.; Tekinerdogan, B. Remaining Useful Life (RUL) Prediction of Equipment in Production Lines Using Artificial Neural Networks. Sensors 2021, 21, 932. [Google Scholar] [CrossRef] [PubMed]
- Truong, D.; Friend, M.A.; Chen, H. Applications of Business Analytics in Predicting Flight On-Time Performance in a Complex and Dynamic System. Transp. J. 2018, 57, 24–52. [Google Scholar] [CrossRef]
- Sangupamba, O.M.; Prat, N.; Comyn-Wattiau, I. Business Intelligence and Big Data in the Cloud: Opportunities for Design-Science Researchers; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; pp. 75–84. [Google Scholar] [CrossRef]
- Saxena, A.; Goebel, K.; Simon, D.; Eklund, N. Damage Propagation Modeling for Aircraft Engine Run-to-Failure Simulation. In Proceedings of the 2008 International Conference on Prognostics and Health Management (PHM), Denver, CO, USA, 6–9 October 2008; IEEE: Piscataway, NJ, USA, 2008. [Google Scholar] [CrossRef]
- Barnhart, C.; Neels, K.; Hansen, M.; Odoni, A. Total Delay Impact Study: A Comprehensive Assessment of the Costs and Impacts of Flight Delay in the United States; Technical Report; Report Number: 01219967; National Center of Excellence for Aviation Operations Research: Berkeley, CA, USA, 2010. Available online: https://rosap.ntl.bts.gov/view/dot/6234 (accessed on 29 October 2025).
- NASA. NASA’s Open Data Portal. CMAPSS Jet Engine Simulated Data. Available online: https://data.nasa.gov/dataset/cmapss-jet-engine-simulated-data (accessed on 29 October 2025).
- Kaggle. 2015 Flight Delays and Cancellations. Available online: https://www.kaggle.com/datasets/usdot/flight-delays (accessed on 23 January 2023).
- Baptista, M.L.; Goebel, K.; Henriques, E.M.P. Relation between Prognostics Predictor Evaluation Metrics and Local Interpretability SHAP Values. Artif. Intell. 2022, 306, 103667. [Google Scholar] [CrossRef]
- Wang, Y.; Zhao, Y.; Addepalli, S. Remaining Useful Life Prediction Using Deep Learning Approaches: A Review. Procedia Manuf. 2020, 49, 81–88. [Google Scholar] [CrossRef]
- Muneer, A.; Taib, S.M.; Naseer, S.; Ali, R.F.; Aziz, I.A. Data-Driven Deep Learning-Based Attention Mechanism for Remaining Useful Life Prediction: Case Study Application to Turbofan Engine Analysis. Electronics 2021, 10, 2453. [Google Scholar] [CrossRef]
- Li, H.; Wang, Z.; Li, Z. An Enhanced CNN-LSTM Remaining Useful Life Prediction Model for Aircraft Engine with Attention Mechanism. PeerJ Comput. Sci. 2022, 8, e1084. [Google Scholar] [CrossRef] [PubMed]
- Taha, H.A.; Sakr, A.H.; Yacout, S. Aircraft Engine Remaining Useful Life Prediction Framework for Industry 4.0. In Proceedings of the International Conference on Industrial Engineering and Operations Management, Toronto, ON, Canada, 23–25 October 2019. [Google Scholar]
- Kasturi, E.; Devi, S.P.; Kiran, S.V.; Manivannan, S. Airline Route Profitability Analysis and Optimization Using Big Data Analytics on Aviation Data Sets under Heuristic Techniques. Procedia Comput. Sci. 2016, 87, 86–92. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).