1. Introduction
Predictive maintenance is a critical pillar of operational efficiency in the airline industry, as unexpected aircraft groundings can lead to substantial revenue losses and operational setbacks. Yet high uncertainty in forecasting component failures hampers effective inventory planning, increasing the risk of delays and compounding financial losses. For mixed fleet airlines managing tens of thousands of components, relying on a single predictive model—especially those rooted in conventional methodologies—falls short of capturing the diverse and complex behavior of all parts. Despite advancements, this challenge remains unresolved, highlighting the need for more sophisticated, data-driven approaches to enhance predictive maintenance strategies.
The comprehensive exploration of the advancements and potential of predictive maintenance as a superior alternative to traditional strategies—especially in optimizing the utilization of the remaining useful life (RUL) of components through data-driven methods and AI technologies—is detailed in [
1]. Yan et al. proposed a predictive maintenance framework leveraging Prognostics and Health Management (PHM) data to optimize maintenance costs and improve reliability for aircraft air-conditioning systems [
2]. However, the framework’s focus on specific subsystems limits its scalability to more complex, interconnected systems. Similarly, ref. [
3] explored the use of digital twin technology for predictive maintenance, providing a foundation for more dynamic and precise maintenance routines. Despite its innovative approach, the approach is quite data-intensive, requiring extensive high-quality operational datasets to realize its full potential. Moreover, ref. [
4] introduced a scheduling framework for predictive maintenance based on RUL prediction using a CNN-BiLSTM (Convolutional Neural Networks and a bidirectional long short-term memory network) hybrid deep learning model, demonstrating accurate RUL estimations for aircraft engines. Nevertheless, the method’s reliance on high-quality labeled datasets and its lack of consideration for data imbalance and real-time processing constraints present notable challenges for broader adoption. Similarly, by minimizing the size of data and computational demands, the summarization strategies suggested in the paper improve predictive maintenance workflows. However, these strategies often depend on a high level of preprocessing knowledge and have a limited ability to generalize to a variety of aviation data types, such as minimally detected inconsistent logs or components [
5].
Long short-term memory (LSTM) networks have extensively been employed in predictive maintenance to improve prediction accuracy by utilizing time series maintenance data. Ref. [
6] presents a hybrid framework where an autoencoder-based anomaly detector flags rare aircraft faults, and a bidirectional gated recurrent unit (GRU) network predicts their subsequent occurrence; the model ingests error codes, flight-deck effect reports, and maintenance logs, and its performance is assessed via accuracy, precision, recall, and confusion-matrix metrics. Similarly, ref. [
7] applies a Random Forest classifier to a real aircraft central maintenance system log-based dataset, achieving above seventy percent precision while correctly forecasting over half of the unscheduled aircraft component replacements. Ref. [
8] designed an effective multi-modal LSTM network to preprocess time series data and to predict anomalies in piston engine aircraft. Ref. [
4] proposed an ensemble model combining CNN and bidirectional LSTM with Bayesian optimization for RUL prediction. Ref. [
9] developed a hybrid CNN-LSTM framework for jointly optimizing production and predictive maintenance scheduling that effectively reduced total costs while improving efficiency. Ref. [
10] couples an LSTM model with swarm-based optimization and a Support Vector Machine (SVM) to improve hazard prediction from aircraft communications data. Despite their strong predictive capacity, LSTM-based architectures generally require large volumes of finely sampled, continuous time series data to capture degradation dynamics and mitigate overfitting. As a result, their performance can drop significantly when applied to sparse, irregularly sampled, or purely event-driven maintenance records.
These limitations become even more pronounced in maintenance datasets that contain only event logs—often hundreds of flight hours apart—where the long stretches of “no-event” data give an LSTM little temporal structure to learn and can cause vanishing-gradient problems or over-regularization. As a result, recent research has shifted toward representation-learning and data-augmentation techniques that can extract robust health indicators from sparse streams and supplement scarce failure cases before any sequence model is applied. Ref. [
11] leveraged a deep autoencoder to derive a latent space health indicator for accurate RUL prediction. Ref. [
12] presents a transfer-learning framework that employs consensus self-organizing models to improve predictions of equipment remaining useful life. Ref. [
13] provides a domain-agnostic survey of data-scarcity remedies and explicitly single out Generative Adversarial Networks (GAN) as a principal strategy for synthesizing additional training samples, arguing that GAN-based augmentation can mitigate small, imbalanced or non-generalizable datasets. To estimate the RUL of aircraft auxiliary power units, ref. [
14] uses routine auxiliary power unit (APU) monitoring data to train Random Forest models of normal behavior, rates each unit’s health by how far it drifts from the best model, and then applies a Bayesian time series method to predict the RUL. Ref. [
15] proposes a hybrid method that generates cautious point estimates of remaining useful life and translates them into a precise schedule for predictive maintenance. Even though these approaches are applied to predictive maintenance and RUL estimation, they often do not account for the challenges that arise when maintenance event logs are used and there is data imbalance across multiple components.
When maintenance datasets primarily consist of event logs, survival analysis provides a principled framework to address the associated data sparsity and imbalance across components. This approach enables the estimation of failure probabilities and RUL without requiring dense temporal measurements. Survival analysis has gained traction in predictive maintenance for aviation, providing crucial insights into the RUL of aircraft components and enabling proactive maintenance scheduling. Ref. [
16] introduced DeepHit, a deep learning-based survival model that directly learns the distribution of survival times and competing risks, reducing the need for the strong parametric assumptions typical of traditional approaches. Ref. [
17] adopts a survival analysis approach, fitting a Cox proportional hazards model to estimate the survival curve while treating occurrences of other faults or events as explanatory features. Ref. [
18] demonstrates that, for log data that fit a Weibull pattern, event-driven predictive maintenance can be clearly interpreted. Ref. [
3] combined probability-based survival models with digital twin technology to enhance maintenance strategies, enabling crews to forecast support needs, record part changes in real time, and keep each plane’s configuration up to date. Collectively, these studies underscore the utility of survival analysis techniques for aviation predictive maintenance, while also highlighting unresolved challenges such as data sparsity and the need for more adaptable models.
In this study, we propose a latent space classification framework that utilizes a shared encoder backbone trained on the dataset to address the challenge of data imbalance across part numbers (PNs). By learning global feature representation through an autoencoder, the model effectively transfers knowledge from high-data components to those with sparse maintenance histories, enabling reliable classification performance even in low-sample regimes. The study is based on datasets sourced from an airline MRO company operating with a fleet of over 500 aircraft, offering a realistic and scalable setting for fleet-wide maintenance analysis. This approach is compared against DeepHit, a survival analysis model designed for time-to-event prediction, allowing for a systematic evaluation of probabilistic versus discriminative modeling strategies in the context of predictive maintenance. Through extensive experiments across varying prediction horizons, we analyze the sensitivity, robustness, and practical utility of each method for fleet-wide maintenance planning and inventory decision support under sparse and irregular observational data.
The structure of this paper is as follows. In
Section 2, the data preparation process is described, covering data sources, preprocessing techniques, and feature selection methods used to improve model performance. In
Section 3, we focus on the maintenance event prediction model, describing the methodology, machine learning techniques used, and the model architecture. Finally, in
Section 4, the results and validation are provided, where the predictive accuracy of the model is evaluated and its implications for the management of aircraft maintenance inventory are discussed.
2. Data Preparation
The dataset in this study includes aircraft maintenance data collected over ten years, from 2014 to 2024. The data, sourced from two comprehensive datasets, provides valuable insights into the lifecycle of aircraft components, covering both their installation and subsequent maintenance activities. These databases are sourced from an airline MRO company that operates under a large airline with a fleet of over 500 aircraft.
The first dataset focuses on installed components and contains detailed records of parts installed on aircraft during the operation period. This dataset includes information on part numbers, serial numbers, the specific aircraft on which components were installed, and the dates of installation. It also captures details such as the flight hours and cycles logged by each component since its installation, the initial condition of the parts, and the age of both the aircraft and the components at the time of installation. Specifically, the fault-free operation interval for each component is directly quantified by the SN_FH and SN_FC metrics, which record the continuous accumulated usage from the installation to the maintenance event. In addition, flight count and flight hour values for 3 months later are also given to be used in the estimation study according to the planning strategies. The mapping of calendar-based prediction horizons to operational metrics was established using the projected monthly utilization rates derived from the airline’s flight schedules. In
Figure 1, the real installation data is shown.
The second dataset includes maintenance logs, documenting activities carried out on the aircraft over the period. This dataset provides a rich account of maintenance transactions, including the types of maintenance performed, the components involved, and the brief reasons behind the interventions. Key information includes part and serial numbers, sub-categories of the components, aircraft identifiers, and the dates of maintenance actions. Additionally, it records whether the maintenance was scheduled or unscheduled, the condition of the components at the time of maintenance, and flight data such as hours and cycles linked to specific components. The age metrics of both aircraft and components are given, and the total flight counts and flight hours for each of the part numbers are also given. The real maintenance data is shown in
Figure 2.
To construct a unified event-level dataset suitable for predictive modeling, two complementary sources were merged: one containing records of currently installed components and another comprising historical maintenance logs. Each entry was labeled as either INSTALLED or MAINTENANCE to distinguish between components that are in service and those removed or intervened upon. Operational attributes were extracted to capture distinct dimensions of component life: flight hours (SN_FH) and flight cycles (SN_FC) were selected to quantify operational exposure, while component age (SN_AGE) and aircraft age (AC_AGE) were included to account for calendar-based aging effects that progress independently of flight operations.
Since maintenance records are temporally sparse and irregular, we avoid using sequential models such as LSTMs or recurrent neural networks (RNNs), which typically require dense and consistent time series data. Instead, we adopt a first-order Markovian abstraction, assuming that the current feature vector
acts as a sufficient statistic for the system’s degradation history. This assumption allows the probability of a maintenance event
to be approximated solely on the instantaneous state, rather than conditioning on the entire historical trajectory:
This memoryless formulation is valid because is explicitly engineered to include cumulative integrals of operational stress—such as SN_FH and SN_FC. By embedding these cumulative metrics into the input vector, the historical usage trajectory is effectively encoded into the current state representation, enabling the use of static classifiers without sacrificing temporal context.
By structuring the input data under this assumption, we aim to preserve temporal relevance while maintaining compatibility with static models. This abstraction facilitates learning from maintenance histories without requiring fully sequential modeling, which is often infeasible in aviation datasets due to missing records, asynchronous updates, and inconsistent sampling frequencies. To capture component-level usage history, cumulative metrics—TOTAL_SN_FH and TOTAL_SN_FC—were computed by aggregating previous flight hours and cycles associated with the same serial number. These cumulative metrics quantify the component’s total operational usage over its lifespan. Additionally, the number of prior maintenance events was encoded using the NO_OF_PREV_MAINTENANCE feature, offering insight into individual part reliability over time by capturing the recurrence frequency of failures.
Following integration, variables representing the operational context were processed to capture distinct risk factors. AC_TYPE accounts for utilization differences across fleets, while CONDITION (e.g., new vs. repaired) represents the component’s baseline reliability profile upon installation. Subsequently, these categorical variables were transformed into binary representations using one-hot encoding. Continuous features were scaled to the range via min-max normalization to ensure numerical stability across models. The resulting dataset comprises both current operational states and enriched maintenance history, forming a structured basis for training latent space classifiers and survival analysis models under sparse and imbalanced conditions.
Table 1 provides a detailed breakdown of the data fields available in the unified dataset, categorizing them by their origin and utility in the modeling process. The table distinguishes between raw operational metrics and those derived through feature engineering, such as cumulative usage statistics (
TOTAL_SN_FH,
TOTAL_SN_FC) and previous maintenance numbers (
NO_OF_PREV_MAINT). The final column explicitly identifies the subset of features selected as inputs for the predictive models. While unique identifiers (
PN,
SN) remain in the dataset for tracking and validation, they are excluded from the training feature vector to prevent overfitting, ensuring the model generalizes based on operational behavior rather than specific identity tags. Similarly, fields such as
ATA_CHAPTER,
AC,
LONGITUDE/LATITUDE,
DATE, and
REASON_CATEGORY are listed to reflect the original data structure but are excluded from training as they do not contribute to the learning process. Collectively, these selected features construct a robust operational profile, enabling the model to correlate specific usage patterns and installation attributes with the likelihood of maintenance events.
Figure 3 illustrates the skewed distribution of available maintenance data across various part numbers. While some PNs exhibit a large number of maintenance records, many have relatively few observations. Such data scarcity can impede the effectiveness of predictive maintenance algorithms, as these models rely on sufficiently large and diverse training examples to learn robust patterns. In extreme cases where a PN has fewer than 100 maintenance records, the lack of data hinders the model’s ability to capture important failure modes or maintenance needs, ultimately degrading its overall performance and generalizability.
The final dataset comprises a total of 35,005 records collected from a fleet of over 500 aircraft between 2014 and 2024, covering 80 distinct PNs. While the aggregate dataset shows a balanced distribution with maintenance events constituting approximately 50.92% of the total observations, a significant data imbalance exists across varying part numbers, as illustrated in
Figure 3. Regarding data quality, missing values in the
SN_AGE field (observed in 57.84% of records) were handled via mean imputation, while missing entries in the
CONDITION field (4.63%) were encoded as a distinct “No_Condition” category to preserve information. No missing values were observed in the remaining fields. As listed in
Table 1, continuous features (e.g.,
SN_FH,
SN_FC) represent cumulative usage metrics and were normalized to the range [0, 1] for model stability.
3. Maintenance Event Prediction
This section details the methodological frameworks employed for the prediction of aircraft maintenance events, a critical component in optimizing fleet availability and supply chain management. We investigate two distinct predictive paradigms: a probabilistic survival analysis approach and a hybrid classification-based approach. First, we introduce the application of DeepHit, a deep neural network tailored for time-to-event prediction, which estimates the probability of component survival over continuous flight hours to derive actionable risk thresholds. Subsequently, we present a proposed hybrid architecture that leverages an autoencoder for dimensionality reduction coupled with traditional machine learning classifiers—specifically Random Forest, K-Nearest Neighbors, and Decision Trees—to categorize maintenance needs within a compact latent feature space.
To provide a comprehensive operational overview, the proposed framework proceeds in a structured workflow. Initially, raw data from maintenance logs and installation records are merged and preprocessed, involving the engineering of cumulative usage metrics—such as total flight hours and cycles—and feature normalization. Following data preparation, the methodology applies two parallel modeling strategies. The first utilizes the DeepHit network to learn probability distributions for survival analysis directly from the processed data. The second path implements a hybrid latent space classification strategy, where an autoencoder first compresses high-dimensional inputs into a lower-dimensional latent representation. These learned latent features are then utilized to train Random Forest, K-Nearest Neighbors, and Decision Tree classifiers to predict maintenance events. Finally, the models are evaluated across varying prediction horizons (3-month, 6-month, and 1-year) to assess their utility for fleet-wide inventory planning.
3.1. Maintenance Prediction with Survival Analysis: DeepHit
DeepHit is a deep learning model designed to address challenges in time-to-event prediction, particularly in multi-risk scenarios with data imbalance [
16]. Unlike traditional survival analysis methods, it learns the joint probability distribution of event times and outcomes directly from data, making it robust and flexible for predictive maintenance tasks in aviation. Its architecture combines a shared sub-network that captures common latent features with risk-specific sub-networks that model each competing risk separately. This structure enables DeepHit to predict the likelihood of various risks occurring at specific time points, with a softmax layer ensuring normalized and interpretable outputs.
DeepHit’s training process is centered on a composite loss function that effectively handles partially observed data. The primary component is the log-likelihood loss,
, which ensures that the model accurately predicts observed event times and types while accommodating instances with incomplete information. This loss is defined in [
16] as
where
denotes the specific event type observed for the
i-th data point, while
indicates that the observation is censored.
is the indicator function,
is the predicted probability of event
k occurring at time
, and
represents the cumulative incidence function for event
k.
To refine the discrimination capability, a ranking loss
is incorporated to penalize the incorrect ordering of risk predictions between pairs of aircraft components. This is expressed as
where
is defined as
, an indicator function that selects valid pairs for comparison where subject
i experiences an event earlier than subject
j, and
is a convex loss function quantifying the concordance error with a scaling parameter
. The parameter
acts as a scaling hyperparameter for the ranking loss function, controlling the steepness of the penalty for incorrect risk orderings. The total loss combines these objectives, balanced by a hyperparameter
:
The architecture employed in this study, illustrated in
Figure 4, is a deep neural network specifically adapted for survival analysis with competing risks. The model features a multi-task architecture comprising a shared sub-network and
K parallel cause-specific sub-networks (corresponding to the
K maintenance reasons). The input covariates
are first processed by the shared sub-network to capture latent representations
common to all event types. To preserve original feature information while leveraging these learned representations, a residual connection concatenates the original covariates with the shared output to form the vector
. This vector serves as the input for the subsequent cause-specific sub-networks.
DeepHit produces a set of survival functions for each aircraft part, predicting the probability of the part remaining installed (i.e., not undergoing maintenance) at each specific flight hour. For each part
j, the model outputs a survival function
over discrete flight hours
, where
is the maximum considered flight hour. The survival function
represents the probability that part
j has not undergone maintenance up to flight hour
t. The maintenance probability (cumulative failure probability) at time
t, denoted here as
, is the complement of the survival function:
This relationship ensures that
represents the probability that part
j has undergone maintenance by or at flight hour
t.
To predict the overall maintenance need, a threshold-based approach is applied to determine specific maintenance requirements for individual parts. If there are
N parts under consideration, the maintenance likelihood for each part
j at flight hour
t is
. A maintenance event is flagged only when this probability exceeds a predefined threshold
. The binary decision indicator,
, for each part
j at time
t is defined as
where
is a tunable probability threshold (e.g.,
). Consequently, the total count of predicted maintenance events across all parts at flight hour
t, denoted as
, is computed by summing the active decision indicators:
This formulation ensures that only components with a sufficiently high likelihood of failure are included in the forecast, effectively filtering out false positives arising from low-confidence predictions. By varying the threshold , different maintenance strategies can be simulated, allowing operators to balance between early interventions (low ) and conservative maintenance policies (high ).
3.2. Proposed Method: Maintenance Prediction with Autoencoder and Latent Space Classifier
A unified model capable of handling diverse part numbers with varying data availability is essential for ensuring robust predictive maintenance. However, maintenance events are rare, and for certain part numbers, the number of recorded observations is limited. We adopt a hybrid approach, where a deep learning-based autoencoder is employed as a backbone feature extractor, and machine learning classifiers are used for final classification within the latent space. This strategy allows us to leverage the representational power of neural networks while ensuring effective learning with limited data using traditional classifiers.
The autoencoder first transforms high-dimensional input data into a lower-dimensional latent representation, capturing essential feature structures while eliminating redundant information. The learned latent space serves as an effective feature space for classification, where three machine learning algorithms, (i) K-Nearest Neighbors, (ii) Decision Trees, and (iii) Random Forest methods, are evaluated. These algorithms were selected due to their ability to work efficiently with moderate-sized datasets, unlike deep learning models that require extensive training data. This approach enables the model to adapt to different part numbers, ensuring robust classification even when maintenance records are sparse.
3.2.1. Autoencoder Architecture
In this section, we describe the autoencoder architecture, which transforms high-dimensional input data into a compact latent space while preserving essential features.
The autoencoder model used in this study consists of three main components, an encoder, a latent space, and a decoder, as illustrated in
Figure 5, enabling dimensionality reduction and reconstruction of the input data while preserving essential information and eliminating redundant features. The encoder transforms the high-dimensional input into a compact latent representation, capturing the core attribute features of the data and effectively reducing its dimensionality. The latent space serves as the compressed representation of the input data, retaining the critical features necessary for reconstruction and acting as an effective feature space for downstream tasks such as classification. The decoder reconstructs the original input from the latent representation, aiming to closely replicate the input while minimizing the reconstruction error. Training of the autoencoder focuses on reducing this reconstruction error, ensuring that the model captures the intrinsic structure of the input data in the latent space. These latent features are then utilized to train machine learning classifiers to effectively predict component maintenance schedules.
The autoencoder is trained using a mean squared error (MSE) loss function, which quantifies the reconstruction error between input data
and its reconstructed output
:
3.2.2. Latent Space Classifiers
In this section, we introduce the machine learning classifiers applied to this latent space and justify their selection for handling limited data (
Figure 6). Three different machine learning algorithms—K-Nearest Neighbors (KNNs), Decision Trees, and Random Forest—are employed for classification in the latent space. These models are selected based on their ability to learn effectively from limited data, making them suitable for scenarios where the number of available samples for certain part numbers is relatively small. Unlike deep learning models, which require large-scale datasets to generalize well and avoid overfitting, traditional machine learning algorithms can efficiently extract patterns from moderate-sized datasets without extensive hyperparameter tuning or computational demands.
KNN is chosen due to its instance-based learning approach, which relies on local neighborhood relationships to make predictions. This property allows it to adapt well to variations in feature distributions within the latent space. Decision Trees, on the other hand, provide an interpretable structure that recursively partitions the feature space to maximize information gain at each step. This method is particularly useful for identifying key decision boundaries within the reduced latent representation. Finally, Random Forest, an ensemble-based technique composed of multiple Decision Trees, enhances classification robustness by aggregating predictions across multiple trees to reduce variance and improve generalization. These models collectively offer a balance between interpretability, computational efficiency, and adaptability to limited data, making them an appropriate choice for predictive maintenance tasks in aviation [
19].
The Random Forest classifier, an ensemble of
T Decision Trees, is trained on bootstrap samples of the data, with each tree
t producing a prediction
. The final prediction is obtained through majority voting [
20]:
The model also calculates feature importance
for each feature
k, based on the reduction in impurity
across all splits involving
k:
where
is the set of splits on feature
k in tree
t.
The K-Nearest Neighbors classifier assigns a label to each instance based on the majority class among its
k Nearest Neighbors in the feature space. Given an input
z, the predicted class
is [
21]
where
represents the set of
k Nearest Neighbors of
z.
Decision Trees partition the latent space into a hierarchy of decision nodes. Each split is determined by finding the feature
that maximizes the reduction in impurity
[
22]:
where
D is the dataset at the current node, and
and
are the left and right child nodes, respectively.
Model evaluation is performed across different temporal horizons, including one-month and three-month prediction periods. For each test instance, the encoder maps the input features into the latent space, which is then passed to the classifiers for prediction.
The proposed latent space representation, combined with the aforementioned classifiers, provides a robust framework for predictive maintenance. The backbone model effectively captures and encodes critical features across the entire dataset, mapping high-dimensional input data into a compact latent space. By leveraging this learned representation, the machine learning classifiers operate on a feature space that preserves essential information while reducing noise and redundancy. This approach enhances classification performance, even in scenarios with limited data availability, as the latent space enables more efficient learning of underlying patterns. Consequently, the integration of deep feature extraction with traditional machine learning classifiers improves predictive maintenance outcomes by ensuring more reliable and generalizable decision-making.
4. Data Driven Validations and Results
To evaluate the proposed framework against a rigorous baseline, we selected DeepHit, a prominent deep learning approach in survival analysis. While traditional methods such as Cox proportional hazards (CPHs) and Random Survival Forests (RSFs) are common baselines, DeepHit has demonstrated superior performance in capturing non-linear relationships and handling complex, time-varying covariates in multi-risk scenarios. Its architecture, combining a shared sub-network with risk-specific sub-networks, allows it to learn joint probability distributions directly from data, making it robust for predictive maintenance.
4.1. Experimental Setup
The DeepHit model utilizes a multi-task neural network architecture configured with a shared sub-network to extract common latent features, feeding into distinct cause-specific sub-networks. Each sub-network comprises two fully connected layers of 64 neurons. Batch normalization was applied to the outputs of these dense layers to mitigate internal covariate shift, immediately preceding the non-linear tanh activations. Training was optimized using the Adam optimizer with a learning rate of for 200 epochs, utilizing a batch size of 128, minimizing a composite loss of log-likelihood and ranking constraints. To quantify maintenance predictions, we used multiple probability thresholds to convert the predicted cumulative incidence functions (CIFs) into discrete maintenance events.
The proposed autoencoder-based latent space classifiers was trained using the configuration detailed in
Section 3.2. In the proposed method, the encoder consists of three fully connected layers with progressively decreasing dimensions: 64, 32, and 16 neurons, respectively. We conducted an ablation study on the latent dimension size (testing sizes of 8, 16, 32, and 64). The selected dimension offered the optimal trade-off between reconstruction fidelity (MSE loss) and classification separability. To prevent overfitting and stabilize learning, batch normalization is applied after the first two layers. Additionally, an
regularization term is introduced with a gain factor
to constrain weight magnitudes and improve generalization. The decoder mirrors the encoder’s structure, reconstructing the original input by successively expanding the feature space through layers of 16, 32, and 64 neurons before the final output layer, which employs a sigmoid activation function. This final activation ensures that reconstructed values remain within a normalized range, making the model particularly effective for datasets with bounded feature distributions. The model was trained for 200 epochs using the Adam optimizer (learning rate
) and a batch size of 256. The hyperparameters for the latent space classifier were selected based on extensive empirical analysis to maximize performance, and the results presented correspond to the best-performing configurations. Specifically, for the K-Nearest Neighbors classifier, the number of neighbors was set to
. For the Decision Tree classifier, the maximum depth was set to 100. Additionally, for the Random Forest classifier, the number of estimators was set to 200 to ensure robust ensemble learning.
For all methods, learning performed an 80–20% train–test split. The data splitting process employed a stratified sampling strategy based on part numbers. This ensures that every component type is proportionally represented in both training and testing sets, preventing scenarios where data-sparse components are isolated solely in the test set.
4.2. Classifier Performances at Test Set
Figure 7 provides a side-by-side comparison of different classification strategies on the test set. In these boxplots, the interquartile ranges represent the spread of average results for each part number, illustrating how performance varies across different PNs. Since the primary utility of the proposed framework is maintenance prediction to support binary inventory decisions such as stock or do not stock at fixed horizons, we prioritized classification metrics F1, accuracy, precision, and recall over rank-based survival metrics (e.g., C-index). Converting standard classifier outputs to continuous risk scores for such metrics was avoided to prevent misleading comparisons arising from uncalibrated probability estimates.
The F1 scores and accuracies highlight the overall effectiveness of each model in correctly identifying the positive class, while precision and recall give deeper insights into trade-offs between false alarms and missed detections. Here, KNN, Decision Tree, and Random Forest show relatively high median values with tight interquartile ranges for most metrics, reflecting more consistent performance. In contrast, the DeepHit classifiers exhibit varying results depending on the chosen probability threshold. Even though a threshold of 0.10 is optimal when maximizing the F1 score, we present results across a range of thresholds to explicitly demonstrate the behavioral shifts in performance: lower thresholds (e.g., 0.10) typically increase sensitivity but can reduce precision, whereas higher thresholds (e.g., 0.50) lower the false alarm rate at the cost of missing more positive cases.
Figure 8,
Figure 9 and
Figure 10 illustrate the predicted versus actual maintenance decisions for 80 parts using three latent space classifiers: K-Nearest Neighbors, Decision Tree, and Random Forest. In each subplot, the classifiers aim to distinguish components that require maintenance at the test set. Overall, all three models demonstrate a strong alignment between predicted and actual outcomes, with Random Forest outperforming the others in terms of consistency and accuracy. This visual comparison reinforces the suitability of traditional classifiers in leveraging learned latent representations for effective binary classification, especially when part-level features are compactly encoded even when the data is limited.
Figure 11,
Figure 12 and
Figure 13 show the predictive results of the DeepHit model applied with different probability thresholds (0.10, 0.25, and 0.50). The aggregate predicted maintenance counts at a probability threshold of 0.25 align most closely with the ground truth values, avoiding the systematic overestimation observed at the 0.10 threshold. However, analysis of the classification metrics in
Figure 7 reveals that this aggregate alignment is misleading regarding predictive utility. While the 0.25 threshold brings the total number of predicted events closer to reality, the boxplots demonstrate a significant reduction in the recall metric compared to the 0.10 threshold. At lower thresholds, the model tends to predict more maintenance events, closer to reality, capturing more true positives but also increasing the risk of false alarms. As the threshold increases, predictions become more conservative, reducing false positives but also missing potential failures.
4.3. Classifier Performances Across Prediction Horizons
Figure 14 compares the performance of various latent space classifiers—KNN, Decision Tree, Random Forest, and DeepHit (at thresholds 0.10, 0.25, and 0.50)—across three different prediction horizons: 3 months, 6 months, and 1 year. Each row in the figure corresponds to a specific time period, and each column shows one of the evaluation metrics: F1 score, accuracy, precision, and recall. Across all periods, traditional classifiers (KNN, Decision Tree, and Random Forest) generally outperform DeepHit-based methods in terms of F1 score and accuracy. Random Forest consistently shows strong performance with high median scores and low variability, indicating robust generalization across different parts.
As the prediction window lengthens from 3 months to 1 year, a general decline is observed in F1 score, accuracy, and recall across most classifiers, reflecting the increasing difficulty of making accurate predictions over extended horizons. Notably, the interquartile range for F1 scores tightens at the 1-year mark, indicating more consistent performance across different parts despite the lower average metrics. One possible explanation is that by one year, most components either clearly require maintenance or clearly do not, reducing variability in predicted outcomes across parts. Precision remains relatively stable for many models and even shows slight improvement in some cases, especially for the more conservative classifiers like Random Forest and DeepHit with higher thresholds (e.g., 0.50). This behavior suggests that while models become less sensitive to detecting true maintenance events over time (lower recall), their specificity in correctly identifying necessary interventions does not degrade as markedly. DeepHit models exemplify this trade-off clearly: higher thresholds lead to higher precision but lower recall, whereas lower thresholds capture more true positives at the expense of false positives.
Figure 15,
Figure 16 and
Figure 17 show the predictions of various latent space classifiers and DeepHit (with thresholds 0.10, 0.25, and 0.50) for 3-month, 6-month, and 1-year prediction horizons, respectively. The results indicate that DeepHit significantly overestimates maintenance counts at lower thresholds (0.10 and 0.25), demonstrating a tendency to produce false positives when adopting less restrictive criteria.
As the prediction horizon expands from 3 months (
Figure 15) to 6 months (
Figure 16) and ultimately to 1 year (
Figure 17), the variability in classifier predictions tends to decrease, leading to more consistent and stable results. This phenomenon could be attributed to reduced data noise over longer horizons; for instance, certain components predicted to require immediate maintenance within 3 months may continue operating beyond that period, thus introducing uncertainty into short-term predictions. Longer-term predictions allow classifiers to better generalize and reduce false positives and negatives caused by transient anomalies or short-term operational variations. These observations emphasize the necessity of carefully selecting appropriate thresholds and classifiers tailored to specific maintenance objectives and prediction horizons.
The predicted maintenance outcomes produced by the proposed framework can be directly utilized to support seamless airline operations by enabling proactive planning and timely interventions. By anticipating which components are likely to require maintenance within a specified horizon, airlines can align their maintenance schedules and logistics operations to reduce unplanned downtimes. These predictions also serve as a basis for dynamic inventory management—allowing planners to better estimate required quantities of spare parts to stock, lease, or redistribute across maintenance bases. This predictive capability ensures high fleet availability, minimizes last-minute part sourcing, and enhances overall operational resilience, particularly for high-rotation or mission-critical components.
Effective inventory planning in aviation depends not only on accurate maintenance forecasts but also on aligning predictions with operational planning horizons. Short-term predictions (e.g., 3-month) are essential for immediate procurement and reactive inventory management but are prone to higher variability due to transient operational anomalies and noise in maintenance records. In contrast, longer-term forecasts (6-month to 1-year) enable strategic planning by smoothing out short-term fluctuations and providing more stable demand estimates. The proposed latent space classification framework, supported by a shared encoder trained on global data, demonstrates superior robustness across all horizons, especially in data-scarce settings. By maintaining consistent prediction accuracy and minimizing false alarms, it enables both responsive short-term actions and reliable long-term stocking strategies. This dual capability provides airlines with a practical advantage in balancing just-in-time logistics and cost-efficient spare part provisioning across varied planning intervals. Furthermore, unlike complex sequential models that require processing long historical dependencies, the proposed framework operates on aggregated snapshot features. This architectural simplicity ensures low computational overhead, enabling rapid retraining and near real-time inference, making it highly suitable for daily operational updates in an airline environment.
5. Conclusions
This study presents a predictive maintenance framework tailored for aviation, addressing challenges posed by sparse, imbalanced, and irregular maintenance records across diverse aircraft components. Designed for a large-scale airline operation managing a fleet of over 500 aircraft, the framework leverages a shared encoder backbone trained on all available part data. By leveraging a shared encoder backbone trained on all available part data, the proposed latent space classification approach mitigates data imbalance and enhances generalization for part numbers with limited historical observations. This encoder-based method is systematically compared with DeepHit, a state-of-the-art survival analysis model, enabling a thorough evaluation of discriminative versus probabilistic strategies for forecasting component-level maintenance needs.
The comparative results show that while DeepHit provides valuable insights into survival probabilities and maintenance timing, its sensitivity to threshold selection and performance degradation in low-data regimes limits its robustness across heterogeneous fleets. In contrast, the latent space classifiers—particularly the Random Forest model—maintain consistent accuracy across part numbers and prediction horizons, demonstrating superior adaptability in sparse data environments. This stability becomes increasingly important in longer-term forecasts, where operational noise is reduced and demand trends become clearer.
Accurate predictions of maintenance events have a direct and measurable impact on inventory control and logistics coordination in airline operations. By aligning forecast horizons—such as 3-month, 6-month, and 1-year intervals—with planning cycles, the proposed framework enables both short-term tactical decision-making and long-term strategic inventory management. The encoder-based classification approach supports seamless scheduling by producing reliable estimates across all horizons, thereby reducing unexpected component shortages and improving spare part provisioning process. This capability is particularly critical in high-scale operations, where minimizing disruption and maximizing fleet availability hinge on anticipating part demand with high fidelity.
In conclusion, the proposed framework offers a practical and scalable solution for predictive maintenance and inventory planning in aviation. By combining global feature learning with part-specific classifiers, it bridges the gap between data-rich and data-poor components, delivering actionable insights that enhance reliability, reduce downtime, and support cost-effective operations across complex fleets.
Beyond its predictive accuracy, the real-world implication of this framework lies in its potential to transform reactive maintenance into a proactive strategy. By integrating these forecasts into inventory planning, airlines can optimize spare part logistics and significantly reduce costs associated with unexpected Aircraft-on-Ground (AOG) events. However, a limitation of the current study is its reliance on historical maintenance logs and cumulative usage metrics alone. While effective for capturing general degradation trends, this approach does not account for external operational variances—such as harsh environmental conditions or specific route characteristics—that may accelerate wear for individual components independently of flight hours. Future work will focus on improving the interpretability of model decisions, particularly in high-stakes maintenance scenarios. In addition, incorporating additional data sources, such as sensor streams, environmental records, or operational context logs, can further enhance the predictive power and robustness of the models. Such integration will help contextualize maintenance needs more accurately and enable deeper insight into component behavior beyond what is captured in log-based datasets alone.