Evaluating RUL Predictive Models: A Risk-Based Predictive Maintenance Approach

El-Thalji, Idriss; Usman, Ali; Ali, Waqar

doi:10.3390/ai7050169

Open AccessArticle

Evaluating RUL Predictive Models: A Risk-Based Predictive Maintenance Approach

by

Idriss El-Thalji

^1,2,*

,

Ali Usman

² and

Waqar Ali

²

¹

Department of Engineering Management, Prince Sultan University, Riyadh 11586, Saudi Arabia

²

Department of Mechanical and Structural Engineering and Materials Science, University of Stavanger, 4036 Stavanger, Norway

^*

Author to whom correspondence should be addressed.

AI 2026, 7(5), 169; https://doi.org/10.3390/ai7050169

Submission received: 2 March 2026 / Revised: 10 May 2026 / Accepted: 12 May 2026 / Published: 14 May 2026

(This article belongs to the Special Issue AI for Industrial Operation and Maintenance: Recognition Challenges with Limited Data Condition)

Download

Browse Figures

Versions Notes

Abstract

Remaining Useful Life (RUL) forecasting models are essential to enable predictive maintenance strategies. However, selecting the most appropriate model based solely on conventional accuracy metrics may be insufficient for practical decision making, where an adequate prediction horizon is required to plan maintenance activities. This study investigates the impact of prediction horizon on model performance and its implications for maintenance decision making. A multi-horizon evaluation approach is applied to assess model accuracy across different predictive horizons. The results show the fluctuation of accuracy and prediction error over different prediction horizons. Across both datasets, predictive accuracy was generally lowest at the long horizon (11.64–86.62%), remained variable at the medium horizon (18.13–82.04%), and was highest at the short horizon (30.29–98.25%). The results demonstrate that model performance varies significantly with the prediction horizon, highlighting a trade-off between prediction accuracy and the time available for operational planning. These findings emphasize that models with high short-term accuracy may not necessarily support effective maintenance decisions if sufficient lead time is not provided. The findings show how prediction horizon considerations shall be integrated into a risk-based evaluation framework, in which model performance is interpreted in relation to the operational consequences of prediction errors. A complete risk-based predictive maintenance framework is proposed to support a shift toward comprehensive, risk-based evaluation as a prerequisite for reliable and effective RUL prediction in predictive maintenance systems.

Keywords:

remaining useful life; data-driven RUL; risk-based predictive maintenance; forecasting models; vibration data; asset management

1. Introduction

Predictive maintenance (PdM) enabled a shift from reactive or scheduled to proactive, data-driven approaches. This shift allows organisations to anticipate potential equipment failures and address them before they occur, minimising downtime, reducing costs, and improving overall operational efficiency. However, PdM forecasts when equipment is likely to fail using condition data and analytics, not all failures carry the same level of risk. Some failures may lead to safety hazards, environmental damage, or significant downtime, while others have minimal consequences. Risk assessment helps identify and prioritise the most critical assets or failure modes, considering both the likelihood of failure and the severity of its consequences.

Several studies [1,2,3,4,5] have addressed the need for risk-informed predictive maintenance where risk assessment is integrated with the predictive maintenance strategy to improve decision making and reduce operational costs. Risk-informed predictive maintenance leverages risk assessment to determine optimal timing and prioritisation of maintenance actions within an ongoing condition-based strategy. It is important to highlight that Risk-informed predictive maintenance approach is related to the utilisation phase, i.e., when the systems are in operation. This means that predictive maintenance models are already designed and in use. Then the question is how the predictive techniques were evaluated and selected, and what the risk is in terms of accuracy over the entire prediction horizon considered. This need is highlighted in several studies [6,7,8]. A preliminary framework was developed to determine technical specifications for predictive models in the project phase [6,7]. However, recent review papers [9,10,11,12,13] show that the risk approach is not yet used to evaluate the remaining useful lifetime (RUL) models.

Remaining Useful Life (RUL) prediction plays a key role in predictive maintenance by supporting data-driven maintenance planning and reducing operational risks. Data-driven RUL prediction models have evolved into a diverse methodological landscape encompassing classical machine learning, deep learning, and hybrid approaches. Deep learning for RUL prediction uses temporal models like LSTM and variants [14,15,16,17,18], convolutional feature extractors (1D CNN and hybrids) [15,18,19], attention/Transformer architectures [20,21], and generative methods like GANs, physics-informed generators [22] often combined in hybrids. Existing studies have largely focused on improving model accuracy using statistical, machine learning, and deep learning approaches, typically evaluated through aggregate metrics such as RMSE, MAE, or MAPE. The vast majority of studies predict a single scalar RUL at each time point, reflecting single time horizon evaluation. Several recent works begin to explore multi-step, autoregressive, or multi-scale formulations, but they remain a minority of published approaches [20,22,23].

The model evaluation is often limited to reporting performance without explicitly linking it to maintenance decision-making requirements. In practical applications, the effectiveness of an RUL model depends not only on its overall accuracy but also on its reliability across different prediction horizons. Short-term predictions may be highly accurate but provide limited time for intervention, whereas long-term predictions offer greater planning flexibility but are associated with higher uncertainty. Despite this, the relationship between horizon-dependent performance and operational consequences remains underexplored in the literature.

To address this gap, this study demonstrates a horizon-dependent evaluation perspective that links model performance with maintenance decision contexts. The study analyzes how prediction accuracy and error varies across horizons and interprets these variations in relation to potential operational consequences. In addition, complementary evaluation metrics, including the Prediction Horizon Error Profile (PHEP) and the SHAPE RMSE, are introduced to provide deeper insight into model reliability. This work, therefore, contributes to advancing toward a more comprehensive, risk-based evaluation of RUL models.

Two well-known and open datasets are used for this study. The first dataset is related to vibration data for a defective high-speed shaft bearing in a 2 MW wind turbine over 50 days. The second dataset is related to a bearing set-up at the Intelligent Maintenance Systems Centre at the University of Cincinnati. In this study, five predictive techniques to forecast the remaining useful time (RUL) are evaluated over three predictive horizons and levels of consequences.

In the following section, the theoretical background and the demonstration case are explained. Later, the results of the demonstrated case and the predictive techniques studied are summarised and illustrated, followed by conclusions.

2. Methods and Materials

2.1. RUL Prediction Models

RUL has a wide range of algorithms, including reliability-based models, traditional machine learning, and deep learning and physics-based models [11]. These algorithms have been developed to suit different types of data and application needs. Classical statistical and reliability models are still widely used, especially where historical failure data is limited. Models like Auto-Regressive Integrated Moving Average (ARIMA), Facebook Prophet [24], Proportional Hazard Model (PHM), Cox Proportional Hazards Regression [25], Kalman Filters, Particle Filters and Hidden Markov Models (HMM) [26] are fundamentally built on statistical or probabilistic principles, and often incorporate explicit representations of the degradation process over time. These models often rely on strong assumptions about system behavior (e.g., linearity, Gaussian noise, or predefined state transitions) and require handcrafted features or deep domain knowledge to perform well. Condition monitoring data such as vibration data are included with several non-linear patterns such as exponential increases in amplitude due to crack growth, appearance of harmonic content due to misalignment, or threshold effects caused by lubrication loss [27].

In contrast to statistical and reliability-based models, machine learning models are highly flexible and data-driven to capture nonlinear patterns directly from raw sensor data without requiring explicit knowledge of the internal physics of the system [28]. In traditional machine learning, common algorithms include linear regression, support vector regression (SVR), random forest regression, gradient-boosted decision trees (such as XGBoost and LightGBM) and k-Nearest Neighbors (k-NN) are effective when meaningful features or degradation indicators can be extracted from historical or sensor data [29]. However, these models do not have an internal state or memory to track past observations. This means that it does not capture temporal patterns such as acceleration of degradation or seasonal behavior, which violates the nature of time series data [29].

Deep learning models can forecast RUL considering time series features, e.g., dependencies and long sequences [30]. Deep models automatically learn multiple levels of abstraction. For example, lower layers detect simple patterns (e.g., edges, spikes), whereas middle layers combine patterns into features (e.g., trends, cycles), and higher layers form abstract concepts (e.g., fault signatures, degradation types). Three are three mechanisms to learn temporal structure and dependencies in sequential data: recurrence (used in Recurrent Neural Networks), convolution or sliding filter (used in Convolutional Neural Networks), and self-attention (used in transformers). Figure 1 illustrates the main categories of RUL predictive models.

Recurrent Neural Networks (RNNs) depend on the recurrence mechanism to maintain a hidden state that carries forward past information. The mechanism creates a short-memory capability to capture temporal dependencies (e.g., trends, cycles). However, RNNs forget fast because they lack mechanisms to protect or filter important information [31]. Both long-short-term memory (LSTM) and gated recurrent unit (GRU) are considered special types of RNNs, specifically designed to enable learning of long-term dependencies in time series data. Both introduce gating mechanisms that allow the network to control information flow; deciding what to remember, what to forget, and what to update at each time step. For example, if a vibration spike is irrelevant, the model can forget it, while if a new vibration trend appears, the model can add it. LSTM and GRU process each time step sequentially, not in overlapping windows. That means they may miss sudden short events, such as a crack or impact spike, repeating bursts of oscillation, or harmonic patterns, which are also called local patterns. These patterns are short, but high-resolution patterns, and easy to miss.

Convolutional neural networks (CNNs) use a convolution or sliding filter (kernel) to compute a weighted sum of values, which acts as overlapping scan windows [32]. Transformer-based architectures, such as the Time Series Transformer and Informer, transform each input element into a context-rich representation using attention mechanisms [33], rather than recurrence or convolution. Self-attention allowing every position in the sequence to interact with each other and process the entire sequence at once [34]. It enables the model to learn which time points are important, regardless of how far apart they are. Although both CNNs and self-attention mechanisms rely on weighted sums to extract features, they differ fundamentally in how they choose their weights and where they apply them. CNNs use fixed filters that focus on local neighborhoods, making them ideal for detecting repeating short-term patterns [32]. In contrast, self-attention calculates dynamic weights based on the relationships between all time steps, enabling the model to learn global dependencies. Transformers look at all time steps at once, dynamically deciding what matters most, whether it is a spike in a few seconds or a trend from several minutes back [35].

Transformer-based architectures have shown strong potential in time series forecasting due to their ability to capture long-range dependencies through attention mechanisms [36,37]. In the context of RUL prediction, this may lead to more consistent performance across different predictive horizons compared to recurrent models such as LSTM and GRU, which often exhibit reduced accuracy at longer horizons. This capability is particularly relevant for predictive maintenance, where reliable early predictions are critical for planning.

RNNs, CNNs, and Transformers are all powerful in learning from time series data, but are not inherently designed to denoise or compress raw sensor data [38]. Raw signals from industrial equipment are often high-dimensional and noisy, posing challenges for direct learning. To effectively process raw sensor data for predictive maintenance and RUL forecasting, two primary solution strategies are commonly employed [39]. First, feature engineering and signal processing techniques such as filtering, Fourier transforms, and statistical descriptors like peak or kurtosis. Second, auto-encoders are used to compress high-dimensional data into lower-dimensional representations while preserving key system behaviors, often acting as denoisers or feature extractors. An autoencoder is a type of neural network designed to learn a compressed representation (encoding) of input data and then reconstruct the original input from that compressed form. Autoencoders are trained to minimize the reconstruction error between the input and the output. A high reconstruction error from an autoencoder typically means that the input data does not match what the model has learned to represent due to anomalous (input that deviates from what the autoencoder has learned), contains noise and different inputs of operating conditions, or is compressed too aggressively.

The hybrid approach addresses a specific limitation of individual models in predicting RUL by leveraging their complementary strengths [30]. A common example is the CNN–LSTM architecture, where CNN acts as a local feature extractor, such as repetitive oscillations or shock impulses, while the LSTM captures how those patterns evolve over time. In another example, autoencoder-RNN (like LSTM or GRU) models are applied, where the autoencoder compresses and denoises the complex multivariate input, then fed into an RNN, which learns the degradation trends over time to estimate the remaining life. The CNN–Transformer combination can be highly effective when long-term dependencies and noise present major challenges. Although high reconstruction error is a powerful signal for detecting anomalies, it becomes a limitation when predicting Remaining Useful Life [13,40]. RUL forecasting depends on smooth, interpretable features that track gradual degradation over time. If an auto-encoder is trained only on healthy data, it may fail to represent fault conditions accurately, leading to noisy or misleading predictions.

Physics-informed machine learning models are also described as hybrid models that integrate domain knowledge, such as degradation laws, into the learning process [41]. The choice of algorithm depends on factors such as the availability and quality of the data, the need for interpretability or real-time performance, and whether uncertainty estimation is required [42]. Combining domain knowledge with machine learning might lead to more reliable and practical RUL predictions in industrial applications.

2.2. Evaluation Metrics for RUL Forecasting

To evaluate the performance of Remaining Useful Life (RUL) forecasting models, several key metrics are used to assess prediction accuracy and trend alignment such as mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE) and coefficient of determination (

R^{2}

) [11].

The Mean Absolute Error (MAE) measures the average magnitude of errors between predicted and true RUL values. It is simple to interpret and treats all errors equally:

MAE = \frac{1}{n} \sum_{i = 1}^{n} |{\hat{y}}_{i} - y_{i}|

(1)

The Root Mean Square Error (RMSE) computes the square root of the mean squared differences between the predicted and actual RUL values. It penalizes large errors more than MAE and is sensitive to outliers:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}

(2)

The Mean Absolute Percentage Error (MAPE) expresses prediction errors as a percentage of the actual RUL values, making it easier to compare across different asset types or degradation ranges:

MAPE = \frac{100}{n} \sum_{i = 1}^{n} |\frac{{\hat{y}}_{i} - y_{i}}{y_{i}}|

(3)

Accuracy is sometimes approximated as a complement of MAPE:

Accuracy = 100 % - MAPE (%)

(4)

This definition should be used with caution, especially when MAPE values are high or when domain-specific tolerances are critical.

The coefficient of determination (

R^{2}

) indicates how well the predicted RUL values explain the variance in the actual RUL values. An

R^{2}

close to 1 implies strong predictive performance, while a negative value suggests that the model performs worse than simply predicting the mean RUL:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(5)

2.3. Risk-Based Predictive Maintenance

Risk-based maintenance management aims to manage and prioritize maintenance activities based on risks associated with equipment failures [43], which considers both the likelihood and the consequences. Considering these two elements of risk helps to classify assets and failure modes and determine the most critical ones, which have the highest possible potential consequence on safety, production, and operating expenses. The NORSOK Z-008 [44] standard provide guidelines for criticality classification. After 2011, the emphasis was more towards consequences to prioritise corrective maintenance activities and handle high-consequence but low-frequency failure events. There are three purposes in the risk-based maintenance approach in which risk assessment is used as a determining criterion to select or prioritize preventive and corrective maintenance actions. First, the equipment is classified by the risk index, which determines the preventive maintenance concept. Second, the risk index determines the maintenance task and needs, i.e., maintenance interval and level of service. For example, critical equipment classes are usually inspected and maintained more frequently. Third, the risk index is used to prioritize the corrective maintenance tasks; in fact, only consequences are used in this case, as the failures have already occurred (no need to consider the likelihood of failure).

In the context of predictive maintenance, risk can be used for two purposes. First, it should be used to select the appropriate predictive technique. In this case, risk is related to the accuracy, reliability, and predictive horizon that each predictive technique provides. For example, it might be risky to select a predictive technique that has a short predictive horizon, even though it has high accuracy and reliable predictions. The second purpose is to use risk assessment to determine when preventive maintenance action should be taken after the fault is detected and the failure is predicted. In this case, the risk is related to the confidence level of the prediction and the consequences of delayed maintenance. Figure 2 illustrates the difference between risk-based and risk-informed approaches.

For the risk-based predictive maintenance approach, the risk is related to two criteria: model accuracy and predictive horizon. The predictive horizon is the threshold for the prediction of the lead time to failure as desired by the ISO 13381 standard [45]. The predictive horizon is an important criterion as it is related to the consequence of delaying maintenance and associated levels of damage and service that are required over time. Moreover, short predictive horizon offer limited maintenance planning time and might lock the opportunistic maintenance.

In Figure 3, an example illustrates the accuracy of three predictive models across three predictive horizons. The “model A” might look like the best model; however, if you consider predictive horizon criteria, the “Model B” may be the preferable one, as it offers better accuracy at long-term prediction that better satisfies the maintenance planned needs. Therefore, it is important to classify predictive models by their accuracy and consequence levels, as shown in Figure 3. For example, if the model fails in the low–high class, it should be considered a risky model. It has a low accuracy rate, while the consequence of not detecting and predicting the failure is high.

2.4. Demonstration Case

To demonstrate the risk-based predictive maintenance, two datasets were selected: wind turbine dataset [46] and IMS bearing dataset [47]. The first dataset is an open dataset related to a 2 MW wind turbine operating under real-world conditions, publicly accessible through GitHub [46]. The dataset provides a vibration data collected using MEMS-based accelerometers mounted radially on the high-speed shaft bearing. The monitored bearing is subjected to progressive degradation for 50 consecutive days [48]. Data acquisition was performed at fixed 10-min intervals while the turbine operated at a nominal rotational speed of 1800 rpm. A total of 50 data sets were compiled, each data set containing vibration recordings captured at a high sampling rate of 97,656 Hz for 6 s [48]. The stacked time series plot Figure 4 displays raw vibration waveforms over 50 days, showing fluctuations and a degradation trend (higher impact) towards the end.

The second dataset is widely used as a benchmark for RUL prediction, developed by the Intelligent Maintenance Systems Centre at the University of Cincinnati and hosted by the NASA Prognostic Data Repository. It consists of run-to-failure experiments in which multiple bearings were operated under constant conditions. The horizontal stack vibration signals, shown in Figure 5, illustrate the vibration signals over the entire lifetime, reflecting the degradation evolution of the bearing from healthy to faulty conditions.

Five predictive models are used to demonstrate the risk-based predictive maintenance approach: ARIMA, Facebook Prophet, LSTM, GRU, and PatchTST. Autoeconder is applied to compress and extract features and then fed that into all considered models. Three experiments were conducted to evaluate the performance of the model on different predictive horizons: short, medium, and long. These horizons were implemented by varying the training-to-testing data ratios as follows: (1) the long-term horizon used a 20–30 day training-testing split, (2) the medium-term horizon used a 40–10 day split, and (3) the short-term horizon used a 47–3 day split. Several studies [49,50,51] show that the performance of the classifiers is affected by varying the training-testing split ratio.

In this study, these three configurations are grounded in practical assumptions about fault progression and associated service levels. In the long term, 30 days before damage, it is assumed that the bearing begins to develop early-stage faults, which can be addressed with relatively low service intervention. In the medium-term horizon, 10 days before damage, the fault is presumed to have propagated, affecting adjacent components, and requiring a higher level of service. In the short-term horizon representing the final 3 days before system failure, it is assumed that the damage has reached a critical threshold, compromising multiple components, and resulting in full system degradation. The hyperparameters applied, and optimisation ranges are shown in Table 1.

3. Results and Discussion

The results are related to testing five different RUL predictive models over three different predictive horizons and under two different datasets. First, the RUL predictions of the five compared models are shown, and the effect of the predictive horizon is discussed. Second, the RUL predictive adequacy based on several metrics is illustrated and discussed. Third, new metrics, “Shape RMSE” and Prediction Horizon Error Profile (PHEP), are introduced and estimated. Fourth, the sensitivity related to model hyperparameter settings and optimisation ranges is analysed. Fifth, the sensitivity related to the predictive horizon, which is also analysed and discussed. Sixth, the risk matrix is built, and the classification of the RUL predictive algorithm is illustrated. Finally, the comparison results of the multi-dataset test are illustrated.

3.1. Accuracy over Multi Predictive Horizons

The accuracy results for dataset 1 across three different horizons for the five selected RUL predictive algorithms are summarised in Table 2. It can be observed that the accuracy of all models has improved over time as a result of increasing the training time and shortening the predictive horizon. To select the more accurate or effective model, the “ARIMA model” has the highest accuracy score on the short predictive horizon, 90.63%. However, three days before damage is not very helpful for the maintenance department to react, and the level of damage is already high. Therefore, the maintenance engineer might focus more on the long predictive horizon and select the Facebook Prophet model that scored 72.67% accuracy. It should be noted that the near straight-line forecasts observed for the neural network and transformer models likely stem from limited data and insufficient optimization, as these models typically require richer datasets and careful tuning. Therefore, their behavior in this study should not be taken as evidence of ineffectiveness for RUL prediction.

Taking a look at the long predictive horizon case in Figure 6, all the predictive models studied show a low level of accuracy in this case. Given the short training time, it is challenging to predict such a long time frame. The capability of each model, in general, is low as the models have not been able to mimic the actual degradation, either in values or trend. As illustrated in Figure 7, RUL predictions were more accurate under the medium predictive horizon compared to the long-term horizon. However, the highest prediction accuracy was achieved within the short predictive horizon, shown in Figure 8. Moreover, the ARIMA model prediction curve in Figure 8 shows that it did not effectively capture the direction and shape of the actual degradation curve, even though it achieves the highest accuracy among all models. This indicates that accuracy shall not be the only metric to evaluate the RUL predictions.

3.2. RUL Predictive Adequacy

The accuracy, which is based on the mean absolute percentage error (MAPE), expresses the prediction errors as a percentage of the actual RUL values. As observed in Table 2, several models have a poor prediction curve in terms of share and direction (compared to the actual degradation curve), although they have an acceptable or high accuracy rate. This can be observed as well in the coefficient of determination (

R^{2}

). The coefficient of determination (

R^{2}

) indicates how well the predicted RUL values explain the variance in the actual RUL values. As shown in Table 3, the ARIMA model has an accuracy of 90.63%, and it has a negative R² score. This can also be seen in the ARIMA curve in Figure 8. Although the ARIMA prediction has low error, it has high variance. In vibration-based predictive maintenance, it is not uncommon to observe a situation in which a model shows high prediction “accuracy” based on MAPE yet yields a negative R² score.

This apparent contradiction often arises from the nature of vibration data, which typically remain stable at low levels for extended periods before suddenly rising as a fault develops. In such cases, a simplistic model might predict a constant low value throughout. Because this prediction aligns closely with most of the early, stable data, the Mean Absolute Percentage Error (MAPE) appears low, suggesting high accuracy. However, this same model may completely fail to capture the actual degradation trend or abrupt changes leading to failure. As a result, R2, which measures how well the model explains variance, can become negative, indicating that the model performs worse than simply predicting the average. This highlights the risk of relying on a single metric such as MAPE and underscores the importance of evaluating both trend fidelity (via R²) and magnitude accuracy (via MAPE).

To assess the prediction adequacy, it is important to adopt a multi-metric evaluation strategy. This includes using different metrics to capture various aspects of performance, such as magnitude accuracy and trend fidelity. In a multi-metric evaluation strategy, assigning weights to different performance metrics allows you to reflect the real-world priorities of your application. By weighting metrics such as MAPE, RMSE and R2 according to their relevance, a composite score will be created. These weights shall be based on expert judgment, business impact, or optimized through experimentation. Ultimately, this approach transforms model evaluation from a purely statistical exercise into a decision-driven process.

However, Predictive adequacy shall refer to the extent to which a prediction model provides sufficiently reliable, accurate, trend-consistent, and decision-relevant estimates for the intended maintenance horizon. To achieve that, more metrics are required to capture the changes over several predictive horizons.

3.3. RUL Predictive Adequacy Needs More Metrics

The adequacy of RUL prediction models is still evaluated mainly using global metrics such as accuracy and R². These metrics are beneficial because accuracy provides an intuitive indication of overall prediction closeness, while R² reflects how well the model explains the variance in the degradation or RUL trend. However, metrics that assess prediction performance across multiple forecasting horizons and the similarity of the predicted trajectory shape remain limited. Consequently, current evaluation practices may not fully capture whether a model is suitable for maintenance decisions that depend on both the timing of the prediction and the realism of the predicted degradation path. In this context, the proposed approach supplements conventional metrics with Shape RMSE and Prediction Horizon Error Profile (PHEP). Shape RMSE is intended to evaluate the similarity between the predicted and actual degradation trajectories. This combination enables a more comprehensive assessment of RUL predictive adequacy beyond overall accuracy and goodness-of-fit alone.

To complement global accuracy and goodness-of-fit metrics, Shape RMSE is proposed to quantify the similarity between the predicted and actual RUL trajectories. Unlike conventional RMSE, this metric is computed on normalized series so that it reflects the agreement in shape and trend rather than absolute magnitude. It is defined as:

Shape - RMSE = \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {({\hat{y}}_{t}^{norm} - y_{t}^{norm})}^{2}}

(6)

where

y_{t}^{norm} = \frac{y_{t} - \min (y)}{\max (y) - \min (y)}, {\hat{y}}_{t}^{norm} = \frac{{\hat{y}}_{t} - min (\hat{y})}{max (\hat{y}) - min (\hat{y})}

(7)

with

y_{t}

representing the actual RUL value,

{\hat{y}}_{t}

the predicted RUL value, and N the number of evaluated time steps. Smaller values of Shape-RMSE indicate that the predicted curve better follows the actual degradation pattern.

Since lower Shape RMSE indicates better agreement with the actual degradation shape, the lowest value is considered the best. Accordingly, Prophet showed the best shape matching in the long and medium horizons (0.1015 and 0.1734), while PatchTST performed best in the short horizon (0.0540), as summarised in Table 4. Although ARIMA delivered the highest accuracy in the short predictive horizon, Shape RMSE indicated that Prophet better preserved the degradation trajectory shape, consistent with the visual comparison of the predicted and actual trends, while PatchTST obtained the lowest Shape RMSE in the short horizon, visual inspection indicates that Prophet better follows the actual degradation trajectory. This suggests that Shape RMSE alone may not fully represent shape fidelity in very short prediction windows.

3.4. Sensitivity of RUL Prediction

Predictive models, particularly those based on machine learning, often rely on sensitive hyperparameters and are influenced by randomness during training. These factors can cause performance to vary significantly across different runs, even when using the same data. For example, in a neural network used to predict the remaining useful life (RUL), the initial weights are set randomly. If these weights are too high or too low, the network may converge to very different solutions. Similarly, changing the learning rate from 0.001 to 0.01 can lead to faster convergence or complete instability. Randomness also plays an important role. Suppose that you use random splits of your training and test data: a lucky split may give 90% accuracy, but a different one drops it to 70%. If a stochastic gradient descent is used in models such as LSTM, the random order of data points can affect learning, unless the random seed is fixed.

The variation of prediction error (MAPE) with respect to the number of training days for different forecasting models is analysed, as shown in Table 5 and illustrated in Figure 9. Several key observations can be drawn. The results indicate that increasing the number of training days generally improves prediction accuracy by reducing MAPE. Prophet shows the most consistent reduction in error, while ARIMA achieves low errors but with higher fluctuation. LSTM, GRU, and PatchTST perform poorly at shorter training windows and improve only when more training data are available. This suggests that statistical models are more robust in data-limited settings, whereas neural and transformer models are more data-dependent. Overall, the figure highlights that model performance varies significantly across training configurations, and that increasing data does not guarantee consistent improvement. These findings emphasize the importance of evaluating models across multiple scenarios and support the need for a multi-horizon, risk-based evaluation approach for reliable predictive maintenance decision making.

In RUL forecasting, model performance can vary considerably depending on both the amount of historical data available for training and the length of the predictive horizon; therefore, evaluating models under different training windows and horizon settings is necessary to obtain a more reliable assessment of predictive adequacy. Instead of reporting errors at only one selected training window, the table summarises the range of MAPE values within ±3 days around key training lengths (25, 40, and 45 days). This sensitivity analysis assesses how RUL prediction performance at each horizon varies within a ±3-day neighbourhood around the selected training window, thereby revealing the local stability and robustness of each model to small changes in data variability. The results, summarised in Table 6, show clear differences in robustness across models. In general, the error range is broader under long predictive horizons and becomes narrower as the horizon shortens, suggesting reduced uncertainty and improved stability in short-horizon RUL prediction. At a range of 42–48 training days, Prophet remains the most stable and accurate overall (7.75–14.46). In contrast, LSTM, GRU, and PatchTST continue to show higher and wider error ranges, implying stronger dependence on larger datasets and/or further optimization.

To compare how sensitive each model’s prediction error is to small changes in the training window, a boxplot was included, as shown in Figure 10. Sensitivity is expressed as the change in MAPE (ΔMAPE) when the training window is shifted within a ±3-day range around the selected reference point. A narrower box and a median close to zero indicate greater robustness to small changes in training-window length. Prophet exhibits the most stable behavior, with a compact interquartile range and central tendency near zero, whereas ARIMA shows the widest spread and the most extreme outliers, indicating the highest sensitivity. LSTM, GRU, and PatchTST demonstrate intermediate behavior, with moderate spreads and some outlying cases. Overall, the results suggest that Prophet is the most robust model to local training-window variation, while ARIMA is the most sensitive.

In maintenance practice, long-horizon predictions are important because they provide sufficient time for planning, resource allocation, and intervention preparation. At the same time, short-horizon predictions remain critical because the consequences of inaccurate forecasts become much more severe as failure approaches. Therefore, the Prediction Horizon Error Profile (PHEP) is needed to reveal how prediction error evolves across the RUL horizon, so that model adequacy can be judged not only by overall accuracy, but also by when the model performs well or poorly. Figure 11 presents the Prediction Horizon Error Profile (PHEP) of the forecasting models, showing how MAPE changes with days remaining before failure. Since PHEP is expressed here in terms of MAPE, lower values indicate better predictive performance and higher values indicate greater error propagation, while most models begin to show visible error growth in the medium horizon and strong propagation in the short horizon, Prophet maintains a relatively flat and stable error profile. This suggests that Prophet is better able to preserve prediction reliability across the full RUL horizon, including the consequence-critical period close to failure.

Although one may expect prediction error to decrease as the forecasting horizon shortens, this is not always observed in RUL forecasting. Near failure, degradation often becomes more abrupt and nonlinear, which can make short-horizon prediction more difficult and cause renewed error growth. The presence of two peaks, in Figure 11 indicates that prediction error does not increase uniformly toward failure, but rather rises at two distinct stages of the RUL horizon. The first peak likely reflects a transition region where the degradation pattern begins to change and the models start losing stability. The second peak, which is usually stronger and closer to failure, reflects the critical stage where degradation becomes steeper, more nonlinear, or more irregular, making accurate prediction much harder. These results show that some models exhibit increasing error and instability near failure, highlighting the importance of horizon-dependent evaluation in risk-based predictive maintenance.

The Prediction Horizon Error Profile (PHEP) is therefore introduced to evaluate how prediction error changes across the remaining useful life horizon. Unlike aggregate error metrics, PHEP provides a horizon-wise view of model performance by computing the prediction error at each time-to-failure point. It is defined as:

PHEP (h) = \frac{1}{N_{h}} \sum_{i = 1}^{N_{h}} |\frac{y_{i, h} - {\hat{y}}_{i, h}}{y_{i, h}}| \times 100

(8)

where

y_{i, h}

and

{\hat{y}}_{i, h}

denote the actual and predicted RUL values, respectively, for instance i at horizon h, and

N_{h}

is the number of prediction instances available at that horizon. Lower values of PHEP indicate better predictive performance, while higher values reveal stronger error propagation at the corresponding prediction horizon.

3.5. Risk-Based Predictive Maintenance

The risk of a predictive RUL model is the likelihood that prediction errors lead to maintenance actions that are either too late or too early, causing operational, economic, or safety consequences. Thus, risk is not only “how wrong the model is,” but also how costly or dangerous that error is. In RUL prediction, the consequence should be evaluated by asking: what happens if the model is wrong by this amount at this horizon? If the model overestimates the remaining life, maintenance may be delayed and the asset may fail before intervention. If the model underestimates the remaining life, maintenance may be performed too early, causing unnecessary replacement, wasted component life, and avoidable operational cost. A practical way to evaluate consequences is to classify the effects of prediction error into a few impact dimensions, such as safety, production loss, maintenance cost, spare-parts waste, and environmental or service impact. Each dimension can then be scored on a simple scale from 1 to 3, where 1 indicates low consequence, 2 indicates medium consequence, and 3 indicates high or severe consequence. The overall consequence can then be estimated by combining these scores, either equally or using weights that reflect the operational context.

A practical engineering estimate is to assign increasing consequence levels to the three prediction intervals. For a wind turbine, prediction errors occurring 30 days before failure may be associated with low consequence because there is still sufficient time for verification and maintenance planning. Errors occurring 10 days before failure may be assigned medium consequence, as the available response window becomes limited and the likelihood of operational disruption increases. Errors occurring 3 days before failure should be considered high consequence, since wrong maintenance decisions at this stage may directly lead to unplanned downtime, emergency repair, and loss of production. In this framework, consequence is evaluated for two main categories of predictive error: undetected events and false prediction events. The first corresponds to missed or late detection of impending failure, which may lead to severe operational and safety consequences. The second corresponds to erroneous or misleading predictions that trigger unnecessary or mistimed maintenance actions. Hence, the consequence measure reflects both failure-related losses and the cost of incorrect preventive actions.

Accuracy was categorised into three levels based on asset manager experience: high (>95%), medium (80–95%), and low (<80%). These levels were used to distinguish between strongly reliable, moderately reliable, and weakly reliable prediction performance across the different failure horizons. These thresholds were used as a practical classification scheme for comparative interpretation within the proposed risk matrix, rather than as universal accuracy standards for all RUL applications. Table 7 presents a risk matrix for classifying RUL prediction algorithms according to both their prediction accuracy and the consequence associated with the prediction horizon. The rows represent the algorithm performance levels, while the columns represent the severity of consequence at different time intervals before failure, namely 30 days, 10 days, and 3 days. The matrix highlights that model usefulness is not determined by accuracy alone, but also by how early reliable predictions can be provided. Algorithms showing acceptable performance at earlier horizons exhibit greater early-warning potential and may support consequence reduction through earlier maintenance planning. In contrast, algorithms that achieve strong performance only close to failure remain associated with higher operational consequences due to the limited response time available. It should also be noted that, at the current stage, none of the evaluated algorithms provides high prediction accuracy at the early stage (30 days before failure). Although several models show some early-warning capability, their performance at this horizon remains within the low-accuracy range.

Among the evaluated models, Prophet appears to be the least risky algorithm due to its consistently superior performance across all horizons, whereas PatchTST is the most risky because it remains in the low-accuracy region under all consequence levels. ARIMA and GRU become relatively less risky in the short horizon, while LSTM and PatchTST remain less suitable for consequence-critical decisions. However, the probability of wrong prediction should not be defined by a single performance metric such as accuracy. This is because no individual indicator is sufficient to capture the different dimensions of predictive reliability in Remaining Useful Life forecasting. For example, a model may achieve acceptable numerical accuracy while failing to reproduce the degradation trend, or it may perform well overall but remain unreliable at critical decision horizons. Therefore, the probability of wrong prediction is interpreted as a composite measure of predictive inadequacy, informed jointly by pointwise accuracy, goodness-of-fit, slope similarity, and horizon-specific error behavior. Accordingly, the probability of wrong prediction is expressed as:

P_{wrong} = f (E_{acc}, E_{R^{2}}, E_{ShapeRMSE}, E_{PHEP})

(9)

where

E_{acc}

denotes normalized inaccuracy derived from conventional error metrics,

E_{R^{2}}

denotes the penalty associated with poor explanatory fit,

E_{ShapeRMSE}

denotes the penalty associated with low slope similarity, and

E_{PHEP}

denotes the penalty associated with poor horizon-wise predictive behavior. Since the true statistical probability of wrong prediction is difficult to estimate directly, it is approximated here through a composite predictive unreliability index:

P_{wrong} \approx w_{1} E_{acc} + w_{2} E_{R^{2}} + w_{3} E_{SSS} + w_{4} E_{PHEP}

(10)

subject to

\sum_{i = 1}^{4} w_{i} = 1

(11)

where

w_{i}

represents the relative importance of each performance dimension in the maintenance decision context. In conclusion, this risk matrix suggests that RUL prediction for this dataset remains challenging, especially at longer horizons. Therefore, it is important to perform multi-dataset testing.

3.6. Multiple Dataset Testing

Multiple-dataset testing is used to assess the robustness and generalizability of a prediction model across different datasets, operating conditions, or degradation patterns. Instead of relying on results from a single dataset, this approach examines whether the observed performance trends remain consistent under varying data characteristics. The comparison across the two datasets (Dataset 1 and 2), summarised in Table 8, shows that model performance is strongly dataset- and horizon-dependent. In both datasets, accuracy generally improves as the predictive horizon shortens, confirming that short-horizon RUL prediction is easier than long-horizon forecasting. Prophet is the most consistent model overall, achieving the best long- and medium-horizon accuracies in both Dataset 1 (72.67% and 66.12%) and Dataset 2 (86.62% and 81.03%). At the short horizon, ARIMA performs best in both datasets, reaching 90.63% in Dataset 1 and 98.25% in Dataset 2, while Prophet remains highly competitive with 89.81% and 96.76%, respectively. Dataset 2 generally yields higher accuracies than Dataset 1 for ARIMA, GRU, LSTM, and Prophet, suggesting that it may contain a more learnable degradation structure. In contrast, PatchTST performs inconsistently across the two datasets and shows the weakest generalization, particularly in Dataset 1, where accuracy drops to 18.13% and 67.64% in the medium and short horizons, respectively. Overall, the two-dataset evaluation confirms that model adequacy cannot be judged from a single dataset alone, as both robustness and ranking may change depending on the data characteristics.

Compared with Dataset 1, Dataset 2 shows a more pronounced end-of-life escalation, with a clearer transition from low-amplitude stable behavior to highly volatile and amplified vibration responses near failure. This indicates a stronger degradation signature and a more visible failure progression. Dataset 2 appears to be more learnable, as it exhibits a stronger and more continuous degradation signature. However, this interpretation should be made cautiously, since model performance still depends on the forecasting horizon and the model architecture.

The pyramid plot, Figure 12, confirms three main patterns: accuracy rises toward shorter horizons, Dataset 2 is easier to predict than Dataset 1, and Prophet is the most consistent model overall. ARIMA performs especially well at the short horizon, whereas PatchTST remains the weakest model across most conditions.

For Dataset 2, the risk matrix, shown in Table 9 shows that model reliability improves markedly as the prediction horizon shortens. Prophet is the most consistent model across all horizons, ARIMA is strongest at short horizon, and PatchTST remains the weakest and most risky model throughout. A notable result is that several models change risk class between Dataset 1 and Dataset 2. This demonstrates that the risk profile of an RUL prediction algorithm is not intrinsic to the model alone, but is strongly affected by dataset characteristics. Differences in degradation clarity, signal variability, and learnable temporal structure can shift a model from low-accuracy to medium- or high-accuracy regions, or the reverse. Therefore, risk-based assessment should be performed across multiple datasets to avoid dataset-specific conclusions and to provide a more reliable judgment of model adequacy.

4. Conclusions

The results indicate that the accuracy of the model tends to increase as the predictive horizon shortens, largely due to the greater availability of training data and the reduced complexity of forecasting. However, this gain in accuracy comes at a cost: short-term predictions, though more precise, provide limited usefulness for proactive maintenance planning.

The findings support a risk-based interpretation of predictive model performance. ARIMA demonstrated the highest accuracy in the short predictive horizon for both datasets, making it a strong candidate for near-failure detection and immediate reactive or just-in-time maintenance decisions. In contrast, the Facebook Prophet model, while less accurate overall, performed relatively better in the long predictive horizon, which is more suitable for strategic planning and early intervention. By incorporating model accuracy at each horizon into a risk matrix, we can align model selection with the potential consequences of predictive errors. This approach helps identify which models are most suitable for specific decision windows. A risk-based approach to model evaluation involves evaluating each predictive model not just by its overall accuracy, but by how well it performs across different forecasting horizons, short, medium and long term. Each horizon carries distinct operational implications and levels of risk. A recommended evaluation flowchart for the predictive model is illustrated in Figure 13, which indicates the need to perform accuracy tests, single-horizon reliability tests, multi-horizon reliability tests, and multi-dataset testing.

It is recommended to use a risk matrix to evaluate each predictive model during the project phase, prior to deployment in operational settings. Rather than selecting a single model, combining multiple models may be beneficial to enhance prediction reliability. However, conducting a thorough risk assessment during the project phase is essential to understand expected performance, validate whether the models meet operational needs, and ensure informed decision making before implementation.

Furthermore, the study emphasizes the importance of understanding horizon-dependent model performance. Single-horizon assessments may mask variations in model behavior, whereas multi-horizon evaluations reveal how prediction accuracy and reliability evolve across different time horizons. This perspective provides a clearer understanding of model suitability under varying operational conditions. In this context, metrics such as Shape RMSE can be used to quantify the similarity between the predicted and actual RUL trajectories and offer better insight regarding the shape prediction, enabling a more comprehensive evaluation of reliability.

The selection of prediction horizons in this study (3, 10, and 30 days) was guided by practical considerations derived from asset management practices in a wind farm context, where different time horizons correspond to varying levels of operational criticality and maintenance planning needs. Short-term horizons support immediate intervention, medium-term horizons enable maintenance scheduling, and long-term horizons facilitate strategic planning. It is important to note that these intervals are not universal and may vary depending on the type of equipment, degradation characteristics, and operational environment (e.g., onshore versus offshore conditions). Therefore, the proposed multi-horizon evaluation approach should be viewed as a flexible framework, where horizon selection must be adapted to the specific industrial context to ensure meaningful risk-based interpretation. Although RUL predictions are subject to uncertainty, the present work emphasizes horizon-dependent accuracy to analyze model behavior and its implications for maintenance planning during the project phase. Future work will integrate uncertainty quantification to enable fully risk-informed predictive maintenance.

In conclusion, this study contributes to the field of predictive maintenance by first highlighting the limitations of relying solely on aggregate accuracy metrics for evaluating RUL models. Second, it introduces a horizon-dependent evaluation perspective that links predictive performance with maintenance decision requirements. Third, the empirical analysis demonstrates that model suitability varies across prediction horizons, emphasizing that the most accurate model at one horizon may not be optimal for another. These findings support the need for more comprehensive and context-aware evaluation approaches in RUL prediction.

Author Contributions

Conceptualization, I.E.-T., A.U., and W.A.; methodology, I.E.-T. and A.U.; software, I.E.-T., A.U., and W.A.; validation, I.E.-T., A.U., and W.A.; formal analysis, I.E.-T., A.U. and W.A.; investigation, I.E.-T., A.U. and W.A.; resources, I.E.-T.; data curation, I.E.-T., A.U. and W.A.; writing—original draft preparation, I.E.-T., A.U. and W.A.; writing—review and editing, I.E.-T., A.U., and W.A.; visualization, I.E.-T., A.U. and W.A.; supervision, I.E.-T.; project administration, I.E.-T.; funding acquisition, I.E.-T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available from public repositories referenced in the manuscript. The first dataset, the MathWorks Wind Turbine High-Speed Bearing Prognosis Data Repository, is publicly available from MathWorks [46]. The second dataset, the IMS Bearing Data Set, is publicly available from the NASA Prognostics Data Repository [47].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, T.; Tang, T.; Wang, H.; Yuan, T. Risk-Based Predictive Maintenance for Safety-Critical Systems by Using Probabilistic Inference. Math. Probl. Eng. 2013, 2013, 947104. [Google Scholar] [CrossRef]
Agarwal, V.; Manjunatha, K.A.; Gribok, A.V.; Mortenson, T.J.; Bao, H.; Reese, R.D.; Ulrich, T.A.; Palas, H. Scalable Technologies Achieving Risk-Informed Condition-Based Predictive Maintenance Enhancing the Economic Performance of Operating Nuclear Power Plants; Technical Report; Idaho National Laboratory (INL): Idaho Falls, ID, USA, 2021. [Google Scholar]
Walker, C.M.; Agarwal, V.; Appiah, R. Development of a Scalable Risk-Informed Predictive Maintenance Cloud-Based Strategy at Nuclear Power Plants; Technical Report; Idaho National Laboratory (INL): Idaho Falls, ID, USA, 2023. [Google Scholar]
Liao, R.; He, Y.; Feng, T.; Yang, X.; Dai, W.; Zhang, W. Mission reliability-driven risk-based predictive maintenance approach of multistate manufacturing system. Reliab. Eng. Syst. Saf. 2023, 236, 109273. [Google Scholar] [CrossRef]
Fingerhut, F.; Tsiporkova, E.; Boeva, V. Interpretable Data-Driven Risk Assessment in Support of Predictive Maintenance of a Large Portfolio of Industrial Vehicles. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 2870–2879. [Google Scholar] [CrossRef]
El-Thalji, I. Predictive maintenance (PdM) analysis matrix: A tool to determine technical specifications for PdM ready-equipment. IOP Conf. Ser. Mater. Sci. Eng. 2019, 700, 012033. [Google Scholar] [CrossRef]
Nordal, H.; El-Thalji, I. Assessing the Technical Specifications of Predictive Maintenance: A Case Study of Centrifugal Compressor. Appl. Sci. 2021, 11, 1527. [Google Scholar] [CrossRef]
El-Thalji, I. Emerging Practices in Risk-Based Maintenance Management Driven by Industrial Transitions: Multi-Case Studies and Reflections. Appl. Sci. 2025, 15, 1159. [Google Scholar] [CrossRef]
Nordal, H.; El-Thalji, I. Lifetime Benefit Analysis of Intelligent Maintenance: Simulation Modeling Approach and Industrial Case Study. Appl. Sci. 2021, 11, 3487. [Google Scholar] [CrossRef]
Ferreira, C.; Gonçalves, G. Remaining Useful Life prediction and challenges: A literature review on the use of Machine Learning Methods. J. Manuf. Syst. 2022, 63, 550–562. [Google Scholar] [CrossRef]
Kumar, S.; Raj, K.K.; Cirrincione, M.; Cirrincione, G.; Franzitta, V.; Kumar, R.R. A Comprehensive Review of Remaining Useful Life Estimation Approaches for Rotating Machinery. Energies 2024, 17, 5538. [Google Scholar] [CrossRef]
Li, H.; Zhang, Z.; Li, T.; Si, X. A review on physics-informed data-driven remaining useful life prediction: Challenges and opportunities. Mech. Syst. Signal Process. 2024, 209, 111120. [Google Scholar] [CrossRef]
Qiao, X.; Jauw, V.L.; Seong, L.C.; Banda, T. Advances and Limitations in Machine Learning Approaches Applied to Remaining Useful Life Predictions: A Critical Review. Int. J. Adv. Manuf. Technol. 2024, 133, 4059–4076. [Google Scholar] [CrossRef]
Deutsch, J.; He, M.; He, D. Using Deep Learning-Based Approach to Predict Remaining Useful Life of Rotating Components. IEEE Trans. Syst. Man Cybern. Syst. 2018, 48, 11–20. [Google Scholar] [CrossRef]
Kong, Z.; Cui, Y.; Xia, Z.; Lv, H. Convolution and Long Short-Term Memory Hybrid Deep Neural Networks for Remaining Useful Life Prognostics. Appl. Sci. 2019, 9, 4156. [Google Scholar] [CrossRef]
Wang, B.; Lei, Y.; Yan, T.; Li, N. Multi-Scale Remaining Useful Life Prediction Using Long Short-Term Memory. Sustainability 2022, 14, 15667. [Google Scholar] [CrossRef]
Nie, L.; Xu, S.; Zhang, L.; Yin, Y. Multi-Head Attention Network with Adaptive Feature Selection for RUL Predictions of Gradually Degrading Equipment. Actuators 2023, 12, 158. [Google Scholar] [CrossRef]
Ensarioğlu, K.; İnkaya, T.; Emel, E. Remaining Useful Life Estimation of Turbofan Engines with Deep Learning Using Change-Point Detection Based Labeling and Feature Engineering. Appl. Sci. 2023, 13, 11893. [Google Scholar] [CrossRef]
Xie, Z.; Du, S.; Lv, J.; Deng, Y.; Jia, S. A Hybrid Prognostics Deep Learning Model for Remaining Useful Life Prediction. Electronics 2020, 10, 39. [Google Scholar] [CrossRef]
Wang, H.; Peng, M.; Miao, Z.; Liu, Y.; Ayodeji, A.; Hao, C. Remaining Useful Life Prediction for Aero-Engines Using a Time-Enhanced Multi-Head Self-Attention Model. Aerospace 2023, 10, 80. [Google Scholar] [CrossRef]
Wang, J.; Yan, J.; Li, C.; Gao, R.X.; Zhao, R. Utilizing Multiple Inputs Autoregressive Models for Bearing Remaining Useful Life Prediction. arXiv 2023, arXiv:2311.16192. [Google Scholar] [CrossRef]
Xiong, J.; Fink, O.; Zhou, J.; Ma, Y. Controlled physics-informed data generation for deep learning-based remaining useful life prediction under unseen operation conditions. Mech. Syst. Signal Process. 2023, 197, 110359. [Google Scholar] [CrossRef]
Narang, A.; Anakiya, A.; Singh, K.; Kiran, M.B.; Gaur, H.; Thakur, S. Remaining Useful Life Prediction of Ball Bearings Using Deep Learning Techniques. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; pp. 908–913. [Google Scholar] [CrossRef]
Ratre, S.; Jayaraj, J. Sales Prediction Using ARIMA, Facebook’s Prophet and XGBoost Model of Machine Learning. In Proceedings of the Machine Learning, Image Processing, Network Security and Data Sciences; Doriya, R., Soni, B., Shukla, A., Gao, X., Eds.; Springer: Singapore, 2023; pp. 101–111. [Google Scholar]
Andersen, P.K. Fifty years with the Cox proportional hazards model: History, influence, and future. J. R. Stat. Soc. Ser. Stat. Soc. 2023, 187, 578–579. [Google Scholar] [CrossRef]
Mor, B.; Garhwal, S.; Kumar, A. A Systematic Review of Hidden Markov Models and Their Applications. Arch. Comput. Methods Eng. 2021, 28, 1429–1448. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, Z.; Zhang, H.; Han, Q.; Liu, Y.; Wang, X. On-line condition monitoring for rotor systems based on nonlinear data-driven modelling and model frequency analysis. Nonlinear Dyn. 2024, 112, 5229–5245. [Google Scholar] [CrossRef]
Raissi, M.; Karniadakis, G.E. Hidden physics models: Machine learning of nonlinear partial differential equations. J. Comput. Phys. 2018, 357, 125–141. [Google Scholar] [CrossRef]
Agarwal, V.; Singh, M.; Kumar, K.P. A Comprehensive Review of Linear Regression, Random Forest, XGBoost, and SVR: Integrating Machine Learning and Actuarial Science for Health Insurance Pricing. In Proceedings of the Data Science and Security; Shukla, S., Sayama, H., Kureethara, J.V., Mishra, D.K., Eds.; Springer: Singapore, 2024; pp. 355–367. [Google Scholar]
Esfahani, Z.; Salahshoor, K.; Farsi, B.; Eicker, U. A New Hybrid Model for RUL Prediction through Machine Learning. J. Fail. Anal. Prev. 2021, 21, 1596–1604. [Google Scholar] [CrossRef]
Das, S.; Tariq, A.; Santos, T.; Kantareddy, S.S.; Banerjee, I. Recurrent Neural Networks (RNNs): Architectures, Training Tricks, and Introduction to Influential Research. In Machine Learning for Brain Disorders; Colliot, O., Ed.; Springer: New York, NY, USA, 2023; pp. 117–138. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Lara-Benítez, P.; Gallego-Ledesma, L.; Carranza-García, M.; Luna-Romera, J.M. Evaluation of the Transformer Architecture for Univariate Time Series Forecasting. In Proceedings of the Advances in Artificial Intelligence; Alba, E., Luque, G., Chicano, F., Cotta, C., Camacho, D., Ojeda-Aciego, M., Montes, S., Troncoso, A., Riquelme, J., Gil-Merino, R., Eds.; Springer: Cham, Switzerland, 2021; pp. 106–115. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Forootani, A.; Khosravi, M. Synthetic Time Series Forecasting with Transformer Architectures: Extensive Simulation Benchmarks. arXiv 2025, arXiv:2505.20048. [Google Scholar] [CrossRef]
Dintén, R.; Zorrilla, M.; Veloso, B.; Gama, J. Building of transformer-based RUL predictors supported by explainability techniques: Application on real industrial datasets. Inf. Fusion 2026, 127, 103892. [Google Scholar] [CrossRef]
Saleem, U.; Liu, W.; Riaz, S.; Li, W.; Hussain, G.A.; Rashid, Z.; Arfeen, Z.A. TransRUL: A Transformer-Based Multihead Attention Model for Enhanced Prediction of Battery Remaining Useful Life. Energies 2024, 17, 3976. [Google Scholar] [CrossRef]
Wen, K.; Dang, X.; Lyu, K. RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval. arXiv 2024, arXiv:2402.18510. [Google Scholar] [CrossRef]
Garcia, J.; Rios-Colque, L.; Peña, A.; Rojas, L. Condition Monitoring and Predictive Maintenance in Industrial Equipment: An NLP-Assisted Review of Signal Processing, Hybrid Models, and Implementation Challenges. Appl. Sci. 2025, 15, 5465. [Google Scholar] [CrossRef]
Wang, H.; Shi, A. Remaining Useful Life Prediction of Rolling Bearings Based on an Improved U-Net and a Multi-Dimensional Hybrid Gated Attention Mechanism. Appl. Sci. 2025, 15, 7166. [Google Scholar] [CrossRef]
Hao, Z.; Liu, S.; Zhang, Y.; Ying, C.; Feng, Y.; Su, H.; Zhu, J. Physics-Informed Machine Learning: A Survey on Problems, Methods and Applications. arXiv 2023, arXiv:2211.08064. [Google Scholar] [CrossRef]
Popović, S.; Viduka, D.; Bašić, A.; Dimić, V.; Djukic, D.; Nikolić, V.; Stokić, A. Optimization of Artificial Intelligence Algorithm Selection: PIPRECIA-S Model and Multi-Criteria Analysis. Electronics 2025, 14, 562. [Google Scholar] [CrossRef]
Khalifa, M.; Khan, F.; Thorp, J. Risk-based maintenance and remaining life assessment for gas turbines. J. Qual. Maint. Eng. 2015, 21, 100–111. [Google Scholar] [CrossRef]
NORSOK Z-008; Risk Based Maintenance and Consequence Classification. NORSOK: Tingvoll, Norway, 2017.
ISO 13381-1:2025; Condition Monitoring and Diagnostics of Machine Systems—Prognostics—Part 1: General Guidelines and Requirements. International Organization for Standardization: Geneva, Switzerland, 2025. Available online: https://www.iso.org/standard/88029.html (accessed on 14 February 2026).
MathWorks. Wind Turbine High-Speed Bearing Prognosis Data Repository. Available online: https://www.mathworks.com/help/predmaint/ug/wind-turbine-high-speed-bearing-prognosis.html (accessed on 17 February 2025).
Lee, J.; Qiu, H.; Yu, G.; Lin, J.; Services, R.T. Bearing Data Set. 2007. Available online: www.kaggle.com/datasets/vinayak123tyagi/bearing-dataset/data (accessed on 17 February 2026).
Bechhoefer, E.; van Hecke, B.; He, D. Processing for improved spectral analysis. In Proceedings of the Annual Conference of the Prognostics and Health Management Society, New Orleans, LA, USA, 14–17 October 2013; pp. 1–6. [Google Scholar]
Bichri, H.; Chergui, A.; Hain, M. Investigating the Impact of Train/Test Split Ratio on the Performance of Pre-Trained Models with Custom Datasets. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 331–339. [Google Scholar] [CrossRef]
Sivakumar, M.; Parthasarathy, S.; Padmapriya, T. Trade-off between training and testing ratio in machine learning for medical image processing. PeerJ Comput. Sci. 2024, 10, e2245. [Google Scholar] [CrossRef]
Radočaj, D.; Jurišić, M.; Plaščak, I.; Galić, L. Randomness in Data Partitioning and Its Impact on Digital Soil Mapping Accuracy: A Comparison of Cross-Validation and Split-Sample Approaches. Agronomy 2025, 15, 2495. [Google Scholar] [CrossRef]

Figure 1. Tree of RUL forecasting models.

Figure 2. Risk-based and risk-informed approaches in predictive maintenance contexts.

Figure 3. Evaluating predictive models in terms of accuracy over several time horizons.

Figure 4. Stacked time series vibration plot of dataset 1.

Figure 5. Stacked time series vibration plot of dataset 2.

Figure 6. RUL predictions under long predictive horizon.

Figure 7. RUL predictions under medium predictive horizon.

Figure 8. RUL predictions under short predictive horizon.

Figure 9. Prediction error (MAPE %) of forecasting models over training days.

Figure 10. Distribution of prediction error sensitivity across forecasting models. The orange horizontal line represents the median sensitivity, and the green triangle denotes the mean sensitivity.

Figure 11. Prediction Horizon Error Profile (PHEP) of the forecasting models.

Figure 12. Pyramid plot of model accuracy over several predictive horizons.

Figure 13. Flowchart to evaluate the performance of RUL predictive models.

Table 1. Hyperparameter settings and selected values used in the forecasting models.

Model	Parameter	Optimization Range	Selected Value
LSTM	Sequence Length	user-defined	2
	LSTM Units	[8, 16, 32]	16
	Dropout	[0.0–0.5]	0.1
	Dense Hidden Units	[8, 16, 32]	16
	Epochs/Batch Size	[100–300]/[2, 4]	300/4 or 2
GRU	Sequence Length	user-defined	2
	GRU Units	[8, 16, 32]	16
	Dropout	[0.0–0.5]	0.1
	Dense Hidden Units	[8, 16, 32]	16
	Epochs/Batch Size	[100–300]/[2, 4]	300/4 or 2
ARIMA	$(p, d, q)$	$p = [0 - 3], d = [0 - 2], q = [0 - 3]$	Best AIC-based order
Prophet	Daily/Weekly/Yearly Seasonality	{False, True}	False/True/False
	Changepoint Prior Scale	[0.01–0.5]	0.05
	Seasonality Prior Scale	[1–20]	10.0
	Added Weekly Seasonality	Fourier order [1–10]	period = 7, order = 3
PatchTST-inspired	Sequence Length	user-defined	2
	Attention Heads	[1–8]	2
	Key Dimension	[2–64]	4
	Feedforward Units	[8–128]	16
	Dense Hidden Units	[8, 16, 32]	16
	Epochs/Batch Size	[100–300]/[2, 4]	300/4 or 2

Table 2. Accuracy of predictive models across different prediction horizons and consequence levels, using dataset 1.

Model	Long Horizon (Low Consequence)	Medium Horizon (Medium Consequence)	Short Horizon (High Consequence)
ARIMA	11.64	56.56	90.63
Prophet	72.67	66.12	89.81
LSTM	34.46	59.04	79.72
GRU	33.77	54.77	90.39
PatchTST	33.99	18.13	67.64

Table 3. Performance comparison for short-term predictive horizon, with 47 training days and 3 testing days.

Model	MAE	MSE	RMSE	R² Score	MAPE (%)	Accuracy (%)
ARIMA	0.0252	0.0010	0.0313	−0.5988	9.3651	90.63
Prophet	0.0292	0.0009	0.0294	−0.4160	10.1910	89.81
LSTM	0.0558	0.0037	0.0612	−5.1204	20.2753	79.72
GRU	0.0275	0.0008	0.0279	−0.2698	9.6050	90.39
PatchTST	0.0905	0.0087	0.0935	−13.2818	32.3619	67.64

Table 4. Shape RMSE comparison of forecasting models across different scenarios.

Model	Long Predictive Horizon 20–30	Medium Predictive Horizon 40–10	Short Predictive Horizon 47–3
ARIMA	0.3846	0.7464	0.5636
GRU	0.5307	0.5147	0.1024
LSTM	0.5323	0.4746	0.5999
PatchTST	0.5959	0.3228	0.0540
Prophet	0.1015	0.1734	0.0681

Table 5. Prediction error (MAPE %) of forecasting models across different training window sizes.

Train Days	ARIMA	Prophet	LSTM	GRU	PatchTST
20	88.36	27.33	65.49	65.60	66.01
21	93.19	19.80	72.85	75.87	72.00
22	96.29	43.83	91.98	93.16	79.90
23	95.15	51.50	95.35	94.13	84.85
24	95.56	58.22	95.77	96.16	91.29
25	100.81	64.96	101.41	100.39	97.46
26	105.52	71.80	97.46	97.51	86.05
27	103.51	74.48	97.94	98.71	97.02
28	110.82	76.72	99.83	98.47	104.66
29	112.14	77.69	100.69	103.35	105.36
30	112.39	78.45	96.31	97.73	107.47
31	53.63	76.88	90.73	96.31	105.95
32	84.16	76.40	100.76	94.94	111.72
33	46.76	75.46	102.50	97.80	114.12
34	38.81	72.74	98.29	98.76	119.89
35	16.32	68.64	99.51	99.20	123.04
36	21.32	61.42	92.00	85.28	123.18
37	15.21	22.10	90.98	91.17	117.48
38	37.77	16.27	54.79	83.11	104.64
39	16.80	38.57	50.97	79.48	95.92
40	43.44	33.88	46.78	42.70	81.87
41	93.18	20.32	51.90	58.23	76.85
42	30.14	14.46	45.12	55.45	65.27
43	45.12	10.08	36.60	48.14	55.76
44	8.35	9.20	48.31	41.55	53.85
45	10.60	9.70	35.67	35.05	46.16
46	27.56	7.75	26.60	29.54	37.78
47	9.37	10.19	20.99	25.80	32.36
48	17.94	10.06	24.44	29.53	32.82
49	7.54	7.52	34.75	32.13	38.26

Table 6. Prediction error ranges (MAPE, %) of different models across selected training windows under a ±3-day variation in training length.

Model	Long Predictive Horizon 25 Days (±3)	Medium Predictive Horizon 40 Days (±3)	Short Predictive Horizon 45 Days (±3)
ARIMA	95.15–110.82	15.21–93.18	8.35–45.12
Prophet	43.83–76.72	10.08–38.57	7.75–14.46
LSTM	91.98–101.41	36.60–90.98	20.99–48.31
GRU	93.16–100.39	42.70–91.17	25.80–55.45
PatchTST	79.90–104.66	55.76–117.48	32.36–65.27

Table 7. Risk matrix to classify RUL prediction models, using dataset 1.

Accuracy Rate	Low Consequence (30 Days)	Medium Consequence (10 Days)	High Consequence (3 Days)
High Accuracy (>95)	–	–	–
Medium Accuracy (80–95)	–	–	ARIMA (90.63%), GRU (90.39%), Prophet (89.81%)
Low Accuracy (<80)	ARIMA (11.64%), GRU (33.77%), LSTM (34.46%), PatchTST (33.99%), Prophet (72.67%)	ARIMA (56.56%), GRU (54.77%), LSTM (59.04%), PatchTST (18.13%), Prophet (66.12%)	LSTM (79.72%), PatchTST (67.64%)

Note: The background colors indicate the associated risk level of each accuracy–consequence combination: green denotes lower risk or more acceptable prediction adequacy, yellow denotes moderate or cautionary risk, light red denotes elevated risk, and red denotes the highest risk or least acceptable prediction adequacy.

Table 8. Comparison of model accuracy (%) across two datasets and prediction horizons.

Model	Dataset 1			Dataset 2
Model	20–30	40–10	47–3	20–30	40–10	47–3
ARIMA	11.64	56.56	90.63	77.35	82.04	98.25
GRU	33.77	54.77	90.39	76.56	73.21	95.18
LSTM	34.46	59.04	79.72	84.86	70.72	95.14
PatchTST	33.99	18.13	67.64	46.12	33.79	30.29
Prophet	72.67	66.12	89.81	86.62	81.03	96.76

Table 9. Risk matrix to classify RUL prediction algorithms under dataset 2.

Accuracy Rate	Low Consequence (30 Days)	Medium Consequence (10 Days)	High Consequence (3 Days)
High Accuracy (>95)	–	–	ARIMA (98.25%), GRU (95.18%), LSTM (95.14%), Facebook Prophet (96.76%)
Medium Accuracy (80–95)	LSTM (84.86%), Facebook Prophet (86.62%)	ARIMA (82.04%), Facebook Prophet (81.03%)	—
Low Accuracy (<80)	ARIMA (77.35%), GRU (76.56%), PatchTST (46.12%)	GRU (73.21%), LSTM (70.72%), PatchTST (33.79%)	PatchTST (30.29%)

Note: The background colors indicate the associated risk level of each accuracy–consequence combination: green denotes lower risk or more acceptable prediction adequacy, yellow denotes moderate or cautionary risk, light red denotes elevated risk, and red denotes the highest risk or least acceptable prediction adequacy.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El-Thalji, I.; Usman, A.; Ali, W. Evaluating RUL Predictive Models: A Risk-Based Predictive Maintenance Approach. AI 2026, 7, 169. https://doi.org/10.3390/ai7050169

AMA Style

El-Thalji I, Usman A, Ali W. Evaluating RUL Predictive Models: A Risk-Based Predictive Maintenance Approach. AI. 2026; 7(5):169. https://doi.org/10.3390/ai7050169

Chicago/Turabian Style

El-Thalji, Idriss, Ali Usman, and Waqar Ali. 2026. "Evaluating RUL Predictive Models: A Risk-Based Predictive Maintenance Approach" AI 7, no. 5: 169. https://doi.org/10.3390/ai7050169

APA Style

El-Thalji, I., Usman, A., & Ali, W. (2026). Evaluating RUL Predictive Models: A Risk-Based Predictive Maintenance Approach. AI, 7(5), 169. https://doi.org/10.3390/ai7050169

Article Menu

Evaluating RUL Predictive Models: A Risk-Based Predictive Maintenance Approach

Abstract

1. Introduction

2. Methods and Materials

2.1. RUL Prediction Models

2.2. Evaluation Metrics for RUL Forecasting

2.3. Risk-Based Predictive Maintenance

2.4. Demonstration Case

3. Results and Discussion

3.1. Accuracy over Multi Predictive Horizons

3.2. RUL Predictive Adequacy

3.3. RUL Predictive Adequacy Needs More Metrics

3.4. Sensitivity of RUL Prediction

3.5. Risk-Based Predictive Maintenance

3.6. Multiple Dataset Testing

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI