1. Introduction
Prognostic and Health Management (PHM) of complex systems such as aircraft engines leverages expert domain knowledge and advanced sensor analysis to provide real-time monitoring of the health status of vital system components. The precise estimation of Remaining Useful Life (RUL) is an essential health-state indicator in modern PHM systems. By providing vital information about the degradation of the equipment, catastrophic failures, unplanned shutdowns and financial losses are prevented [
1].
RUL estimation techniques are classified into two main classes: model-based and data-driven methods [
2]. Model-based methods require prior domain knowledge of complex systems from experts to build a physical model, making their application difficult. On the other hand, the recent advancements in sensor and communication technology have improved data collection, enhancing the effectiveness of data-driven methods and giving them a clear advantage.
Data-driven methods can be further classified into conventional Machine Learning (ML) techniques [
3] and Deep Learning (DL) approaches [
4]. The initial stages of data-driven methods are data collection and pre-processing. Conventional ML methods typically rely on handcrafted statistical features to represent the degradation process, whereas many advanced DL frameworks are designed to learn complex hierarchical features directly from raw sensor data, often reducing the need for extensive manual feature engineering. However, the specific framework, the characteristics of the dataset and the designed architecture of the network influence the effectiveness of the automatic feature extraction [
5].
Several studies have employed conventional machine learning methods to represent the degradation process and estimate the RUL. In a study, Empirical Mode Decomposition (EMD) effectively analyzes non-linear and non-stationary data to extract robust features. These features are then inserted into a Random Forest (RF) model, which is optimized through Bayesian optimization for superior RUL prediction [
6]. Also, to achieve adaptability and real-time capability of RUL prediction, another research effort uses the Unscented Kalman Filter (UKF) to recursively update the degradation parameters within a logistic regression model [
7]. While this method offers advantages in online adaptability, it can be sensitive to the initial state estimation and the underlying assumptions about the system’s dynamics. A hybrid similarity-based clustering method enhances the performance of ML regression models by training them on groups clustered based on a degradation index [
8]. Additionally, the Support Vector Regression model is presented in [
9], while the XGBOOST algorithm, as an ensemble method utilizing decision trees as base learners, achieves the best performance compared to conventional machine learning algorithms [
10]. An extensive comparison study of kernel Adaptive Filtering Methods has been presented in [
11], showing effectiveness and reliability in cases where the computation cost is crucial.
Deep learning models are sufficient for accurately predicting remaining useful life (RUL), having as a compelling advantage the learning of complex hierarchical features directly from raw sensor data. In one study [
12], a combination of four deep neural networks with an attention mechanism achieved impressive accuracy in RUL prognosis. Several research papers combine recurrent neural networks with convolutional layers or attention mechanisms to integrate spatial and temporal information. Long Short-Term Memory (LSTM) and bidirectional gated recurrent unit layers are utilized alongside a temporal self-attention mechanism in studies [
13,
14]. In another approach, the features extracted from ensemble empirical mode decomposition and wavelet packet transform are inserted into genetic-based optimized RNN and LSTM blocks [
15]. Convolutional operations effectively capture spatial knowledge with the extraction of multi-scale features and estimate RUL predictions in [
16,
17]. In another study [
18], a two-phase DL model combines a reformed convoluted LSTM layer with an attention mechanism to effectively estimate RUL. Also, with the use of variational inference, the degradation representation is encoded in a latent space, improving regularization and overall performance [
19]. In another approach, a Graph Spatial-Temporal Neural Network effectively captures the correlations between sensor responses and health states [
20] and an Embedded Attention-based Parallel network enhances the representation capability of extracted features [
21]. An innovative neural approach uses ordinary differential equations (ODEs) as a residual network and shows the validity of ODEs in estimating the degradation process [
22].
As discussed in the architectures above, most neural frameworks for RUL estimation use explicit layers to directly produce output from input based on their mathematical formulations. In the current research paper, we employ an implicit technique, where the input–output mapping is not defined from a fixed computational architecture but is dynamically estimated as the convergence fixed-point of a dynamic equation that describes the system.
Implicit layers often refer to:
Neural Ordinary Differential Equations (Neural ODEs), where the model learns a function that represent the evolving dynamics of the hidden state over time. This continuous transformation is solved using numerical integration methods. Because of their ability to share parameters across time, these models are memory-efficient [
23].
Optimization-based Implicit Layers define their output as the outcome of a mathematical optimization problem embedded within the neural architecture. Gradients are estimated using the implicit differentiation theorem. The model learns not only from data but also ensures the satisfaction of certain mathematical properties or constraints [
24].
Deep Equilibrium Models (DEMs) where the hidden representation is computed as the equilibrium of a fixed point equation. This fixed point is computed through iterative root-finding methods, allowing the model to represent arbitrarily deep computations without explicitly stacking layers, while using constant memory during training [
25].
While these frameworks provide valuable approaches for implicit modeling, they often lack internal expressive mechanisms for modeling multivariate spatio-temporal dependencies in the time-series sensor data of complex systems.
In this research study, we use implicit layers as equilibrium blocks. A fixed-point equation, derived from the underlying architecture of the implicit layer, describes the nonlinear dynamics of the system. This state equation is solved using iterative numerical solvers, producing a convergence representation that corresponds to the equilibrium point of the fixed-point equation. So, instead of stacking a predetermined number of layers, we estimate the convergence point of an implicit layer, considering that the equilibrium point more accurately reflects the system dynamics.
Explicit deep networks use a fixed number of operating layers. In contrast, implicit models perform in an adaptive number of layers until convergence. This flexibility allows them to converge effectively to an equilibrium solution, making them a more robust option for various applications. However, applying classic back-propagation becomes challenging because memory requirements increase significantly due to the unrolling process of multiple iterations. Instead, we use the Implicit Differentiation Technique based on the Implicit Differentiation Theorem (IFT) to compute gradients without unrolling the intermediate iterations [
24].
The Deep Equilibrium (DE) block integrates convolutional operations and a novel attention-based Dual-Input Interconnection mechanism, created specifically for implicit deep models on multivariate time-series data. The convolutional component extracts local spatial and temporal patterns from raw sensor inputs, producing an input feature map representing short-term dependencies. The input feature mapping and a latent representation vector, which encodes the internal health state of the system, are dynamically processed by the Dual-Input Interconnection mechanism. This allows the model to perform a cross-attention-like operation, where the input mapping is projected as keys and values and the health state as queries in a shared embedding space. So, the latent state is adaptively updated based on the most relevant observed patterns of the input. The equilibrium state is used as a health indicator for system monitoring since it captures long-term degradation patterns and local sensor behavior. So, the DE block allows for a highly expressive and memory-efficient representation that captures the dynamics of the underlying system by iteratively updating the latent state until convergence. The architecture differs from conventional attention-based architectures in DE models, where self-attention is applied only to the latent representation vector and the input vector mapping is used as a residual connection.
A fundamental challenge of DE models is maintaining stability during the training process since slight deviations in the input can result in significant deviations in the computed fixed point. Researchers are actively addressing these stability issues by implementing regularization techniques. In our current research effort, we are dealing with instability by incorporating Group Normalization across layers of the DE block. This technique divides the channels of a feature map into groups and normalizes the values within each group independently. By doing so, we can control the magnitude of the activations and help prevent instability.
As a final key component of the framework, we employ a feed-forward neural regression model augmented with the technique of Monte Carlo (MC) Dropout [
26,
27]. Unlike traditional models that disable dropout during inference, this method maintains dropout functionality throughout the process, resulting in varied outputs for each forward pass. The mean of the multiple stochastic responses is the estimated RUL, and their variance measures uncertainty for each prediction. So, we increase the neural network’s generalization capability and improve performance. After the end of the training process, the uncertainty for each prediction is calibrated based on a validation set, enhancing the overall reliability of the model. The main component of the calibration process is a Gaussian Mixture Model, which groups the validation samples. Subsequently, each group is calibrated with the estimation of a scaling factor by an optimization problem. Therefore, by the addition of uncertainty into the prediction process, the model’s reliability is improved, and an insightful confidence interval is provided. In that way, the proposed framework becomes more robust, reliable, and informative.
The CMAPSS (Commercial Modular Aero-Propulsion System Simulation) dataset [
28] is a commonly used benchmark in PHM, particularly for RUL estimation, and is used to evaluate the presented framework. This dataset is generated from a detailed simulation model of turbofan engines, which captures their complex dynamics and degradation behavior. CMAPSS consists of four sub-datasets, each having different levels of complexity by incorporating varying operating conditions and faults. The experimental results demonstrate competitive performance comparable to recent state-of-the-art frameworks. Also, the proposed DEM can be applied on any complex system since it has a general purpose design, provided that multivariate sequential sensor data is available.
The contributions of the research paper are as follows:
We propose a Deep Equilibrium Model (DEM) for RUL estimation that effectively captures both spatial and temporal patterns in multivariate sensor data through implicit modeling. The architecture consistently achieves competitive performance across diverse operating conditions on the CMAPSS dataset.
The core element of the Deep Equilibrium Model is a novel Dual-Input Interconnection Attention Block, which enables iterative and adaptive updates of the latent degradation representation by jointly processing the internal health state and the spatio-temporal features extracted from convolutional blocks. Unlike standard Transformer self-attention mechanisms used in DE frameworks, which typically operate only on the latent representation and incorporate input features as a static residual, the proposed attention-based block performs a cross-attention-like interaction between two distinct inputs. This design enhances the model’s ability to capture complex degradation dynamics, leading to a more expressive and context-aware health representation.
The Calibrated Monte Carlo Dropout technique improves the reliability of the framework, providing a confidence interval for each estimation. An innovative calibration method based on a Gaussian Mixture Clustering Model is presented.
The structure of the paper follows.
Section 2 describes the general principles of DE models, focusing on the estimation of the fixed point during the forward pass and the application of implicit differentiation in the backward pass.
Section 3 describes in detail the proposed architecture, focusing on the innovative design of the Dual-Input Interconnection layer.
Section 4 presents the experimental analysis and the simulation results, and finally we have the conclusions and future work.
2. Deep Equilibrium Models
DE models are implicit layers that include modern DL architectures. Their goal is to reach an equilibrium point that captures the system’s non-linear dynamics, using root-finding techniques rather than iteration-based methods. The characteristic fixed-point equation of a DE model is
where
is the input of the equilibrium layer,
denotes the latent representation state, and
is the implicit function that defines the balance between internal representation states and external influences. Also, we denote as
the equilibrium representation vector of Equation (
1), which is derived using a fixed-point solver.
The first part of
Figure 1 illustrates the architecture of the DE layer. The second part shows its unfolding process until it reaches convergence. We observe its similarity with RNNs since both update an internal representation but with a significant difference; recurrent networks process input sequentially, one element at a time corresponding to a specific time step. In contrast, the DE layer processes all the input information at each update step and updates its internal representation until convergence. So, the DE layer analyzes input information from iterative time steps in parallel, capturing temporal coherence. This approach, along with potential internal procedures that capture spatial knowledge, enhances the model’s ability to understand the complex non-linear dynamics of modern systems.
Alternatively, DE layers are strongly similar to Residual Networks (ResNets), as we can observe through their unfolding. While ResNets achieve depth by stacking multiple layers, DE layers effectively achieve depth through the iterative update of an internal representation until it reaches a fixed point. Moreover, DE layers share parameters across the updating process, making them more robust to over-fitting. Also, the training of DE layers is more computationally efficient than ResNets’ since it is based on implicit differentiation.
One crucial part concerning DE training is the existence and uniqueness of fixed points. The stability of the process is an active research topic. One key requirement for ensuring convergence is the existence of 1-Lipschitz continuity during training, which guarantees that the function does not increase distances between inputs, leading to controlled updates in the iterative fixed-point solver [
29]. A function is 1-Lipschitz when
Spectral normalization constrains the spectral norm of the Jacobian matrix so that its largest singular value is less than 1. This transforms the process to 1-Lipschitz and guarantees fixed-point convergence [
30]. By the application of 1-Lipschitz activation functions like ReLU, SoftThreshold, and GroupSort, we preserve bounded transformations and decrease instability [
31]. Also, the use of Monotone Operators and Energy-based stability techniques such as Lyapunov Theory [
32] provides theoretical guarantees of the convergence of DE layers.
2.1. Forward Pass of DE Layer
Unlike explicit layers that transform the input using their nonlinear mapping function, DE layers output the fixed point
, determined by the solving of Equation (
1). This root-finding problem can be solved efficiently using numerical methods such as simple fixed-point iteration until convergence, Newton’s and Broyden’s Methods, or Anderson Acceleration. Fixed-point iteration methods apply Equation (
1) repeatedly until convergence. The approach is simple and intuitive but may be slow, especially if the function
f has a contraction (Lipschitz) constant that is near 1.
Newton’s Method updates
z as
where
is the Jacobian of
f. The requirement for the estimation of the inverse of the Jacobian, shown in Equation (
3), is often expensive.
To decrease the computation cost, Broyden’s method approximates
as
J with the application of an extra iterative process:
where
,
. This method avoids direct Jacobian inversion while still achieving fast convergence.
Anderson Acceleration, to reduce oscillations and so enhance the convergence speed, uses a linear transformation of past iteration estimates as:
The coefficients
of Equation (
5) are estimated by solving a least squares problem where the objective is minimizing the residual output
. The mathematical expression of the optimization problem follows:
The Anderson method significantly accelerates convergence with the disadvantage of higher memory cost from the storing of past iterates. So, the main advantage of Anderson acceleration over Broyden’s method is its typically faster and more robust convergence, especially in high-dimensional and non-linear fixed-point problems.
2.2. Backward Pass of DE Layer—Implicit Differentiation
The forward pass of DE models requires iterative solvers, making conventional backpropagation infeasible due to the increased storing demand of the intermediate gradients, generated during the process. To overcome the problem, gradients are computed using the Implicit Function Theorem (IFT), a technique that avoids the unrolling of the iterative process [
25].
The goal is to compute
where w denotes the parameters of the DE layer. The IFT states that the gradient of the fixed-point
with respect to the parameters can be derived without needing the solver’s intermediate layers.
We consider the fixed-point equation
, and by taking the total derivative with respect to
w and applying the chain rule, we have
By solving Equation (
8) for
, we get
and substituting to Equation (
7)
Setting
, Equation (
10) becomes
Instead of inverting the matrix
directly, which is expensive for large-scale problems, we solve the following equation for
u using an iterative solver:
The term
denotes the Jacobian of
f with respect to
z evaluated at the fixed point
. Also,
is a Jacobian-vector product (JVP), which can be efficiently computed with modern auto-grad programming packages without explicitly forming the full Jacobian. Finally, to estimate
u we follow the iterative update rule:
3. Architecture of DE Layer
The inner architecture of the DE layer is crucial, since it determines its expressiveness and effectiveness and ensures that the model can learn the complex representations of data. Multi-sensor time series data, commonly used in the RUL estimation, have both spatial and temporal dependencies. Indeed, sequential sensor responses are present in the data while variations on the reading of one sensor affect the others. Also, sensor readings change over time, representing the future system’s behavior. These challenges can be addressed using spatio-temporal DL layers, integrating correlation among sensors and capturing long-range dependencies.
Figure 2 provides a visual diagram of the architecture of the proposed framework for RUL estimation. The equilibrium model (DE block) consists of two key components: a convolutional block (
Figure 3) and a Dual-Input Interconnection mechanism (
Figure 4). Additionally, a Monte Carlo Dropout Feedforward Neural Network is incorporated as the final block, improving both the performance and reliability of the framework.
3.1. Deep Equilibrium Block
As the initial component of the DE block, a convolutional block is employed to capture local temporal patterns and extract hierarchical features from the input sequence. Convolutional Neural Networks (CNNs) consist of multiple convolutional blocks that learn adaptive, hierarchical representations of input visual data automatically. CNNs are well-known for their success in processing visual data using 2D convolutions. However, they are also effective for processing one-dimensional data, such as time series. A 1D convolutional layer applies filters along a single spatial dimension, making it particularly suitable for extracting features from sequential data.
The mathematical formulation of a 1D convolutional layer for an input signal
is defined as:
where
denotes the convolutional kernel and
is the produced output. Furthermore, this operation can be extended to multiple input and output channels and incorporate padding and stride parameters to control both the output size and computational cost [
33].
We consider each sensor time series as an input channel and apply shared kernel filters across time steps. In
Figure 3, we notice that three convolutional layers are used, each followed by a ReLU activation function. The first convolutional layer increases the number of channels by applying a kernel filter with a size of 1, while the other two extract low-level and high-level temporal features. Also, a residual connection between the last two convolutional layers enhances the learning of diverse representations across time. So, the main goal of the convolutional block is the capture of local patterns in the input and the robust representation of short-term dependencies.
The DE block incorporates a Dual-Input Interconnection mechanism, allowing dynamic feature interaction between two different input mappings. The first input is the latent representation
that evolves over time, while the second input is the extracted feature mapping
, which is produced by the previous convolutional block and captures short-term spatio-temporal patterns from sensor data. We linearly project
and
using the learnable weight matrices
to obtain the Query, Key and Value as
where
and
. The transformation of the latent vector
and the convolutional feature mapping
to Query
Q, Key
K and Value
V allows their interaction since it is a projection to a shared embedding space. The Query represents the information that the model is looking for, based on its internal understanding, which is the latent vector
. The Key provides an encoding of the input mapping
, while the Value contains the information of the input that will be operated, based on the interaction between the Query and the Key.
In the sequel, we estimate the interaction weight
A as
representing the weighted similarity score between Query and Key and providing information about the relevance of each part of the input to the current state of the system. Also, the Softmax layer transforms the scores into a probability distribution over the input vector mapping, giving focus to the most important parts of the latter. The final Dual Interconnection Output is computed as:
where
is a learnable weight matrix. The value projection
V of the input in the latent space is updated based on the attention score
A, amplifying the most important features. Finally, the result is connected to the latent representation vector
, establishing a residual connection. The residual connection prevents information loss and enhances the flow of gradients during training. So, through the adaptive process described in Equation (
17) and illustrated in
Figure 4, the latent space vector
is updated dynamically, combining long-term dependency capture and the fusion of multi-channel input information.
Figure 2 shows the components of the proposed model where we can observe the use of ReLU activation functions and Group Normalization (GroupNorm) layers in the inputs of the DE layer to achieve stability during training and accelerate the convergence speed. GroupNorm layers [
34] divide the feature mapping into groups and normalize each group independently. GroupNorm is more robust than its alternative Batch Normalization since it does not depend on batch statistics. With the use of GroupNorn, we maintain the spatial coherence of the convolutional block output by normalizing a collection of channels rather than the complete feature mapping or each individual channel. Also, the combination of ReLU activation functions, which are 1-Lipschitz, with GrouNorm layers ensures non-expansive outputs while maintaining their ability to represent complex patterns.
3.2. Monte Carlo Dropout Feed-Forward Neural Network
As the final element of the regression model, we employ a Monte Carlo dropout feed-forward Neural Network with three linear layers. The outputs of the first two are inserted into ReLU activation functions and dropout layers. The difference is that during the inference we keep the dropout enabled and pass the same input through the network multiple times. Finally, we estimate the mean of the outputs as the final prediction and the standard deviation as uncertainty. In that way, we obtain for each input an uncertainty about the prediction of the model, as a measure of its confidence.
Also, to improve the reliability of the Monte Carlo Dropout Network, a calibration method is applied with the dynamic adjustment of the uncertainties. The calibration technique is performed in a validation set, which is derived from the training dataset.
Also, we do not apply a single scaling factor for all uncertainties, but we group the validation samples based on their mean predictions and standard deviations and subsequently estimate a different scaling factor for each group. We assume that the uncertainty of the model should be increased in certain working conditions where it predicts with large errors. In the same way, smaller errors result in less uncertainty inflation. To find these regions, we use the Gaussian Mixture Clustering Method, exploiting its strength to build robust elliptical clusters in a probabilistic way. The adjustment of the predicted standard deviations, within each cluster, is designed such that approximately 95% of the actual target values fall within the estimated prediction interval. This adaptive scaling method ensures that the uncertainty estimates accurately reflect its reliability, providing trustworthy confidence intervals.
The mathematical formulation of the uncertainties’ calibration method is as follows. Let
,
and
be the predicted mean, standard deviation and true target for sample
i, accordingly. We want to adjust
by a scaling factor
k such that the probability of
falls within the range
, which equals our predefined confidence level
.
We use the validation dataset and build a Gaussian Mixture Model (GMM) to assign each validation sample to a cluster based on the prediction means and standard deviations of the Monte Carlo Dropout Network. Finally, to find the optimum scaling factor
for each cluster
f, we solve
where
is the indicator function and
is the number of samples in cluster
f. At inference, we assign each point
j to a cluster
g using the GMM and scale the uncertainties as
.
4. Experimental Analysis and Results
4.1. Description of Dataset—CMAPSS Dataset
To test and validate the effectiveness of the proposed Deep Equilibrium Neural Framework for RUL prediction, we use the CMAPSS (Commercial Modular Aero-Propulsion System Simulation) dataset provided by the NASA data repository as an evaluation benchmark [
28]. The dataset provides realistic simulations of aircraft engine degradations over time and contains four sub-datasets (FD001, FD002, FD003 and FD004), each corresponding to various operational conditions and fault modes. The dataset includes a combination of three operational parameters (altitude, Mach number, and fuel flow), which create diverse degradation conditions. Also, fault modes refer to the mechanisms that lead to the degradation of the aircraft engine. In the CMAPSS dataset, two specific fault modes are represented: High-Pressure Compressor (HPC) Degradation and Fan Degradation. The training dataset consists of run-to-failure simulations, where each engine starts from a healthy condition and gradually degrades until failure. So, the last sample for each engine is considered broken, meaning that the target RUL equals zero. On the contrary, in the testing dataset, the engine simulations are terminated at a point before overall failure with predefined given RUL values.
Table 1 and
Table 2 provide a detailed description of the characteristics of each sub-dataset and information about the sensor readings, accordingly.
4.2. Data Pre-Processing
The data pre-processing process is crucial and can be broken down into smaller parts. Initially, the multivariate time-series data is normalized so that the response of each sensor follows a standard normal with a mean of 0 and a standard deviation of 1. The fact that FD002 and FD004 sub-datasets relate to various operational conditions is of important consideration during the normalization process. To enhance data representation at these two sub-datasets, we first apply the k-means clustering algorithm according to the operational conditions, and then we normalize each cluster independently. Therefore, denoting that the
sample point belongs to the
cth operational cluster, the normalized process is given by the equation
where
and
are the mean and the std of the cluster, respectively.
In the following, a selection process of the features takes place, since various sensor readings in the dataset remain constant over time or provide meaningful information about the degradation process. The Random Forest Ensemble Regressor algorithm is used to analyze and detect the importance of each feature. The Random Forest Ensemble builds a set of multiple decision trees, where each one is trained using a random subset of the data and a random subset of features at each split. The final estimation is the mean of the prediction from all the individual trees. The significance of each feature is evaluated as a measure of its contribution to the reduction in the variance of predictions (Mean Decrease in Impurity—MDI) [
35]. This method is selected due to its robustness and ability to handle non-linear relationships. Other feature selection methods, such as mutual information or recursive feature elimination (RFE) could also be applied, leading to alternative feature subsets and influencing model performance. A comparative analysis of feature selection strategies is left as future work, to explore whether such alternatives could present improvements in RUL estimation accuracy, especially under varying operational conditions.
The next step of pre-processing implies the use of the Exponentially Weighted Average Smoothing (EWAS) function in each sensor reading. The CMAPSS time-series data as measurements of sensors in turbofan engines contains high-frequency noise and fluctuations. By the application of the EWAS method, we smooth out noise, remove sudden spikes and improve the identification of degradation trends in the time-series data. The EWAS process is described by the following formula:
where
is the smoothing factor, and
and
are the sensor reading and the smoothed value at time
t, respectively.
A common practice in the literature is to limit the maximum value of RUL, considering that for a starting time, the engines are considered healthy until a breakdown occurs. The operation of the engine after the breakdown drives linear decreases in the RUL values. To have fair comparisons with papers that examine the performance on the same dataset, we adopt this practice and set the maximum value of RUL as 125 engine cycles. Also, as a final preprocessing step, we normalize each target value by dividing it by the maximum value, resulting in a target range of [0,1] for the regression task.
4.3. Setting Hyper-Parameters and Configuration
We construct the input of the DE model using 15 subsequent sensor readings, so the window interval of the input is set to 15. During the feature selection process, we use a significance threshold of
, meaning that features that contribute less than this amount are rejected. In the experimental setup, to effectively remove noise and capture the degradation trends, we set the smoothing factor of EWAS to
.
Figure 5 shows the smoothing process for the FD001 sub-dataset.
We set the training batch size to 256. The Adam optimizer is used to train the DE model with a learning rate of . Also, a reduction scheduler of the learning rate by is applied every 10 epochs. The training is completed after a fixed number of 35 epochs.
Given that the number of input channels is D, each convolutional block utilizes filters. For each sub-dataset, the number of input channels D varies due to the applied feature selection process. Therefore, the output of the concatenate layer in the convolutional block extracts a feature mapping of channels. Also, GroupNorm layers employ 4 normalization groups and the linear layers of the Dual-Input Interconnection Block do not change the dimension of the feature mapping. Consequently, the dropout ratio of the Monte Carlo feedforward NN is set to .
Finally, to estimate the fixed-point of the DE model, we use the Anderson Acceleration method. In the utilization of Anderson Acceleration, which enhances the convergence of the forward pass of the DE model, we set the maximum number of iterations to 200 and the relative residual tolerance to . The weights of the linear projections in the Dual-Input Interconnection Block are initialized with random values drawn from a normal distribution centered at 0, with a standard deviation of 0.01.
The hyper-parameters (learning rate, batch size, dropout ratio, number of convolutional filters, attention projection dimension) were selected by a combination of trial-and-error experimentation and reference to values commonly reported in related literature [
1,
2,
3,
12,
17]. A limited manual tuning process was employed on a validation split from the training data, where the performance metric of RMSE was monitored. A full grid search of the hyper-parameters can further improve the performance of the RUL estimation framework, but in this study, it was not conducted.
4.4. Experimental Results
To evaluate and compare the performance of the proposed DE model against state-of-the-art frameworks, we use two standard metrics: the Root Mean Square Error (RMSE) and the PHM08 scoring metric. The RMSE metric measures the average magnitude of the prediction error, is sensitive to large deviations and intuitively informs us how far, on average, the model’s predictions are from the true RUL values. The PHM08 score is a widely adopted tool for evaluating RUL predictions since it penalizes both early and late predictions with the application of different scaling factors to each type of error. In the current research paper, we used
for early predictions and
for late predictions. By setting
higher than
, the PHM08 score penalizes late predictions, an important aspect in real-world scenarios.
Table 3 presents the performance of various state-of-the-art frameworks in predicting RUL for the CMAPSS dataset. Comparing the results, we notice that the proposed DE model achieves better performance, particularly in the more challenging sub-datasets FD002 and FD004, which operate in multiple working conditions. Moreover, for the extremely difficult sub-dataset FD004, the presented model shows an improvement in the RMSE metric by
compared to the second-best model and by
compared to the average performance of the models under comparison. Also, results on the PHM08 scoring metric indicate that the model performs better in both early and late predictions. An interesting comparison is with the Neural ODE model, which represents implicit deep learning approaches. We observe that our proposed DEM consistently outperforms Neural ODE, demonstrating its ability to capture degradation patterns in RUL prediction.
Figure 6 displays the outcome of the DE model alongside the actual RUL values for the testing samples from each sub-dataset. The shaded red region indicates the
confidence interval, which is determined by the calibrated standard deviation of the predictions. The plots show that the model effectively tracks the degradation trend of the engines, especially in the critical final stages before failure. Furthermore, the green scatter points in the plot indicate predictions that fall within the confidence interval. On the contrary, the red scatter points correspond to a significant prediction error, even outside the interval of confidence.
4.5. Comparison Without Monte Carlo Dropout Technique
The final component of the DE model is a Monte Carlo dropout neural model with three linear layers. By utilizing the MC dropout technique, we prevent over-fitting of the network during training, enhance generalization, and provide a confidence interval.
Table 4 presents the performance of the DE model with and without Monte Carlo dropout enabled during inference across the four sub-datasets. By the observation of the table, we notice that the utilization of MCD during inference slightly improves the performance, achieving lower RMSE and score values. The application of MC dropout not only enhances the model’s predictive accuracy but also provides a crucial reliability tool, which is highly valuable for real-world PHM applications.
4.6. Effectiveness of the Proposed Calibration Method on Predictive Uncertainty
Table 5 evaluates the prediction interval coverage before and after applying the proposed calibration method across the four sub-datasets. As coverage performance, we define the ratio of true RUL target values that fall within the confidence intervals of the predictions. We notice that the proposed calibration method, based on GMM clustering, improves significantly for all sub-datasets the coverage ratio, constructing prediction intervals that better reflect the true variability of the data and the model’s uncertainty. Indeed, the confidence coverage has been increased for
for sub-dataset FD001,
for sub-dataset FD002,
for sub-dataset FD003 and
for sub-dataset FD004. So, the proposed overall DE model enhances maintenance decisions by providing not only predictions but also reliable confidence intervals.
4.7. Convergence Behavior of the Deep Equilibrium (DE) Model During Training
Figure 7 demonstrates the convergence dynamics of the proposed DE model during training. The top sub-plot illustrates the residual errors as the model converges to the fixed point for each training epoch. The bottom one shows the number of fixed-point iterations needed for convergence for each training epoch.
As we can observe, the residual errors remain consistently low after the initial epochs. Also, the number of convergence iterations for all sub-datasets is significantly smaller than the maximum number of 200 iterations set for the Andersson Acceleration algorithm. So, we notice that the DE model reliably reaches a fixed point with minimal error at a small number of iterations. Therefore, the steady residuals and the low count of iterations even for FD002 and FD004 sub-datasets, where the operating conditions vary, show the robustness of the proposed DE model during the training process.
4.8. Computational Cost and Time Overhead
To evaluate the practicality and feasibility in real-world applications, we present an analysis of the computation requirements of the proposed Bayesian Deep Equilibrium framework in the FD001 sub-dataset of CMAPSS. The model’s training was conducted in a system with an NVIDIA GeForce GTX 1060 with 6 GB, utilizing a batch size of 256 over 35 epochs. The sequence window is set to 15, building a model with trainable parameters, making it lightweight and suitable even for low-resource environments. The training time interval was 17 min, which corresponds to an average of s per epoch.
Also, a computational analysis of the inference process is provided. To estimate predictive uncertainty during inference, we use 100 Monte Carlo forward passes in the Bayesian framework. This increased the overhead during inference by around . However, the response time for each engine is s, making the model suitable for real-time industrial applications.
4.9. Failure Analysis on the CMAPSS Dataset
It is crucial to provide deeper insights into the robustness of the DE model observed within the CMAPSS dataset and analyze the specific conditions where its predictive performance decreased. Observing the performance across the four CMAPSS sub-datasets, we notice increased sensitivity and higher error rates for datasets with multiple operating conditions. The model has difficulties with engines that show sudden degradation patterns or irregular sensor responses in different operating modes. Furthermore, sensor noise and sudden measurement anomalies are commonly encountered in realistic aircraft engine operations, disrupting the equilibrium convergence process and resulting in significant prediction inaccuracies. The model’s reliance on strict Lipschitz constraints for stability further contributes to sensitivity since small deviations can considerably affect performance, underscoring the need for careful parameter initialization, regularization, and normalization strategies in practical applications.