A Novel Remaining Useful Life Probability Prediction Approach for Aero-Engine with Improved Bayesian Uncertainty Estimation Based on Degradation Data

: As the heart of aircraft, the aero-engine is not only the main power source for aircraft ﬂight but also an essential guarantee for the safe ﬂight of aircraft. Therefore, it is of great signiﬁcance to ﬁnd effective methods for remaining useful life (RUL) prediction for aero-engines in order to avoid accidents and reduce maintenance costs. With the development of deep learning, data-driven approaches show great potential in dealing with the above problem. Although many attempts have been made, few works consider the error of the point prediction result caused by uncertainties. In this paper, we propose a novel RUL probability prediction approach for aero-engines with prediction uncertainties fully considered. Before forecasting, a principal component analysis (PCA) is ﬁrst utilized to cut down the dimension of sensor data and extract the correlation between multivariate data to reduce the network computation. Then, a multi-layer bidirectional gate recurrent unit (BiGRU) is constructed to predict the RUL of the aero-engine, while prediction uncertainties are quantized by the improved variational Bayesian inference (IVBI) with a Gaussian mixture distribution. The proposed method can give not only the point prediction of RUL but also the conﬁdence interval of the prediction result, which is very helpful for real-world applications. Finally, the experimental study illustrates that the proposed method is feasible and superior to several other comparative models.


Introduction
With the vigorous development of the aviation industry, aircraft are widely used in many fields, such as civil, military, transportation, and other fields. The aero-engine is the core component of aircraft that provides the main flight power for aircraft [1,2]. Aero-engines usually work continuously in a high-temperature, high-pressure, dusty environment, so their health state deteriorates with long-term operation. As such, it is of great importance to establish a comprehensive prognostics health management (PHM) system for aero-engines [3][4][5][6]. PHM includes fault detection and isolation, fault diagnosis, remaining useful life (RUL) prediction and health management [7,8], etc. PHM is becoming an important part of the design and use of new-generation aircraft, ships, vehicles, and other systems. RUL prediction is one of the main tasks for PHM that has been researched by many scholars in the past decades.
The existing RUL prediction methods can be divided into physical model-based methods, statistical data model-based methods, machine learning model-based methods, and hybrid model-based methods. Physical model-based prediction methods require a deep understanding of system structures and precise analysis of failure mechanisms in order to establish a reliable degradation model for RUL prediction [9,10]. However, the system structures are complex and surrounding conditions changeable, it is next to impossible to build a universal degradation model with sufficient accuracy in practical applications. The statistical data model-based methods fit historical data with a random relationship model or stochastic process model to obtain the RUL prediction [11,12]. Though such methods consider the impact of system uncertainty on RUL, they are unable to mine the internal, complex, in-depth features of the monitoring data and, thus, cannot make sufficient utilization of massive monitoring data. The primary advantage of machine learning modelbased methods is that they do not require a priori information about system degradation while making use of complex relationships in the monitoring data [13,14]. The hybrid model-based methods are achieved by combining the advantages of different models [15], but they often require a more complex structure than a single model. Against the backdrop of more and more massive data that we can obtain, machine learning model-based methods have definitely become the mainstream direction.
Deep learning-based methods are representative of machine learning model-based methods [16][17][18]. In recent years, deep learning has been successfully applied to deal with the RUL prediction problem of aero-engines due to its strong ability in nonlinear modeling. For example, a multi-objective deep belief network ensemble (MODBNE) method was proposed in [19]. It could extract features automatically and estimate the RUL of an aero-engine by building multiple deep belief networks (DBNs). The convolutional neural network (CNN) is also welcomed by many scholars because of its unique advantages in feature extraction. In ref. [20], Li et al. constructed a new deep convolutional neural network approach, which could predict the RUL of an aero-engine by learning features of the originally collected sensor data without accurate physical or expert knowledge. Considering the high-dimensional characteristics and complexity of aero-engine monitor data, Liu et al. [21] adopted deep convolutional neural networks to extract information from the input data and then used a strong classifier-light gradient boosting machine (LightGBM) to replace the fully connected layer of deep convolutional neural networks to improve the accuracy of RUL prediction. Kim et al. [22] proposed a convolutional neural network-based multi-task learning method for RUL prediction considering the influence of the health status detection process on RUL estimation.
Unfortunately, the CNN-based methods mentioned above lack time memory ability. As such, these methods cannot extract the temporal dependence in the degradation data. In recent years, approaches based on the recurrent neural network (RNN) and its variants have become dominant in RUL prediction. In ref. [23], Liu et al. presented an improved multistage long short-term memory network with a clustering method for the RUL prediction of aero-engines, which integrated clustering analysis and long short-term memory (LSTM) to improve the low prediction accuracy of traditional single-parameter and single-stage models. Xia et al. [24] researched an LSTM method with a multi-layer self-attention (MLSA) mechanism for aero-engine RUL estimation. The degradation data characteristics and time step characteristics were extracted simultaneously by the MLSA mechanism, where LSTM was used to capture the degradation process of the system. To suppress the raw signal noise and improve the quality of input data, Jin et al. [25] provided a novel bidirectional LSTM-based two-stream network. One stream came from the raw data and the other from a series of new handcrafted feature flows (HFFs). Then, bidirectional LSTM (BiLSTM) was used to estimate the RUL of aero-engines based on these processed data. Moreover, Li et al. [26] constructed an integrated deep multiscale feature fusion network (IDMFFN) for aero-engine RUL prediction. Different scale features were firstly extracted by convolutional filters with different sizes, and then a gate recurrent unit (GRU) layer was built to replace the traditional, fully connected layer to obtain more accurate prediction results. Song et al. [27] put forward a hierarchical scheme with LSTM networks to regard RUL prediction as a bi-level optimization problem; the lower level was used to predict the time series in the near future, and the upper level was used to estimate the RUL by merging the measurement data and the predicted ones obtained by the lower level.
Despite the advancement of the above deep learning models, most of them only provide point values of RUL and neglect the prediction uncertainties. In reality, RUL prediction is often impacted by many uncertain factors including data uncertainty imported by noise from the data themselves or the measurement errors of the sensors, changes in operating conditions (also called aleatoric uncertainty) [28,29], and model uncertainty introduced by limited access to monitoring data (also called epistemic uncertainty) [30,31]. Therefore, it is necessary to analyze these uncertainties in deep learning to avoid making incorrect decisions. However, these kinds of uncertainties were rarely considered in the existing deep learning-based aero-engine RUL prediction methods [32]. To shorten this gap, in this paper, we construct a novel RUL probability prediction method for aeroengine based on a bidirectional gate recurrent unit (BiGRU) and improved variational Bayesian inference (IVBI) technology to achieve RUL prediction with its confidence interval (CI) as well. Finally, the proposed method is verified using a commercial modular aeropropulsion system simulation (CMAPSS) data set and compared with five other existing models. Experimental studies show the feasibility and effectiveness of our model. The main contributions of this work are as follows: • An RUL probability prediction neural network based on degradation data is proposed for an aero-engine based on a BiGRU network and an IVBI technology, which can give not only the RUL prediction but also an accurate estimate of prediction uncertainties. • A new IVBI method is proposed by replacing the traditional single Gaussian distribution in the variational Bayesian inference with a Gaussian mixture distribution to improve the generalization capability and prediction ability of the proposed method. • The performance of the proposed model is validated on the CMAPSS data set. Comparisons with five other advanced deep learning methods show that our method is the most effective one under all of the considered evaluation indices.
The rest of this paper is structured as follows. Section 2 gives a description of the proposed framework for aero-engine RUL prediction. Section 3 reports our model performance on the CMAPSS data set and discusses the results of comparative experiments. Finally, the conclusions are shown in Section 4.

Bidirectional Gate Recurrent Unit
RNN is a special neural network structure. Because of its memory function of historical information, it has unique advantages in dealing with time-series problems and has received great attention in recent years. However, a problem that cannot be ignored in traditional RNN is the disappearance and explosion of the gradient. LSTM, as an upgraded version of RNN [33], can better alleviate the problems of gradient disappearance and gradient explosion by turning gradient multiplication into gradient addition. By merging the forget gate, input gate, and output gate into a reset gate and update gate, GRU, as depicted in Figure 1, makes further improvements on the basis of LSTM performance [34]. GRU can achieve a similar prediction performance to LSTM but with a simpler structure and fewer parameters, which greatly improves the training speed.
The reset gate in GRU determines how to combine the new input information with the previous memory through the activation function σ, which is calculated as where W r indicates the weight from the input layer to the hidden layer of the reset gate, h t−1 is the hidden layer at time t − 1, and x t indicates the input data at time t. Candidate hidden layer statush t contains the input information of time t and selective retention of hidden layer state h t−1 . The calculation formula is where W h is the weight of the hidden state. The update gate defines the amount of previous memory saved to the current time step. The update gate calculation formula is where W z indicates the weight from the input layer to the hidden layer of the update gate. Finally, the hidden state decides how to combine a past hidden status and the current candidate information. The calculation formula is The standard GRU network only considers the forward influence of historical information on future time series, overlooking the potential implicit information that may exist during network communication. To address this limitation, the BiGRU [35] model is introduced as a modification to the conventional GRU. By processing sequences of observations from both forward and reverse directions, as illustrated in Figure 2, the BiGRU network gains additional insights into the sequence. The primary objective of this approach is to extract maximal information from a given subsequence of observations, thereby providing an enhanced reference value for RUL prediction. Consequently, this paper adopts the BiGRU model in order to achieve this goal.

Bayesian Neural Network and Improved Variational Inference
The Bayesian neural network is a powerful uncertainty framework [36]. It introduces uncertainty into a neural network by changing the constant weights (W in Figure 3) in the traditional neural network into probabilistic weights that obey a certain distribution; the structural comparison diagram is shown in Figure 3. The basic idea of the Bayesian framework is to obtain model update knowledge, called posterior probability, from learned knowledge, called prior probability, and observed knowledge, called likelihood. Then, the model will be used to predict unknown samples. For a given training set {X, Y}, X represents the variable, and Y represents labels. Let P(Y|X, w) be a model likelihood function, and let P(w) be a model's prior distribution. The posterior distribution is given by For a new input data x, the Bayesian framework can output a probability distribution of the prediction variable y based on the new input data x and the historical data set {X, Y}. The predictive distribution is given as follows Although the integral in Equation (6)is easy to calculate in the low-dimensional case, it is difficult to obtain the accurate posterior distribution P(w|X, Y) in a deep neural network that contains high-dimensional parameters to obtain accurate integral value. To address this problem, approximation techniques are necessary. Currently, the two most popular approximation methods are the Markov chain Monte Carlo (MCMC) [37] and variational inference (VI) [38]. The MCMC, a sampling-based approach, does not require specific model assumption, but it is computationally expensive to obtain prediction results due to the need for obtaining a large number of samples of the posterior distribution. As a result, it is less suitable for problems with high-dimensional parameters. On the other hand, VI, an approximation-based method, requires the assumption of a prior model of the posterior distribution. However, it offers faster computation for generating prediction results. Considering the computational efficiency of the model, we employ VI in this study. Hypothetical prior distribution selection and distribution difference measurement are two key components in VI, which are introduced in detail as follows.
• Improved hypothetical prior distribution Initially, a prior distribution of moderate complexity is carefully chosen. An overly simplistic model, while easier to optimize, may yield considerable approximation errors when attempting to approximate relatively complex distributions, thus compromising the accuracy of the approximate solution. Conversely, an excessively complex model, although capable of approximating any distribution, may introduce significant challenges in the optimization process. Previous studies have frequently employed Gaussian distributions for VI, but single Gaussian models often exhibit limited accuracy when approximating complex distributions.
In this research, we adopt a Gaussian mixture distribution as the prior distribution. The Gaussian mixture distribution comprises two individual Gaussian models with a mean value of µ 1 , µ 2 and a variance value of σ 1 , σ 2 , providing a trade−off between computational complexity and prediction accuracy. The distribution is mathematically represented as follows: where ρ denotes the proportion of the mixture, and θ = {µ 1 , µ 2 , σ 1 , σ 2 } denotes the parameter set of the Gaussian mixture distribution.

• Distribution difference measurement
The Kullback-Leibler (KL) [39] divergence is a method to describe the difference between two probability distributions, P and Q. In neural networks, if P is unknown and we want to approximate it with a known distribution Q, the KL divergence is usually taken as the error function, and the optimal approximate distribution Q can be obtained by minimizing the KL divergence, which is calculated as KL(Q(w|θ) P(w|X, Y)) = Q(w|θ) log( Q(w|θ) P(w|X, Y) )dθ (8) In this work, a known probability distribution Q(w|θ) is directly used to approximate the posterior distribution P(w|X, Y). We turn the posterior solution problem in Equation (6) into a parameter optimization problem by minimizing the KL distance between Q(w|θ) and P(w|X, Y). The formula is as follows According to the definition of KL divergence in Equation (8), we rewrite the formula into the form of an expectation to facilitate the calculation results as Then, the loss function of our network consists of two parts: one is the loss of the network between the real RUL value and the predicted RUL value, which is measured by MSE, and the other one is the loss of uncertainty estimation, which is measured by KL. The final model loss function is defined as Loss = MSELoss(y true , y pred ) + KL(Q(w|θ) P(w|X, Y)) where the MSELoss is a loss function under the Pytorch frame, y true is the RUL label of the data set, and y pred is the RUL predicted by the model.

The RUL Probability Prediction Framework
This section will introduce the proposed probabilistic RUL prediction framework in this paper. The overall framework is depicted in Figure 4. First of all, the sensor data is subjected to a filtering process, where only pertinent information is retained and subsequently normalized using the minimum-maximum normalization technique, as in Equation (12). This normalization aims to mitigate the adverse effects arising from singular sample data. (12) where x is the sensor data, x * is normalized data, x min is the minimum value in the data, and x max is the maximum value in the data. Then, the widely employed data compression technique, PCA, is applied. Numerous studies have demonstrated its significant impact on data dimensionality reduction and denoising. PCA is utilized to extract the principal features of the data, thereby reducing its dimensionality and enhancing the computational efficiency.
Subsequently, the core prediction model BiGRU-IVBI is composed by leveraging the Bayesian network into BiGRU. The deterministic weight value in BiGRU is replaced by our prior probability distribution to analyze the uncertainty factors in the neural network and train the model with gradient descent to minimize the loss function to obtain the optimal distribution parameters and, finally, obtain the RUL probability distribution prediction based on which the point prediction and CI of the aero-engine RUL prediction also can be obtained.

Data Description
As shown in Figure 5, an aero-engine normally includes a fan, low-pressure compressor (LPC), low-pressure turbine (LPT), high-pressure compressor (HPC), high-pressure turbine (HPT) [2], and other modules. In this paper, the CMAPSS data set, which was generated by NASA using commercial modular aero-propulsion system simulation software for PHM themed competitions, is used for experiments. The data comes from turbofan engines, and each engine has 21 sensors. There are four data sets under different conditions; we choose FD001 as the experimental data set which was conducted under a single fault, i.e., the degradation of the high-pressure compressor, including the training set and test set. The number of training samples is 17,631, and the model is tested on 100 engines with a total of 10,096 samples. Specific information about the sensors is displayed in Table 1.

• Data visualization and analysis
We first visualize the original monitoring data in Figure 6 to better observe the data characteristics. It can be observed from the above picture that not all of the sensor data contain useful information; some sensor data remain unchanged. It is noteworthy that although sensor 6 has fluctuation, it is just some noise.

• Correlation analysis
We also conduct correlation analysis between the sensor data and RUL. As can be seen from Figure 7, some sensor data have no correlation information, so we combine the results of the data visualization in Figure 6 and finally decide to reject the data from sensors 1, 5, 6, 10, 16, 18, and 19. We also found that some sensor data show a high correlation, indicating the existence of some redundancy features. It is necessary to extract the main features.

Principal Component Analysis
Next, the data features are extracted by PCA. As shown in Figure 8, the first seven principal components can contain more than 90% of the information of the raw data. So the first seven principal component features are selected.

Evaluation Metrics
The performance is evaluated from four indicators, including the root mean squared error (RMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE), and score function (Score). These indicators are defined as follows where n is the number of test samples, y pred denotes the predicted RUL, and y true denotes the real RUL. RMSE is selected to measure the deviation between the prediction RUL and the real RUL. MAE is more robust to outliers. SMAPE can represent the quality of the model. Score is a scoring used in the 2008 PHM management data challenge.

Experimental Results
There are 100 engines in the data set. Here, we choose engine No. 21,No. 35,No. 51,and No. 100 as examples. The RUL prediction results are shown in Figure 9.
The figures show that there is great uncertainty in the initial stage. Then, the CI gradually narrows as the aero-engine degrades, and, finally, the fault interval is accurately predicted. From the partially enlarged image, it can be seen that even if the point prediction does not completely overlap with the real-life degradation curve, the CI can still provide a reliable prediction range. The same conclusion can be made from the results of other engines in the data set, which are not displayed due to the space limitation.
To illustrate the effectiveness of the BiGRU-IVBI model in the proposed method more comprehensively, under the same network configuration conditions, we compared our model with the Bayesian multi-layer perception (BMLP), Bayesian long short-term memory (BLSTM), Bayesian gate recurrent unit (BGRU), and Bayesian bidirectional long short-term memory (BBiLSTM) models. Table 2 shows the RUL prediction results of the comparison with 95% CI. It can be clearly seen from the data in Table 2 that our BiGRU-IVBI method has the best performance among the four models with the lowest RMSE, MAE, Score, and SMAPE. To make the results more intuitive, a line chart is also provided in Figure 10. Simultaneously, we conducted an experiment using unprocessed data without the PCA feather-extraction process. It is evident from Table 2 and Figure 10 that the experimental results are inferior to the results obtained using PCA, which explains the effectiveness of PCA in improving the prediction results. All of the above results indicate that our model is feasible and is the most cost-effective one.  As seen in Table 3, we compare the proposed method with some advanced methods on the CMAPSS data sets mentioned in the literature, and the results show that our method has a significant improvement in the RMSE on data set FD001. This means that the RUL predicted by our proposed method is closer to the real RUL and achieves more satisfactory accuracy.

Methods and References RMSE
MODBNE [19] 17.96 DCNN [20] 12.61 LightGBM [21] 12.79 MT-CNN [22] 12.48 MLSA [24] 11.57 BiLSTM [25] 11.96 IDMFFN [26] 12.18 LSTM [27] 11.80 BiGRU-IVBI 9.91 We also conducted comparative experiments for different window sizes and finally set the window size to T = 30. The comparative experimental results of the different window sizes are shown in Table 4 and Figure 11.  In order to illustrate the effect of the parameter ρ in the Gaussian mixture model on the prediction performance, we also compare the experimental results of several Gaussian mixture models with different mixing ratios ρ, and the comparison results are shown in Table 5 and Figure 12. In order to ensure the fairness of the experiment, the average value of five experiments is taken for all of the above test data. We regularly take ρ within the range from 0 to 1 with a step size of 0.25. It can be found from the table and the figure that the RUL prediction error of the network is at its minimum when ρ = 0.25. It is worth noting that the prior of the Gaussian mixture distribution used in this paper will degenerate to a single Gaussian distribution when ρ = 0 and ρ = 1. The experimental results show that the prior of the Gaussian mixture distribution used in this paper has certain advantages in probability RUL prediction and can be extended to other RUL predictions.

Conclusions
A novel aero-engine RUL probability prediction framework based on BiGRU-IVBI is proposed in this paper, which takes into account the uncertainties during the degradation process of aero-engines. The representative features are first extracted by the PCA algorithm. Then, the BiGRU network is trained for RUL prediction with prediction uncertainties quantified by the proposed IVBI method, where the Gaussian mixture distribution is introduced to improve the generalization capability and prediction accuracy of the traditional variational Bayesian inference approach. The proposed method can give not only the point estimate of RUL prediction but also the CI of the estimate, which is of great importance for real-world application. Finally, the performance of the proposed method is verified on the CMAPSS data set, and the prediction accuracy is higher than other comparative models, which proves the effectiveness and superiority of the proposed framework.
Further research will be committed to finding more effective uncertainty prediction methods to obtain a higher RUL prediction accuracy. At the same time, exploring the application of the proposed method to the RUL prediction in situations where multiple degradation mechanisms are in place is also the direction of our future research.