Remaining Useful Life Estimation of Bearings Using Data-Driven Ridge Regression

: Predicting the remaining useful life (RUL) of mechanical bearings is a challenging industrial task since RUL can differ even for the same equipment due to many uncertainties such as operating condition, model inaccuracy, and sensory noise in various industrial applications. This paper proposes the RUL prediction method combining analytical model-based and data-driven approaches to forecast when a failure will occur based on the time series data of bearings. Feature importance ranking and principal component analysis construct a reliable and predictable health indicator from various statistical time, frequency, and time–frequency domain features of the observed signal. The adaptive sliding window method then optimizes the parameters of the degradation model based on the ridge regression of the time series sequence with the sliding window. The proposed adaptive scheme provides signiﬁcant performance improvement in terms of the RUL estimation accuracy and robustness against the possible errors of the degradation model compared to the traditional Bayesian approaches.


Introduction
Modern industrial companies must continuously maintain their production resources by using appropriate maintenance strategies to improve availability, reliability, and safety while reducing their maintenance costs [1][2][3]. In this domain, predictive maintenance is a promising research direction because it enables manufacturers to monitor the working condition of machinery and diagnose faults or predict the next failure [4,5]. While the interventions are conducted after the failure occurs in conventional corrective maintenance, it is executed according to the predicted health condition of the equipment in predictive maintenance [6].
The bearing prevents direct metal to metal contact between multiple elements of most motor-driven machines for various industrial applications [7,8]. Besides, it reduces energy consumption as sliding motion is replaced with low friction rolling. The faults of the rolling element bearings result in significant machinery failures because it incurs friction, heat, and, ultimately, the wear and tear of parts. Predicting the remaining useful life (RUL) of bearings is one of the major goals of predictive maintenance for mechanical systems [1]. The RUL of a system is the length of remaining useful time before it is repaired or replaced. Engineers can schedule maintenance, optimize operating efficiency, and avoid unplanned downtime based on the RUL estimation. However, the RUL prediction is challenging since the failure is typically the result of the long and slow degradation of multiple components of the complex system [9,10]. The traditional model-based prognostic approach provides reliable results by using an analytical model to reflect the

Related Works
During the last decade, numerous methods have been proposed to estimate RUL of the mechanical system using measurement data and suitable models [20,21]. RUL estimation methods are generally categorized into model-based and data-driven approaches [22]. To estimate the current health status and forecast future failure, model-based methods use mathematical models of the degradation trend of machines such as the partial differential equations [23] and state-space model [24,25]. However, they require comprehensive domain knowledge to establish the degradation trend of the failure process of the system. Moreover, an adequate physical model is infeasible to derive, especially for real complex systems, since the essential data are hard to obtain.
Since the degradation process is a functional relationship between measured data and health status, the data-driven methods apply statistical tools and machine learning algorithms to identify the degradation trend of bearings based on the observed data [22,24]. Time series prediction methods Kalman filter [26] and autoregressive and integrated moving average [27] and machine learning techniques such as support vector machine (SVM) [28] and its variation [29] are used to characterize the degradation process. Sutrisno et al. [30] extract the main feature by using the principal component analysis (PCA) from the vibration signals of ball bearings. The least-square SVM constructs a regression model to predict RUL. Since most monitoring signals of bearings remain stable for over 90% of the time and only suddenly rise when it is close to failure, a single regression model built upon training samples is not effective in predicting RUL. Wang et al. [31] separate the fault detection module and the RUL prediction module. The fault detection module uses PCA to estimate the fault type based on the frequency feature of the envelope analysis. A regression model predicts the RUL value if the fault is detected. Liu et al. [32] divided the entire degradation process of bearings into multiple health states where a regression model is locally applied. After selecting the main features out of the extracted fault features, they condensed the degradation trend into three representative states by using unsupervised learning and supervised learning. SVM is then used as the primary tool to predict the RUL value of the regression problem.
Bayesian approaches are also used to estimate the RUL value. Mosallam et al. [33] proposed a data-driven approach consisting of offline and online phases to predict the RUL value. In the offline phase, the proposed method builds different health indicators representing degradation as a function of time using unsupervised variable selection, PCA, and trend extraction. In the online phase, the K-nearest neighbors algorithm finds the most similar offline health indicator based on the time series data. Then, the Bayesian filter estimates the degradation state using the linear model of the selected health indicator. The method is evaluated using battery and turbofan engine degradation simulation data from the National Aeronautics and Space Administration data repository. Yu et al. [34] developed a collaboration algorithm combining the Bayesian method and expectation conditional maximization algorithm to estimate RUL of the newly made system using real-time sensing data. Residual life distributions and posterior distributions are first calculated through the Bayesian updating method based on random initial priori distributions. Then, the prior distributions are revised and improved for future predictions by the expectation conditional maximization algorithm. A set of simulation results is used to verify the applicability of the proposed fusion algorithm. While previous Bayesian approaches [33,34] consider the specific degradation model and the error fluctuation for the given degradation path, it is hard to guarantee the optimal static model by only evaluating the estimation performance based on the partial information. Furthermore, the stochastic parameters of the degradation model may not follow the general assumption of the normal distribution in practice.
DL is the strong candidate technique in constructing the relationship between raw observed data and high-level degradation process through multiple layers. Many researchers have applied DL techniques to the fault detection and diagnosis [35][36][37], while few DL-based studies are only conducted on the RUL prediction. Guo et al. [38] proposed the RUL prediction approach by combining a convolutional neural network (CNN) and an outlier region correction. The proposed CNN consists of multiple convolution and pooling layers to extract features and a logistic regression to convert the output features into a health indicator. Furthermore, an outlier correction technique detects and removes outliers based on a confidence interval of the constructed health indicator. Ren et al. [39] used time and frequency domain features as the input and then built the multi-bearing RUL prediction model using the deep neural network. This collaborative prediction relies on the RUL model of the same type of bearings under the identically controlled operating condition. While these works [38,39] mainly evaluate the performance benefits in terms of the health indicator estimation, predicting the health indicator is not enough to estimate the RUL value. Zhao et al. [40] combined CNN to extract deep features from raw sequential data and a bi-directional long short-term memory network (LSTM) to encode the temporal information for the RUL prediction of a high-speed machine. The bi-directional LSTM captures long-term dependencies of time series data in forward and backward ways. The fully-connected layer and the regression layer are then added to predict the target RUL value.
The data collection of whole-life bearings is expensive since the degradation process generally takes several months or years. Furthermore, the degradation data of bearings are not even available or cover only partial characteristics of data distribution in practice. Since many external disturbances considerably affect the degradation behaviors of bearings, the degradation processes of the same bearings have significantly different signal distributions, even under identically controlled operating conditions. The traditional Bayesian approaches may fail to provide accurate RUL estimation since it is not feasible to derive a universal analytical degradation model in practice. Furthermore, most existing DL works will face substantial performance losses if the data distributions of training and test sets are not equal [41]. Prior knowledge can help avoid such problems by combining analytical model-based and data-driven approaches. In this paper, we formulate the RUL estimation problem as the ridge regression to balance the training accuracy and generalization over the sliding window. The sliding window-based model optimization essentially reduces the performance sensitivity of the RUL estimation with the analytical degradation model. Figure 1 illustrates two main components of the prognosis workflow to predict RUL of degraded bearings, namely health indicator construction and RUL estimation. The prognosis framework predicts when a failure will occur based on the historical and current machine data, including temperature, pressure, or vibration measurements. The health indicator construction module consists of feature extraction and feature postprocessing to construct a reliable and predictable health indicator as a system degrades. Potential indicators include mean, standard deviation, kurtosis, and peak-to-peak value to quantify the chaotic behavior of a signal. Feature postprocessing techniques identify suitable features to serve as the health indicator in the offline phase. In our approach, feature importance ranking and PCA reduce the feature dimension by eliminating irrelevant features.  In the next step, the RUL estimation module predicts the future value of the extracted health indicator based on the degradation model to estimate the RUL value in the online phase. The proposed prognostics integrate machine learning and mathematical degradation model of the time series data of the health indicator. The proposed feature importance ranking and PCA computes the health indicator based on the time series data in the online phase. We predict the time to cross the specific threshold value of the health indicator using the degradation model. An adaptive sliding window method then optimizes the model parameters of the degradation model by using the ridge regression of the time series data via the sliding window in the online phase.

Health Indicator Construction
In this section, we present the feature extraction and postprocessing to construct the health indicator.

Feature Extraction
Various signal processing techniques, including time domain, frequency domain, and time-frequency domain analysis, are used to construct the features. We consider not only basic statistical measures such as mean, standard deviation, and root mean square (RMS) but also the higher-order statistics such as skewness and kurtosis. All these statistics can be expected to change as a deteriorating fault signature intrudes the nominal signal [4,7]. Table 1 shows all statistical features used to build the health indicator, including 10 time domain features, 5 frequency domain features, and 12 time-frequency domain features.

Statistical Features
Time domain mean, standard deviation, RMS, maximum-to-minimum difference, skewness, kurtosis, energy, signal median absolute deviation, crest factor and shape factor Frequency domain mean frequency, skewness, and kurtosis of the power spectrum and mean and standard deviation of the local maxima of the power spectrum Time-frequency domain mean, standard deviation, skewness and kurtosis of the conditional spectral moment with different orders between 2 and 4 • Time domain feature: Simple statistical features of time domain signals can serve as health indicators for predicting RUL. For instance, the average value or variance of a specific signal increases as the system performance degrades [7]. Furthermore, the higher-order statistics provide insight into system behavior through the third moment (skewness) and fourth moment (kurtosis) of the signal [4]. We use various statistical metrics of the time domain analysis including mean, standard deviation, RMS, skewness, kurtosis, maximum-to-minimum difference, sum of the square called energy, signal median absolute deviation, peak value divided by the RMS called crest factor, and RMS divided by the mean of the absolute value called shape factor. • Frequency domain feature: Spectral analysis extracts the useful features for predicting RUL, such as bearings, gears, and engines [7,11]. The frequency domain features include power bandwidth, mean frequency, signal-to-noise ratio, and local maxima of the power spectrum of the signal. For example, the peak value of a signal spectrum or the frequency at which the peak magnitude occurs is changed as the machine degrades. The mean frequency, kurtosis, skewness of the power spectrum, and mean and standard deviation of the local maxima of the power spectrum are used as the statistical features of the frequency domain. • Time-frequency domain feature: Another way to quantify the chaotic behavior is the time-frequency spectral properties such as spectral kurtosis and spectral entropy. Spectral kurtosis, for example, in the frequency domain, is considered a powerful method for the RUL prediction of the wind turbine [42]. Furthermore, the time-frequency moment effectively characterizes the frequency changes in time of non-stationary signals [43]. The short-time Fourier transform technique is used to capture the time-varying frequency behavior because the classical Fourier analysis fails to analyze the time-varying behavior. The conditional spectral moment of the time-frequency distribution of a signal is computed for a given sampling rate and order between 2 and 4. We then use the statistical metrics such as mean, standard deviation, skewness, and kurtosis of the conditional spectral moment with different orders.
The noise of various features substantially degrades the accuracy of the RUL prediction. Furthermore, the feature importance rank based on the monotonicity, as described below, is vulnerable to noise. We apply a moving average filter to reduce the noise effect of the features where no future feature value is used.

Feature Postprocessing
Feature postprocessing techniques construct a suitable health indicator to predict RUL built upon various statistical metrics in Table 1. For reliable RUL estimations, a health indicator requires to be observable and correlated with the system degradation process over time. The feature importance ranking and PCA identify the health indicator and threshold values for the RUL prediction.
We first select main features out of all available features in Table 1 to build a reliable RUL prediction model. A suitable condition indicator has a consistent positive or negative behavior as a system gets closer to failure. We use the monotonicity as the feature selection metric to quantify the importance of the features [42]. The monotonicity of ith feature is defined as with x i (t) ith feature signal of sampling sequence t, N is the number of measured samples, and δ i the difference of ith feature signal x i over sampling sequences. N p (respectively, N n ) is the number of positive (respectively, negative) values of δ i for all training data. The monotonicity evaluates the feature importance score on a scale ranging from 0 to 1. A higher ranked feature tracks the degradation process more reliably and, hence, is more suitable to train the RUL prediction model. Features with large importance score are selected for feature postprocessing based on the training data in offline.
Once we identify the main features, PCA extracts the health indicator by reducing the feature dimension. PCA essentially reduces a system of a large number of features into a few principal components using a linear transformation while maintaining most of the variability from the feature set [30,31]. Note that the mean and the standard deviation obtained from training data are used to normalize the features of the entire datasets. After PCA is applied to the feature set of the training data, we select the specific principal component out of multiple principal components as the health indicator if it increases as the machine degrades.

Parameter Estimation of Degradation Model
We estimate the RUL value by developing a degradation model of the time series health indicator. The degradation model identifies a dynamic model that describes the failure process of the system behavior. It predicts the health indicator and the time to cross a specific threshold of the health indicator as the failure indicator.
We assume that the health indicator h(t i ) of bearings is observed at discrete times t 1 , t 2 , . . . , where t i ≥ 0. Exponential degradation models are useful when the machine component experiences cumulative degradation [8]. Besides, the linear degradation model is another useful one if the observed system does not have cumulative degradation processes [33,44]. Hence, the considered degradation model combines the exponential model and the linear model as follows: where φ, β w and α w are the updated parameters dependent on the health indicator over the sliding window. Compared to the existing exponential degradation model [8], the exponential part of Equation (2) reduces the flexibility of the degradation model. The sliding window method adapts the local model parameters of the time series data to compensate for the fundamental limits of the analytical degradation model. We evaluate the RUL prediction accuracy and robustness of the proposed scheme in Section 7. The adaptive sliding window method relies on ridge regression to balance training accuracy and generalization accuracy. We formulate the ridge regression problem by combining the prediction error term and the regulation term with different weights on model parameters x = [φ, α w , β w ] to avoid the overfitting and reduce the critical effect of the exponential term of Equation (2). The parameter optimization problem of the adaptive sliding window method is where the first term is the mean square error between measured health indicators h(t j ) and estimated health indicatorsĥ w (t j )) with sliding length s and the second term is the weighted regularization term with diagonal matrix W ∈ R 3×3 . Each element of the diagonal matrix W corresponds to the weight of the decision variable of x.
Since the objective function is a regularized quadratic cost function, we rewrite it as the matrix form where The optimal solution is Determining the optimal length of the sliding window is a complex task since it requires to take into account the reduction of the noise effect and the fault detection delay. The sliding window length is set to 30 s based on the estimation accuracy of the health indicator of the training sets for different sliding window lengths from 10 to 102 s.

Evaluation Setup
In this section, we describe the benchmark dataset of the mechanical bearings and the existing Bayesian approach that we used to compare our proposed method.

PHM Challenge Problem
PRONOSTIA is an experimental platform developed by the FEMTO-ST institute to evaluate and validate various fault detection and diagnostic algorithms of ball bearings [45]. The platform consists of three parts, namely rotating part, a measurement part, and a degradation generation part. It provides experimental measurements of accelerated degradation of mechanical bearings with constant or variable rotating speed and load force. Dedicated sensors measure the vibration and temperature of the platform. Two miniature accelerometers are installed orthogonally on the external race of the bearing to monitor the vertical and horizontal vibrations. The temperature sensor is located near the external ring of the bearing. These measurements are used to extract the health indicator of ball bearings.
PRONOSTIA platform provides various run-to failure experimental data since it enables the experimental degradation within a few hours. Seventeen experimental data cases are generated under three different operating conditions of rotating speed and load force, as summarized in Table 2. The experimental data includes 6 run-to-failure training sets and 11 remaining test sets to build RUL estimation models and evaluate the accuracy. Note that we use the notation i_j to denote jth dataset of operating condition i where i ∈ {A, B, C}. The vibration and temperature signals are recorded with specific sampling frequencies of 25.6 kHz and 10 Hz, respectively. Each experiment is terminated if the vibration signal reaches 20 g to prevent testbed damages. Hence, RUL is defined as time to accelerometer exceeding 20 g. Note that the RUL prediction method is only able to use the measurements of ball bearings since the detailed information of the degradation process is not provided.  Figure 2 depicts a horizontal vibration raw signal of various cases Bearing A_1, Bearing A_2, Bearing B_1, Bearing B_2, Bearing C_1, Bearing C_2 obtained during a whole experiment. The degradation processes of bearings have considerably different behaviors with various experimental duration (1-7 h). Efficient estimator design is challenging due to a small amount of training data with a high variation of the life duration of all bearings. As the machine progressively approaches to failure, the vibration signal impulsiveness increases. While the acceleration of operating condition A is slowly growing over time, it is suddenly raised under condition C. Furthermore, we observe significantly different behaviors even under the same operating condition, such as cases Bearing A_1 and Bearing A_2. The raw vibration signals do not have a clear correlation with various operating conditions. In a typical framework, the machine learning trains a regression model based on the data of some bearings and uses it to predict the RUL value of the other bearings. However, even if we obtain the measurements of the same type of bearing under the identical operating condition, their probabilistic density functions are considerably different due to various uncertainties. Hence, most existing machine learning algorithms result in a substantial performance loss because of the violation of the i.i.d. condition even with the same operating condition. While many mathematical models using frequency domain features efficiently detect the faults of the inner and outer races of ball bearings, they are hard to apply in practice due to the extraction difficulty of the frequency features under complex interactions [45]. Moreover, the statistical features of various cases tend to have different noise levels depending on the degradation process. Hence, a single regression model trained on limited data is difficult to generalize for other cases.

Bayesian Approach
The Bayesian method estimates the parameters of the degradation model based on its prior probability distribution and current measurements [8]. Two degradation models are considered, namely exponential and linear degradation models, to predict RUL of bearings using experimental data. The discrete-time exponential degradation model is defined aŝ where the predicted health indicatorĥ e (t i ) is a function of time, φ e is the constant intercept, α e and θ e are stochastic parameters deciding the slope of the model, and is a normally distributed random error with N(0, σ 2 ). In addition, α e is lognormal-distributed and θ e is Gaussian-distributed [8].
We also define the discrete-time linear degradation model aŝ where φ l is the model intercept, β l is a Gaussian-distributed random parameter determining the slope, and is a normally distributed random noise with N(0, σ 2 ). At each time step t i , Bayesian estimator updates the distribution of model parameters of Equations (6) and (7) based on the previous knowledge of the model and the latest observation of h(t i ) as the basis for the RUL prediction. We initially set the slope of the exponential degradation model based on the historical training data. If historical data are not available, the prior of the slope parameters are randomly selected with large variances to rely more on the observed data during the initial setup.
Since the performance of Bayesian methods considerably depends on the models and the data, as discussed below, we also propose an adaptive Bayesian method to choose one of the models between the exponential degradation model and the linear degradation model based on the residual error between the proposed model and measured health indicator.

Performance Evaluation
In this section, we evaluate the performance of the RUL estimation obtained by the adaptive sliding window method and the Bayesian method for various experimental data. We first discuss the feature extraction and postprocessing and then evaluate the accuracy of the RUL prediction. Figure 3 shows the power spectrum as a function of time and frequency of Bearing B_2. The colorbar indicates the fault severity normalized to 1, dependent on the RUL value of the experiment. As the machine progressively approaches to failure, the power spectrum around 13 kHz gradually increases. Hence, the statistical features of the power spectrum are potential indicators of degraded bearings.

Feature Extraction and Postprocessing
Various statistical features of time domain, frequency domain, and time-frequency domain signals, as listed in Table 1, are used to construct the health indicator. We only select main features with monotonicity greater than 0.4 as the input to PCA for feature fusion by considering all training sets. Note that we only use the training data to select the main features for the health indicator. These features are standard deviation, RMS, energy, median absolute deviation of the time domain, and mean and standard deviation of the conditional spectral moment with orders 2 and 3 in the time-frequency domain. Besides, the mean, skewness, kurtosis, and standard deviation of the local maxima of the power spectrum are selected in the frequency domain. One interesting finding is that the operating condition is not a critical factor in the feature importance ranking using the monotonicity. Once the main features are selected using the monotonicity rank, PCA is applied to construct the health indicator by reducing the dimension of feature spaces. Before performing PCA, the mean and standard deviation of training data normalize the whole data. Figure 4 shows the projected data onto the first two principal components obtained by PCA of Bearing A_2. We clearly observe that the first principal component increases as the machine gets closer to failure. We use the first principal component as the health indicator for the RUL prediction. Figure 5 presents the health indicators of various training cases Bearing A_1, Bearing A_2, Bearing B_1, Bearing B_2, Bearing C_1, Bearing C_2 as a function of experimental time. We observe that the health indicator monotonically increases under various operating conditions with different cases. While each case shows significantly different health indicator at the initial state, all cases converge to a similar health indicator value as the machine approaches failure. Based on these observations, we set the threshold as 9, around 90% of the maximum value of the health indicators of the training data, to reduce the delay effect of smoothing.
Another interesting observation is that the operating condition is not a dominant factor to characterize the health indicator. For instance, two different cases Bearing A_1 and Bearing A_2 of same operating condition are considerably different. On the other hand, the health indicators of both Bearing B_2 and Bearing C_2 are similar and rise rapidly when the bearing is closer to the end, even if they have different operating conditions. Hence, it clearly shows that the health indicator significantly varies due to the complex interactions of machine components and high uncertainty.  Figure 6 presents the health indicator and RUL of true measurements, adaptive sliding window method, and Bayesian methods using the exponential and the linear degradation models of Bearing B_6 as a function of experimental time. In Figure 6b, we also show the α bound of the true RUL where α is set to 20%. Figure 6a depicts that the health indicator of Bearing B_6 gradually increases as a function of experimental time. Both the adaptive sliding window method and the Bayesian method using the linear degradation model follow the true health indicator. On the contrary, the Bayesian method using the exponential degradation model fails to estimate the health indicator. The main reason is the fundamental model limits of the exponential degradation model with respect to the slowly degraded health indicator of Bearing B_6. By comparing Figure 6a,b, we observe that predicting the health indicator is not enough to estimate the RUL value due to the complex interaction between machine components with high uncertainty during whole experimental times. In fact, some experimental cases show sudden performance degradation when it gets closer to failure, as shown in Figure 2. Hence, it is infeasible to provide an accurate estimation of RUL at the beginning of the experimental operations. However, the adaptive sliding window method still provides the reasonable estimation of RUL within the α-bound of the true RUL for Bearing B_6 as it is closer to failure, as shown in Figure 6b. Both Bayesian methods using different models generally provide the conservative RUL prediction during the experimental time. Although the Bayesian method using the exponential degradation model does not provide the reasonable RUL estimation for Bearing B_6 due to the fundamental model limit, the performance of different methods considerably depends on the degradation trends of the real health indicator of each case, as we show below.

RUL Evaluation
We use the error ratio metric to evaluate the RUL prediction of different methods. Let us denote y i andŷ i as the actual RUL to be predicted and the estimated RUL of ith dataset, respectively. The error ratio of of ith dataset is defined by The effects of RUL underestimation and overestimation are not the same in practice. While early predictions of RUL, R i > 0, are considered as the reasonable estimation with the deduction to early removal, the RUL estimation that exceeded actual RUL, R i < 0, incurs more severe consequences.
Based on this observation, the IEEE PHM Data Challenge defines a scoring function where and P i denotes the percent error equal to Equation (8) multiplied by 100 for ith testing set [45]. The accuracy of the RUL prediction becomes more critical as the machine approaches to failure over time. To analyze the RUL prediction performance over time, we evaluate the error ratio of different solutions in each interval of 10% based on the time sequences to the whole experimental time.
In this evaluation, we show the RUL prediction between 60% and 100% of the experimental time of each case since the first 40% of data are not a critical factor for the practical purpose of the maintenance. For instance, the error ratio between 90% and 100% of the experimental time means the average error ratio of the last 10% of the experimental data. Figure 7 presents the normalized histogram of error ratios in each interval of 10% from 60% to 100% of the experimental time using different approaches for all cases. The adaptive sliding window method, Bayesian methods using the exponential and the linear degradation models, and adaptive Bayesian method provide considerably different error ratios of the RUL estimation as a function of time. One interesting observation is that the overall error ratio using Bayesian methods is gradually shifted from the positive error ratio to 0 as a function of time. It means that these methods provide a better underestimation of RUL prediction as the machine approaches to failure. While the histogram of the adaptive Bayesian method is similar to the one of the exponential degradation model for the error ratio around 0, the large error ratio greater than 0.6 is reduced thanks to the linear degradation model. The RUL prediction error ratio of the adaptive Bayesian method is improved compared to the ones using either exponential degradation model or linear degradation model.
While the error ratio using Bayesian methods is spread around 0.5, the one using the adaptive sliding window method is concentrated around 0. We observe that most absolute error ratios between 90% and 100% of the experimental time are less than 0.1 or larger than 1 for all cases. The main reason for the high negative or positive ratio is the small value of the true RUL of Equation (8) as the machine approaches to the end.
The score of the adaptive sliding window method is 0.6028, greater than that of the adaptive Bayesian method (0.4605). Note that the score values are 0.2554 and 0.3827 for Bayesian methods using the exponential and the linear degradation models, respectively. The score of the recently developed multiscale CNN method [10] is 0.3624, while the adaptive sliding window method gives 0.3117 for five test sets of the operating condition A. Note that Zhu et al. [10] only provided the score value of the operating condition A. However, our proposed scheme significantly reduces the computation complexity while guaranteeing the robustness against the possible uncertainty. Note that DL models are vulnerable to input uncertainties or adversarial attacks [14,15]. Mean absolute error (MAE) and mean absolute percentage error (MAPE) are used to evaluate prediction performance with the following formula: where N is the number of samples. Note that we do not show the root-mean-squared error since it has a similar trend with the MAE results. Figure 8 shows MAE and MAPE between true RUL value and predicted RUL obtained by adaptive sliding window method, Bayesian methods using the exponential and the linear degradation models, and adaptive Bayesian method for all cases. We observe that various approaches have considerably different MAEs of the RUL estimation dependent on cases.
Since we do not reuse any specific model parameters of the training sets to evaluate the testing sets, the performance of the training sets does not necessarily better than the ones of testing sets. The adaptive sliding window method provides the smallest MAE for 12 cases out of 17 cases. It outperforms all cases with operating condition C and B except the case of Bearing B_7. The RUL MAE using the adaptive sliding window method is only worse than other approaches for Bearing A_3, Bearing A_4, Bearing A_5, Bearing A_7, Bearing B_7. This is the main reason for the high positive or negative error ratio of the adaptive sliding window method in Figure 7. The adaptive Bayesian method outperforms the adaptive sliding window method for five cases out of 17 cases. These five cases consist of four cases with condition A and a single case with condition B. In these five cases, the Bayesian estimator using the exponential degradation model performs better than the ones using the linear degradation model for three cases with condition A. It is an interesting observation since the linear degradation model outperforms the exponential one for 10 cases out of 17 cases. While the adaptive sliding window method generally outperforms the Bayesian method using the linear degradation model, it cannot fully replace the Bayesian method using the exponential degradation model. The main reason is the fundamental model limits of the degradation model of Equation (2)   The overall trends of MAPE results are similar to those of the MAE results. The average value of all MAPE values of the adaptive sliding window is 0.3087, lower than that of the adaptive Bayesian method (0.4343). Note that the average values of all MAPEs are 0.6056 and 0.5229 for Bayesian methods using the exponential and the linear degradation models, respectively.

Conclusions
In this paper, we propose the RUL prediction method of mechanical bearings combining model-based and data-driven approaches. The adaptive sliding window method optimizes the local model parameters of the degradation model based on the ridge regression of the time series health indicator to estimate the RUL value. We compare the performance of the proposed scheme to existing Bayesian approaches for predicting the RUL value of the experimental data of the PHM data challenge problem. The adaptive sliding window method provides the smallest mean absolute error for 12 cases out of 17 cases. Furthermore, it achieves around 28% improvement in terms of mean absolute error and mean absolute percentage error with respect to the ones using Bayesian approaches. One interesting finding is that predicting the health indicator is not enough to estimate the RUL value due to the complex interaction between machine components with high uncertainty during the whole experimental times. Even though some existing Bayesian approaches estimate the health indicator well, the performance considerably depends on the degradation model. The adaptive sliding window approach provides significant performance improvement and robustness against the possible errors of the degradation model with respect to the ones using Bayesian approaches. The robustness to the model error is crucial to deploy the actual algorithm in industrial systems since it is not feasible to derive the universal degradation model even for the same bearing.
In practical industrial systems, a full coverage of representative data of all possible failures and their combinations is typically not available, while it is a critical factor for the accurate RUL estimation. For instance, our degradation model is not general enough to capture various behaviors due to many uncertainties in practice. The RUL estimation is essentially the behavior forecasting problem of the time series data using the sparse and delayed measurement with high noise and uncertainty. Furthermore, each failure mode typically has long-term dependencies along with short-term ones of time series data. Future research will focus on bridging the gap between the traditional knowledge of mathematical models and the deep learning techniques to estimate RUL using the time series data.
Author Contributions: P.P. and P.D.M. conceived the main idea and the network model; P.P. and M.J. contributed to data analysis and simulation. All authors have read and agreed to the published version of the manuscript.