Bearing Prognostics: An Instance-Based Learning Approach with Feature Engineering, Data Augmentation, and Similarity Evaluation

: We propose an instance-based learning approach with data augmentation and similarity evaluation to estimate the remaining useful life (RUL) of a mechanical component for health management. The publicly available PRONOSTIA datasets, which provide accelerated degradation test data for bearings, are used in our study. The challenges with the datasets include a very limited number of run-to-failure examples, no failure mode information, and a wide range of bearing life spans. Without a large number of training samples, feature engineering is necessary. Principal component analysis is applied to the spectrogram of vibration signals to obtain prognostic feature sequences. A data augmentation strategy is developed to generate synthetic prognostic feature sequences using learning instances. Subsequently, similarities between the test and learning instances can be assessed using a root mean squared (RMS) difference measure. Finally, an ensemble method is developed to aggregate the RUL estimates based on multiple similar prognostic feature sequences. The proposed approach demonstrates comparable performance with published solutions in the literature. It serves as an alternative method for solving the RUL estimation problem.


Introduction
An important task of prognostics and health management (PHM) is to estimate the remaining useful life (RUL). It is the time to failure if the monitored system continues to operate without intervention. Commonly, RUL estimation is tackled by first extracting prognostic features from data collected as part of the monitoring process to represent the degradation state. Then, a prognostic model is developed based on prognostic features and used to generate RUL estimates.
From the existing literature, RUL estimation methods can be categorized into physics or model-based, data-driven, and hybrid methods [1][2][3]. Physics-based methods require a description of degradation mechanisms for the system and its components, which can be challenging when such knowledge is limited. On the other hand, data-driven methods attempt to derive prognostic models from routinely collected data using statistics and machine learning approaches [4][5][6][7]. One commonly used hybrid approach is to describe the degradation process, e.g., trends and patterns, using the physics-based model and update model parameters with measured data using Bayesian inference, the maximum likelihood estimation (MLE) method, the extended Kalman filter (EKF), or particle filter (PF) [2].
With significant advances in machine learning and deep learning technologies, datadriven methods have attracted an increasing attention in the PHM community. For example, Babu et al. proposed a convolutional neural network (CNN) based regression approach to estimate RUL [5]. Zheng  short-term memory (LSTM) for the RUL estimation [6]. These approaches usually aim to construct a generalized prognostic model to map the underlying relationship between prognostic features and the estimated RUL based on a wide range of historic run-to-failure samples. However, in real-world applications, they have often encountered difficulties in constructing such generalized models without sufficient historic degradation samples.
As one of the machine learning approaches, the instance-based learning (IBL) approach utilizes the experience gained particularly from similar historic instances to solve new problem instances, rather than building a generalized prognostic model. Usually, there are three major steps in the IBL application for RUL prediction [8,9]: (i) instance retrieval from a historical dataset; (ii) prediction through local models based on the instances retrieved; (iii) aggregation of local predictions.
The instance retrieval step aims to select a set of learning instances of run-to-failure from historical datasets, whose degradation processes are deemed similar to the given test instance. Once completing the retrieval step, a local RUL prediction can be estimated by identifying the location of the current time cycle in the life cycle of each learning instance retrieved. In the aggregation step, an ensemble method is usually employed to generate the final estimation result from all local RUL predictions.
In order to perform instance retrieval, similarity evaluation between a learning instance and a test instance usually adopts the Euclidean distance measure. In early IBL applications for RUL prediction, similarity measure is based on a set of attributes that characterize each instance. In an application proposed by Bonissone et al., the RUL prediction for a given test locomotive was obtained by a fuzzy aggregation of a cluster of peers. In the cluster, each peer had a similar instance to the given locomotive in terms of a set of attributes related to the usage and maintenance history [10]. Xue et al. used a similar approach to make RUL prediction for a given aircraft engine. [11].
For RUL prediction applications, existing IBL approaches usually evaluate instance similarity based on point-wise comparison of feature sequences (i.e., time series), which consists of prognostic feature vectors or health indicators (HI) constructed at every observation time cycle. For example, Wang et al. created a library of degradation models using run-to-failure datasets from multiple engine units. Each of those prognostic models was characterized by HIs using a linear regression technique. The RUL for a given test engine was estimated using a weighted sum of RUL estimates based on the most similar instances [12]. In Wang's Ph.D. thesis [13], a RUL prediction method named Trajectory Similarity Based Prediction (TSBP) was further developed from the author's previous work [12]. In the TSBP method, two relaxing variables, namely time lag and scaling factor, were introduced in the Euclidean distance-based similarity definition to accommodate minor discrepancies in the degradation process between two comparing instances. Minimizing the distance-based similarity with the time lag is used to deal with the discrepancy in degradation process alignment, the scaling factor for discrepancy in degradation acceleration rate. Khelif et al. proposed a new similarity measure for instance retrieval in their IBL approach to improve the prediction accuracy [8]. The algorithm includes a comparison of the degradation processes between the test and learning instances, while giving heavier weights to recent observations. In an IBL approach for RUL prediction proposed by Ramasso et al. [9], the instance retrieval step is undertaken using a Euclidean distance measure only with respect to the last block of the test observation trajectory. In another paper presented by Ramasso [14], for each instance the entire time series of health indicator was decomposed into a lower envelope and upper envelope. The two envelopes formed a planar figure called a polygon, which was considered an imprecise health indicator (IHI). The proposed similarity evaluation approach makes use of computational geometry tools in order to estimate the nearest instances to a given testing instance.
It is important to recognize that system degradation behaviors are influenced by many factors including material and manufacturing variability. Hence, the same design and make of a component or a device may present different degradation behaviors under the same operating condition. It is challenging for the IBL approach to make accurate RUL estimations with limited historical degradation datasets. In this paper, we propose a novel IBL approach supported by data augmentation to address the shortage of historical data. We illustrate the method and its performance using the PRONOSTIA bearing datasets, which were provided to the IEEE PHM 2012 Prognostic Challenge [15,16]. In addition to dealing with limited run-to-failure examples in the learning datasets, the proposed method is capable of handling situations where learning and test instances may have different degradation rates. A key idea is the application of a data augmentation technique to generate multiple synthetic prognostic feature sequences by modifying the learning instance (i.e., run-to-failure example). From the multiple synthetic prognostic feature sequences, an optimal RUL estimate for the test instance can be obtained using a similarity evaluation method.
It is worth noting that the similarity measure used in this paper has a distance-based definition (named "RMS difference measure" in this paper) similar to the TSBP method [13] where the time lag and scaling factor are considered. In this paper, we introduce two manipulating variables in similarity evaluation accordingly from the data augmentation point of view. In the TSBP method, the similarity definition with the two relaxing variables was designed to accommodate minor discrepancies between the comparing instances for instance retrieval. Tighter constraints were applied to the two variables while implementing TSBP. In this paper, we implement a data augmentation strategy by modifying the degradation rate with either stretching or shrinking factor in scaling. Our proposed IBL approach can be implemented with different data augmentation strategies for different prognostic applications. More manipulating variables may be introduced in the similarity evaluation accordingly.
The remainder of this paper is organized as follows. Section 2 introduces the bearing RUL estimation problem and the PRONOSTIA datasets. Section 3 describes the methodology of the proposed IBL approach. Section 4 presents the evaluation results and discussion. A comparison of the proposed IBL approach with the previous studies is also made in Section 4. Finally, Section 5 gives a conclusion of this study.

Problem Description
The open-access PRONOSTIA bearing datasets were produced by the FEMTO-ST Institute [15,16]. In the bearing experimental platform, namely PRONOSTIA, two accelerometers were mounted on the bearing housing to measure vibration in directions parallel and perpendicular to the gravity axis. Data were collected at 10 s intervals at a 25.6 kHz sampling rate for a 0.1 s duration. Hence, an observation contained 2560 sample points in each time cycle.
The bearing datasets were collected under three different operating conditions in variation of load and speed. For each operating condition (e.g., speed: 1800 rpm and load: 4000 N), only two sets of data from the entire run-to-failure experiments were provided. These are used as learning instances in this study to build the RUL estimation model. Partial data from run-to-failure experiments obtained by truncating the time series after a certain number of time cycles are also provided and used as test instances. In the 2012 PHM Prognostic Challenge [15,16], RUL was defined as the time span until the vibration amplitude exceeds 20 g. Figure 1 shows vibration signals in the horizontal direction, i.e., perpendicular to gravity, under the operating condition I, i.e., speed of 1800 rpm and load of 4000 N. A total of seven datasets are provided including two learning instances labelled as Bearing1-1 and Bearing1-2, and five test instances labelled as Bearing1-3, 1-4, 1-5, 1-6, and 1-7. The two run-to-failure datasets show very different lifespans. A large variation in data length can be seen among the seven bearing instances (from 1 h to 7 h). The test bearing has 13 rolling elements with 3.5 mm diameter. Its inner race diameter is 22.1 mm and outer race diameter is 29.1 mm. The bearing has a contact angle of zero degrees. 1 and Bearing1-2, and five test instances labelled as Bearing1-3, 1-4, 1-5, 1-6, and 1-7. The two run-to-failure datasets show very different lifespans. A large variation in data length can be seen among the seven bearing instances (from 1 h to 7 h). The test bearing has 13 rolling elements with 3.5 mm diameter. Its inner race diameter is 22.1 mm and outer race diameter is 29.1 mm. The bearing has a contact angle of zero degrees.

Raw Signal
Root-Mean-Square Crest-Factor Kurtosis (a) Bearing1-1 Bearing1-5 For the 2012 PHM challenge, no information about the type of failure, e.g., fatigue spalling and its location or poor lubrication, was given. The data provider from the FEMTO-ST Institute has advised that theoretical models based on bearing characteristic frequencies were highly unlikely to be useful in identifying bearing fault types, because degradation of all bearing components may take place at the same time. In addition, theoretical reliability framework, e.g., L10 law, ball-pass-frequency-inner (BPFI), ball-passfrequency-outer (BPFO) does not match the experimental observations very well [15].
There have been a variety of statistical measures that can be used to extract prognostic features from vibration signals, including Peak, Peak-to-Peak, Energy, Variance, Standard Deviation (SD), Root-Mean-Square, Entropy, Kurtosis, Skewness, Crest Factor, K-Factor, Impulse Factor, Margin Factor (i.e., Clearance Factor), Shape Factor (i.e., Form Factor, Wave Factor), and the trigonometric function based features SD-IHC and SD-IHS (i.e., SD of inverse hyperbolic cosine (IHC), SD of inverse hyperbolic sine (IHS)) [3,7]. For each time cycle containing a set of data points that are within the signal sampling duration, these statistical measures can be computed.
Some statistical parameters are reported to have the ability to indicate the severity of bearing faults [15]. For example, signal energy is generally proportional to the severity of bearing damage and hence parameters such as Root-Mean-Square and Peak-to-Peak should increase as faults progress. Other parameters, including the Crest Factor and the Impulse Factor, are sensitive to incipient faults and hence should increase as faults develop in the bearing but will fall back down as faults become widespread. Figure 1 shows the values and trends of Root-Mean-Square, Crest Factor, and Kurtosis for the 7 bearing instances under the operating condition of 1800 rpm and 4000 N. They were calculated for every time cycle. However, it is apparent that not a single parameter demonstrates consistent patterns across all 7 instances as bearing faults progress. Thus, it is difficult to derive the RUL estimation model based on these statistical parameters.
In addition, the PRONOSTIA data provider demonstrated that the evolution of the signal power spectrum density (PSD) shows monotonic increasing trend during bearing degradation for an ideal case such as Bearing1-1. However, they have also advised that, in most cases, the degradation appears suddenly and does not depict a slow monotonic behavior. In those cases, finding a prediction model is much more difficult based on PSD [15].

An Instance-Based Learning (IBL) Approach
In this section, we formulate the RUL estimation problem and present an overview of the proposed IBL approach. We then provide details of major components of the proposed approach, using the horizontal vibration signals of the 7 bearing instances under the operating condition I (i.e., 1800 rpm/4000 N) in the PRONOSTIA bearing datasets.

Formulation of the RUL Estimation Problem
Given a learning instance (denoted by superscript (l)) of the vibration signal, the run-to-failure degradation process can be represented by where x i (l) represents the prognostic feature vectors of dimension K at time cycle i indexed from the beginning of bearing usage. M denotes the end of useful life.
Features are extracted from raw signals at every time cycle. Recall that the PRONOSTIA datasets provide recordings of 0.1 s duration at intervals of every ∆t = 10 s. In Equation (1), i is the RUL at time cycle i. Hence: Signals 2021, 2

667
For a test instance, the degradation process can be represented much the same way except we use superscript (p) to distinguish test from learning instances with feature vectors represented by: In the above, index N denotes the last data recording. Obviously, RUL N is the ultimate goal. The key to the IBL approach lies in the selection of a learning instance from historical datasets, whose degradation process is deemed similar to the given test instance. Once a learning instance has been identified, the RUL problem can be solved by identifying the location of the current time cycle in the life cycle of the learning instance in order to determine the remaining useful life RUL

Overview of Methodology
As illustrated in Figure 2, the proposed IBL approach consists of three major components, namely, spectrogram generation, feature extraction, and RUL estimation:

•
Spectrogram generation applies the short-time-Fourier-transform (STFT) technique on the raw signals during each time cycle to generate frequency domain signals, so that spectrograms can be obtained for both learning and test instances. RUL estimation provides the RUL estimate for the test instance by identifying its similar prognostic feature sequences, which are generated synthetically from the learning instance using the data augmentation method.
Signals 2021, 2 FOR PEER REVIEW 6 Features are extracted from raw signals at every time cycle. Recall that the PRONO-STIA datasets provide recordings of 0.1 s duration at intervals of every ∆t = 10 s. In Equation (1), ( ) is the RUL at time cycle i. Hence: For a test instance, the degradation process can be represented much the same way except we use superscript (p) to distinguish test from learning instances with feature vectors represented by: In the above, index N denotes the last data recording. Obviously, ( ) is to be estimated and finding ( ) is the ultimate goal.
The key to the IBL approach lies in the selection of a learning instance from historical datasets, whose degradation process is deemed similar to the given test instance. Once a learning instance has been identified, the RUL problem can be solved by identifying the location of the current time cycle in the life cycle of the learning instance in order to determine the remaining useful life ( ) .

Overview of Methodology
As illustrated in Figure 2, the proposed IBL approach consists of three major components, namely, spectrogram generation, feature extraction, and RUL estimation: • Spectrogram generation applies the short-time-Fourier-transform (STFT) technique on the raw signals during each time cycle to generate frequency domain signals, so that spectrograms can be obtained for both learning and test instances. RUL estimation provides the RUL estimate for the test instance by identifying its similar prognostic feature sequences, which are generated synthetically from the learning instance using the data augmentation method.

Spectrogram Generation and Feature Extraction
In the PRONOSTIA datasets, data were collected every 10 s for a period of 0.1 s. In every time cycle of 0.1 s, 2560 data points were recorded, thus with a time resolution of 3.91 µs and frequency resolution of 10 Hz. Each spectrum obtained for each time cycle contains 1280 frequency components. In order to reduce noise effects, spectrograms are smoothed using moving averaging filters in both time and frequency directions. We chose

Spectrogram Generation and Feature Extraction
In the PRONOSTIA datasets, data were collected every 10 s for a period of 0.1 s. In every time cycle of 0.1 s, 2560 data points were recorded, thus with a time resolution of 3.91 µs and frequency resolution of 10 Hz. Each spectrum obtained for each time cycle contains 1280 frequency components. In order to reduce noise effects, spectrograms are smoothed using moving averaging filters in both time and frequency directions. We chose 80 and 40 as the window sizes in time and frequency, respectively. Figure 3a,b show the spectrograms of Bearing1-1 as the learning instance and Bearing1-3 as the test instance.  For a learning instance, the spectrogram has 1280 frequency components at eac the M time steps so the PCA technique is utilized for dimension reduction. The PCA reduce a larger number of features to the representative principal components usi linear transformation while maintaining most of the variability of the dataset [17]. In proposed IBL approach, PCA is first applied to the spectrogram of the learning inst to generate a set of PCs (i.e., a set of eigenvectors). We chose the first five PCs to repre the bearing degradation states. In Figure 4, we show the mean and first five PCs o learning instance Bearing1-1. They account for 91.27%, 5.13%, 1.48%, 0.14%, 0.03% o spectrogram variability respectively, for a total of 99.86% of the data variability. For a learning instance, the spectrogram has 1280 frequency components at each of the M time steps so the PCA technique is utilized for dimension reduction. The PCA can reduce a larger number of features to the representative principal components using a linear transformation while maintaining most of the variability of the dataset [17]. In the proposed IBL approach, PCA is first applied to the spectrogram of the learning instance to generate a set of PCs (i.e., a set of eigenvectors). We chose the first five PCs to represent the bearing degradation states. In Figure 4, we show the mean and first five PCs of the learning instance Bearing1-1. They account for 91.27%, 5.13%, 1.48%, 0.14%, 0.03% of the spectrogram variability respectively, for a total of 99.86% of the data variability.
Signals 2021, 2 FOR PEER REVIEW 7 80 and 40 as the window sizes in time and frequency, respectively. Figure 3a,b show the spectrograms of Bearing1-1 as the learning instance and Bearing1-3 as the test instance. For a learning instance, the spectrogram has 1280 frequency components at each of the M time steps so the PCA technique is utilized for dimension reduction. The PCA can reduce a larger number of features to the representative principal components using a linear transformation while maintaining most of the variability of the dataset [17]. In the proposed IBL approach, PCA is first applied to the spectrogram of the learning instance to generate a set of PCs (i.e., a set of eigenvectors). We chose the first five PCs to represent the bearing degradation states. In Figure 4, we show the mean and first five PCs of the learning instance Bearing1-1. They account for 91.27%, 5.13%, 1.48%, 0.14%, 0.03% of the spectrogram variability respectively, for a total of 99.86% of the data variability. By projecting spectrograms onto the five major PCs generated above, the corresponding PC coefficients of both learning and test instances can be obtained. The spectrum at each time cycle i (or j) can be approximately represented by coefficients x k,i (or x k,j ) on a total of 5 major principal components PC k (k = 1, 2, 3, 4, 5) as shown in Figure 4.
These coefficients are considered as the prognostic features that are included in the prognostic feature vector x i (l) or x j (p) in Equations (2) and (5) to represent a bearing's degradation state.
As a result, the five PC feature sequences can be obtained as illustrated in Figure 3c,d for Bearing1-1 and Bearing1-3, respectively. From these figures, it can be observed that the 1st PC mainly captures the variability of the learning instance data over time and also the test instance has a similar trend with respect to the 1st PC feature sequence. It is reasonable to associate machine health conditions with such variability in the bearing vibration data. Particularly, we can see that the 1st PC mainly reflects the variability of the vibration signal at two frequency ranges 0-3 kHz and 10-12 kHz, as shown in Figure 4.
The procedure is summarized as follows: <Procedure-I: Spectrogram Generation and Feature Extraction>

RUL Estimation
The step of RUL estimation is approached with three components, namely, data augmentation, similarity evaluation, and estimate ensemble. Details are provided below.

Data Augmentation
In machine learning applications, a data augmentation strategy is often used to generate additional learning data by modifying the historical samples [18][19][20]. By increasing the learning datasets, classification models can be trained to be more robust and generalized without over-fitting. For example, in image analysis, learning samples may be acquired under a limited set of conditions. However, classification may need to be applied to images under conditions beyond that of the learning samples, such as different orientation, location, scale, brightness, etc. These situations can be accounted for by training the classification model with additional image samples, which may be generated by synthetically modifying existing samples through translation, rotation, scaling, and illumination. Augmenting the learning datasets can result in a trained classification model to be more robust and generalized with-out over-fitting. Taking a similar approach, we propose a segment scaling method in order to generate synthetic feature sequences by modifying the learning instance in the IBL approach for bearing RUL estimation.
We take the learning instance Bearing1-1 as an example. Figure 5b shows the 1st PC feature sequence over time. Based on the magnitude trend of the feature curve, three typical stages within the lifespan of the learning instance can be observed: (i) normal stage where no obvious degradation occurs, the magnitude remains nearly constant; (ii) degradation stage where a gradually changing trend is clear; (iii) close-to-failure stage where the bearing is close to failure as signaled by steep or abrupt changes in the feature curve. The close-to-failure stage may last a short period, while the degradation stage can last quite long.
begins on entering the degradation stage until the current time cycle. Correspondingly, specify a degradation segment of length Tl within the feature sequence of the learning instance. The learning degradation segment should present similar trending as the test segment. For the learning instance, the specified segment also begins on entering the degradation stage until ending the entire degradation stage. Understandably, the learning degradation segment should encompass the test segment in magnitude so that it is possible to determine the relative location of the test case degradation. Hence, the following condition Equation (7) should be satisfied for the specified test and learning segments.
where ( ) and ( ) are prognostic feature values of the test and learning instances at the ending times of their respective segments, as illustrated in Figure 5a,b. The learning degradation segment is then stretched (α > 1) or shrunk (α < 1) in length with different scaling factor α, as illustrated Figure 5b-d. By scaling, multiple segments representing different prognostic feature sequences from the given learning instance can be generated. The heuristics are based on the following assumptions [21,22]: (a) The same type of bearing faults may progress in different degradation rates due to different conditions including fault initial status, material heterogeneity, and lubrication condition, even though under the same loading and rotation conditions; (b) As the segments within the  Figure 5a shows the 1st PC feature sequence of the test instance. If we can determine the equivalent location of the test instance relative to the lifespan of the learning instance, we can estimate the RUL of the test instance. As illustrated in Figure 5a, the proposed segment scaling method starts by identifying a defect-degradation segment in the first PC feature sequence of the test instance Bearing1-3. The specified segment for the test instance begins on entering the degradation stage until the current time cycle. Correspondingly, specify a degradation segment of length T l within the feature sequence of the learning instance. The learning degradation segment should present similar trending as the test segment. For the learning instance, the specified segment also begins on entering the degradation stage until ending the entire degradation stage. Understandably, the learning degradation segment should encompass the test segment in magnitude so that it is possible to determine the relative location of the test case degradation. Hence, the following condition Equation (7) should be satisfied for the specified test and learning segments.
s are prognostic feature values of the test and learning instances at the ending times of their respective segments, as illustrated in Figure 5a,b.
The learning degradation segment is then stretched (α > 1) or shrunk (α < 1) in length with different scaling factor α, as illustrated Figure 5b-d. By scaling, multiple segments representing different prognostic feature sequences from the given learning instance can be generated. The heuristics are based on the following assumptions [21,22]: (a) The same type of bearing faults may progress in different degradation rates due to different conditions including fault initial status, material heterogeneity, and lubrication condition, even though under the same loading and rotation conditions; (b) As the segments within the close-to-failure stage and normal stage are not used in the similarity comparisons between the test and learning instances, the segment scaling method is only applied on the learning degradation segment; (c) The time of the close-to-failure stage, ∆ as indicated in Figure 5b, is approximately considered as a constant in all synthetic learning sequences, as this last stage of bearing life is usually relatively shorter than its gradual degradation stage.
This segment scaling method is implemented by resampling with a linear or spline interpolation function [23]. At first, the interpolation function is constructed to approximate the specified segment of the learning instance.
where x i is the interpolated feature value at the time cycle of index i. Then, a new feature sequence in the scaled length αT l is generated by re-interpolating the linear or spline function as illustrated in Figure 5c,d.
where x i is the interpolated feature value at the time cycle of index i on the synthetic sequence, and the scaled length αT l needs to be rounded if it is not an integer. At last, the normal segment, the synthetic degradation segment, and the close-to-failure segment are concatenated sequentially to generate the entire synthetic feature sequence. Accordingly, multiple synthetic learning feature sequences can be generated with different scaling factors.

Similarity Evaluation
If we can determine "a segment" in the learning instance feature sequence that is "similar" to that of the test instance, we can conclude its RUL without much difficulty. This segment may be original or synthetic as described above. We now describe how to assess similarity. In this study, we use the root mean squared (RMS) difference measure to evaluate similarity between the test and learning instances, as described below. Figure 6 illustrates the similarity matching by using the 1st PC, that is, k = 1. In particular, it depicts the prognostic feature segment of the test instance (e.g., Bearing1-3), P k , and one of the synthetic feature sequences (e.g., α = 0.75) of the learning instance (e.g., Bearing1-1), L k .
By displacing P k along the time axis, the similarity measure method computes the difference between L k and each shifted P k in the form of RMS using the following function of time displacement T g : Recall that T p, and αT l, are the lengths of the two feature segments being compared. T g is in the range of [−T p , αT l, − T p ]. The best matching between the two segments is indicated by a minimum value of similarity measure d min . Figure 6b shows the similarity measure d P k , L k , T g for the given segments of comparison as a function of displacement T g . At d min , the displacement is denoted as T g,min . For each given synthetic feature sequence (e.g., α = 0.75), we identify the d min (α) and the RUL measured in time cycles then can be calculated by: In the above equation, ∆ as indicated in Figure 6a, is constant for all synthetic feature sequences because it is excluded from scaling. In the example shown in Figure 6, the prognostic segments are chosen between 300-and 1800-time cycles from the test instance and between 20-and 2750-time cycles from the learning instance. Therefore, Tp =1500, Tl =2730.  Note that in the above computation each comparison pair of feature sequences, Lk and Pk, are normalized with the mean and standard deviation of the feature segment Pk, using Equations (14) and (15) In the above equation, ∆ as indicated in Figure 6a, is constant for all synthetic feature sequences because it is excluded from scaling. In the example shown in Figure 6, the prognostic segments are chosen between 300-and 1800-time cycles from the test instance and between 20-and 2750-time cycles from the learning instance. Therefore, T p = 1500, T l = 2730. The learning instance Bearing1-1 is recorded as failing at 2785 cycles, hence ∆ = 53. At the best matching, T g, min =−34, α = 0.75. Thus, RUL (0.75) = 0.75 × 2730 − (1500 − 34) + 53 = 635 (time cycles).
Note that in the above computation each comparison pair of feature sequences, L k and P k , are normalized with the mean µ and standard deviation σ of the feature segment P k , using Equations (14) and (15) (L k − µ(P k ))/σ(P k ) (14) (P k − µ(P k ))/σ(P k )

Estimate Ensemble
Once synthetic data are generated, we can try to find the best match between the test instance and the learning instance by the similarity measures of all real and synthetic data. As the synthetic learning sequences are generated in the discrete way (with a set of scaling factors), the estimation result may be biased from the actual RUL if it is only based on one synthetic sequence. Hence, we propose an ensemble method based on multiple synthetic learning sequences to achieve the best out-come of assessing RUL.
As illustrated in Figure 7, multiple synthetic feature sequences are generated from the original feature sequence of the learning instance (e.g., Bearing1-1) with different scaling factor α. From each synthetic sequence, an RUL estimate, RUL(α), for the test instance (e.g., Bearing1-3) can be obtained with the corresponding minimum similarity measure d min (α). A scaling factor α* for the best similarity matching sequence can be identified from all learning sequences including the original and synthetic ones. After that, the final RUL estimation can be aggregated from the best matching sequence and the multiple synthetic sequences, whose scaling factors are evenly distributed on both sides of α*, for example, two sequences with α > α* and two sequences with α < α*, as shown in Figure 7. The formula for computing the ensemble RUL estimate is defined as Equation (16). In the averaging formula of the ensemble method over M number of the selected sequences, each RUL estimate is weighted by its corresponding similarity measure, d min .
In this example as illustrated in Figure 7, nine synthetic sequences are generated from the original feature sequence of Bearing1-1, with the scaling factor α of 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, and 0.95, respectively. Table 1 presents the results of a similarity evaluation between the nine synthetic sequences and the selected segment in the range of [300, 1800] (time cycles) of Bearing1-3. The best similarity matching sequence can be identified as α = 0.75. Based on the seven selected synthetic feature sequences with α = 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, and 0.90, the ensemble RUL estimate, RUL e , can be computed as 642 (time cycles) using Equation (18). The actual RUL at the last time cycle 1800 (i.e., the length of future sequence) of Bearing1-3 is RUL a = 573 (time cycles). Thus, our proposed IBL approach can achieve a good performance with the relative error E r = −12.07%, which is calculated using the formula Equation (17).
Signals 2021, 2 FOR PEER REVIEW

Estimate Ensemble
Once synthetic data are generated, we can try to find the best match between th instance and the learning instance by the similarity measures of all real and synthetic As the synthetic learning sequences are generated in the discrete way (with a set of s factors), the estimation result may be biased from the actual RUL if it is only based o synthetic sequence. Hence, we propose an ensemble method based on multiple syn learning sequences to achieve the best out-come of assessing RUL.
As illustrated in Figure 7, multiple synthetic feature sequences are generated the original feature sequence of the learning instance (e.g., Bearing1-1) with differen ing factor α. From each synthetic sequence, an RUL estimate, RUL(α), for the test ins (e.g., Bearing1-3) can be obtained with the corresponding minimum similarity me dmin(α). A scaling factor α* for the best similarity matching sequence can be identified all learning sequences including the original and synthetic ones. After that, the fina estimation can be aggregated from the best matching sequence and the multiple syn sequences, whose scaling factors are evenly distributed on both sides of α*, for exa two sequences with α > α* and two sequences with α < α*, as shown in Figure 7. Th mula for computing the ensemble RUL estimate is defined as Equation (16). In the aging formula of the ensemble method over M number of the selected sequences RUL estimate is weighted by its corresponding similarity measure, dmin. In this example as illustrated in Figure 7, nine synthetic sequences are generated the original feature sequence of Bearing1-1, with the scaling factor α of 0.55, 0.60 0.70, 0.75, 0.80, 0.85, 0.90, and 0.95, respectively. Table 1 presents the results of a sim evaluation between the nine synthetic sequences and the selected segment in the ran [300, 1800] (time cycles) of Bearing1-3. The best similarity matching sequence can be tified as α = 0.75. Based on the seven selected synthetic feature sequences with α = 0.65, 0.70, 0.75, 0.80, 0.85, and 0.90, the ensemble RUL estimate, , can be com as 642 (time cycles) using Equation (18). The actual RUL at the last time cycle 180 the length of future sequence) of Bearing1-3 is = 573 (time cycles). Thus, ou posed IBL approach can achieve a good performance with the relative error = −12 which is calculated using the formula Equation (17).  In summary, the procedure for the RUL estimation method with data augmentation is described as follows: <Procedure-II: Data Augmentation and RUL Estimation>

Data Augmentation
(1.a) Specify the pair of comparison feature segments within the test instance and learning instance, P k and L k (i.e., α = 1.0). (1.b) Generate multiple synthetic sequences L k (i.e., α > 1.0 or α < 1.0) from the original one using the segment scaling method described in Section 3.4.1.

Similarity Evaluation
(2.a) Compute the similarity measure as expressed in Equation (12) between the P k and each sequence L k using the similarity evaluation method as described in Section 3.4.2 (2.b) Identify the minimum similarity measure d min and the corresponding displacement T g with respect to P k . (2.c) Repeat the steps (2.a) and (2.b) for each pair of sequences P k and L k .

Estimate Ensemble
(3.a) Identify the best similarity matching pair of sequences from all the similarity evaluation results. (3.b) Select several learning sequences (e.g., 4 sequences in this study) around the best similarity matching sequence identified in (3.a). (3.c) Compute the final RUL estimate using Equations (16) and (17)

Results and Comparison
In the previous section, we described the proposed IBL approach step by step with the example of the test instance Bearing1-3 and the learning instance Bearing1-1 in the PRONOSTIA bearing datasets. The RUL estimation results for the four other test bearings under the operating condition I (1800 rpm and 4000 N), including Bearings 1-4, 1-5, 1-6, and 1-7, are discussed in this section.

RUL Estimation for Test Instance Bearings
The spectrogram and its corresponding PC feature sequences for the learning instance Bearing1-2 are shown in Figure 8a,b, respectively. In Figure 9, we show the first five PCs of the learning instance Bearing1-2. They each account for data variability of 49.78%, 37.45%, 9.24%, 1.71%, 0.98%, respectively. As shown in Figure 9, we can observe that the 1st PC mainly reflects the variability of the vibration signal at two frequency ranges 0-3 kHz and 10-12 kHz, the 2nd PC at two frequency ranges 0-3 kHz and 3-5 kHz, and so on.      Applying the proposed IBL approach to the bearing datasets, we compare the five PC feature sequences between each pair of a test instance and the augmented feature sequences of the learning instance. The best similarity matching sequence pair can be identified from the test instance vs. each learning instance. It should be noted that we only present the best matching sequence pair and the corresponding RUL estimate results for each test instance here.
As a result, the 2nd PC feature sequences were chosen for test instances Bearing1-5, 1-6, and 1-7 paired with the learning instance Bearing1-2. The 1st PC feature sequences are used for Bearing1-4 paired with the learning instance Bearing1-2. Figure 10 shows the five PC feature sequences of the test instances Bearings 1-4, 1-5, 1-6, and 1-7, which are generated by projecting their spectrograms on the PCs of learning instance Bearing1-2. Recall that the first five PC sequences of Bearing1-1 have been illustrated in Figure 4 in the previous Section 3.3. For comparison, Table 2 summarizes the PCs for data variability with respect to Bearing1-1 and Bearing1-2. Applying the proposed IBL approach to the bearing datasets, we compare the five PC feature sequences between each pair of a test instance and the augmented feature sequences of the learning instance. The best similarity matching sequence pair can be identified from the test instance vs. each learning instance. It should be noted that we only present the best matching sequence pair and the corresponding RUL estimate results for each test instance here.
As a result, the 2nd PC feature sequences were chosen for test instances Bearing1-5, 1-6, and 1-7 paired with the learning instance Bearing1-2. The 1st PC feature sequences are used for Bearing1-4 paired with the learning instance Bearing1-2. Figure 10 shows the five PC feature sequences of the test instances Bearings 1-4, 1-5, 1-6, and 1-7, which are generated by projecting their spectrograms on the PCs of learning instance Bearing1-2.

RUL Estimation for Test Instance Bearing1-4
Using Bearing1-2 as the learning instance, the pair of the 1st PC sequences with similar trends is compared to estimate the RUL for the test instance Bearing1-4. As illustrated in Figure 11, Bearing1-4 has passed its degradation stage and entered its close-to-failure stage at time cycle 1039 and records are given up to cycle 1137. Therefore, the usage of Bearing1-4 can be estimated pessimistically, and the RUL is estimated as 0 at time cycle 1137. On the other hand, the RUL of Bearing1-4 can also be optimistically estimated as the duration of close-to-failure stage of Bearing1-2, which is known as 45-time cycles from the turning point. Averaging the pessimistic and optimistic estimates, we conclude the final RUL estimate as RUL e = 23 (time cycles). As the actual RUL of Bearing1-4 is known as RUL a = 34 (time cycles), the estimation has the relative error E r = 32.4%.

RUL Estimation for Test Instance Bearing1-5
Based on the specified segment of the range [0, 825] (time cycles) within the 2nd PC feature sequence of the learning instance Bearing1-2, the nine synthetic feature sequences are generated with the scaling factor α of 2.80, 2.90, 3.00, 3.10, 3.20, 3.30, 3.40, 3.50, and 3.60, as shown in Figure 12. The duration ∆ is equal to 45 (time cycles) in this case. Table 3 presents the results of similarity evaluation of synthetic sequences and the selected segment in the range of [0, 2303] (time cycles) of Bearing1-5. Based on the seven synthetic sequences of α = 2.90, 3.00, 3.10, 3.20, 3.30, 3.40, and 3.50, the final RUL estimate of Bearing1-5 can be computed as RUL e = 160 (time cycles) using the proposed ensemble method. As the actual RUL of Bearing1-5 is known as RUL a = 161 (time cycles), the relative error of estimation is calculated as E r = 0.92%. Using Bearing1-2 as the learning instance, the pair of the 1st PC sequences with similar trends is compared to estimate the RUL for the test instance Bearing1-4. As illustrated in Figure 11, Bearing1-4 has passed its degradation stage and entered its close-to-failure Bearing1-4 can be estimated pessimistically, and the RUL is estimated as 0 at time cycle 1137. On the other hand, the RUL of Bearing1-4 can also be optimistically estimated as the duration of close-to-failure stage of Bearing1-2, which is known as 45-time cycles from the turning point. Averaging the pessimistic and optimistic estimates, we conclude the final RUL estimate as = 23 (time cycles). As the actual RUL of Bearing1-4 is known as = 34 (time cycles), the estimation has the relative error = 32.4%.

RUL Estimation for Test Instance Bearing1-5
Based on the specified segment of the range [0, 825] (time cycles) within the 2nd PC feature sequence of the learning instance Bearing1-2, the nine synthetic feature sequences are generated with the scaling factor α of 2.80, 2.90, 3.00, 3.10, 3.20, 3.30, 3.40, 3.50, and 3.60, as shown in Figure 12. The duration ∆ is equal to 45 (time cycles) in this case. Table  3

RUL Estimation for Test Instance Bearing1-5
Based on the specified segment of the range [0, 825] (time cycles) within the 2nd PC feature sequence of the learning instance Bearing1-2, the nine synthetic feature sequences are generated with the scaling factor α of 2.80, 2.90, 3.00, 3.10, 3.20, 3.30, 3.40, 3.50, and 3.60, as shown in Figure 12. The duration ∆ is equal to 45 (time cycles) in this case. Table  3    The similarity evaluation for RUL estimation was made based on the pair of 2nd PC feature sequences between the test instance Bearing1-6 and the learning instance Bearing1-2. Initially, we specify the comparing segment as the monotonic segment in the range of [0, 1590] (time cycles) within the degradation stage of Bearing1-6. Using Bearing1-2, the segment of [0, 825] (time cycles) is specified to generate nine synthetic sequences with scaling factors of 3.40, 3.60, 3.80, 4.00, 4.20, 4.40, 4.60, 4.80, and 5.00, as shown in Figure 13. The duration ∆ is equal to 45 (time cycles) in this case. Table 4

thus
= 2654 − 710 = 1944 (time cycles). However, from the 2nd PC feature sequence, we can observe that sudden and abrupt interruptions occurred during the range of [1590, 2300] (time cycles) of the test instance. Thus, we may consider that the degradation process has entered the close-to-failure stage after the time cycle 1590. Reasonably, the RUL of Bearing1-6 can be estimated as = ∆ = 45 (time cycles) at the last time cycle 2300. As the actual RUL is known as = 146 (time cycles), the relative error of estimation can be calculated as = 69.2%.    However, from the 2nd PC feature sequence, we can observe that sudden and abrupt interruptions occurred during the range of [1590, 2300] (time cycles) of the test instance. Thus, we may consider that the degradation process has entered the close-to-failure stage after the time cycle 1590. Reasonably, the RUL of Bearing1-6 can be estimated as RUL e = ∆ = 45 (time cycles) at the last time cycle 2300. As the actual RUL is known as RUL a = 146 (time cycles), the relative error of estimation can be calculated as E r = 69.2%.

RUL Estimation for Test Instance Bearing1-7
The similarity evaluation for the test instance Bearing1-7 is performed based on the 2nd PC feature sequence using Bearing1-2 as the learning instance. From the segment of [0, 825] (time cycles) of Bearing1-2, the nine synthetic feature sequences are generated with the scaling factor α = 2.40, 2.50, 2.60, 2.70, 2.80, 2.90, 3.00, 3.10, and 3.20, as shown in Figure 14. The duration ∆ is equal to 45 (time cycles). Table 5 presents the results of similarity evaluation between each synthetic sequence and the degradation segment of [0, 1500] (time cycles) of Bearing1-7. Selecting the seven synthetic sequences of α = 2.50, 2.60, 2.70, 2.80, 2.90, 3.00, and 3.10, the final RUL estimate can be computed as RUL e = 822 (time cycles) using the proposed ensemble method. As the actual RUL of Bearing1-7 is known as RUL a = 757 (time cycles), the relative error of estimation can be calculated as E r = −8.61%. the scaling factor α = 2.40, 2.50, 2.60, 2.70, 2.80, 2.90, 3.00, 3.10, and 3.20, as shown in Figu  14. The duration ∆ is equal to 45 (time cycles). Table 5 presents the results of similari evaluation between each synthetic sequence and the degradation segment of [0, 1500] (tim cycles) of Bearing1-7. Selecting the seven synthetic sequences of α = 2.50, 2.60, 2.70, 2.8 2.90, 3.00, and 3.10, the final RUL estimate can be computed as = 822 (time cycle using the proposed ensemble method. As the actual RUL of Bearing1-7 is known as = 757 (time cycles), the relative error of estimation can be calculated as = −8.61%.

. Summary of Results and Discussions
In the IEEE PHM 2012 Prognostic Challenge, a scoring function was used to evalua the RUL estimation results as follows [15]. Based on the relative error Er obtained usin Equation (17), the score of RUL estimation for each test bearing is defined as: This scoring function can penalize both underestimates and overestimates such th (i) the perfect estimate corresponding to a perfect score of 1 (i.e., Er = 0), (ii) score dedu tion is given to estimates that are shorter than the actual bearing RUL (i.e., Er > 0), and (i a more severe score deduction is given to estimates that exceed the actual bearing RU (i.e., Er < 0). Figure 15 depicts the scoring function. The overall estimation score averag the scores for N test bearings.

Summary of Results and Discussions
In the IEEE PHM 2012 Prognostic Challenge, a scoring function was used to evaluate the RUL estimation results as follows [15]. Based on the relative error Er obtained using Equation (17), the score of RUL estimation for each test bearing is defined as: This scoring function can penalize both underestimates and overestimates such that (i) the perfect estimate corresponding to a perfect score of 1 (i.e., Er = 0), (ii) score deduction is given to estimates that are shorter than the actual bearing RUL (i.e., Er > 0), and (iii) a more severe score deduction is given to estimates that exceed the actual bearing RUL (i.e., Er < 0). Figure 15 depicts the scoring function. The overall estimation score averages the scores for N test bearings.
Signals 2021, 2 FOR PEER REVIEW 19 69.2%, and −3.15%, respectively. For Bearing1-3, Bearing1-5, Bearing1-6, and Bearing1-7, the similarity evaluation was made using the proposed IBL approach with the data augmentation method. For Bearing1-4, its RUL was estimated directly using the reference of the close-to-failure stage duration of Bearing1-2, because the test bearing is considered to have entered the close-to-failure stage at its last time cycle. In this study, we used (dmin < 0.5000) as the criteria for identifying a pair of similar feature sequences between the test and learning instance.   The RUL estimation results for Bearing1-3, 4, 5, 6, and 7 (under the operating condition I) are summarized in Table 6, which achieve the relative errors −6.58%, 32.4%, 19.6%, 69.2%, and −3.15%, respectively. For Bearing1-3, Bearing1-5, Bearing1-6, and Bearing1-7, the similarity evaluation was made using the proposed IBL approach with the data augmentation method. For Bearing1-4, its RUL was estimated directly using the reference of the close-to-failure stage duration of Bearing1-2, because the test bearing is considered to have entered the close-to-failure stage at its last time cycle. In this study, we used (d min < 0.5000) as the criteria for identifying a pair of similar feature sequences between the test and learning instance. In this paper, the seven synthetic feature sequences associated with different scaling factors are selected to calculate the final RUL using the ensemble method, that is, M = 7 in the Equation (16). Worthy to be studied, we use the Mann-Withney test [24,25] to investigate the statistical difference among the prediction results obtained by setting M to 1, 3, 5, 7, and 9. The hypotheses include: (1) H0-the medians of two sets of RUL prediction scores (e.g., M = 3 vs. M = 1) are the same, and (2) H1-the medians are the same for the two sets of scores [25]. A p-value of 0.05 is used for the hypothesis test. Table 7 shows the prediction scores for test instances Bearing1-3, 1-5, 1-6, and 1-7 obtained with five different settings, M = 1, 3, 5, 7, and 9. Note that Bearing1-4 is not included in the test, as the ensemble method is not applied for this instance. The overall score (i.e., average score) of M = 7 for the four test instances apparently outperforms others in the comparison. However, we cannot conclude that there are significant differences among the different settings, as all p-values are greater than 0.05 by the Mann-Withney test. Reasonably we consider that the very limited number of test instances might have affected the test of significant difference in statistics.
Due to the difficulty to solve this challenging problem, most of the researchers did not present the comprehensive solutions with RUL prediction results, while they focused on extracting bearing prognostic features and constructing HI using different feature extraction and fusion methods. For example, Javed et al. extracted prognostic features (i.e., SD-IHC and SD-HIS) from the wavelet packet decomposition (WPD). A cumulative function then transformed time series of the prognostic features into descriptors depicting bearing degradation status [26]. In the method proposed by Mosallam et al., multiple features were fused through feature selection using the pairwise symmetrical uncertainty measure and PCA. The empirical mode decomposition (EMD) algorithm was then used to identify a monotonic trend of degradation [27]. Duong et al. employed the WPD technique to extract a set of prognostic features. The prognostic features showing the best degradation trends were iteratively accumulated to construct one fused prognostic feature [28]. Cosme et al. trained an adaptive neuro-fuzzy inference system (ANFIS) model for only onestep-ahead prediction of degradation state [29]. In the Kundu et al.'s study, the most sensitive prognostic feature was selected using the Chi-Square test to build the General Log-Linear Weibull (GLL-Weibull) models [30]. Xia et al. proposed a hierarchical method by integrating deep neural networks (DNN) and artificial neural networks (ANN) [31]. Li et al. proposed a CNN-based deep learning approach. Multiple spectrograms were used as inputs to the CNN to map to their corresponding RUL prediction targets [32]. In the two studies [31,32], they mixed all learning and test bearing examples and trained/tested prediction model with one-holdout validation strategy. Apparently, their experiments did not aim to deal with the challenging problem of shortage of learning examples.

Review of Representative Solutions
In this section, we compare the performance of our proposed IBL approach with the previous studies [35][36][37][38][39][40][41][42]. Eight representative solutions are briefly reviewed as follows and summarized in Table 8.  [35]. In their third solution, the RUL estimation was based on making comparisons on the durations of degradation stages between the learning and test bearing instances. Anomalies were detected by analyzing the changes of frequency signatures in the signal spectrogram. The authors attempted to create a prognostic feature by taking the average of five Peaks of the vibration signal. Their RUL estimation results achieved the best performance in the competition.
As the second winner in the IEEE PHM 2012 competition, Wang extracted time-domain statistical features, i.e., Peak, RMS, Kurtosis, and Energy from two demodulated envelope settings on vibration signals [36]. The PCA technique was applied to these extracted feature data to generate Hoteling's T 2 statistic as the representative prognostic feature. In this work, no local regression model was developed to predict RUL for test bearings. Instead, the average of the detect-to-failure times of the learning bearings were directly used to estimate RUL for the test bearings.
Hong et al. developed a method to evaluate the bearing degradation state and estimate the RUL for the bearing datasets [37]. For feature extraction, the EMD technique was used to generate a set of intrinsic mode functions (IMFs) based on the signal Energy and Entropy profiles from WPD. A self-organizing map (SOM) network was then trained on the IMF data obtained under normal stage of the bearing usage. To evaluate the degradation state, the trained SOM network was applied on the observed IMF data to produce the minimum quantization error (MQE), which acted as a prognostic feature, named the confidence value (CV). Finally, the Gaussian Process Regression (GPR) method was applied on the generated CV sequential data to make the RUL prediction.
Singleton et al. introduced a stochastic modeling approach using the EKF technique [38]. This approach adopted two types of prognostic features, one is the signal Variance and another is the Entropy within the identified signature frequency band. Once the prognostic features were extracted, an analytical function approximating the degradation process was used to initialize the parameters of EKF. The initialized EKF was then applied to the test bearing data to predict the RUL under different operating conditions. For the signal Variance feature, an exponential form of ae bt was found to be a suitable modeling function, whereas a function in the form of (a − be −ct ) was used to model the time-frequency Entropy feature. Model parameters a, b, and c are updated with measurement data.
Lei et al. presented a hybrid method [39] that adopted the feature fusion strategies similar to the method by Hong et al. [37]. The extracted prognostic features consist of WPD-based features and time-domain features. To fuse the prognostic features, a weighted minimum quantization error (WMQE) measure was constructed based on the trend-ability and monotonicity of prognostic feature sequences and correlations among those sequences. To predict the RUL, their method adopted the Paris-Erdogen model that is commonly used to depict fatigue crack propagation in mechanical materials. The model parameters were initialized using the MLE method and RUL was predicted using the model updated with the PF technique.
Guo et al. proposed a recurrent neural network (RNN) based health condition indicator (RNN-HI) [40]. Based on the similarity, monotonicity, and correlation metrics, they selected a set of prognostic features from the time-domain and time-frequency domain features similar to those by Lei et al. [39]. The RNN-HI sequence was then constructed by feeding the set of selected features into the RNN model. As a hybrid approach, they described the bearing degradation process using the double exponential model ae bt + ce dt and adapted the PF technique for the model parameter estimation.
Chen et al. employed a RNN based on encoder-decoder framework with attention mechanism to predict values of HI [41]. At each time cycle, five band-pass energy values of frequency spectrum were extracted as prognostic features, which were directly into the RNN based model for HI prediction. Upon completing the HI prediction for each time cycle, final RUL value was estimated using linear regression over the up-to-date time series of HI.
Huang et al. also applied an RNN with LSTM modeling technique for HI prediction [42]. From the learning bearing examples, this method generated a set of representative templates of frequency spectrum for both normal and failure states. For a given test bearing, similarity-based features were calculated between new vibration data and the set of learned templates. In addition, eight typical statistical time domain features, including Mean, SD, Min-to-Max (i.e., Peak-to-Peak), Skewness, Kurtosis, Energy, Entropy, and Energy/Entropy, were also included for prognostics. These features were used to estimate HI values via a trained CNN model. Finally, the obtained values of health indicators were then extrapolated to estimate future HI of the test bearing and its RUL.

Performance Comparison
The performances of the proposed approach and the eight previous solutions [32][33][34][35][36][37][38][39] are compared in Tables 9 and 10, which compare the relative error and score, respectively. The overall scores are computed by averaging five test bearings under the operating condition of 1800 rpm/4000 N. From the comparison, we can observe that the proposed IBL approach can achieve a comparable performance with the eight representative solutions to the challenging RUL estimation problem. The overall score of the proposed approach (i.e., Score = 0.3951) and the other eight solutions are all in the range from 0.0000 to 0.5000. However, none of the nine approaches can consistently provide accurate RUL estimations for all five test bearings. For example, the proposed approach has excellent performance for Bearing1-5 with almost full score, while it is unable to estimate as accurately for Bearing1-6. On the other hand, the estimation results from Lei et al. show high errors for Bearing1-5 (i.e., Score = 0.0313), even though this solution for Bearing1-3 is ranked at the top on the score table (i.e., Score = 0.9526). The latest study of Chen et al. provided very actuate RUL prediction for Bearing1-6 (i.e., Score = 0.9683), but performed extremely poorly with zero scores for Bearing1-3 and Bearing1-4.
Even though there is the limited number of test instances in this study, we tentatively apply the Mann-Withney test [24] to see whether the proposed approach is significantly different from the existing solutions statistically. Table 11 provides the p-values of the test between the proposed approach with the eight existing solutions. It can be found that the approach is significant different with the Wang's solution but not considered differently with the others by setting a significance level p-value = 0.05. We can also summarize that the previous studies generally adopted the same technical path. Firstly, they extracted a set of prognostic features for health indicator (HI) construction using different statistical analysis, signal processing, and machine learning methods [35][36][37][38][39][40][41][42]. With advances in deep learning technology, the latest three solutions employed RNN-based models to estimate HI [40][41][42]. Secondly, previous studies attempted to model bearing degradation profiles depicted by HI. Subsequently, bearing RUL is estimated by predicting when the HI crossed a predefined threshold. A variety of time-series modeling techniques were employed to build the RUL prediction model, including conventional linear and nonlinear regression/extrapolation, GPR, EKF (with exponential model), and PF (with Paris-Erdogan model and exponential model).
Unlike the existing solutions, we proposed an IBL approach where RUL for a given test bearing is estimated based on a run-to-failure example showing similar degradation behavior. Instead of modeling the bearing degradation process, we performed similarity comparisons directly using entire bearing degradation profiles. To deal with the challenges of limited run-to-failure examples in the learning datasets, a data augmentation technique is applied to generate multiple synthetic prognostic feature sequences by modifying the learning instance. To extract prognostic feature sequences, the PCA technique is applied to the spectrograms of vibration signals. To evaluate the similarity between a pair of the learning and test instances, the data projection base (i.e., set of PCs) is derived from the spectrogram of the learning instance.
Therefore, the proposed approach can be considered an alternative approach for the challenging bearing datasets. It is also our intent to produce a comparative review of the previous studies with the proposed approach that can serve as a reference for researchers interested in the RUL estimation problem particularly using the PRONOSTIA datasets.

Conclusions
Since the IEEE PHM 2012 Prognostic Challenge published the PRONOSTIA bearing datasets, the RUL prediction problem has attracted much interest from researchers. Unlike the existing approaches in the literature, we propose an alternative solution to address the limitations on the number of run-to-fail examples in RUL estimation. The proposed IBL approach is summarized as follows: • An efficient feature extraction method is employed for extracting the prognostic feature sequences. In this method, based on the spectrograms of vibration signals, the PCA technique is utilized to extract prognostic feature sequences for the bearing instance. For a pair of the learning and test instances to be compared, the data projection base (i.e., set of PCs) is derived from the spectrogram of the learning instance. • Inspired by the data augmentation strategy adopted in the machine learning for image classification applications, the segment scaling method is proposed to generate a set of synthetic prognostic feature sequences by modifying the observed prognostic feature sequence of the learning instance. • An ensemble method is used to aggregate the multiple RUL estimates obtained from similar prognostic feature sequences identified via the similarity evaluation. In the averaging function of the ensemble method, each RUL estimate is weighted by its corresponding similarity measure (i.e., RMS difference measure).
As a result, the performance comparison of our proposed approach with the eight representative solutions can become a valuable reference for researchers who attempt to achieve higher accuracy RUL estimations on the PRONOSTIA datasets in the future. For further development, our proposed IBL approach can be implemented as a generic scheme to incorporate with different data augmentation strategies, such as translating signal in time domain, shifting signal in frequency domain, and scaling signal in time domain, for different prognostic applications. It can be expected that the proposed IBL approach can be applied to different types of components and industrial systems such as wind turbines and aircraft engines, which may consist of a large number of components and have complicated degradation process modes.