Long Short-Term Memory Neural Network with Transfer Learning and Ensemble Learning for Remaining Useful Life Prediction

Prediction of remaining useful life (RUL) is greatly significant for improving the safety and reliability of manufacturing equipment. However, in real industry, it is difficult for RUL prediction models trained on a small sample of faults to obtain satisfactory accuracy. To overcome this drawback, this paper presents a long short-term memory (LSTM) neural network with transfer learning and ensemble learning and combines it with an unsupervised health indicator (HI) construction method for remaining-useful-life prediction. This study consists of the following parts: (1) utilizing the characteristics of deep belief networks and self-organizing map networks to translate raw sensor data to a synthetic HI that can effectively reflect system health; and (2) introducing transfer learning and ensemble learning to provide the required degradation mechanism for the RUL prediction model based on LSTM to improve the performance of the model. The performance of the proposed method is verified by two bearing datasets collected from experimental data, and the results show that the proposed method obtains better performance than comparable methods.


Introduction
In mechanical equipment, rolling element bearings are one of the most critical and vulnerable components due to their transferring motion and withstanding loads during the operation of machinery [1,2]. To guarantee the normal operation of machinery and to avoid catastrophic events, condition-based maintenance (CBM) or predictive maintenance (PM) need to be arranged using remaining-useful-life (RUL) predictions of bearings [3]. Thus, the accuracy of RUL prediction is greatly significant for the safety and reliability of manufacturing equipment.
With increasing complexity of mechanical equipment, traditional RUL approaches are unable to suit the correspondingly higher standards and higher levels for mechanical equipment. Therefore, data-driven RUL methods have been widely studied in recent years [4,5]. The framework of data-driven methods mainly consists of data acquisition, construction of health indicators (HI) and RUL prediction [6]. According to a recent review, deep learning has attracted increasing attention in the field of RUL prediction. Lei et al. [7] utilized a multi-scale dense-gate recurrent neural network to identify relevant information at different timescales to improve the performance of model. Yang et al. [8] proposed an improved long-short term memory neural network to estimate bearing performance degradation. In [9], a hybrid model based on long short-term memory and Elman neural networks predicted RUL for lithium-ion batteries. Despite existing prognosis methods achieving satisfactory accuracy, there are still two deficiencies that limit their industrial extension and application.
The first shortcoming is that most HI construction methods adopt hand-designed labels for supervised training, which makes HI construction time-consuming and laborious. A deep belief network (DBN) is a deep learning method [10] that can extract the mostsuitable feature representations from raw data due to its nonlinear network structure. It has been gradually applied in the field of prognostic and health management (PHM) [11]. Peng et al. [12] utilized deep belief networks with an unsupervised algorithm to construct HI and achieved RUL prediction by the improved particle filter. Xu et al. [13] proposed an improved unsupervised deep belief network to estimate bearing performance degradation.
The second shortcoming is that RUL prediction models are trained based on the availability of large training datasets with fault trends. However, in real conditions, we can only obtain relatively small sets of fault data for the predicted target. This makes it difficult to acquire high accuracy in RUL prediction. To overcome this drawback, transfer learning and ensemble learning are provided, solving the problem of insufficient accuracy due to lack of historical measurement data. Transfer learning uses the knowledge learned from the source domain to improve learning in the target domain. Ensemble learning is a learning approach in which results are combined by multiple learning algorithms to improve performance. In the field of CBM, transfer learning and ensemble learning have been applied mainly to fault diagnosis, but there are a few RUL prediction applications. For instance, Shen et al. [14] proposed deep convolutional neural networks with ensemble learning and transfer learning to estimate the capacity of lithium-ion batteries. Zhang et al. [15] presented an instance-based ensemble deep transfer learning network to recognize degradation of ball screws. In [16], transfer component analysis (TCA) was used to find the common feature representation between different bearings, and then an SVM prediction model was constructed to achieve RUL prediction. Zhang et al. [17] utilized an ensemble learning-based prognostic approach to improve the performance of models in RUL prediction of aircraft engines.
To overcome the aforementioned drawbacks, this paper presents a hybrid model of HI construction based on a deep belief network (DBN) and a self-organizing map (SOM), and an RUL prediction model based on a long short-term memory neural network (LSTM) with transfer learning and ensemble learning. As shown in Figure 1, HI is first constructed from measured vibration data by the hybrid model based on DBN and SOM. Following that, a long short-term memory neural network with transfer learning and ensemble learning is constructed to improve the performance RUL prediction of target bearing by using the auxiliary data of the inherent degradation trend under different working conditions. In the end, the performance of the proposed method is demonstrated by experimental bearing datasets. The main contributions of this paper are summarized as follows.
(1) Unsupervised HI construction based on DBN and SOM is used to construct HI.
The process of feature extraction and feature fusion does not require artificial labels. Further, there is no need to determine a fault threshold (FT) based on the experience of researchers. Therefore, the constructed HI can eliminate the influence of personnel participation. (2) Transfer learning and ensemble learning are introduced into a long short-term memory neural network to improve the accuracy and robustness of the model when it is trained with a small amount of data. Experimental results indicate that LSTM with transfer learning and ensemble learning (LSTM-ETL) has better performance than the LSTM.

Deep Belief Network
A deep belief network is stacked by multiple restricted Boltzmann machines (RBMs); as a deep learning method, it can extract essential patterns from raw data based on its strong nonlinear mapping performance [18]. For instance, Peng et al. [12] proposed a deep  The rest of the paper is organized as follows: Section 2 reviews existing studies related to the basic theory. In Section 3, the proposed method is introduced. In Section 4, the performance of the proposed method is validated through an experimental dataset. Finally, conclusions are drawn in Section 5.

Deep Belief Network
A deep belief network is stacked by multiple restricted Boltzmann machines (RBMs); as a deep learning method, it can extract essential patterns from raw data based on its strong nonlinear mapping performance [18]. For instance, Peng et al. [12] proposed a deep belief network to construct HI and employed an improved particle filter model for RUL prediction.
In a deep belief network, the training process is divided into an unsupervised pretraining phase and a supervised fine-tuning phase. In the pre-training phase, the output of the previous RBM is invoked as the input of the next RBM to update network parameters through a contrastive divergence (CD) algorithm. In the fine-tuning phase, a back-propagation (BP) algorithm is employed to update network parameters from the bottom layer to the top layer, as shown in Figure 2.
performance of the proposed method is validated through an experimental dataset. Finally, conclusions are drawn in Section 5.

Deep Belief Network
A deep belief network is stacked by multiple restricted Boltzmann machines (RBMs); as a deep learning method, it can extract essential patterns from raw data based on its strong nonlinear mapping performance [18]. For instance, Peng et al. [12] proposed a deep belief network to construct HI and employed an improved particle filter model for RUL prediction.
In a deep belief network, the training process is divided into an unsupervised pretraining phase and a supervised fine-tuning phase. In the pre-training phase, the output of the previous RBM is invoked as the input of the next RBM to update network parameters through a contrastive divergence (CD) algorithm. In the fine-tuning phase, a backpropagation (BP) algorithm is employed to update network parameters from the bottom layer to the top layer, as shown in Figure 2.

Long Short-Term Memory Neural Network
Compared with a traditional feed-forward neural network, a recurrent neural network (RNN) is a deep learning model into which a notion of time has been introduced. It is able to relate time and data. As the time sequence increases, the RNN [19] structure deepens by hidden-layer back-propagation. RNNs can have problems of gradient explosions or gradient vanishing in the training process.
Therefore, for long-time-sequence data, RNNs cannot remember the relationship between the current information and the long-time information in the time sequence. To solve these kinds of problems, a memory cell is used in LSTM architecture. As shown in Figure 3, the memory cell replaces the hidden layer of a traditional RNN, and it contains three kinds of gate structure: the forget gate, the input gate, and the output gate. Therefore, in time sequences, LSTMs are more suitable than traditional RNN for predicting data.
The input of LSTM consists of the previous time output ℎ and the current time input , which is input to the memory cells to determine the information discarded by the forget gate. The forget gate controls what information from the previous memory cells is to be discarded, as defined by:

Long Short-Term Memory Neural Network
Compared with a traditional feed-forward neural network, a recurrent neural network (RNN) is a deep learning model into which a notion of time has been introduced. It is able to relate time and data. As the time sequence increases, the RNN [19] structure deepens by hidden-layer back-propagation. RNNs can have problems of gradient explosions or gradient vanishing in the training process.
Therefore, for long-time-sequence data, RNNs cannot remember the relationship between the current information and the long-time information in the time sequence. To solve these kinds of problems, a memory cell is used in LSTM architecture. As shown in Figure 3, the memory cell replaces the hidden layer of a traditional RNN, and it contains three kinds of gate structure: the forget gate, the input gate, and the output gate. Therefore, in time sequences, LSTMs are more suitable than traditional RNN for predicting data.
The input of LSTM consists of the previous time output h t−1 and the current time input x t , which is input to the memory cells to determine the information discarded by the forget gate. The forget gate f t controls what information from the previous memory cells is to be discarded, as defined by: where σ indicates the sigmoid function; w fx and w fh indicate the weight from the input layer to the hidden layer of the forget gate and the weight from the previous hidden layer to the hidden layer of the forget gate at time t, respectively; and b f is the bias of the forget gate.
where indicates the sigmoid function; fx and fh indicate the weight from the input layer to the hidden layer of the forget gate and the weight from the previous hidden layer to the hidden layer of the forget gate at time t, respectively; and is the bias of the forget gate. Then, the memory cells need to determine the updated new information combined with the input gate , the candidate value , the forget gate , and the previous state . The following formula can be used to describe the process in which the memory cells determine the updated information.
where ix and ih indicate the weight from the input layer to the hidden layer of the input gate and the weight of the previous hidden layer to the hidden layer of the input gate at time t, respectively; and are the bias of the input gate and the candidate gate, respectively; indicates the cell state at the present time; tanh(·) indicates a hyperbolic tangent function; and ⊗ indicates element-wise multiplication. Finally, in the memory cells, the network decides the output by the current state and the output gate , which is defined as: where and indicate the weight from the input layer to the hidden layer of the output gate and the weight of the previous hidden layer to the hidden layer of the output gate at time t respectively; and is the bias of the output gate.

Proposed HI Construction
It is well-known that the performance of bearings gradually decreases with time. Correspondingly, to represent the health state of a bearing, an effective and universal method of HI construction is needed. This section mainly provides the procedure for HI construction based on unsupervised learning. The key idea is that the extracted features should retain the essential information of the raw data. Since DBN has strong nonlinear mapping performance, it is able to extract essential patterns from raw data. With data collected from n time steps , the collected data is firstly preprocessed to [0, 1] by normalizing according to the corresponding formula * = ( − )/( − ), where is Then, the memory cells need to determine the updated new information combined with the input gate i t , the candidate value g t , the forget gate f t , and the previous state c t−1 . The following formula can be used to describe the process in which the memory cells determine the updated information.
where w ix and w ih indicate the weight from the input layer to the hidden layer of the input gate and the weight of the previous hidden layer to the hidden layer of the input gate at time t, respectively; b i and b g are the bias of the input gate and the candidate gate, respectively; c t indicates the cell state at the present time; tanh(·) indicates a hyperbolic tangent function; and ⊗ indicates element-wise multiplication. Finally, in the memory cells, the network decides the output by the current state c t and the output gate o t , which is defined as: where w ox and w oh indicate the weight from the input layer to the hidden layer of the output gate and the weight of the previous hidden layer to the hidden layer of the output gate at time t respectively; and b o is the bias of the output gate.

Proposed HI Construction
It is well-known that the performance of bearings gradually decreases with time. Correspondingly, to represent the health state of a bearing, an effective and universal method of HI construction is needed. This section mainly provides the procedure for HI construction based on unsupervised learning. The key idea is that the extracted features should retain the essential information of the raw data. Since DBN has strong nonlinear mapping performance, it is able to extract essential patterns from raw data. With data collected from n time steps T, the collected data T is firstly preprocessed to [0, 1] by normalizing according to the corresponding formula t * = (t − t min )/(t − t max ), where t is raw data, t min is the minimum value of raw data, and t max denotes the maximum value of raw data. After the DBN structure is determined, the normalized training dataset T * is employed to train DBN. Then, the trained DBN is obtained by the pre-training phase and the fine-tuning phase using the normalized dataset. Finally, the extracted features h n are expressed as follows: where W i , b i are parameters of DBN (i = 1, 2, . . . , n), x denotes the raw vibration signal, σ denotes the activation function of DBN, n is the number of hidden layers, and h i denotes the output of the ith hidden layer.
To fully utilize the feature information extracted from the raw data, it is necessary to fuse the features through an appropriate method. A self-organizing map is an unsupervised algorithm that is only composed of an input layer and a competitive layer; it can map highdimensional data into a two-dimensional topology structure. Therefore, to take advantage of degraded information from the extracted features, an SOM is employed to fuse the extracted features. After the SOM structure is determined, the extracted training feature h n is employed to train the SOM. Then, the trained SOM is obtained by a competitive formula of training features. Finally, the feature-set is feed into the trained SOM to construct HI according to the corresponding formula h i = f − m BMU , where h i is the HI at time i, f denotes the large features of input, and m BMU stands for the weight vector of the best matching unit (BMU) of input f [20]. The flowchart of the proposed HI construction based on unsupervised learning is shown in Figure 4. raw data, is the minimum value of raw data, and denotes the maximum value of raw data. After the DBN structure is determined, the normalized training dataset * is employed to train DBN. Then, the trained DBN is obtained by the pre-training phase and the fine-tuning phase using the normalized dataset. Finally, the extracted features ℎ are expressed as follows: where , are parameters of DBN (i = 1, 2,…, n), x denotes the raw vibration signal, denotes the activation function of DBN, is the number of hidden layers, and ℎ denotes the output of the ith hidden layer.
To fully utilize the feature information extracted from the raw data, it is necessary to fuse the features through an appropriate method. A self-organizing map is an unsupervised algorithm that is only composed of an input layer and a competitive layer; it can map high-dimensional data into a two-dimensional topology structure. Therefore, to take advantage of degraded information from the extracted features, an SOM is employed to fuse the extracted features. After the SOM structure is determined, the extracted training feature ℎ is employed to train the SOM. Then, the trained SOM is obtained by a competitive formula of training features. Finally, the feature-set is feed into the trained SOM to construct HI according to the corresponding formula ℎ = ‖ − ‖, where ℎ is the HI at time i, denotes the large features of input, and stands for the weight vector of the best matching unit (BMU) of input [20]. The flowchart of the proposed HI construction based on unsupervised learning is shown in Figure 4.

LSTM Parameter Transfer
As mentioned before, the LSTM algorithm has the inevitable problem of low prediction accuracy with small sets of fault data. Hence, we introduce learned knowledge from the source domain to improve the performance of LSTM in the target domain. The source dataset is usually different but related to the target domain. Based on the HI constructed in Section 3.1, the source dataset and the target dataset are expressed as follows: where Y s denotes source HI data, Y t denotes target HI data, Y t train and Y t test are target training HI data and target testing HI data, respectively, H I s i denotes the HI value of the source at time i, H I t i denotes the HI value of the target HI at time i, n is the length of the source HI data, k is the length of target training HI data, and m is the length of target HI data.
The source dataset Y s is first used to pre-train n individual LSTMi (i = 1, 2, . . . , 5) models. Subsequently, the knowledge learned from the source dataset is employed to help complete the targeted task through transfer learning. As shown in Figure 5, the learned parameters θ i of trained LSTM i models are transferred to LSTM-TL i models of the target. Finally, the training data of target Y t train are utilized to fine-tune the parameters of LSTM-TL i models to fit the targeted task.

LSTM Parameter Transfer
As mentioned before, the LSTM algorithm has the inevitable problem of low prediction accuracy with small sets of fault data. Hence, we introduce learned knowledge from the source domain to improve the performance of LSTM in the target domain. The source dataset is usually different but related to the target domain. Based on the HI constructed in Section 3.1, the source dataset and the target dataset are expressed as follows: , … , } where denotes source HI data, denotes target HI data, and are target training HI data and target testing HI data, respectively, denotes the HI value of the source at time i, denotes the HI value of the target HI at time i, n is the length of the source HI data, is the length of target training HI data, and is the length of target HI data.
The source dataset is first used to pre-train n individual LSTMi ( = 1, 2, … , 5) models. Subsequently, the knowledge learned from the source dataset is employed to help complete the targeted task through transfer learning. As shown in Figure 5, the learned parameters of trained LSTM models are transferred to LSTM-TL models of the target. Finally, the training data of target are utilized to fine-tune the parameters of LSTM-TL models to fit the targeted task.  In parameters transfer, the parameters learned from the source domain can be effectively utilized to optimize the target model. It uses the knowledge learned from the source domain to improve the learning task in the target domain. This improves the RUL prediction accuracy of LSTM models that have a small set of training data.

Ensemble LSTM-ETL
After parameter transfer, a single LSTM model may have the risk of poor performance. In order to avoid this problem, ensemble learning is utilized to overcome this drawback. Therefore, the main goal of ensemble learning is to decrease the risk of creating a learning algorithm with poor performance. The framework of the LSTM-ETL model is shown in Figure 6.
In previous research, a widely used optimization method with random gradient descent (SGD) and momentum has been used to minimize the expected generalization error In parameters transfer, the parameters learned from the source domain can be effectively utilized to optimize the target model. It uses the knowledge learned from the source domain to improve the learning task in the target domain. This improves the RUL prediction accuracy of LSTM models that have a small set of training data.

Ensemble LSTM-ETL
After parameter transfer, a single LSTM model may have the risk of poor performance. In order to avoid this problem, ensemble learning is utilized to overcome this drawback. Therefore, the main goal of ensemble learning is to decrease the risk of creating a learning algorithm with poor performance. The framework of the LSTM-ETL model is shown in Figure 6.  Figure 6. Construction of the proposed LSTM-ETL mode.

An Experiment System
Experimental data are provided by the Key Laboratory of Education Ministry for Modern Design and Rotor-Bearing System [21], which has been widely employed to research RUL prediction. In the experiments, two accelerometers are positioned on the horizontal and vertical directions of the tested bearings; one is mounted on the horizontal axis and the other one is mounted on the vertical axis. The data of tested bearings were acquired on the experimental platform shown in Figure 7. Sampling frequency was set to 25.6 kHz. A total of 32,768 datapoints (i.e., 1.28 s) were recorded for each sampling, and the sampling period was equal to 1 min. To avoid redundancy, the horizontal vibration was only employed to verify the algorithms in this paper.  In the experiments, there were three different operating conditions: (1) 2100 rpm and 12 kN, (2) 2250 rpm and 11 kN, and (3) 2400 rpm and 10 kN. The full cycle data data of Condition 2 is shown in Figure 8. The last two datasets were selected in this paper to In previous research, a widely used optimization method with random gradient descent (SGD) and momentum has been used to minimize the expected generalization error between the output and the real value. This maximized algorithm accuracy and robustness. The cost function of the model is defined as where C is the cost function of the model, λ is the L2 regularization factor of the model, θ(w) is the regularization term that relates to the weight, y t is the output value of the model at time t, and y t is the label value of the model at time t. The learned parameters of n LSTM models are transferred to construct n LSTM-TL models. Subsequently, the LSTM-ETL model is established by using a BP neural network to integrate n individual LSTM-TL models. Detailed steps of the ensemble strategy are described as follows: (1) Construction of LSTM-ETL model: The output of n LSTM-TL models is integrated by a BP neural network. This layer is used to assign the model weights w i e to y t i and to calculate the ensemble predictive output y t ensemble of n LSTM-TL models based on weights w i e such that y t and y t+1 = [HI t+2 ], respectively. In each iteration, the input of the model consists of the second HI value to the last HI value of the input in the previous iteration as well as the prediction output from the previous iteration [21]. By the method described above, n LSTM-TL models contain a set of parameters that are adjusted. To further update these parameters and the parameters of the secondary model (BP), we use an optimization method with random gradient descent (SGD) and momentum to update the parameters (weights, w, and biases, b) to minimize the expected generalization error between the output of LSTM-ETL and the real value. By the predicted value y t and the real value y t , the parameters of the feedforward neural network can be obtained by error back-propagation of the cost function.
where w yh is the weight between a hidden layer and an output layer, the vector b y is the bias parameter of the output layer, y (k) denotes the predicted value of the model in step k, and y threshold denotes the predefined failure threshold.

An Experiment System
Experimental data are provided by the Key Laboratory of Education Ministry for Modern Design and Rotor-Bearing System [21], which has been widely employed to research RUL prediction. In the experiments, two accelerometers are positioned on the horizontal and vertical directions of the tested bearings; one is mounted on the horizontal axis and the other one is mounted on the vertical axis. The data of tested bearings were acquired on the experimental platform shown in Figure 7. Sampling frequency was set to 25.6 kHz. A total of 32,768 datapoints (i.e., 1.28 s) were recorded for each sampling, and the sampling period was equal to 1 min. To avoid redundancy, the horizontal vibration was only employed to verify the algorithms in this paper.  Figure 6. Construction of the proposed LSTM-ETL mode.

An Experiment System
Experimental data are provided by the Key Laboratory of Education Ministry for Modern Design and Rotor-Bearing System [21], which has been widely employed to research RUL prediction. In the experiments, two accelerometers are positioned on the horizontal and vertical directions of the tested bearings; one is mounted on the horizontal axis and the other one is mounted on the vertical axis. The data of tested bearings were acquired on the experimental platform shown in Figure 7. Sampling frequency was set to 25.6 kHz. A total of 32,768 datapoints (i.e., 1.28 s) were recorded for each sampling, and the sampling period was equal to 1 min. To avoid redundancy, the horizontal vibration was only employed to verify the algorithms in this paper.  In the experiments, there were three different operating conditions: (1) 2100 rpm and 12 kN, (2) 2250 rpm and 11 kN, and (3) 2400 rpm and 10 kN. The full cycle data data of Condition 2 is shown in Figure 8. The last two datasets were selected in this paper to In the experiments, there were three different operating conditions: (1) 2100 rpm and 12 kN, (2) 2250 rpm and 11 kN, and (3) 2400 rpm and 10 kN. The full cycle data data of Condition 2 is shown in Figure 8. The last two datasets were selected in this paper to demonstrate the effectiveness of the proposed method. Bearing2_1 data and bearing2_2 data from Condition 2 were regarded as the training dataset for constructing the HI model of the target domain. Bearing3_1 data and bearing3_2 data from Condition 3 were regarded as the training dataset for constructing the HI model of the source domain.

Health Indicator Construction
According to [18], the structure of DBN is composed of two RBMs, where the number of hidden layers is set to three. The size of the input layer is 3000. The size of the first hidden layer, the second hidden layer, and the final hidden layer is 1000, 500, and 100, respectively. The input layer of SOM is 100, and the output layer is 1. After the structure of DBN and SOM are determined, the HI is constructed following the process shown in Figure 3.
In order to verify the performance of the proposed construction method, the RMS [23], PCA [24], and DBN-SOM methods are compared for bearing3_1. As shown in Figure  9, compared with the constructed HI curves of the traditional methods, the degradation curve constructed by the DBN-SOM method is smoother and has better monotonicity. To further illustrate the effectiveness of the DBN-SOM method, the correlation and monotonicity are used to evaluate the constructed HI. The former represents the linear relationship between the constructed HI and the sampling time point. The latter evaluates the increasing or decreasing trend of the constructed HI over time. The formula is as follows: where represents the HI value of the sample at time t, represents the time value of the sampling point, represents the differential of the HI sequence, and represents the sampling points of the whole bearing cycle.

Health Indicator Construction
According to [18], the structure of DBN is composed of two RBMs, where the number of hidden layers is set to three. The size of the input layer is 3000. The size of the first hidden layer, the second hidden layer, and the final hidden layer is 1000, 500, and 100, respectively. The input layer of SOM is 100, and the output layer is 1. After the structure of DBN and SOM are determined, the HI is constructed following the process shown in Figure 3.
In order to verify the performance of the proposed construction method, the RMS [23], PCA [24], and DBN-SOM methods are compared for bearing3_1. As shown in Figure 9, compared with the constructed HI curves of the traditional methods, the degradation curve constructed by the DBN-SOM method is smoother and has better monotonicity. To further illustrate the effectiveness of the DBN-SOM method, the correlation and monotonicity are used to evaluate the constructed HI. The former represents the linear relationship between the constructed HI and the sampling time point. The latter evaluates the increasing or decreasing trend of the constructed HI over time. The formula is as follows: where F t represents the HI value of the sample at time t, l t represents the time value of the sampling point, dF represents the differential of the HI sequence, and T represents the sampling points of the whole bearing cycle.  Table 1 shows the monotonicity and correlation results of the HI curves for the RMS, PCA, and DBN-SOM methods. It can be seen that the HI performance of the DBN-SOM method on most bearings is significantly better than the two methods used for comparison. Finally, the constructed HIs are shown in Figure 10. It can be seen that the HI constructed by the DBN-SOM method can effectively reflect the state of the bearing.    Table 1 shows the monotonicity and correlation results of the HI curves for the RMS, PCA, and DBN-SOM methods. It can be seen that the HI performance of the DBN-SOM method on most bearings is significantly better than the two methods used for comparison. Finally, the constructed HIs are shown in Figure 10. It can be seen that the HI constructed by the DBN-SOM method can effectively reflect the state of the bearing.  Table 1 shows the monotonicity and correlation results of the HI curves for the RMS, PCA, and DBN-SOM methods. It can be seen that the HI performance of the DBN-SOM method on most bearings is significantly better than the two methods used for comparison. Finally, the constructed HIs are shown in Figure 10. It can be seen that the HI constructed by the DBN-SOM method can effectively reflect the state of the bearing.   The process of bearing degradation is usually divided into two stages. In the first stage, the bearing is in a healthy state with no risk of bearing failure. In the second stage, the bearing is at risk of failure. If the training data from the first stage are included to train the model, the irrelevant time-series data will interfere with the construction of the model. Therefore, partial data without the trend of degradation is ignored. In the second stage of the constructed HI value, the first 20% of the data are used for training data, and the last 80% of the data are used for validation data.

RUL Prediction
The main objective of the proposed method is to improve the RUL prediction accuracy of LSTM in a small set of training samples. To achieve this objective, five LSTM models first are pre-trained through the source HI data, and then the LSTM-ETL model is retrained by the target HI data. Finally, the model parameter settings are shown in Table 2. To demonstrate the performance of the LSTM-ETL model for RUL prediction, the proposed method is compared with the LSTM, LSTM-TL, and SVM methods. The trained set is used to update the RUL model; the value of RUL can be calculated when the predicted HI reaches the failure threshold, which is set 1.0.
To compare the performance of the LSTM-ETL model, two methods to calculate the error of prediction results are applied to evaluate the performance of the RUL prediction, which are defined as follows: where ActRUL i denotes the true RUL of the ith bearing data, PreRUL i denotes the predicted RUL of the ith bearing data, Er i denotes the average error of ith bearing data, and MAE denotes the mean absolute error of all bearing data.

Results and Discussion
The results of prediction are shown in Figure 11, in which the evaluation result of the proposed method is compared with other methods for accuracy. As shown in Figure 11, in bearings2_5, the predicted RULs based on the SVM, LSTM, LSTM-TL, and LSTM-ETL models are 18,100 s, 16,600 s, 13,300 s, and 9990 s, respectively. The corresponding actual RUL is 8570 s. The error of the LSTM-ETL model is less than the error of the SVM, LSTM, and LSTM-TL models. As mentioned before, smaller error indicates more-accurate prediction. Thus, it is proved that the proposed method can improve model performance with a small set of training samples.
In order to further illustrate the performance of the proposed method, five bearings under Condition 2 are selected as the target bearings to be predicted. The LSTM-ETL, LSTM-TL, LSTM, and SVM models are employed to predict RUL. Table 3 shows the results of the RUL predicted by the LSTM, LSTM-TL, LSTM-ETL, and SVM models in terms of both the Er (percentage error) and MAE (mean absolute error). The following two significant conclusions can be drawn from the results listed in Table 3.

Results and Discussion
The results of prediction are shown in Figure 11, in which the evaluation result of the proposed method is compared with other methods for accuracy. As shown in Figure 11, in bearings2_5, the predicted RULs based on the SVM, LSTM, LSTM-TL, and LSTM-ETL models are 18,100 s, 16,600 s, 13,300 s, and 9990 s, respectively. The corresponding actual RUL is 8570 s. The error of the LSTM-ETL model is less than the error of the SVM, LSTM, and LSTM-TL models. As mentioned before, smaller error indicates more-accurate prediction. Thus, it is proved that the proposed method can improve model performance with a small set of training samples.  . Results of RUL prediction for bearings2_1, bearings2_2, bearings2_3, bearings2_4, and bearings2_5.
In order to further illustrate the performance of the proposed method, five bearings under Condition 2 are selected as the target bearings to be predicted. The LSTM-ETL, LSTM-TL, LSTM, and SVM models are employed to predict RUL. Table 3 shows the results of the RUL predicted by the LSTM, LSTM-TL, LSTM-ETL, and SVM models in terms of both the Er (percentage error) and MAE (mean absolute error). The following two significant conclusions can be drawn from the results listed in Table 3.
(1) Based on the overall MAE, the performance of the LSTM-ETL method is better than that of the LSTM, LSTM-TL, and SVM methods. From the perspective of Er in individual bearings, the proposed method has lower error for all four bearings except bearing2_4. When there is insufficient training data, the overall performance of the LSTM-ETL method is better than that of the other methods. (2) LSTM-ETL shows a MAE of 31.89%, which shows that the proposed method can precisely predict the value of RUL with a small set of data. Comparing with the MAEs of LSTM (63.39%) and SVM (66.07%), it can be concluded that both transfer learning and ensemble learning are conducive to the LSTM model's higher prognostic accuracy. The MAE of the LSTM-ETL model is 13.54% higher than that of the LSTM-TL model, which shows that ensemble learning can effectively improve the performance of RUL prediction models for single-transfer learning. Figure 11. Results of RUL prediction for bearings2_1, bearings2_2, bearings2_3, bearings2_4, and bearings2_5.
(1) Based on the overall MAE, the performance of the LSTM-ETL method is better than that of the LSTM, LSTM-TL, and SVM methods. From the perspective of Er in individual bearings, the proposed method has lower error for all four bearings except bearing2_4. When there is insufficient training data, the overall performance of the LSTM-ETL method is better than that of the other methods. (2) LSTM-ETL shows a MAE of 31.89%, which shows that the proposed method can precisely predict the value of RUL with a small set of data. Comparing with the MAEs of LSTM (63.39%) and SVM (66.07%), it can be concluded that both transfer learning and ensemble learning are conducive to the LSTM model's higher prognostic accuracy. The MAE of the LSTM-ETL model is 13.54% higher than that of the LSTM-TL model, which shows that ensemble learning can effectively improve the performance of RUL prediction models for single-transfer learning.

Conclusions
In this study, an HI based on DBN and SOM was proposed and constructed to enhance RUL prediction accuracy of bearings. During the construction of the HI, the process of feature extraction and feature fusion did not need artificial labels. Further, there was no need to determine the fault threshold (FT) based on researcher experience. Subsequently, LSTM with ensemble learning and transfer learning (LSTM-ETL) was utilized to predict the RUL of bearings. The results of the transfer learning and ensemble learning method had relatively low MAEs. The results showed that transfer learning can improve the performance of prediction models by using the knowledge learned from the source domain to improve learning in the target model, and ensemble learning can improve accuracy by integrating the results from multiple models. In addition, the proposed method performs better than the other methods, which provides an improved strategy for RUL prediction. For the negative transfer problem in transfer learning and ensemble learning, evaluation and analyzation of bearing-data information loss under different working conditions will be the topic of future work. This is a key problem that a prediction model with transfer learning and ensemble learning can be addressed through deep metric learning.