Transfer Learning with Deep Recurrent Neural Networks for Remaining Useful Life Estimation

Ansi Zhang 1,2, Honglei Wang 3, Shaobo Li 1,4,* , Yuxin Cui 2, Zhonghao Liu 2 , Guanci Yang 1 and Jianjun Hu 2,4,* 1 Key Laboratory of Advanced Manufacturing Technology of Ministry of Education, Guizhou University, Guiyang 550025, China; zhangansi@gmail.com (A.Z.); gcyang@gzu.edu.cn (G.Y.) 2 Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA; ycui@email.sc.edu (Y.C.); liu338@email.sc.edu (Z.L.) 3 Guizhou Provincial Key Laboratory of Internet Collaborative intelligent manufacturing, Guizhou University, Guiyang 550025, China; gzdxhlwang@163.com 4 School of Mechanical Engineering, Guizhou University, Guiyang 550025, China * Correspondence: lishaobo@gzu.edu.cn (S.L.); jianjunh@cse.sc.edu (J.H.)


Introduction
Recently, fault diagnosis and health management, including diagnosis and prognosis approaches, have been actively researched [1][2][3][4][5]. Fault diagnosis and health management techniques are widely applied in diverse areas such as manufacturing, aerospace, automotive, power generation, and transportation [6][7][8][9]. Diagnostics is the process of identification of a failure. Prognostics is an engineering discipline focusing on the prediction of the time when a system or a component fails to perform its intended function. In diagnostics, once degradation is detected, unscheduled maintenance should be performed to prevent the consequences of failure. In prognostics, maintenance preparation could be performed when the system is up and running, since the time to failure is known early enough. A major application of prognostics is estimating remaining useful life (RUL) of systems. It is also called remaining service or residual life estimation [10]. Prognostic methods can be grouped into two major categories: data-driven methods, and physics-based methods. The former requires sufficient samples that are run until faults or failures are detected, whereas the latter require an understanding of the physics of system failure progression [11].
In recent years, the data-driven approaches for remaining useful life (RUL) studies [12][13][14][15][16] have been attracting a lot of attention due to their avoidance of dependency on the theoretical understanding of complex systems. However, a major challenge in data-driven prognostics is that it is often impossible to obtain a large number of samples for failure progression, which is costly and labor demanding [11]. This situation can arise for several reasons: (1) industry systems are not allowed to run until failure due to the consequences, especially for critical systems and failures; (2) most of the electro-mechanical failures occur slowly and follow a degradation path such that failure degradation of a system might take months or even years [17]. Several methods have been used to address this challenge. The first approach is accelerated aging by running the system in a lab with extreme loads, increased speed or using imitations of real components which are made by vulnerable materials, so that a failure progresses faster than normal [14][15][16]. Another approach is introducing unnatural failure progression by using exponential degradation to model regular failure progression [18]. Both methods have their own strengths and weaknesses with the capability to represent the failure degradation to a certain level. However, in real-world applications where the condition is very different from the lab environment, these methods are difficult to apply and to obtain good estimation performance.
Recently, deep learning has been shown to achieve impressive results in areas such as computer vision, image and video processing, speech, and natural language processing [19]. Deep learning methods have also been applied to fault diagnosis [20][21][22] and the RUL estimation problem. For the RUL estimation problem, a Convolutional Neural Network (CNN) is applied on a sliding window with multi-times weight and failures height in [23]. However, a major limitation of this work is that the sequence information is not fully considered in the CNN model. To address this issue, a Long Short-Term Memory (LSTM) model was proposed for RUL estimation [24], where the model can make full use of the sensor sequence information and performs much better than the CNN model. Following this success, Convolutional Bi-directional Long Short-Term Memory networks (CBLSTM) have been designed in [25] for RUL prediction, where firstly a CNN is first used to extract robust and informative local features from the sequential input, and then a bi-directional LSTM is introduced to encode temporal information. In addition to these approaches which use the raw input, a Vanilla LSTM neural network model has been used in [26] for RUL prediction, where a dynamic differential technology was proposed to extract inter-frame information and achieved high prediction accuracy. Moreover, an ensemble learning-based prognostic method is proposed in [27], which combines prediction results from multiple learning algorithms to get better performance. However, all these modern deep learning approaches for RUL prediction have a major limitation: they all require the availability of a large amount of training data for training these deep neural networks, while in real-world applications, it is often impossible to obtain a large number of failure progression samples. Therefore, both these deep neural network methods and traditional methods have not addressed the major challenge in data-driven prognostics: the big data of failure samples.
In this paper, we propose to use transfer learning-based deep neural networks for RUL prediction. In recent years, transfer learning has made great progress in image, audio, and text processing to address the data scarcity issue by taking advantage of datasets in related domains. It is widely adopted in the situation where source data and target data are in different feature spaces or have different distributions [28,29]. Transfer learning methods work by learning properties from source data and transfer them to the target data. The source data and target data can be different. But they should be more or less related. Zhong et al. [30] applied the domain separation framework of transfer learning for automatic speech recognition. Singh et al. [31] applied transfer learning in object detection with improved detection performance. Cao et al. [32] used transfer learning for breast cancer histology image analysis and achieved better performance than the popular handcrafted features. These studies showed that transfer learning can make use of both source data and target data to get better performance. In real-world applications of RUL prediction, the scarcity of target data is one of the major challenges in data-driven prognostics as it is often not possible to obtain large numbers of samples of failure progressions. However, we usually have access to a large amount of data of different but approximately related working conditions. In this paper, we proposed a transfer learning approach with LSTM deep neural networks for RUL prediction. It provides an effective way to address the major challenge in data-driven prognostics. Our contribution in this paper includes: (1) We developed a bidirection LSTM recurrent neural network model for RUL prediction; (2) We proposed and demonstrated for the first time that the transfer learning-based prognostic model can boost the performance of RUL estimation by making full use of different but more or less related datasets; (3) We showed that datasets of mixed working conditions can be used to improve the performance of single working condition RUL prediction while the opposite is not true. This can give good guidance in real-world application where samples of certain working conditions are hard to obtain.
This rest of this paper is organized as follows: Section 2 describes the transfer learning-based prognostic RUL prediction algorithm. Section 3 presents the experiments and results. Section 4 discusses related issues of the results and Section 5 concludes the paper.

The Turbofan Engine RUL Prediction Problem and the C-MAPSS Datasets
To verify our transfer learning-based algorithm, we selected the Turbofan Engine RUL prediction problem as the benchmark. The corresponding C-MAPSS datasets (Turbofan Engine Degradation Simulation Datasets) are widely used for RUL estimation [33]. These datasets are provided by the NASA Ames Prognostics Data Repository [34], which contains 4 sub-datasets as given in Table 1. Each sub-dataset consists of multiple multivariate time series, which are further divided into training and testing sets. In the training set, the fault grows in magnitude until the system fails. In the testing set, the time series ends some time prior to system failures. Every trajectory is an engine's cycle records data. Every cycle record is a snapshot of data taken during a single operational cycle. A single-cycle datum in the C-MAPSS dataset is a 24-dimensional feature vector consisting of 3 operational settings and 21 sensor values. The operating condition settings are altitude, Mach number, and throttle resolver angle respectively, which determine different flight conditions of the aero-engine. In sub-dataset FD001, the engine suffered a failure of a high-pressure compressor with a single operation condition. In sub-dataset FD002, engine suffered a failure of a high-pressure compressor with six operation condition. In sub-dataset FD003, engine suffered a failure of a high-pressure compressor and fan with a single operation condition. In sub-dataset FD004, engine suffered a failure of a high-pressure compressor and fan with six operation condition.

Transfer Learning for RUL Prediction
Based on the availability of sample labels, transfer learning can be divided into three categories: inductive transfer learning, transductive transfer learning, and unsupervised transfer learning [28]. Inductive transfer learning requires the availability of the target data labels. If the source data labels are available and the target data labels are unavailable, transductive learning should then be used. If both the target data labels and the source data labels are unavailable, then the unsupervised transfer learning should be used. According to the properties of C-MAPSS datasets in which both the target data labels and the source data labels are available, inductive transfer learning is the most suitable approach that we adopted.
Based on what to transfer, transfer learning can be conducted at several levels: instance-transfer, feature-representation-transfer, parameter-transfer, and rational-knowledge-transfer [28]. Instance-transfer aims to re-weight some labeled data from the source domain and then use them in the target domain. Feature-representation-transfer tries to get the good feature representations that can reduce the difference between the source domain and the target domain. Parameter-transfer discovers shared parameters or prior knowledge between the source domain and the target domain. Relational-knowledge-transfer works by mapping the common relational knowledge or some similar patterns from the inputs to the outputs between both domains. According to the properties of the C-MAPSS datasets and experiment conditions, parameter-transfer is selected in our algorithm.
In this research, we apply transfer learning to address one of the major challenges in data-driven prognostics by transferring model parameters learned from the different but approximately related domain with a large amount of source data to the target RUL prediction problem with a small amount of data. The parameter-transfer scheme can be defined as follows: where D s , X s , T s , D t , X t , T t respectively represent source domain, source samples, source task labels, target domain, target samples, target task labels. D s , D t are related or similar domains.
The relationship of the model weights of the source problem and of the target problem can be represented as follows: Here, W s , W t are parameters in the source and the target task, respectively. They have some common parts W 0 and some different parts. Y s , Y t are the real outputs. f s , f t denote the learning models that map from the sample inputs to the task labels. The parameter-transfer scheme aims to transfer parameters from W s to W t by making use of the common parts W 0 and fine-tuning the different parts by further training the models in the target task.

The Transfer Learning Framework for RUL Prediction
The transfer learning framework is illustrated in Figure 2. The framework is composed of two Long Short-Term Memory (LSTM) networks. The network on the top was trained on the large amount data of the source task. Then the learned model was fine-tuned by further training with the small amount of data from the target task, which is usually a different but related task. In our experiments, it represents the degradation failure under different working conditions.

Parameters transfer
Step 1: Train with a large source data

The BLSTM Neural Networks
To take advantage of the sequential nature of the sensor data in the turbine engine RUL prediction problem, recurrent neural networks (RNNs) are favored due to their ability to capture time-dependent relationships. However, conventional RNNs have the issue of vanishing and exploding gradient problems, which makes it extremely challenging to train such models. To address these issues, Long Short-Term Memory (LSTM), a gated recurrent neural network (RNN), was proposed in [35]. However, basic LSTM is only able to process the sequential data in forward directions. To capture both the past and future contexts, Bi-directional LSTM (BLSTM) [36] was proposed to process the sequence data in two directions including forward and backward directions with two separate hidden layers. Their outputs are then fed forward to the same output layer. This model has been shown to achieve good performance in machine health monitoring [25].
In an LSTM/BLSTM network, an LSTM cell contains three gates, namely input gate i t , forget gate f t , and output gate o t , which determine whether to use the input, whether to update the cell memory state, and whether to create an output, respectively. At each time step t, the following equations define the BLSTM [25].
where the → and ← denote the forward and backward processes, respectively; model parameters including all W i , V i and b i are shared by all time steps and learned during model training; σ is the sigmoid function, is the element-wise product and c t is a memory cell; The hidden state h t is updated by current data at the same time step x t ; the hidden state h t is at the previous time step; The complete BLSTM outputs h t of forward and backward processes as follows: In our transfer learning framework, two BLSTM neural networks are used. Each of them has four hidden layers: the first BLSTM layer has 64 nodes with return sequences and a dropout rate of 0.2, 64 nodes in the second BLSTM layer with return sequences and a dropout rate of 0.2, the third flatten layer and 128 nodes in the fourth dense layer with a dropout rate of 0.5. Finally, a one-dimensional output layer is used to predict the RUL. We apply the L2 regularization techniques for four hidden layers and early stopping in training model. In our framework, the top BLSTM neural network model is first trained over the source data, which is then refined by training it with the target dataset. So the top BLSTM neural network model trained by source data can be considered as the initializer for the bottom BLSTM neural network model to be trained with target task data.

Input Data and Parameter Settings
All training and test datasets are shown in Table 1. Single-cycle data in the C-MAPSS dataset is a 30-dimensional feature vector, consisting of 3 operational settings, 21 sensor values and 6 one-hot encoding values. Each input sample in our networks contains 30 single-cycle data which are extracted from every multivariate time series of trajectories as shown in Figure 3. The step size of the sliding window is 1. The true RUL of a sample is determined by the true RUL of the last single-cycle data. For training data, 40% of trajectories are randomly selected as validation data. For testing data, the last slide window of every trajectory is selected as the input data.

Evaluation
To evaluate the performance of our RUL estimation model on the test data, two measures are used here: Scoring Fuction and Root Mean Square Error (RMSE).
The scoring function proposed in PHM 2008 Data Challenge [37] is shown in Equation (10), where n is the total number of trajectories, h i = RUL est,i − RUL i , RUL est,i is estimated RUL and RUL i is the true RUL. This scoring function gives two different penalties. The h i < 0 penalty (estimated RUL is less than the true RUL) is smaller than the h i ≥ 0 penalty (estimated RUL is larger than true RUL). The justification for this difference is that when h i < 0, we still have time to conduct system maintenance; but when h i ≥ 0, the maintenance would be scheduled later than the required time, which may cause system failure.
RMSE is also widely used as an evaluation metric for RUL estimation, as shown in Equation (11). RMSE gives the same penalty weights for h i < 0 and for h i ≥ 0.
We used the mean score function as the loss function for training our networks, as shown below.
where S is calculated from Equation (10), n is the total number of trajectories.

Data Normalization
There are several ways to normalize the sensor data. Here we used standardization to remove the mean and scale the data with unit variance. We standardize the data by Equation (13), where x i is the ith sensor data, x i is the normalized data, µ i is the mean of ith sensor data, σ i is the ith corresponding standard deviation. x

Operating Conditions
Previous research [23,24,37] on the C-MAPSS dataset showed that the operating setting in this dataset can be clustered into six distinct groups, each representing a distinct operating condition. Here we used the K-means algorithm to cluster operating setting values in all datasets into 6 clusters as shown in Figure 4. Based on clustering, an operating setting label can be represented as a 6-dimension vector using one-hot encoding.

RUL Target Function
The traditional way to define the degradation process in a system is using a linear model along time. However, in practical applications, the degradation in a system is negligible at the beginning of use and increases after an anomaly point. It is hard to estimate RUL before the anomaly point. Besides, estimating RUL before the anomaly point during which the system works well is not useful in practical applications. Hence, for the C-MAPSS datasets, a piece-wise linear degradation model was proposed in [37], which limits the maximum value of the RUL function as illustrated in Figure 5.
In this paper, we set the maximum RUL limit as 130 time cycles, the same as in [37]. We ignore data whose true RULs are greater than the maximum limit to pay attention to the degradation data.

Experiments and Results
To verify the performance of our BLSTM-based transfer learning algorithm for RUL estimation, we conducted a series of experiments on the C-MAPSS Datasets. We evaluated the performance of transfer learning over every dataset and studied how different working conditions and the numbers of samples for each pair of source and target datasets affect the performance of the resulting prediction models.
We designed four groups of without-transfer learning experiments (E1, E4, E7, E10) and eight groups of transfer learning experiments (E2, E3, E5, E6, E8, E9, E11, E12) involving 4 groups of target datasets (FD001, FD002, FD003, FD004) as shown in Table 2. For each group of target datasets, we randomly selected 10, 20, ..., 90, 100 trajectories to generate 10 datasets from which 10 evaluation datasets are generated. This will allow us to evaluate how the number of samples in the target dataset affects the performance of transfer learning. For each transfer experiment, we compared the performance of standard BLSTM models over the target dataset without transfer learning against the performance of the BLSTM models with transfer learning. We repeated each such experiment three times to deal with the randomness of the algorithms. In total, we have conducted 10 × 3 × (8 + 4) = 360 experiments, where 10 is the number evaluation datasets, each with different numbers of train trajectories; 3 is the number of repeats for each experiment; 8 is the number of transfer experiment pairs; 4 is the number of experiments without transfer. For all experiments, the performances of the experiment were evaluated by score function and RMSE. Note that the experiment group of E2 and E3 have the same target data but different source data, and compare with the experiment group of E1. The same is true for E5 and E6 groups who compare with E4 group, E8 and E9 groups who compare with E7 group, and E11 and E12 groups who compare with E10 group. Table 3 shows the results of all experiments in terms of the mean values of the performance scores and RMSE. IMP is the improvement of the models with transfer learning over those without transfer learning. It is defined as I MP = (1 − (WithTrans f er)/(NoTrans f er)) × 100. From the IMP values, we can easily observe that transfer learning made great improvement except for groups E3 and E9, which will be explained in Section 4 (E3 and E9 are multiple operating conditions to 1 condition). For example, when the number of trajectories used is less than or equal to 60, transfer learning has improved the performance scores from 16.41% to 85.12% for group E2. Similar improvements are also observed for groups E5, E6, E8, E11, and E12.
In particular, we found transfer learning is more effective in general on models trained with small data sets. For example, it improves the model score by 43.19% for group E5 when only 10 trajectories are used to create the training sets, while the improvement is only 8.29% when 60 trajectories are used as training sets for the same E5 group.
We also found that the changes of Mean Score and RMSE have similar trends. In Figure 6 we showed the box plot of the Mean Scores to illustrate the distribution of every transfer learning experiment. Figure 6a-d respectively illustrate the performance of four groups of dataset target models (FD001, FD002, FD003, and FD004). The blue boxes are the performance without transfer and the yellow and green boxes are the performance with transfer. In Figure 6a, the transfer experiment E2 is more stable and gets lower mean score than the without-transfer experiment E1. In Figure 6b, both the transfer experiment E5 and E6 are more stable and get lower mean score than the without-transfer experiment E4. In Figure 6c, the transfer experiment E8 is more stable and gets a lower mean score than the without-transfer experiment E7. In Figure 6d, both the transfer experiment E11 and E12 are more stable and get lower mean score than the without-transfer experiment E10. Thus, from the distribution of box plots for experiments, we can easily observe that the performance of models with transfer learning are more stable and make great improvement than those without transfer learning except the transfer experiment E3 and E9, which will be explained in Section 4. Figure 7 illustrates the RUL estimation results in test data for experiments, in which the training data set is generated by randomly selected 50 trajectories. Those figures show and compare the RUL estimation results for four test data sets (FD001, FD002, FD003, FD004). Figure 7a,d,j are the without-transfer experiment and others are the transfer experiments. For every figure, the X-axis is the test unit with increasing RUL, which is sorted by actual RUL values. The Y-axis is the actual and prediction RUL value. The blue points are the actual RUL value of test unit and the red points are the prediction RUL value of test unit. Comparing with Figure 7a (Figure 7j). For these figures, comparing between transfer and no-transfer learning performance, we found that the effect of transfer experiments performance is obvious except for E3 and E9, which will be explained in Section 4.   To further examine how different transfer learnings work, we analyzed the IMP performances of a transfer experiment which transfers from different source datasets to the same target datasets. In Figure 8, Figure 8a-d respectively illustrate the improvement performance of transferring to four datasets (FD001, FD002, FD003, and FD004). Figure 8a shows the IMP performance of transfer experiment which transfers to FD001 target dataset. The E2 has effective performance. The IMP performance in the E2 is greater than 20% when the size of trajectories is less than or equal than 50. The E3 has a detrimental effect performance. Figure 8b shows the IMP performance of a transfer experiment which transfers to the FD002 target dataset. Both the E5 and E6 have effective performance. The IMP performance in the E5 is greater than 15% when the size of trajectories is less than or equal than 50. The IMP performance in the E6 is greater than 45% when the size of trajectories less than or equal than 50. Figure 8c shows the IMP performance of a transfer experiment which transfers to the FD0013 target dataset. The E8 has effective performance. The IMP performance in the E8 is greater than 17% when the size of trajectories is less than or equal than 70. The E9 has a detrimental effect on performance. Figure 8d shows the IMP performance of a transfer experiment which transfers to the FD004 target dataset. Both the E5 and E6 have effective performance. The IMP performance in the E5 is greater than 15% when the size of trajectories is less than or equal than 50. The IMP performance in the E6 is greater than 30% when the size of the trajectories is less than or equal than 50.
From all the above results, we also found that working conditions affect transfer learning performance a lot. How do working conditions affect transfer learning performance? We will discuss this in Section 4.

Discussion
In the prior section, we showed that transfer learning is effective, except for E3 and E9. Here, we discuss how working conditions affect the transfer performance and analyze the cases when transfer learning can bring negative effects.

How Working Conditions Affect Transfer Learning Performance
To further understand how working conditions affect transfer learning performance, we analyzed the IMP performances of transfer experiments grouped by different working conditions. Figure 9 shows the IMP performance of transfer learning with different working conditions. Table 4 shows the information of experiments in Figure 9.     (d) E12 (O1→O6), E9 (O6→O1) Figure 9. Comparison of transfer learning performance improvement with different working conditions. In the caption, the F symbol is the fault condition, the O symbol is the operating condition, the number after the symbol is the type of working condition from the Table 2, and A→B means from working condition A transfer to working condition B.
From Figure 9a,b, we found that when the fault conditions are considered, in both the transfer learning from single fault condition to multiple fault conditions (E8, E11) and the transfer learning from multiple fault conditions to single fault condition (E2, E5), the transfer learning scheme improved the prediction models of the target dataset.
From Figure 9c,d, we found that when the operating conditions are considered, transfer learning from a single operating condition to multiple operating conditions (E6, E12) improved the prediction performance of the model on target dataset. However, transfer learning from the model trained on the multiple operating condition dataset to the single operating condition (E3, E9) had a detrimental effect on the prediction performance of the target model. From all the above results, we found that transfer learning under working conditions made effective performance improvements, except when transferring from multiple operating conditions to single operating condition (E3, E9).

Negative Transfer
Negative transfer occurs when the information learned from the source domain has a detrimental effect on the prediction model for the target domain [29]. The more the source data is similar to the target data, the lower the negative transfer effect. We already noted that transfer from multiple operating conditions to single condition led to a detrimental effect. Here we try to understand the negative transfer by comparing sensors' monitoring data values and explain why transfer from multiple operating conditions to a single condition has a detrimental effect. Figure 10 shows the comparison between three type trends of sensors' monitoring data values in each of the following datasets (FD001, FD002, FD003, FD004). Figure 10a-d are the sensor-2 values, which represent the ascending trend sensors. Figure 10e-h are the sensor-12 values, which represent the descending trend sensors. Figure 10i-l are the sensor-16 values, which represent the trend that is unchanged in FD001 and FD003 but changed in FD002 and FD004. It is important to note that the operating setting values have great influence on sensor values and influence transfer learning performance a lot.
Comparing Figure 10a,e,i with Figure 10c,g,k, FD001 and FD003 have different fault conditions. We found that the distribution of sensor values with different fault condition are similar. The same is true for FD002 in Figure 10b,f,j and FD004 in Figure 10d,h,l. Since the distribution of sensor values with different fault conditions are similar, the transfer learning from single to the multiple fault conditions (E8, E11) and from multiple to single fault conditions (E2, E5) can achieve positive effects on the prediction model of the target dataset. Comparing Figure 10a,e,i with Figure 10b,f,j, FD001 and FD002 have different operating conditions. We found that the distribution of sensor values with different operating conditions have large differences. The same is true for FD003 in Figure 10c,g,k and FD004 Figure 10d,h,l. With the understanding that the distribution of sensor values with different operating conditions have big differences, our experiments showed that the transfer learning from multiple to single operating conditions (E3, E9) led to a detrimental effect on prediction performance while transfer learning from single to multiple operating conditions (E6, E12) led to a positive effect on the prediction performance of the target model.
What causes the different effects of transfer learning with different operating conditions? Two factors may be involved here. First, we found that the sensor monitoring data under the multiple operating conditions are more complicated than the dataset under single conditions. Second, the distribution of sensor values are more different than the sensor monitoring data with the single operating condition. And then the initial parameters of the model transferred from the complicated multiple conditions to the simple single condition causes overfitting and it is hard to fine-tune it with limited target data size. This can be verified from the plots of training loss history. Figure 11 illustrates the training loss history for experiments in which the training data set is generated by randomly selecting 50 trajectories. Comparing training loss history between Figure 11a and Figure 11b, we found that E6 in Figure 11a is able to fine-tune training well and gets a lower mean score by nearly 10 in validation data than experiments without transfer E4 in Figure 11a. However, E3 causes overfitting. It is hard to fine-tune training and it gets a higher mean score by nearly 20 on validation data than experiments without transfer E1. The same is true between E12 in Figure 11c and E9 in Figure 11d. This may explain why transfer learning from multiple operating conditions to a single condition led to detrimental effects on the prediction performance.  1  9  17  25  33  41  49  57  65  73  81  89  97  105  113  121  129  137  145  153  161  169  177  185  193  201  209  217  225  233  241  249  257  265  273  281 Mean score   (d) E7 and E9 loss history Figure 11. Training loss history for experiments in which the training data set is generated by randomly selecting 50 trajectories.

Conclusions
This paper presents a transfer learning algorithm with bi-directional LSTM neural networks for RUL prediction of a turbofan engine. Our algorithm addresses one of the major challenges in RUL prediction: the difficulty of obtaining a sufficient number of samples in data-driven prognostics. Our transfer prognostic model works by exploiting data samples from different but approximately related tasks for remaining useful life estimation. Our method was validated on the C-MAPSS Datasets with extensive experiments by comparing its performance with those models without transfer learning. The experimental results showed that transfer learning is effective in most cases except when transferring from a dataset of multiple operating conditions to a dataset of a single operating condition, which led to negative transfer learning. How to prevent negative transfer is still an open problem. In future work, we will develop more advanced transfer learning methods or data normalization schemes to improve the multi-condition RUL prediction performance.