Cross-Domain Remaining Useful Life Prediction Based on Adversarial Training

Remaining useful life prediction can assess the time to failure of degradation systems. Currently, numerous neural network-based prediction methods have been proposed by researchers. However, most of the work contains an implicit prerequisite: the network training and testing data have the same operating conditions. To solve this problem, an adversarial discriminative domain adaption prediction method based on adversarial training is proposed to improve the accuracy of cross-domain prediction under different working conditions. First, an LSTM feature extraction network is constructed to mine the source domain data and the target domain data for deep feature representation. Subsequently, the parameters of the target domain feature extraction network are adjusted based on the idea of adversarial training to achieve domain invariant feature mining. The proposed scheme is experimented on a publicly available dataset and achieves state-of-the-art prediction performance compared to recent unsupervised domain adaptation prediction methods.


Introduction
Remaining useful life (RUL) prediction, an important research component of prognostic and health management (PHM), can estimate the time to failure of complex systems and reduce maintenance costs and risks [1][2][3].In general, RUL prediction schemes can be divided into two types: physics-based methods and data-driven methods [4].
Physics-based approaches usually use mathematical modeling methods and classical physical models to describe the health state of mechanical systems [5].Although these methods achieve accurate prediction results and have good interpretability, they require complex domain expertise, which limits the generalizability of these methods [6].In contrast, data-driven methods, especially neural network methods, are able to directly learn the relationship between monitoring data and RUL labels by designing a rational network structure.A large number of contributing neural network prediction schemes have been proposed and applied, e.g., Convolutional Neural Network (CNN) [7][8][9][10][11], Gated Recurrent Units (GRU) [12][13][14] and Long Short-Term Memory Networks (LSTM) [15][16][17].
However, in terms of data-driven method, it is challenging (intolerably labor-expensive and time-consuming) to collect sufficient run-to-failure data, which may reduce the prediction performance of the prognostic method.Even when enough historical data are available, the pre-trained model under one specific working condition cannot be generalized to another although they are similar [2].To cope with the aforementioned limitations, models trained under a single operating condition should be adapted to testing data under other operating conditions, i.e., domain adaptation, a sub-problem of transfer learning [18].
In recently RUL prediction schemes, existing domain adaption methods can be mainly grouped into two categories: moment matching methods and adversarial training methods [19,20].The former methods align feature distributions across domains by minimizing the distribution discrepancy.Zhu et al. [1] proposed transfer multiple layer perceptron (MLP) with Maximum Mean Discrepancy (MMD) loss [21].The source and target domain training data feed to the MLP at the same time after feature extraction.By reducing the source domain regression loss and MMD loss, the domain-invariant features are obtained.Yu et al. [22] developed a domain adaptive CNN-LSTM (DACL) model with MMD to align the domain distributions and evaluated the model on C-MAPSS dataset.The latter methods achieve cross-domain prediction feature extraction by adversarial training [23].Da costa et al. [18] proposed an RUL domain adaptation network named LSTM-DANN.Firstly, degraded pattern knowledge is learned in the source domain, and then the source domain regression model is adjusted to adapt to the target domain feature distribution with the gradient inversion layer to achieve cross-domain RUL prediction [24].
In this work, LSTM combined with a domain adaptive scheme of adversarial discriminative learning framework is proposed, LSTM-ADDA.The proposed scheme can search the domain invariant feature space in the training phase using an adversarial training approach [25] to adjust the network parameters of the LSTM extractor to achieve crossdomain RUL prediction.Compared with the latest adversarial method, LSTM-DANN, the proposed method is not required to repeatedly train the source domain regression model to adapt to different target domains.The extraction of cross-domain RUL prediction features is achieved by reversing the domain feature labels.The implementation process is relatively simple and converges faster.To verify the effectiveness of the proposed method, we conduct experiments on the C-MAPSS datasets and compare the RUL prediction results of aircraft engines with LSTM-DANN and other adaptation methods.
The contributions can be summarized as follows: 1.
A novel unsupervised adversarial learning RUL prediction scheme is proposed, which can adapt to different fault modes and operating settings domain data.

2.
Our experimental results show that the proposed LSTM-ADDA method achieves competitive performance compared with other unsupervised domain adaptation methods.
In Section 2, LSTM and unsupervised domain adaption methods are reviewed.In Section 3, the proposed LSTM-ADDA architecture and training process are detailed introduced.In Section 4, we briefly describe the dataset preprocess method and list the hyperparameter selection in experiments.In Section 5, the superiority of the proposed method is proved through two sets of comparative experimental results.The conclusion and future research are drawn in Section 6.

Background
In this section, we review the problem definition and generalized adversarial learning framework.Subsequently, the adversarial discriminative domain adaptation (ADDA) method and the detailed LSTM structure are introduced.

Problem Definition
In previous neural network-based prediction methods, training and testing data are usually acquired under the same operating conditions, as shown in Figure 1a.However, it is difficult to obtain data under the same operating conditions in the actual data acquisition, resulting in the limitation of the method's generalizability.Suppose there are two datasets for remaining useful life prediction under different operating conditions, one of which has sufficient training and testing data (source domain dataset), and the other has only unlabeled training and testing data (target domain dataset).We define this application scenario as an unsupervised domain adaption scenario [26], i.e., with labeled source domain training data and unlabeled target domain training data to predict the target testing data RUL.This paper aims to adaptively implement the prediction of the target domain testing data by transferring the knowledge of the source domain regression model, as shown in Figure 1b.Suppose the source domain data is denoted as D S = {(x i S , y i S )} N S i=1 , containing N S training samples, where x i S is a subset of source domain feature space χ S .T i denotes the length of time and f S denotes the feature dimensions, i.e., x i S = {x i t } T i t=1 ∈ T i × f S .Moreover, y i S = {y i t } T i t=1 ∈ T i ×1 ≥0 represents the RUL vector of x i S .x i t ∈ 1× f S and y i t ∈ 1 ≥0 denotes the observed feature vector and RUL labels at time t i , respectively.Besides this, we assume a target domain D T = {x i T } N T i=1 , where x i T ∈ χ T and x i T = {x i t } T i t=1 ∈ T i × f T .Specially, there are no available RUL labels.We assume that there are somewhat similar or invariant parts between the deep feature representation of source and target domain mapping.In the model training stage, source domain data and RUL value are available, i.e., {(x i S , y i S )} N S i=1 .The target domain data can only use the training monitoring data without RUL labels, {x i T } N T i=1 .It is aimed to adjust the domain feature extractation network g by both source and target training data to perform the target testing sample RUL transfer regression, i.e y i true ≈ y i predict = g(x i T ).

Generalized Adversarial Adaptation
Tzeng et al. [25] presented a general adversarial adaptive framework, and pointed out that unsupervised domain adaptation could be achieved by reducing the source domain classification loss and alternately minimizing the domain discriminator loss and the distribution discrepancy between the source and target mapping distributions.
Adversarial adaptive methods aim to minimize the distance between the source and target mapping distributions by regularizing the learning of the source and target mappings, M S and M T .Once the adversarial training is finished, the domain-invariant feature representations are found.Source classification block network can directly receive the target domain feature representation for subsequent RUL prediction, C = C S = C T .
The source supervised classification network optimize object function is shown as follows: In adversarial adaptation training part, the domain discriminator, D, is used to distinguishing the source of deep domain features.Similarly, the network parameters of the feature extractor are optimized using the common classification loss, L adv D (x i S , x i T , M S , M T ).
Finally, the label of the target domain feature representation is reversed to adjust the weights of the target domain mapping structure, achieving the source domain degradation knowledge transfer.The reversed label training loss is denoted as follows, L adv M (x i S , x i T , D):

Adversarial Discriminative Domain Adaptation
Adversarial discriminative domain adaptation (ADDA), based on general adversarial adaptive framework, selects the discriminative base model, untied weights and standard GAN loss.
For the mapping optimization constraints, M T uses the M S model architecture and weights pre-trained in the source domain during the M T initialization.In adversarial training, the M S model is fixed, and the source and target data are input into the M S and M T , respectively, to obtain feature representations for domain discriminative and M T weights adjustment.The asymmetric feature representation obtained by unshared weight mappings makes the learning paradigm more flexible.
In adversarial adaptation training, the optimization process is split into two independent objectives, domain discriminator and target feature mapping, and alternately minimizes their loss functions.The L adv D of domain discriminator remains unchanged, and the target mapping uses the standard loss function of the inverted labels [23].
Similar to the training method of GAN, the mapping feature of the source domain can be labeled as '1' and the target domain as '0', which are input into the discriminator for domain discrimination.During the adversarial training, the mapping feature label of the target is inverted as '1', and input to the discriminator for adversarial training with an inverting the labels manner [26].
The ADDA method divides the optimization process into two stages.First, the M S and C are trained using labeled source domain data.After that, the M S is fixed and the M T is initialized, and the unlabeled target domain data is used to alternately optimize the L adv D and L adv M to adjust the model parameters of the M T .

Long Short-Term Memory Neural Network
The temporal feature extractor in our methodology is the LSTM network [27], which is a subclass of recurrent neural networks (RNNs).LSTM is one of the most exploited RNN versions because of the capability of detecting long-and short-term dependencies and relaxes the gradient vanishing or exploding problems.As shown in Figure 2, the LSTM cell uses input gate, forget gate and output gate to update current cell state c t and output hidden state vector h t .
The detail structure of LSTM cell.
Mathematically, the computation of the LSTM cell can be described as follows: where

Methodology
In this section, we detail the proposed method architecture and training processes.Adversarial discriminative domain adaptation based on LSTM, referred as LSTM-ADDA and depicted in Figure 3, can be decomposed into three modules: feature extractor, regression and domain discriminator.
The temporal feature extractor, a combination of LSTMs, establishes a mapping between the input data and the hidden feature state h t ∈ h .Later, a dense layer transforms the LSTM hidden vector to deep feature representation f t ∈ f , which will be used as temporal feature space for subsequent regression tasks.The parameters of feature extractor is denoted as Θ f , i.e., f = g f (x i S ; Θ f ).Both regression and domain discriminator modules consist of simple fully connected layers.We regard the RUL prediction as a regression task.The mean absolute error (MAE) function is selected as the loss function of the source domain regression to minimization.The parameters in the regression module are denoted as Θ ŷ, i.e., ŷ = g ŷ(g f (x i S ; Θ f ); Θ ŷ).The LSTM-ADDA optimization equations applied in regression scenarios are described as follows: The training process of the proposed method can be summarized into three phases, as shown in Figures 3 and 4.
In stage I, the source domain training data and training labels are input to a source domain regression network consisting of source domain mapping structure and regressor for standard supervised learning training, as shown by the blue arrows in Figure 3.After the source domain training is completed, the network parameters of the source domain mapping structure and regressor will be fixed and will not change with subsequent training.The purpose of stage I is to learn the deep feature representation and degradation patterns of the source domain data from the source domain data.
In stage II, the target domain mapping structure first replicates the structure and parameters of the source domain mapping, which is used as the training starting point for the target domain mapping.Then, the source domain training data and the unlabeled training data of the target domain are input to the source domain mapping and target domain mapping structures to obtain the corresponding deep feature representations, respectively.Subsequently, the feature representations of the two domains are input to the discriminator for adversarial training.The adversarial training is achieved by inverting the domain labels of the two domains.By alternately inverting the domain labels, the parameters of the target domain mapping structure are adjusted to automatically adapt to the data distribution of the target domain, as shown by the orange arrows in Figure 3. Stage II aims to adjust the feature mapping of the target domain to obtain an adaptive target domain deep feature representation for subsequent prediction.
In stage III, the trained target domain mapping structure is combined with the regressor to achieve cross-domain RUL prediction for the target domain testing data, as shown by the green arrows in Figure 3.
For researchers to understand and reproduce the ADDA method, the reference code can be found in the following Github link: https://github.com/acmllearner/pytorch-adda,(accessed on 16 May 2022).

Experiment Pre-Settings
In this section, dataset description, data preprocessing, evaluation metric and hyperparameter selection are introduced in this section.

Dataset Description
The Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset contains four subsets, as shown in Table 1.Each subset has different fault modes and operating conditions.Each subset contains a training and testing set.Each engine starts with various initial wear but regarded as healthy at the beginning of the recording cycle.Every recording cycle is composed of three operational settings and 21 sensor readings.As the recording cycle increases, each engine performance will gradually degrade until it loses its function [28].The sliding time window method is adopted to obtain the same time-window block data.Suppose the running time of an engine x i is T i , i.e., x i = {x i t } T i t=1 .The sequential input x i divided by the time window T w is represented as {(x i t−T w , ..., x i t−1 )} T i t=T w +1 .If T i < T w , zero-padding is applied to fulfill the x i , i.e., x i = { T w − T i 0, 0, . . ., 0,x i 1 , x i 2 , . . ., }.This ensures that each engine unit can generate at least two samples.We use the same length of time window to preprocess the source and target domain training sets.To make full use of data, in adversarial training stage, the domain training dataset containing the smaller number of samples is oversampled.

Data Normalization
The min-max normalization is used to scale the subset data (including operating conditions and sensor data) and RUL value to the (0-1) range: where x i,j t denotes the original ith engine unit of the jth feature at record cycle t.To demonstrate the feature shift between different domains, we perform data distribution statistics on the last cycle of each engine unit in each subset of C-MAPSS.We selected two normalized sensor values for visualization.As shown in Figure 5, we can see that the same operating condition has similar feature distributions for different fault modes.On the contrary, the feature distribution of the same fault mode with the different operating conditions has a clear discrepancy.
As in the previous work [29,30], RUL can be estimated as a constant value in normal operating conditions.Therefore, a piecewise linear degradation model is used to describe the RUL curve of the observation engine.This paper selects R e = 125 cycles as the constant RUL before failure and applies it to the training and testing set.

Performance Metrics
We use two metrics to measure the LSTM-ADDA prognostic performance.The root mean squared error (RMSE), shown in (7), is selected to evaluate the error between the predicted RUL and the actual RUL.We also use the score metric [28], shown in ( 8), to evaluate our proposed method.The score metric penalizes positive prediction errors more than negative prediction errors because it is more meaningful on maintenance policies to predict failures as early as possible.

Hyperparameter Selection
To obtain better adversarial learning performance, we minimize the regression loss in source domain.The learning rate of Adam optimizer is set as 0.0001 (λ S ).Besides this, in the adversarial learning process, we reduce the domain classification accuracy of the domain discriminator, i.e., find the domain-invariant representation.For a fair comparison with the adversarial learning method LSTM-DANN [18], we select the same network parameters as the LSTM-DANN method, as shown in Table 2.

Experimental Results
In this section, two comparative experiments are conducted to illustrate the prognostic performance of the proposed method.First, the comparison with the non-adapted models proves that the proposed method can effectively improve the engine RUL prediction under different operating conditions and fault modes.Besides this, compared with other unsupervised transfer learning methods: CORrelation (CORAL) [31], Transfer Component Analysis (TCA) [32] and LSTM-DANN [18].
In the experiment, each training subset of the C-MAPSS dataset is regarded as the source data, and the remaining testing data are regarded as the target data.Therefore, 12 different experiments were performed, and the mean and stand deviations of each experiment result were recorded.All experiments are performed on an Intel Core i7 8th generation processor with 24GB RAM and CPU.We implement our method by using Python 3.6 and Pytorch 1.4.0.

Non-Adapted Models under Domain Shift
To prove the effectiveness of the proposed adversarial adaption method, we input the target testing data into the source domain model trained on the source domain training data as an experiment of the non-adapted method (source-only).Moreover, we input the target testing data into the target model trained with target domain training data, which represents the ideal situation when the target domain has sufficient training data (targetonly).We use each subset of C-MAPSS as the source domain to analyze and discuss the experimental results separately.The results of cross-domain prediction experiments are shown in Figure 6.
Source FD001: FD001 has single fault mode and operating condition, so it is difficult to learn the target domain with complex operating conditions and fault modes, causing the RMSE and prediction scores of LSTM-ADDA higher than the other experiment pairs.As shown in Figure 6a,c, the prediction curve of the proposed model is similar to the source-only waveform and is closer to the target-only curve, indicating that the proposed method can improve the adaptability of the non-adapted model.Compared with the target domain FD003, which has a low distribution discrepancy between the two subsets, the prediction curve obtained in Figure 6b is closer at the end stage of the engine.
Source FD002: Compared with FD001, the operating condition of FD002 is six.When the more complex operating condition dataset as the source domain.as shown in Figure 6d, the proposed model gives better prediction results.When the target domains are FD003 and FD004, containing two fault modes, the prediction effect of FD004 with less domain shift is closer to the target-only curve in Figure 6e.Although the operating conditions and fault modes of FD003 are not the same as FD002, the proposed model can also give better prediction results than the non-adapted models.
Source FD003: Under the same operating conditions, FD003 with two fault modes achieved a lower RMSE when performing adversarial training to the target domain FD001 with one fault mode.Besides this, due to the low distribution discrepancy, Figure 6g shows that the prediction curve obtained by the proposed method also fits the RUL curve of the engine.For FD002 and FD004 with six operating conditions, LSTM-ADDA also gives better prediction results than non-adapted models.
Source FD004: FD004 is containing six operating conditions and two fault modes, which is the most complex subset of the C-MAPSS dataset.When training with the remaining target domain datasets, as shown in Table 3, lower RMSE and prediction scores are obtained.In Figure 6j,l, we can observe that the source-only curve has a large deviation from the RUL curve.For FD002 with a low domain shift, as shown in Figure 6k, the proposed method shows a better fit with the RUL curve.In summary, the percentage changes of average RMSE and scoring performance between the proposed method and the non-adapted case (source-only) are listed in Table 3.It can be seen that all the prediction performances of the proposed LSTM-ADDA are higher than that of the non-adapted models.Although source-only models can get reasonable performance when the domain shift is small, but the proposed method can further improve the effect.Moreover, when the operating conditions and fault modes of the source domain are complex and the target domain is simple, the LSTM-ADDA model can obtain a better prediction effect.That is to say, it is easier to transfer training from complex situations to simple situations, and it is more difficult to learn from simple situations to complex situations.

Domain Adaptation Methods Comparison
Two different unsupervised domain adaption methods are used to compare with the proposed method in this section, Transfer Component Analysis (TCA) [32] and CORrelation ALignment (CORAL) [33].Besides this, the average RMSE and Score are compared with the adversarial method LSTM-DANN.
The distribution of the two domains are mapped together to a high-dimensional Reproducing Kernel Hilbert Space (RKHS).In this space, minimize the MMD between the source and target distribution, while retaining their respective internal attributes.
Different from TCA, CORAL aligns the second-order statistics of the source and target distributions by constructing a differentiable loss function that minimizes the difference between the source and target correlations-the CORAL loss [31].
In the LSTM-DANN method, the source domain shares the network weights with the target domain.When predicting a new target domain, the source domain regression model must be repeatedly trained to extract the degradation knowledge in the source domain.Unlike the LSTM-DANN method, the proposed scheme is not required to train the source-domain regression model repeatedly, and the extraction of cross-domain features can be achieved directly by adversarial training in stage II.Moreover, the significant difference between the proposed and DANN schemes is the cross-domain degradation feature training.The LSTM-DANN method uses the gradient inversion layer, while the LSTM-ADDA method uses the reversed feature domain label.The proposed scheme is simpler to implement and converges faster.
The comparison of the evaluation metrics is shown in Table 4.We can observe that the mean of RMSE metric is close to LSTM-DANN and overall better than the TCA and CORAL models.In addition, we found that the score metric improved by nearly 45%, indicating that the proposed method can achieve better prediction results in a timely manner while maintaining a smaller RMSE metric, which is more in line with the need for advanced prediction in practical maintenance.

Conclusions and Future Work
In this paper, a novel domain adaptation RUL prediction scheme is proposed by combining the LSTM network and adversarial discriminative domain adaptation framework (LSTM-ADDA).The prediction effect of the proposed model is verified on the C-MAPSS dataset.The LSTM feature extractor network parameters are adjusted by the target domain training data to find the domain invariant feature representation.
The proposed method uses the source domain data to obtain source domain degradation knowledge.After that, the target domain data without RUL label is combined for adversarial learning.Finally, the target feature extractor adjusted by adversarial training combines with the source domain regression module to implement the target domain RUL adaption prediction.
To illustrate the effectiveness of the LSTM-ADDA method, we perform two comparative experiments.The experimental results show that the prediction result of the proposed method is better than other methods.In future work, we will try to remove the target domain data from the training process, and use a generalized model combined with multi-source data to directly estimate the RUL of the target domain.

Figure 3 .
Figure 3.The network structure of the proposed LSTM-ADDA method.The data mapping structure consists mainly of the LSTM network, and the output of the last time step is input to the dense layer to obtain the domain deep feature representation.The regressor and the domain discriminator consist of fully-connected network for obtaining the RUL prediction and domain adversarial training, respectively.Source Pre-training Adversarial Adaptation Target Testing

Figure 5 .
Figure 5. Visualization of data distribution discrepancy between C-MAPSS subsets.
Target: FD001 (e)Target: FD003 (f)Target: FD004 (g)Target: FD001 (h)Target: FD002 (i)Target: FD004 (j)Target: FD001 (k)Target: FD002 (l)Target: FD003 W ox are weight values between input layer x t and hidden layer H t at time t; W gh , W ih , W f h , W oh are hidden layer weight values between time t and t − 1; B g , B i , B f , B o are bias of the input layer, input gate, forget gate and output gate, respectively.H t−1 is the hidden layer output at previous time.g t , i t , f t , o t are the output values.c t represents the current LSTM cell state.⊗ is the pointwise multiplication operator.φ, σ represent different activation functions.

Table 2 .
Hyperparameter settings for each group of cross-domain prediction.

Table 3 .
Score and RMSE comparsion between Source-Only, Target-Only and LSTM-ADDA.(∆% denotes the percentage of evaluation metric improvement.± denotes the standard deviation).

Table 4 .
Score and RMSE comparsion of unsupervised-domain adaption methods.(∆% denotes the percentage of evaluation metric improvement.± denotes the standard deviation).