Remaining Useful Life Prediction Using Dual-Channel LSTM with Time Feature and Its Difference

At present, the research on the prediction of the remaining useful life (RUL) of machinery mainly focuses on multi-sensor feature extraction and then uses the features to predict RUL. In complex operations and multiple abnormal environments, the impact of noise may result in increased model complexity and decreased accuracy of RUL predictions. At the same time, how to use the sensor characteristics of time is also a problem. To overcome these issues, this paper proposes a dual-channel long short-term memory (LSTM) neural network model. Compared with the existing methods, the advantage of this method is to adaptively select the time feature and then perform first-order processing on the time feature value and use LSTM to extract the time feature and first-order time feature information. As the RUL curve predicted by the neural network is zigzag, we creatively designed a momentum-smoothing module to smooth the predicted RUL curve and improve the prediction accuracy. Experimental verification on the commercial modular aerospace propulsion system simulation (C-MAPSS) dataset proves the effectiveness and stability of the proposed method.


Introduction
Since entering the age of industrialization, machinery and equipment are seen everywhere. However, due to the harsh environment and the impact of improper human operation, mechanical failure difficulties have also emerged one after another. It can be seen that mechanical failure has become a threat hindering social development. Ordinary people may be affected by mechanical failure at any time. Therefore, how to accurately evaluate the RUL of a machine before a failure occurs is of great significance.
The current prediction methods for RUL can be divided into four categories: physical models, statistical models, artificial intelligence models, and hybrid methods [1].
Technology based on physical model describes the degradation process of machinery by establishing a mathematical model through failure mechanisms or damage first principles [2]. Chan and Enright et al. [3] proposed a time-dependent physical crack propagation method for predicting the RUL of the turbo propulsion system. Using the enhanced risk analysis tool and material constants calibrated to IN 718 data, the effect of timedependent crack growth on the risk of fracture in a turbo engine component demonstrated a generic rotor design and a realistic mission profile. El-Tawil et al. [4] introduced an analytic prognostic methodology based on nonlinear damage laws. It enables the assurance of high availability and productivity with less cost for industrial systems. Kacprzynski et al. [4] developed a gear health prediction model using the physical method of failure. The approach, based on a statistical model, estimates the RUL of the machinery by establishing a statistical model based on empirical knowledge [5]. Barraza-Barraza and Tercero-Gómez et al. [6] put forward a way to use exogenous variables to construct three autoregressive (AR) models to predict the RUL of aluminum plates. Three autoregressive models with exogenous variables (ARX) were constructed, and their capability to estimate the remaining useful (1) The influence of time characteristics on the life prediction of machinery may change in different operating environments. Given this phenomenon, how we choose useful time features to avoid the appearance of feature redundancy or invalid features remains a problem to be solved. (2) For time characteristics, researchers frequently make the model by focusing on the size of the time features at a certain moment and ignoring the difference between the time features of two different moments. In fact, the change speed of the characteristics reflects the internal state of the machine and also reflects the health state of the machine. (3) Generally, the RUL of the machine is smooth and stable; that is, the remaining life of the machine in a certain period should be similar and the number of fluctuations relatively rare. However, in harsh environments, the data sent back to the system by the sensor are commonly unclean. The neural network method learns and predicts based on data, so the RUL predicted by the neural network fluctuates up and down. The RUL curve is jagged and causes a large deviation from the ground RUL.
To solve those problems, this paper proposes a dual-channel LSTM method. The proposed method constructs the direct relationship between the raw data and the ground RUL without using any prior information and improves the learning ability of neural networks by using time characteristic values and characteristic differences to predict the RUL. First of all, some features can hardly change during the entire time period, and the amount of information carried is also very small. If all the features are directly input into the neural network, the training time is longer. Therefore, we adopt a method of adaptively selecting features to eliminate features that have not changed in the life cycle and solve the problem of data redundancy. Secondly, we extract the first-order difference results of features as the input of the model to reflect the change speed of features and improve the accuracy of RUL prediction. Thirdly, in this neural network, LSTM is used to calculate the spatial dimension relationship of different features at each time, and a convolutional neural network (CNN) is used to consider the relationship in the time dimension and fuse the features of multiple time moments into one dimension. Aiming at the jagged phenomenon of the RUL curve predicted by the neural network, a momentum-smoothing method is proposed to deal with RUL curves and improve the accuracy of prediction.
The rest of the paper is organized as follows. The Section 2 introduces the structure of the dual-channel LSTM. Section 3 discusses the degradation data of aircraft turbofan engines and shows the comparative experiments of the dual-channel LSTM network with the other latest methods and so on. The Section 4 gives conclusions and further work.

Methodology
In this part, a new prediction structure for RUL estimation is proposed, called dualchannel LSTM. The architecture of the dual-channel LSTM network is presented in Figure 1. The whole algorithm can be divided into four parts: data preprocessing, dual-channel LSTM, RUL prediction, and momentum smoothing.  Figure 1. Dual-channel LSTM network architecture.

Data Preprocessing
The sample data we employed are C-MAPSS datasets from NSNA. The C-MAPSS dataset is composed of four diverse sub-datasets FD001, FD002, FD003, and FD004, which have the following related advantages and disadvantages. First, the sample size is sufficient. Each subsample contains a training set and test set, and the number of samples is acceptable. Second, there is a wide variety of samples. The whole training sample contains four sub-datasets. The data distribution of each sub-dataset is distinct, which is suitable for testing the generalization of the model. However, the sample is the data generated by simulating the real environment, and there is a certain deviation from the real sample of a turbine engine. At the same time, the sub-dataset contains a lot of noise, which is not conducive to the feature generalization of the model. Therefore, we must firstly preprocess the data, select features useful for model prediction, standardize the data, and speed up the convergence of the model. The following are segmented into the three aspects of feature selection, standardized processing, and time-window processing.

Feature Selection
The following uses engine unit No. 1 in the FD001 sub-dataset as an example to visualize its operational settings and some sensor measurements.
In Figure 2, the abscissa is time; the ordinate includes the operating settings of the engine and some sensors. In Figure 2 above, we can see that some time characteristics remain unchanged throughout the entire period, which has no positive effect on the prediction of RUL. It is necessary to choose the features that have a positive influence on the model prediction. Therefore, we can use prognosability to measure the variability of state indicators, and select time features that have changed significantly. The prognosability formula is as follows:

Feature Selection
The following uses engine unit No. 1 in the FD001 sub-dataset as an example to visualize its operational settings and some sensor measurements.
In Figure 2, the abscissa is time; the ordinate includes the operating settings of the engine and some sensors. In Figure 2 above, we can see that some time characteristics remain unchanged throughout the entire period, which has no positive effect on the prediction of RUL. It is necessary to choose the features that have a positive influence on the model prediction. Therefore, we can use prognosability to measure the variability of state indicators, and select time features that have changed significantly. The prognosability formula is as follows: where x j represents the measurement vector of a certain feature on the j-th system, the variable M is the number of systems to be monitored, and x j is the measurement vector at the first moment of the j-th system. The values of prognosability range from 0 to 1. For all features, if their prognosability is equal to 0 or NaN, these features are removed. Because these features have not changed in the whole prediction cycle, the results of each sub dataset selection are as follows: Entropy 2022, 24, 1818 5 of 21 FD003: op_setting_1, op_setting_2, sensor_2, sensor_3, sensor_4, sensor_6, sensor_7, sensor_8, sensor_9, sensor_10, sensor_11, sensor_12, sensor_13, sensor_14, sensor_15, sen-sor_17, sensor_20, sensor_21.

Normalization
The standardization of data is to scale the data to a small specific interval, which can speed up the model convergence. The data standardization method used in this article is the z-score. The formula is as follows: where u represents the mean value of all selected features; represents the standard deviation of all selected features; and x represents the value of a chosen feature. FD001: op_setting_1, op_setting_2, sensor_2, sensor_3, sensor_4, sensor_6, sensor_7, sensor_8, sensor_9, sensor_11, sensor_12, sensor_13, sensor_14, sensor_15, sensor_17, sensor_20, sensor_21.

Normalization
The standardization of data is to scale the data to a small specific interval, which can speed up the model convergence. The data standardization method used in this article is the z-score. The formula is as follows: where u represents the mean value of all selected features; σ represents the standard deviation of all selected features; and x represents the value of a chosen feature.
Considering that the mechanical state can be divided into a health state and a degraded state, the real RUL should be a piecewise linear function [24]. As shown in Figure 3, the ground RUL can be segmented into two parts: a constant portion and a linear degraded portion. Therefore, an RUL threshold needs to be set. Considering that the mechanical state can be divided into a health state and a degraded state, the real RUL should be a piecewise linear function Error! Reference source not found.. As shown in Figure 3, the ground RUL can be segmented into two parts: a constant portion and a linear degraded portion. Therefore, an RUL threshold needs to be set.

Time-Window Processing
The change speed of the characteristics reflects the internal state of the machine and the health state of the machine. Therefore, we can use the first-order difference results of features as the change speed of features to improve the accuracy of RUL prediction. Of course, adding higher-order difference results can also be considered to predict RUL, but the parameters of the model become more numerous. Then, there remain only features and feature differences to be considered. Two time windows need to be divided for network input. The specific operations are as follows: First, divide the first time window. As shown in Figure 4, the window width is denoted as Nf; the length of window is Nt; the sliding stride is denoted as s (s = 1); and the feature value window Input1 = [x1, x2, …, xNf].

Time-Window Processing
The change speed of the characteristics reflects the internal state of the machine and the health state of the machine. Therefore, we can use the first-order difference results of features as the change speed of features to improve the accuracy of RUL prediction. Of course, adding higher-order difference results can also be considered to predict RUL, but the parameters of the model become more numerous. Then, there remain only features and feature differences to be considered. Two time windows need to be divided for network input. The specific operations are as follows: First, divide the first time window. As shown in Figure 4, the window width is denoted as N f ; the length of window is N t ; the sliding stride is denoted as s (s = 1); and the feature value window Input 1 = [x 1 , x 2 , . . . , x Nf ].
The input size of Input1 is N t × N f . N f represents the number of features chosen by the prognosability formula. After the first time window is obtained, the time feature difference value can be obtained. The feature difference value window The input size of Input2 is (N t − 1) × N f , and each pair of inputs has its corresponding RUL value, which is used for supervised training. The input size of Input2 is (Nt − 1) × Nf, and each pair of inputs has its corresponding RUL value, which is used for supervised training.

Dual-Channel LSTM Predicts Lifetime
When we use recurrent neural networks, in most cases, we deal with features at different times, ignoring another dimension, the feature difference. The feature difference at unequal times can let the network know the change rate of the feature value, assist the method to consider the feature information more comprehensively, reduce the influence of noise on the model, make the model more robust, and enhance the differentiation ability. Therefore, we can use dual-channel LSTM to process the eigenvalues at different times and the feature difference separately and calculate the spatial dimension relationship of different features at each time. This is described in detail below.
As shown in Figure 1 above, after the time window is processed, we can get two inputs, Input1 and Input2. Input1 represents the value of each time feature within a window time; Input2 represents the difference value of the time feature within the same window time. Then we use two LSTM networks to process Input1 and Input2. The output after the first recurrent neural network processes Input1 is Output1; The output after the second recurrent neural network processes Input2 is Output2;

Dual-Channel LSTM Predicts Lifetime
When we use recurrent neural networks, in most cases, we deal with features at different times, ignoring another dimension, the feature difference. The feature difference at unequal times can let the network know the change rate of the feature value, assist the method to consider the feature information more comprehensively, reduce the influence of noise on the model, make the model more robust, and enhance the differentiation ability. Therefore, we can use dual-channel LSTM to process the eigenvalues at different times and the feature difference separately and calculate the spatial dimension relationship of different features at each time. This is described in detail below.
As shown in Figure 1 above, after the time window is processed, we can get two inputs, Input 1 and Input 2 . Input 1 represents the value of each time feature within a window time; Input 2 represents the difference value of the time feature within the same window time. Then we use two LSTM networks to process Input 1 and Input 2 . The output after the first recurrent neural network processes Input 1 is Output 1 ; The output after the second recurrent neural network processes Input 2 is Output 2 ; Output 2 = [g 1 , g 2 , . . . , g hidden_size ].
Entropy 2022, 24, 1818 8 of 21 The first line feature of Output 1 can be obtained by the second line feature of Output 1 and the first line feature information of Output 2 , so the first line feature of Output 1 is not considered. Finally, we add the last (N t−1 ) row vectors of Output 1 and Output 2 directly to get Output;

Life Prediction
RUL prediction can be divided into two parts. One part is to use CNN to consider the relationship in the time dimension and fuse the features of multiple time moments into one dimension after dual-channel LSTM processing, and another part is to use a fully connected neural network to predict RUL. The convolution part mainly includes convolution operation, batch normalization [25,26], and activation function. The main features of the convolutional layer are sparse interaction, parameter sharing, and equivariant representation. Among them, sparse interaction can play a regularization [27] role; parameter sharing reduces the number of parameters of the model and significantly increases the size of the network without increasing the training data.
As shown in Figure 1, the filter sizes of the three convolutional layers are selected as 32 × 1, 32 × 1, and 1 × 1. Each convolutional layer uses batch normalization to improve the performance and stability of the neural network. The activation function uses the ReLU function [28]. To avoid the gradient disappearance, the function of the first two convolutional layers is mainly to extract features, and the last one is to reduce dimensionality, which suppresses overfitting. In the portion of the fully connected layer, the neural unit mainly uses the previously extracted features to predict RUL. The activation function used in the first fully connected layer is also the ReLU function, and the dropout technology [29,30] is applied to avoid overfitting. The last fully connected layer is used to predict the RUL value.

Smooth Calibration
Taking into account the influence of many factors, such as noise, the collected data has a fixed deviation. The neural network predicts life based on the collected data, so the predicted remaining life curve fluctuates up and down throughout the cycle. In practice, the RUL of the machine should be stable. Even if it encounters the interference of some factors, the life curve fluctuates in very few local places. Because of this phenomenon, we should consider that there should be a caching relationship between the RUL at the past moment and the RUL at the current moment. Inspired by the momentum gradient descent, the article proposes a momentum smoothing method for the RUL for the test set. The formula is as follows.
where y t is the predicted value at time t by using dual-channel LSTM, predict t−1 is the predicted outcome after smoothing at the last time, predict t is the value after smoothing at the current time t, and k represents the proportion of y t in predict t . The larger the k is, the smaller the buffer of the previous RUL will be at the current moment; the opposite is true the smaller the value of k is.

Regularization
If the distribution of the acquired data deviates from the distribution of the actual data, the neural network model may be overfitting during training. This leads to the high accuracy of the training dataset but poor performance of the test dataset. In order to avoid this phenomenon, we can consider using regularization [31,32] methods. The regularization methods include the dropout technology, the L2 regularization, and the early stopping [33] method of the partition verification set in this model. The dropout technology is to make the activation value of the neuron stop working with a certain probability during the forward propagation, which can make the model more generalized since it does not rely too much on some local features.
L2 regularization is also called weight attenuation, which makes the solution of the model biased towards weights with smaller norms, and limits model space by limiting the size of the weight norms, thereby avoiding overfitting to a certain extent. The formula is as follows.
Among them, C 0 represents the loss function of the model. In the proposed method, C 0 is the MSE; w represents the weight parameter of the model; n represents the number of parameters; λ is the regular term coefficient; C represents the loss function after adding L2 regularization.
In the process of training the model, the loss function keeps getting smaller and the parameters keep approaching the optimal solution, but it is possible that at a certain gradient contour, the model has reached the optimal solution in a fixed spatial range. If we continue to train at this time, the model may linger near the optimal solution or even overfit. Considering this phenomenon, we can divide the training set into a training set and a validation set, and the division ratio is p1: p2. When the model is in p training cycles, and the loss value of the validation set has not decreased, the training is halted.

Dual-Channel LSTM Model Construction Process
The flow chart of the dual-channel LSTM model construction is described step by step below.
Among them, predict i represents the predicted value, RUL i represents the ground RUL, and N represents the number of all sample data. Figure 5 visualizes the relationship between the evaluation index function. It can be seen that the score evaluation index and di are exponential. Compared with RMSE, the penalty for high deviation is raised and the penalty for low deviation is reduced in the score function.
Among them, predicti represents the predicted value, RULi represents the ground RUL, and N represents the number of all sample data. Figure 5 visualizes the relationship between the evaluation index function. It can be seen that the score evaluation index and di are exponential. Compared with RMSE, the penalty for high deviation is raised and the penalty for low deviation is reduced in the score function. LSTM perform parallel operations with difficulty, so training LSTM takes more time. To speed up the training of the model as much as possible, the number of LSTM layers is set to 1. At the same time, in order to ensure that the LSTM effectively extracts the information of the sequence data through the prediction accuracy of the model on the dataset, the dimension hidden size of the hidden layer of the LSTM is adjusted. Meanwhile, batch size and learning rate are adjusted according to the prediction accuracy of the model. Figure 6 depicts the flowchart of dual-channel LSTM. LSTM perform parallel operations with difficulty, so training LSTM takes more time.

Time (Cycle)
To speed up the training of the model as much as possible, the number of LSTM layers is set to 1. At the same time, in order to ensure that the LSTM effectively extracts the information of the sequence data through the prediction accuracy of the model on the dataset, the dimension hidden size of the hidden layer of the LSTM is adjusted. Meanwhile, batch size and learning rate are adjusted according to the prediction accuracy of the model. Figure 6 depicts the flowchart of dual-channel LSTM.  Figure 6. Flowchart of dual-channel LSTM.

Experiment and Result Analysis
In order to verify the effectiveness of dual-LSTM method, it is applied to predict the RUL of a turbine engine.

Description of C-MAPSS
C-MAPSS simulates a 90,000-pound thrust-type engine model [34]. The built-in control system includes a fan speed controller and a set of regulators and limiters.

Experiment and Result Analysis
In order to verify the effectiveness of dual-LSTM method, it is applied to predict the RUL of a turbine engine.

Description of C-MAPSS
C-MAPSS simulates a 90,000-pound thrust-type engine model [34]. The built-in control system includes a fan speed controller and a set of regulators and limiters. Figure 7 shows the main components of the engine model. regulators and limiters. Figure 7 shows the main components of the engine model. Table  1 lists the 21 sensors that monitor engine conditions.  Total temperature at LPT outlet ˚R 5 Pressure at fan inlet psia 6 Total pressure in bypass-duct psia 7 Total pressure at HPC outlet psia 8 Physical fan speed rpm 9 Physical core speed rpm 10 Engine pressure ratio (P50/P2) -11 Static pressure at HPC outlet psia 12 Ratio of fuel flow to Ps30 pps/psi 13 Corrected fan speed rpm 14 Corrected core speed rpm 15 Bypass ratio -16 Burner fuel-air ratio -17 Bleed enthalpy -18 Demanded fan speed rpm 19 Demanded corrected fan speed rpm 20 HPT coolant bleed lbm/s 21 LPT coolant bleed lbm/s This dataset is simulated using the commercial modular aerospace propulsion system simulation. Details about the dataset are as follows.
From Table 2, we can see that the C-MAPSS dataset is composed of four diverse subdatasets: FD001, FD002, FD003, and FD004. The number of engine units in each sub-dataset is unique; the numbers of failure modes and operating settings are also distinct. Each sub-dataset is divided into a training dataset and a test dataset, recording the true RUL of the scroll engine from a healthy state to a degraded state at each moment. Each sub-dataset is divided into training datasets and test datasets, which at each moment record three operational settings and the data of the 21 sensors of the scroll engine unit. The FD001 and  Total temperature at LPT outlet°R 5 Pressure at fan inlet psia 6 Total pressure in bypass-duct psia 7 Total pressure at HPC outlet psia 8 Physical fan speed rpm 9 Physical core speed rpm 10 Engine pressure ratio (P50/P2) -11 Static pressure at HPC outlet psia 12 Ratio of fuel flow to Ps30 pps/psi 13 Corrected fan speed rpm 14 Corrected core speed rpm 15 Bypass ratio -16 Burner fuel-air ratio -17 Bleed enthalpy -18 Demanded fan speed rpm 19 Demanded corrected fan speed rpm 20 HPT coolant bleed lbm/s 21 LPT coolant bleed lbm/s This dataset is simulated using the commercial modular aerospace propulsion system simulation. Details about the dataset are as follows.
From Table 2, we can see that the C-MAPSS dataset is composed of four diverse subdatasets: FD001, FD002, FD003, and FD004. The number of engine units in each sub-dataset is unique; the numbers of failure modes and operating settings are also distinct. Each sub-dataset is divided into a training dataset and a test dataset, recording the true RUL of the scroll engine from a healthy state to a degraded state at each moment. Each sub-dataset is divided into training datasets and test datasets, which at each moment record three operational settings and the data of the 21 sensors of the scroll engine unit. The FD001 and FD002 sub-datasets contain one failure mode (HPC degradation), while FD003 and FD004 contain two degradation modes (HPC degradation and Fan degradation). There is only one running condition for FD001 and FD003, and there are six running conditions for FD002 and FD004. Due to the complex and changeable operating environment of the FD002 and FD004 sub-dataset engine unit, it is more difficult to estimate the RUL of the FD002 and FD004 sub-datasets.  Table 3 shows the detailed parameters of the proposed dual-channel LSTM. The length of the test set is determined by the size of the time window and the window moving step [12]. The parameters of the model are initialized by setting a random seed. Then, through training the model and adjusting the parameters, the parameters in Table 3 are finally determined. There are 16,324 parameters in the model, with a total size of 28.92 M. The experiment is processed on a PC with Intel Core i5-5200U CPU, 12 GB RAM, and NVIDIA GeForce 910M GPU (on the PyTorch 1.2.0). After training the model, we take the FD001 sub-dataset as an example to test the effect of the method. The following shows the RUL curve of the engine units 24, 34, 76, and 100 in the FD001 test set.

Life Prediction on C-MAPSS Datasets
It can be seen from Figure 8 that the results predicted by the model are constant in the early stage. In the later stage, the predicted outcome shows a linear decline, which is in line with the overall trend of the ground RUL curve.

Compared with State-of-the-Art Methods
In this research part, we use different advanced research means to compare the dualchannel LSTM method proposed in the article.
As shown in Table 4, our proposed dual-channel LSTM method outperforms most methods in terms of RMSE and score. Compared with other methods, the dual-channel LSTM method adds a time feature difference input channel, which can extract key information from the time feature speed to reduce the error in RUL prediction. In the sub-datasets FD002 and FD004 with multiple faults and complex operating settings, the dual-channel LSTM method reduces the RMSE values to 17.63 and 17.41, respectively, and the scores to 1773.47 and 2617.45, respectively. This reflects that when the environment becomes bad, the proposed dual-channel LSTM method still performs well. The main reason is that the dual-channel recurrent neural network not only considers the time value of the time feature but also thinks about the difference of the time feature at different times, which reduces the impact of the environment on the model. Generally speaking, the dual-channel LSTM method has certain advantages compared to the other methods in the table.

Compared with State-of-the-Art Methods
In this research part, we use different advanced research means to compare the dualchannel LSTM method proposed in the article.
As shown in Table 4, our proposed dual-channel LSTM method outperforms most methods in terms of RMSE and score. Compared with other methods, the dual-channel LSTM method adds a time feature difference input channel, which can extract key information from the time feature speed to reduce the error in RUL prediction. In the subdatasets FD002 and FD004 with multiple faults and complex operating settings, the dualchannel LSTM method reduces the RMSE values to 17.63 and 17.41, respectively, and the scores to 1773.47 and 2617.45, respectively. This reflects that when the environment becomes bad, the proposed dual-channel LSTM method still performs well. The main reason is that the dual-channel recurrent neural network not only considers the time value of the time feature but also thinks about the difference of the time feature at different times, which reduces the impact of the environment on the model. Generally speaking, the dualchannel LSTM method has certain advantages compared to the other methods in the table.

Model Analysis
In this section, we use the test set of the sub-dataset FD001 as an example to analyze the model.

Mini-Batch Size
In the training process, the training set is divided into small batch samples for training. Table 5 shows the effect of batch processing on the prediction results of the model. In Table 5, when mini-batch equals 64, the RMSE and score of FD001's test set are the lowest. Therefore, the value of mini-batch in this method is 64. The results are depicted in Figure 9.

Model Analysis
In this section, we use the test set of the sub-dataset FD001 as an example to analyze the model.

Mini-Batch Size
In the training process, the training set is divided into small batch samples for training. Table 5 shows the effect of batch processing on the prediction results of the model. In Table 5, when mini-batch equals 64, the RMSE and score of FD001's test set are the lowest. Therefore, the value of mini-batch in this method is 64. The results are depicted in Figure 9. batch size Figure 9. The effect of mini-batch size.

Momentum Smoothing
This experiment studies the influence of momentum smoothing on the predicted results under the same network structure. Table 6 describes the results of the experiment.

Momentum Smoothing
This experiment studies the influence of momentum smoothing on the predicted results under the same network structure. Table 6 describes the results of the experiment. When k = 0.3, the RMSE of the test set of FD001 dropped from 11.79 to 11.34, and the score dropped from 280.2 to 267.1. It shows that the use of momentum smoothing can predict the RUL more accurately. At the same time, as k decreases, RMSE and score first decrease and then increase. This shows that when k is greater than or equal to 0.5, reducing the value of k (that is, increasing the buffering effect of the RUL of the previous period on the current) can lower the RMSE and score. However, when the value of k is small, continuing to reduce the value of k makes the proportion of the current RUL predicted by the model too low, resulting in the RUL being determined by the previous RUL after smoothing. So, the score rose sharply. The results are visualized in Figure 10.
decrease and then increase. This shows that when k is greater than or equal to 0.5, reducing the value of k (that is, increasing the buffering effect of the RUL of the previous period on the current) can lower the RMSE and score. However, when the value of k is small, continuing to reduce the value of k makes the proportion of the current RUL predicted by the model too low, resulting in the RUL being determined by the previous RUL after smoothing. So, the score rose sharply. The results are visualized in Figure 10. In order to see the effect of momentum smoothing, take the engine unit 100 of the test set in the sub-dataset FD001 as an example to visualize the results after using momentum smoothing in Figure 11. In order to see the effect of momentum smoothing, take the engine unit 100 of the test set in the sub-dataset FD001 as an example to visualize the results after using momentum smoothing in Figure 11. Due to the addition of the momentum-smoothing module in the proposed method, the historically predicted RUL affects the current predicted RUL, which reduces the randomness of the predicted results, so the residual curve is relatively smooth. Due to the addition of the momentum-smoothing module in the proposed method, the historically predicted RUL affects the current predicted RUL, which reduces the randomness of the predicted results, so the residual curve is relatively smooth.

Stop Early
In order to verify that early stopping is beneficial for preventing the model from overfitting, we are now conducting research to set p to take a different value within 180 epochs; if the loss of the verification set does not decrease in p cycles, stop training and use the test set to check the effect. Letting p take None is equivalent to letting the model train for 180 epochs; that is, the early stopping method is not used. The results are visualized in Figure 12. It can be summarized from Table 7 that as the p-value increases, the evaluation indicators RMSE and score are both decreasing. This is due to the use of the mini-batch gradient descent rule, which has randomness; the verification set may not drop continuously within a fixed training period. Therefore, the p-value needs to be enlarged to reduce randomness. However, it can be found that when the p-value increases to a definite value, the RMSE and score does not decrease significantly as the p-value continues to become larger because the randomness has been reduced to a negligible level. If model performance is similar, the training time should be as short as possible, and p should be set to a reasonable value, i.e., not the larger the better. The effect of the model trained in 180 epochs is not much different from the one of using the early stopping strategy, indicating that the model has reached a fixed spatial optimal solution before 180 epochs, but with the early stopping strategy, the model normally stops training before 80 epochs, which saves more than half of the training time. It can be seen from Table 8 that the use of the SGD optimizer converges to a locally optimal solution. RMSE only dropped to 37.5, while the RMSE can be dropped to less than 15 using several other optimizers. Using the RMSprop optimizer, the RMSE can be reduced to 11.32, but the score is higher than that using the Adam optimizer, so the Adam optimizer is finally used for the gradient descent of the method. The results are visualized in Figure 13. It can be summarized from Table 7 that as the p-value increases, the evaluation indicators RMSE and score are both decreasing. This is due to the use of the mini-batch gradient descent rule, which has randomness; the verification set may not drop continuously within a fixed training period. Therefore, the p-value needs to be enlarged to reduce randomness. However, it can be found that when the p-value increases to a definite value, the RMSE and score does not decrease significantly as the p-value continues to become larger because the randomness has been reduced to a negligible level. If model performance is similar, the training time should be as short as possible, and p should be set to a reasonable value, i.e., not the larger the better. The effect of the model trained in 180 epochs is not much different from the one of using the early stopping strategy, indicating that the model has reached a fixed spatial optimal solution before 180 epochs, but with the early stopping strategy, the model normally stops training before 80 epochs, which saves more than half of the training time. It can be seen from Table 8 that the use of the SGD optimizer converges to a locally optimal solution. RMSE only dropped to 37.5, while the RMSE can be dropped to less than 15 using several other optimizers. Using the RMSprop optimizer, the RMSE can be reduced to 11.32, but the score is higher than that using the Adam optimizer, so the Adam optimizer is finally used for the gradient descent of the method. The results are visualized in Figure 13.  It can be seen from Table 8 that the use of the SGD optimizer converges to a locally optimal solution. RMSE only dropped to 37.5, while the RMSE can be dropped to less than 15 using several other optimizers. Using the RMSprop optimizer, the RMSE can be reduced to 11.32, but the score is higher than that using the Adam optimizer, so the Adam optimizer is finally used for the gradient descent of the method. The results are visualized in Figure 13.

Model Generalization
In order to verify the generalization of the model, take the logarithm of RUL, let RUL = ln(RUL + 1), and train the model again. The results are as follows.
In Table 9, after taking logarithms of RUL, the RMSE value of the test set of the sub dataset FD001 is 0.15 and the score value is 1.3. It can be seen from Figure 14 that the model still fits the new RUL.

Model Generalization
In order to verify the generalization of the model, take the logarithm of RUL, let RUL = ln(RUL + 1), and train the model again. The results are as follows.
In Table 9, after taking logarithms of RUL, the RMSE value of the test set of the sub dataset FD001 is 0.15 and the score value is 1.3. It can be seen from Figure 14

Discussion and Summary
This paper proposes a new network model dual-channel LSTM. In this neural network, dual-channel LSTM is used to deal with the time feature and its first-order feature and solve the problems of gradient disappearance and gradient explosion during long sequence training; CNN considers the relationship in the time dimension and integrates the features of multiple time moments into one dimension, which plays a role in dimension reduction. Aiming at the jagged phenomenon of the RUL curve predicted by the neural network, a momentum-smoothing method is proposed to deal with RUL curves and improve the accuracy of prediction. The dual-channel LSTM improves the learning ability of network because it learns the information of the two dimensions of the time features, which has good application scenarios in actual industrial environments. However, there are still some questions that need to be resolved. Firstly, in momentum smoothing, the ratio k is adjusted according to the effect. Whether there is a precise method to determine the value of k or not is not known. Secondly, the application of deep learning to industry requires a large amount of data. Then, the acquisition of industrial data is complicated. Therefore, how to apply deep learning with a small amount of data is also a current problem. Thirdly, there are many types of industrial data. Using the same model to process different types of data, the effect may not be ideal. The difficult point is whether it is possible to propose a data fusion method to improve the universality of the model, but this problem will be explored in future research.