Remaining Useful Life Prediction of Aeroengines Based on Multi-Head Attention Mechanism

: Aeroengines are the core components of an aircraft; therefore, their health determines ﬂight safety. Currently, owing to their complex structure and problems associated with their various detection parameters, predicting the remaining useful life (RUL) of aeroengines is very important to ensure their safety and reliability. In this paper, we propose a new hybrid method based on convolutional neural networks (CNN), timing convolutional neural networks (TCN), and the multi-head attention mechanism. Firstly, an CNN-TCN model is established for multi-dimensional features, in which two layers of the CNN extract features of multi-dimensional input data, and the TCN process the timing features. Subsequently, the outputs of multiple CNN-TCNs are weighted using the multi-head attention mechanism, and the results are stitched together. Next, we compare the root mean square error (RMSE) and scores of various RUL prediction methods to show the superiority of the proposed method. The results showed that compared with previous research results, the RMSE and Score of FD001 decreased by 10.87% and 42.57%, respectively, whereas those of FD003 decreased by 14.13% and 58.15%, respectively.


Introduction
As a core component of the aircraft, the health of the aeroengine determines the flight safety [1]. Therefore, the prediction of the remaining useful life (RUL) of aeroengines is crucial, as it could help engineers in making reasonable maintenance decisions, reducing the cost of airline operations, and improving flight quality [2][3][4].
Currently, according to the prediction principle, there are two main categories of RUL prediction: physical failure, data-driven [5,6] and hybrid models [7,8]. The physical failure model-based approach combines a priori knowledge of the composition, mechanical dynamics principles, and degradation mechanisms of the equipment with sensor monitoring data to construct the RUL physical prediction model. Although this method can achieve high prediction accuracy, it is less versatile, and the modelling process is complex. The data-driven approach extracts useful information from sensor monitoring parameters and uses data analysis to mine valid features to characterize the health of an aeroengine. This technique achieves less accurate results than the previous approach but is easier to use and has better flexibility. The hybrid approach does not avoid the problem that physical failure models are difficult to obtain. Therefore, the data-driven RUL prediction method has been extensively studied in complex devices modeling.
The data-driven RUL prediction method commonly applies two prediction schemes: (1) data fusion is used to map multi-sensor monitoring data to a one-dimensional (1D) health indicator (HI), and then the HI is used for RUL prediction; (2) multisensor monitoring data are directly used to predict RUL. Zhou et al. [9] extracted a new HI from the operating parameters of lithium-ion batteries for degradation modelling and RUL prediction. Yang et al. [10] proposed a dynamic HI smoothing approach to smoothen the current HI value against the previously predicted value. Lee et al. [11] defined an HI for the filter and then used a recurrent neural network (RNN) algorithm to predict the HI value from the degradation point to the end-of-life to generate the RUL. Gou et al. [12] proposed a RNN-based HI (RNN-HI) for RUL prediction of bearing. These methods fused multiple sensor data to construct a composite HI. The single channel network model predicts RUL using the constructed HI to characterize the degradation process of mechanical equipment.
Based on the abovementioned theory, an attempt can be made to construct a multidimensional HI using multi-dimensional features to characterize the degradation process of equipment. Ansari et al. [13] constructed a multi-channel artificial neural network (ANN) for extracting multiple features of batteries, and their proposed model showed strong versatility. Peng et al. [14] proposed a prediction model based on the idea of classification and parallel processing. Zhao et al. [15] used a two-channel hybrid model to predict the RUL of aeroengines, which demonstrated better performance than the conventional prediction models. Li et al. [16] experimentally concluded that a dual-path directed acyclic graph predicts better than a single-path convolutional neural network (CNN) or a long shortterm memory. Based on the summary of the findings and approaches used in literature, we consider multidimensional features as multidimensional HI. The nonlinear mapping capability of a multichannel network structure was used to establish relationships between multidimensional features and RUL.
However, due to the high dimensionality of currently used monitoring parameters and prediction models that do not adequately extract valid information from monitoring data, predicting the health and safety of an aeroengine is difficult. We propose a new RUL prediction method that combines CNN-TCN and a multi-head attention mechanism based on research related to network structures with multiple channels. The method uses CNNs to mine temporal features, a TCN to improve the computing efficiency of the network while ensuring the integrity of long time sequences, and a self-attention mechanism to focus on useful information. Different sensor monitoring parameters were modelled separately to enable parallel processing of different sensor data, maximizing data integrity while improving the computational efficiency of the network. The proposed method was validated using NASA's C-MAPSS data, and the experimental results show a significant improvement in the RUL prediction of aeroengines.

Convolutional Neural Network
CNN is a deep learning method with a strong generalization ability, and it has achieved favorable results in processing multi-array signals such as image, time series, and audio signals [16][17][18][19]. In this study, CNN was selected to uncover channel and spatial features that can effectively characterize the degradation process of aeroengines.
As shown in Figure 1, CNN is usually set alternately at the convolution and pooling layers. We used two CNNs to process the multisource sensor signals of an aeroengine in this paper. Moreover, each layer of the CNN has several convolution kernels of a consistent size that traverse the input multidimensional features in chronological order to create a high-dimensional feature space. Then, different feature spaces are combined to produce inputs to the next network [20].
The convolution layer is the core of the CNN and mainly comprises convolution kernels, which mainly extract features. The convolution layer realizes local sensing and weight sharing, reducing the complexity of model and computation cost. If x n,l represents the nth feature map of layer l, output z n,l of layer l can be calculated as follows: where * is the convolution operator, k n,l is the nth weight of layer l, b n,l is the bias, and C is the number of input channels. The convolution layer is the core of the CNN and mainly comprises convolution kernels, which mainly extract features. The convolution layer realizes local sensing and weight sharing, reducing the complexity of model and computation cost. If xn,l represents the nth feature map of layer l, output zn,l of layer l can be calculated as follows: where * is the convolution operator, kn,l is the nth weight of layer l, bn,l is the bias, and C is the number of input channels. The activation function ReLU is used to nonlinearly transform the output of the activation layer to improve the applicability of the network, and it is calculated as follows: where Sn,l represents the output of the activation function. As a common layer after the convolution layer, the pooling layer reduces network parameters to improve the computational efficiency. In this study, we selected the max pooling layer, which can be calculated as where V is the parameter size of the pooled area and pn,l is the output of the pooling layer.

Temporal Convolutional Network
For the multi-dimensional sensor long-time-series signal of an aeroengine, the conventional CNNs are limited by the depth of the network. In addition, CNNs cannot effectively process the time series. RNNs can capture the latent temporal patterns but face difficulty in avoiding gradient disappearance or explosion. To address these problems, a temporal convolutional network (TCN) is used, which is a sequential prediction model characterized by layered stacks of dilated causal convolution (DCC) with residual connections (RC) [21,22]. The RC is a constituent unit of the TCN and is illustrated in Figure 2. The activation function ReLU is used to nonlinearly transform the output of the activation layer to improve the applicability of the network, and it is calculated as follows: where S n,l represents the output of the activation function.
As a common layer after the convolution layer, the pooling layer reduces network parameters to improve the computational efficiency. In this study, we selected the max pooling layer, which can be calculated as where V is the parameter size of the pooled area and p n,l is the output of the pooling layer.

Temporal Convolutional Network
For the multi-dimensional sensor long-time-series signal of an aeroengine, the conventional CNNs are limited by the depth of the network. In addition, CNNs cannot effectively process the time series. RNNs can capture the latent temporal patterns but face difficulty in avoiding gradient disappearance or explosion. To address these problems, a temporal convolutional network (TCN) is used, which is a sequential prediction model characterized by layered stacks of dilated causal convolution (DCC) with residual connections (RC) [21,22]. The RC is a constituent unit of the TCN and is illustrated in Figure 2. Causal convolution in TCN avoids information disclosure and enhances the memory of past information on the network. Causal convolution ensures that when processing time-series data, the output of time t is only convolved with the convolution of time t and earlier elements in the previous layer. However, the deeper the network depth, the mor Causal convolution in TCN avoids information disclosure and enhances the memory of past information on the network. Causal convolution ensures that when processing time-series data, the output of time t is only convolved with the convolution of time t and earlier elements in the previous layer. However, the deeper the network depth, the more past information is memorized, and the increase of network depth will affect the efficiency of model training. To solve to solve this problem, we introduce the dilated convolution.
Dilated convolution can be sampled at input intervals during convolution. While ensuring that TCN has a wider field of view and receives more historical data, dilated convolution avoids the problems caused by extremely deep networks. The dilated convolution operation F on element s of the sequence is defined as: where d is the dilation factor, k is the filter size, a filter f : {0, 1, . . . , k − 1}, s − d · i accounts for the direction of the past, X is the input and F is the output. A residual block is a key structure of the TCN, which is defined as follows:

Multi-Head Attention
In the multi-dimensional long-time-series prediction of aeroengines, some features are independent of each other. Therefore, we utilized a multi-headed attention mechanism to separately process different sensor monitoring parameters.
The self-attention mechanism is similar to that in TCN, enabling parallel computation. It filters essential messages from the input features and assigns them different weights according to their importance [23]. Hence, the model focuses on the message with a larger weight so that we can quickly capture the degradation signal of the aeroengines. We used the scaled dot-product attention mechanism, a commonly used method, which first obtains the corresponding weights of the query and key matrixes through point multiplication, then the Softmax function is used for normalization. Finally, the Attention is obtained by weighted summation, as follows: where Q is a query matrix, K is a key matrix, V is a values matrix, and A f is an input matrix. Further, W Q , W K , and W V represent the weight matrixes of Q, K, and V, respectively, and d k is the dimension of Q, K, and V. The self-attention mechanism focuses on the details of the input message according to the target. The multi-head attention mechanism is based on the combination of several self-attention mechanisms. For multiple sensor signals of aircraft engines, the multi-head attention mechanism is utilized to achieve simultaneous attention to different parameters, and finally the obtained results are spliced to obtain the final attention, as follows: where W i Q , W i K , and W i V represent the weight matrices of Q, K, and V in the ith attention head respectively, W represents the weight matrix of the multi-head attention mechanism, and the output of the multi-attention mechanism is spliced by the merge layer.

Proposed Methodology
The existing RUL prediction studies based on aeroengines use both CNNs and RNNs as data-driven prediction methods. However, CNNs cannot effectively process timing signals, and RNNs cannot avoid problems related to long-term dependence. Based on the related research, we propose a multi-head attention model based on CNN-TCN to predict the RUL of aeroengines; it contains two CNN layers and a TCN layer. The two CNN layers feature a multi-source sensor signal, and then the extracted features are input into the TCN for processing. Subsequently, the multi-dimensional features of the aeroengines are processed separately using the multi-head attention mechanism, which ensures the integrity of the input data and focuses on the most weighted message. The RUL prediction process of aeroengines based on the CNN-TCN and multi-head attention mechanism is shown in Figure 3.    Figure 3. RUL prediction process of aeroengines.

1.
Data preprocessing: According to the existing research experience [24][25][26], from among the 21 sensors, 14 sensors with large changes are selected. The 14 selected sensors are processed by exponential smoothing (ES) to remove environmental noise and retain the original degradation messages. Then, the sensors used as features are normalized to remove dimensional interference. A sliding window is introduced for secondary processing of preprocessed data. As the length of the sliding window increases, more data information is collected. However, this may cause the short-term state change to be ignored. Therefore, this article selects sliding windows with lengths of 30 and 40.

2.
Model construction: The RUL tag on the divided training dataset is used as the input of the prediction model to train the model. We constructed a 14-channel CNN-TCN network for separate modeling of different features to enable parallel processing of different data. Subsequently, we chose the concatenate function in the Merge layer to stitch the multidimensional data, and two dense layers were utilized to regress.

3.
RUL prediction: The trained model is called to make predictions about the test set, and the prediction results are compared with the true values.

Dataset Description
The experimental data are derived from the degradation data of NASA C-MAPSS turbofan engines, including FD001~FD004 subsets [27]. Each dataset contains three files: the training set, test set, and RUL true values. The proposed method was evaluated on the FD001 and FD003 in the C-MAPSS datasets. The detailed description of the datasets is presented in Tables 1 and 2.

Sensors Selection
Some sensor parameters of aeroengines that are not related to the degradation process during operation should be rejected. The accurate selection of the number of features that are highly correlated with the lifecycle can improve the efficiency of network training. For subsets FD001 and FD003, a few of the 21 sensor parameters remain essentially constant throughout the lifecycle. Hence, sensors 1, 5, 6, 10, 16, 18, and 19 were discarded [24][25][26].

Exponential Smoothing
Exponential smoothing is a time-series forecasting method that evolved from the moving average method. This method makes better use of the utility of recent observations on the predicted values than those commonly used methods such as moving average, spline, etc. In addition, the weights on the observations are scalable. In the C-MAPSS dataset, the correlation between the current value and the surrounding values decreases with the number of cycle steps. Therefore, it is feasible to use the ES method to smooth the original sensor parameters of 100 aeroengines. This method is a special weighted moving average method, where the current value can be regarded as a weighted average of the current actual value and the previous moment value [28]. The calculation is as follows: where S t is the observed value at t, S t−1 is the observed value at t − 1, y t is the true value at t, and α represents the smoothing constant, which ranges from 0 to 1. The value of α in ES determines the degree of smoothing. The higher the value of α, the greater the impact the recent data information has on the forecast. Conversely, the data tend to be flat. When the sensor monitoring data fluctuate but do not significantly change over time, α can be valued between 0.1 and 0.5. Therefore, 14 sensor detection parameters of 100 aeroengines were preprocessed with α of 0.1, 0.3, and 0.5. Two sensor monitoring data were randomly selected from FD001 and FD003, and the processed data are shown in Figure 4. The value of α in ES determines the degree of smoothing. The higher the value of α, the greater the impact the recent data information has on the forecast. Conversely, the data tend to be flat. When the sensor monitoring data fluctuate but do not significantly change over time, α can be valued between 0.1 and 0.5. Therefore, 14 sensor detection parameters of 100 aeroengines were preprocessed with α of 0.1, 0.3, and 0.5. Two sensor monitoring data were randomly selected from FD001 and FD003, and the processed data are shown in Figure 4.  As shown Figure 4, when α = 0.1, the curve fluctuation after ES is small; however, it is impossible to make a good trend fit for the second half of the cycle period. When α = 0.5, the curve after ES fits the cycle period well; however, the rejection of ambient noise is not complete. In contrast, when α = 0.3, not only does the degradation curve of the cycle period fit well but the interference of environmental noise is also avoided to a great extent. In summary, the 14 sensor detection parameters of 100 aeroengines in FD001 and FD003 As shown Figure 4, when α = 0.1, the curve fluctuation after ES is small; however, it is impossible to make a good trend fit for the second half of the cycle period. When α = 0.5, the curve after ES fits the cycle period well; however, the rejection of ambient noise is not complete. In contrast, when α = 0.3, not only does the degradation curve of the cycle period fit well but the interference of environmental noise is also avoided to a great extent. In summary, the 14 sensor detection parameters of 100 aeroengines in FD001 and FD003 were smoothed using α of 0.3.

Data Normalization
The efficient use of data is important for improving the training efficiency and RUL prediction accuracy. As the data collected come from different types of sensors, the data must be preprocessed. For the datasets FD001 and FD003, the MinMaxScaler is used to scale data for each sensor signal. Given n as the time cycle, the raw sensor data are denoted as X = [x 1 , x 2 , x 3 , . . . x n ], with each sensor calculated as

RUL Target Function
Owing to fatigue damage, friction damage, or fracture in the process of operation, the RUL of the components will inevitably decrease with time. In the early stages of aircraft engine operation, the wear on mechanical parts is negligible. Hence, the aircraft engine is assumed to be in a healthy state. With an increase in the working time, the wear of components cannot be ignored when the critical degradation value, R, is reached, and the aircraft engine enters the degradation state. According to several studies, the commonly used R values for multivariate sensor data of aeroengines are 120, 125, 130, and 135 [29,30]; in this study, we set the R value as 130. The RUL target function for aeroengines is defined as follows: where x is the number of cycles at the measured point, and a is the maximum number of cycles.

Metrics
The score [31][32][33] and root mean square error (RMSE) are the two commonly used evaluation metrics for the C-MAPSS dataset; these are defined as follows: where d i is the difference between the predicted and true RUL values. The RMSE is used to measure the deviation of the predicted value from the true value. The smaller the RMSE value, the closer the true value is to the predicted value. In actual conditions, the positive or negative difference between the predicted and true values has a significant impact on the subsequent maintenance and work guarantee. Advanced prediction allows for timely repair before failure; however, premature prediction leads to unnecessary waste. Furthermore, lagging prediction leads to more significant consequences, and not predicting the failure in time creates a safety hazard. As the RMSE does not reflect the true magnitude between the predicted and true values, the introduction of a score increases the penalty for lagged predictions, with lower scores indicating better predictions.

Time Window
For long time sequences, the window length is an important parameter, which is directly related to the final accuracy of the deep learning model [34,35]. We introduced a time window to reconstruct data for 14-dimensional sensor data and RUL labels. For an extremely small time window, the correlation between features and time cannot be captured. In contrast, an extremely large time window contains more useful information but tends to cause the network to ignore short-term changes in features. The operational flow of the time window is shown in Figure 5.
To determine the optimal time-window length for FD001 and FD003, we experimented with different time-window lengths (time-window length is the number of cycles per interception) and their effects on the prediction results, as shown in Table 3. The results show that the choice of the time-window length plays a crucial role in the final accuracy of the deep learning model. The prediction effect improves with the increasing length of the time window at the beginning of the experiment, and the prediction performance mostly decreases gradually after the time-window length of Ltw > 30. Hence, for FD001 and FD003, we recommend Ltw = 30 and 40, respectively.

. Model Features
To explore the structure of the optimal model, this section describes the effect of different layers and hyperparameters on the prediction performance. Separate tests were performed on FD001 and FD003, and the results of the study are listed in Tables 4 and 5.
The comparison of the effects of different layers shows that the model with the combination of a 2-layer CNN and a 1-layer CNN has the best prediction effect. Compared with other structures, the optimal structure reduces the RMSE and Score of FD001 by 21.15% and 35.42% on average, and those of FD003 by 18.45% and 58.87% on average, respectively. Moreover, the effects of different network parameters were evaluated. The results show that the best prediction is achieved when filters = 32 and kernel size = 3. The final optimal model parameters obtained are shown in Table 6. To determine the optimal time-window length for FD001 and FD003, we experimented with different time-window lengths (time-window length is the number of cycles per interception) and their effects on the prediction results, as shown in Table 3. The results show that the choice of the time-window length plays a crucial role in the final accuracy of the deep learning model. The prediction effect improves with the increasing length of the time window at the beginning of the experiment, and the prediction performance mostly decreases gradually after the time-window length of L tw > 30. Hence, for FD001 and FD003, we recommend L tw = 30 and 40, respectively. To explore the structure of the optimal model, this section describes the effect of different layers and hyperparameters on the prediction performance. Separate tests were performed on FD001 and FD003, and the results of the study are listed in Tables 4 and 5. The comparison of the effects of different layers shows that the model with the combination of a 2-layer CNN and a 1-layer CNN has the best prediction effect. Compared with other structures, the optimal structure reduces the RMSE and Score of FD001 by 21.15% and 35.42% on average, and those of FD003 by 18.45% and 58.87% on average, respectively. Moreover, the effects of different network parameters were evaluated. The results show that the best prediction is achieved when filters = 32 and kernel size = 3. The final optimal model parameters obtained are shown in Table 6.

Comparison with the State-of-the-Art Models
The proposed network was trained with FD001 and FD003 in the C-MAPSS dataset and tested on all test aeroengines. To further analyze the proposed method and demonstrate its superiority, the predicted results of different sub-datasets are shown in Figure 6. In addition, Figure 7 shows the predicted and actual degradation processes for the two randomly selected engines from FD001 and FD003.   The C-MAPSS dataset is a more widely used public dataset and has been used extensively in the research field. To verify the effectiveness of the proposed method, we compared the proposed method with the current mainstream machine learning methods and composite methods proposed in previous studies. The comparison results are shown in Table 7. As observed, favorable prediction results were obtained on this dataset by different research methods, and the proposed prediction model showed a significant improvement in prediction performance compared with the state-of-the-art methods. Compared with the optimal method, the RMSE and Score of FD001 decreased by 10.87% and 42.57%, whereas those of FD003 decreased by 14.13% and 58.15%, respectively. Of the two performance evaluation metrics, the Score is significantly improved.

Conclusions
This study introduced a novel model for predicting RUL based on CNN, TCN, and a multi-head attention mechanism; the model was proved suitable for long time series. The proposed method uses a two-layer CNN to process the input long time series to uncover features that can effectively characterize the degradation process of aeroengines. The introduction of TCN improves the gradient propagation capability and network computational efficiency while ensuring the extraction of time series feature information. The multi-head attention mechanism is used to increase the depth of a network in both vertical and horizontal directions. The multi-head structure maximizes the retention of useful information of the original sensor parameters. The proposed method was evaluated on FD001 and FD003 of the C-MPASS dataset, with the results showing improved accuracy. In this study, we tested the experimental data of aeroengines under a single operating condition; future studies must include the RUL prediction test for complex operating conditions. Author Contributions: Conceptualization, L.N.; methodology, L.N. and S.X.; investigation, L.Z. and Y.Y.; software, S.X.; validation, Y.Y.; writing-original draft preparation, S.X.; writing-review and editing, L.N., S.X., L.Z., Y.Y., Z.D. and X.Z. All authors have read and agreed to the published version of the manuscript.