SACGNet: A Remaining Useful Life Prediction of Bearing with Self-Attention Augmented Convolution GRU Network

: In recent years, the development of deep learning-based remaining useful life (RUL) prediction methods of bearings has ﬂourished because of their high accuracy, easy implementation, and lack of reliance on a priori knowledge. However, there are two challenging issues concerning the prediction accuracy of existing methods. The run-to-failure sequential data and its RUL labels are almost inaccessible in real-world scenarios. Meanwhile, the existing models usually capture the general degradation trend of bearings while ignoring the local information, which restricts the model performance. To tackle the aforementioned problems, we propose a novel health indicator derived from the original vibration signals by combining principal components analysis with Euclidean distance metric, which was motivated by the desire to resolve the dependency on RUL labels. Then, we design a novel self-attention augmented convolution GRU network (SACGNet) to predict the RUL. Combining a self-attention mechanism with a convolution framework can both adaptively assign greater weights to more important information and focus on local information. Furthermore, Gated Recurrent Units are used to parse the long-term dependencies in weighted features such that SACGNet can utilize the important weighted features and focus on local features to improve the prognostic accuracy. The experimental results on the PHM 2012 Challenge dataset and the XJTU-SY bearing dataset have demonstrated that our proposed method is superior to the state of the art. we


Introduction
Bearings are one of the key components in a rotating machinery system. The remaining useful life (RUL) of a bearing is often defined as the length of a bearing from the current time to failure [1]. If the damage time or the trend of the vibration signal can be predicted from the collected vibration signal of a bearing, it is beneficial for identifying the adverse running condition in time to avoid the sudden danger of bearings. Thus, the RUL of a bearing is essential for the maintenance and management of mechanical systems [1,2].
In general, the RUL prediction of bearings can be sorted into two different directions: physics-based methods and data-driven methods. Physics-based methods focus on physical and mathematical models, e.g., partial differential equations and state-space models, which require extensive prior knowledge [3][4][5][6].
Data-driven RUL methods directly use historical data to model the degradation process of bearings without any prior knowledge.
Deep learning is a popular approach among data-driven methods, which can directly build a deep neural network to model the degradation process as a functional relationship between health states and original sensory data [7].

1.
We combine the PCA with Euclidean distance metric methods to construct a health indicator to tackle the problem of lack of RUL labels. Facing the high-dimensional and long-term series data, PCA can reduce the data dimensionality while retaining sufficient useful features. The Euclidean distance is to measure the similarity between data to distinguish the different degradation stages. Compared with the existing linear RUL labels, our HI is not only capable of representing the general degradation trend of bearings, but it also can retain more local features from the original vibration signal, which benefit the corresponding model's learning and calculations.

2.
We design a novel self-attention augmented convolution GRU network (SACGNet) to predict the RUL. Combining the self-attention mechanism with a convolution framework can both adaptively assign greater weights to more important information and focus on local information. Furthermore, Gated Recurrent Units (GRU) are used to parse the long-term dependencies in weighted features so that SACGNet can utilize the important weighted features and focus on local features to improve the prognostic accuracy.

3.
Based on the designed HI and SACGNet, a novel remaining useful life prediction approach is proposed. We conduct ablation experiments and different comparison experiments on the PHM 2012 Challenge dataset and XJTU-SY bearing dataset. The experimental results prove the superiority of our proposed method.
The remaining part of this paper is organized as follows: In Section 2, we introduce related works in the field of RUL prediction. We describe our proposed method in detail in Section 3. The experimental results are discussed in Section 4. Finally, we conclude the paper in Section 5.

Health Indicator Construction
In deep learning-based RUL methods, HI construction currently has two branches in general. One approach extracts simple physical fault characterization from the original vibration signals as HI, using statistical methods or signal processing methods. For instance, the root mean square (RMS) of the original vibration signal [10] or the percentage of useful life (the current life divided by the total useful life) [11][12][13][14]. However, such HIs cannot represent enough useful degradation information of the original data. Thereby, using such HIs as model input makes the model fail to accurately capture the degradation trend for RUL prediction.
The other branch constructs the virtual HI by fusing multiple physical characteristics or multi-sensor signals. These HI can filter out abnormal trends in the early degradation stages, which is more suitable for model learning [15][16][17][18]. Guo et al. selected six relatedsimilarity features and combined eight time-frequency features so as to form an original feature set that contains rich degradation signatures of bearings. Then, the selected features are fused into an HI through an RNN [19]. Li et al. used KPCA to integrate multiple features and introduced the EWMA to reduce the fluctuations for the constructed HI [20]. Li et al. designed the generative adversarial network to learn the data distribution in the health states of machine, using the output of the discriminator as HI [21]. Liang et al. proposed a novel index by calculating offset distance and offset angle between the current state and normal state of devices [22].
In summary, the existing HI can only represent the global degradation process of the vibration signal, but it fails to retain more local features. In order to extract more representative features from the vibration signal and facilitate for the model learning, a more effective HI construction method is proposed in this paper.

Prediction Model
With respect to regression model design, LSTM, GRU, CNN, and the attention mechanism have been successively introduced into the field of RUL prediction.
LSTM uses input gates, forgetting gates, and output gates to regulate the information of the input sequence, which enables the network to learn the long-term dependence of the data and gain favorable results. Hinchi et al. used convolutional layers to directly extract local features from sensor data, combined them with LSTM layers to capture the degradation process of the bearing, and finally output the prediction values [23]. Whereas LSTM solves the problem of gradient disappearance of traditional RNN to some extent, the deliberate design of LSTM for RUL prediction is very time consuming [24,25].
GRU is a variant of LSTM, whose structure is further simplified to show better performance than LSTM in smaller datasets [26,27]. Cao et al. use the BiGRU model to solve the problem of distribution discrepancy [28]. However, with regard to LSTM and GRU, they only use the features learned in the previous time step for regression prediction and often do not pay attention to local features in the long time series [29].
CNN can extract features with less computational effort because of the sparsity of parameter sharing of the convolutional kernel and inter-layer connectivity. More importantly, CNN focuses on the local features in the original vibration signal, which is suitable for RUL prediction [30][31][32][33]. Wang et al. proposed a multi-scale convolutional network to improve the domain adaptation capability of the RUL prediction model [34].
It is noted that the original vibration signal often contains different features with different levels of importance. The features that contain more important information should be paid more attention. Hence, an attention mechanism is introduced to RUL prediction of bearing to adaptively extract input features [35]. The self-attention mechanism aims to correlate different states of sequences, which reduces the dependence on external information and is more suitable for capturing the internal relevance of data or features [36,37]. Chen et al. constructed an encoder-decoder model based on the attention mechanism to mine useful degradation information from a long historical vibration signal [38]. Chen et al. proposed an attention-based deep learning framework for RUL prediction, which adopted LSTM to extract features, and then combined with the attention layer to fusion the features, LSTM extracted and manually extracted features [39].

Proposed Method
Without loss of generality, given a bearing, vibration signals V = {v 1 , v 2 , . . . v m }, input V to the health indicator construction module and get the H I = {h 1 , h 2 , . . . h m }. We expect the model to predict the h t+1 value after input h 1 , h 2 , . . . h t .
Then, our proposed SACGNet learns the deep features in HI: where θ is the parameter on the model. There is also a testing dataset of the vibration signal V = {v 1 , v 2 , . . . v m }, which after HI construction obtains H = {h 1 , h 2 , . . . h n }.
Finally, inputing the H to the trained SACGNet, the model will predict the correct value:ŷ t = F(h 1 , h 2 , . . . , h t ). whereŷ t is the model's prediction value.
The whole structure of SACGNet is shown in Figure 1, including the health indicator construction module and remaining useful life prediction module. First, input the original signal of the bearings to the health indicator construction module to obtain the HI. After data normalization and sliding window processing, input it to SACGNet for training. In the testing stage, the predicted values are output by autoregression.

Health Indicator Construction Module
If the dimension of the original signal is set to d , the matrix form of the original vibration signal V = {v 1 , v 2 , . . . v m }, where v i denoted an acquired vibration data that can be written as: Principal components analysis (PCA) linearly transforms the data into a new coordinate system such that the first major variance of any data projection is at the first coordinate (called the first principal component), the second major variance is at the second coordinate, and so on.
V T is the de-averaged data. The singular value decomposition of V is: where the matrix W is the eigenvector matrix of VV T , ∑ is a non-negative rectangular diagonal matrix, and H is the eigenvector matrix of V T V.
Assuming zero empirical means, the principal component w(1) of the dataset V can be defined as: To obtain the k-th principal component, the previous k-1 principal components must first be subtracted from V.V Then, the k-th principal component is obtained to update a new dataset and continue to search for principal components.
Through PCA, we can reduce the original vibration signal's dimensionality from d to k. We retain the k principal components of the original signals; thus, the dimension of V pca is k, which can be abbreviated as {v pca1 , v pca2 , v pca3 , . . . , v pcan }.
Using standard PCA to reduce the dimensionality of the original vibration data, we can only retain the principal components of the data. In this paper, based on PCA, to reduce the dimensionality of vibration data, we use Euclidean distance to calculate the distance between the low-dimensional data to construct HI. The metric Euclidean distance can obtain the similarity between one data in the time series and the neighboring points, which can better reflect the trend of the neighboring data in the original vibration signal, which means "capturing local features" we mentioned.
By calculating the average of the Euclidean distance from each point in V pca to the sequential neighboring points, we can obtain the HI corresponding to each point.
The calculation process of h i is as follows: HI will be input to the constructed SACGNet to make the model learn the relationship between them. In order to make HI meet the dimensionality requirements of the model input, a sliding window is set to process the data into the shape required by the model, with a sliding window size of 20. Then, we obtain the X = {x 1 , x 2 , . . . x n }.

Remaining Useful Life Prediction Module
In this section, we describe our SACGNet in detail, as shown in the Table 1. We combine a 1D convolution (Conv1d) block with self-attention mechanisms to extract deep features from the input data. The Conv1d block focuses more on local features, and the self-attention mechanism can extract global features of the data. GRU can identify longterm features in the input data, which is beneficial to adapt the bearings under different operating conditions, thereby improving the prediction accuracy of our model [40,41]. For convenience, we use C, P, FC, D, MHA, and GRU to denote the Conv1d layer, the pooling layer, the fully connected layer, the dropout layer, Multi-Head attention layer, and the GRU layer, respectively.
In the convolution layer, the calculation of the input data can be written as follows: where represents convolution operation. f i represents the ith convolution filter, and b i is the bias. The convolution layers used ReLU as the activation function. Compared to images, the vibration signal is time-series data; hence, the one-dimensional convolution (Conv1d) neural network can be used to perform convolutional operations. The filters of the Conv1d layer are set to 80, the kernel size is set to 4, the stride is set to 1. In our paper, we select the average pooling layer. The specific calculation process of the self-attention mechanism can be summarized into two processes: calculation the weight coefficients based on the Query and Key, and summation of the weight values based on the weight coefficients. The first process can further include the following: first, calculate the similarity or relevance between Query and Key, and then normalize the found relevance. Its attention function can be described as mapping a Query and a pair of key-value pairs to an output, where Queries, Keys, and values are vectors and the output is computed as a weighted sum of values, where the weight assigned to each value is computed by the compatibility function of the Query with the corresponding Key.
In order to learn the expression of multiple meanings, the input data will be transformed; W Q , W K , W V is the matrix of assigned weights. Self-attention represents a focus on itself, so the equation can be denoted as follows: The output matrix of self-attention is expressed as: where d k is the dimension of K, and the use of √ d k is to change the attention matrix into a standard normal distribution.
Multi-head attention can make the model pay attention to the information from different representational subspaces; the output of the self-attention mechanism layer is three-dimensional vectors, which are written as X a : In this paper, we choose h = 8, Gated Recurrent Units (GRUs) are a gating mechanism in recurrent neural networks. The calculation of the GRU can be written as follows: Among them, a t is input vector X a at time t, h t is output vector,ĥ t is the candidate activation vector, z t is the update gate vector, r t is reset gate vector, W, U, and b are the parameter matrices, vector σ g is a sigmoid function, and φ h is a hyperbolic tangent. The GRU layer receives the features extracted from the Conv1d layer and the Multi-Head attention layer and then outputs the prediction value. The units of GRU are set to 80.
Finally, after the fully connected layer, the final output is obtained: SACGNet is trained using the error back-propagation algorithm and gradient descent method. The loss function of the training process is the mean square error function: where y i is the true value,ŷ i is the prediction value, and n is the total number of samples. In addition, Adam is chosen as the optimizer of this paper, and the learning rate is set to 10 −3 [42].
Dropout is added to our SACGNet with the parameter set to 0.5 in order to reduce overfitting by preventing complex co-adaptations on training data. The algorithm pseudocode is shown in Table 2.
end Output: Trained SACGNet model for prediction END

Experiments and Results
In this section, we use the IEEE PHM Challenge 2012 bearing dataset and the XJTU-SY Bearing dataset to validate the effectiveness of our method.

Dataset Description
The IEEE PHM 2012 Challenge dataset was collected from the PRONOSTIA testbed, as shown in Figure 2. The PRONOSTIA test platform contains a rotating part, load part, and data collection part. The motor power of the rotating part is 250 W. The power is transferred to the bearing by the axis of rotation. The load part provides a load of 4000 N to make the bearing degrade quickly. The acceleration sensor is placed on a bearing seat in horizontal and vertical directions to select the vibration signals. The sampling frequency of the acceleration sensor is 25.6 kHz. When the test platform starts to work, the vibration signal is recorded every 10 s, and the sampling time is 0.1 s. [43].
The data provided by IEEE PHM Challenge 2012 include three different operating conditions. Seven bearings (bearings 1-1 to 1-7) work in the first condition, the motor speed is 1800 rpm, and the load is 4000 N. Seven bearings (bearings 2-1 to 2-7) work in the second condition, the motor speed is 1650 rpm, and the load is 4200 N. Three bearings (bearings 3-1 to 3-3) work in the third condition, the motor speed is 1500 rpm, and the load is 5000 N. Table 3 illustrates the details of the PHM 2012 dataset.
In this paper, the vibration dada of bearings 1-1, 1-2, 2-1, 2-2, and 3-1 are selected, respectively, as the training set, while the rest of the bearings are selected as the testing set. Table 3 illustrates the details of the PHM dataset. The XJTU-SY bearing dataset is provided by the Institute of Design Science and Fundamental Research of Xi'an Jiaotong University and contains the run-to-failure vibration data from 15 rolling bearings [44].
As shown in Figure 3, the bearing testbed is composed of an alternating current (AC) induction motor, a motor speed controller, a support shaft, two support bearings (heavy duty roller bearings), and a hydraulic loading system. This testbed is designed to conduct the accelerated degradation tests of the testing bearings under different operating conditions (i.e., different radial force and rotating speed). The radial force is generated by the hydraulic loading system and applied to the housing of tested bearings, and the rotating speed is set and kept by the speed controller of the AC induction motor [44]. Three different operating conditions are set in the accelerated degradation experiments, and five bearings are used under each operating condition. The sampling frequency is 25.6 kHz, and the sampling period is 1 min. Table 4 illustrates the details of the XJTU-SY dataset. In this paper, the mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) are used to evaluate the prediction accuracy. They are respectively computed as follows: In Equations (18)- (21), y i is the label,ŷ i is the model's prediction value, and n is the total number of samples.

Different HIs Results
In this section, we compare different HI construction methods to validate the superiority of our proposed HI construction method. Figure 4 shows the results of different HI construction methods on the bearing 1-3 of the PHM dataset. It can be clearly observed from Figure 4a that the original vibration signal of the bearing 1-3 is in a very smooth state with little fluctuation when the bearing has just begun to work. In the degradation state, the vibration signal usually fluctuates slightly, whereas the overall trend is upward. The signal fluctuation will increase sharply when the bearing finally completely degrades.  As shown in Figure 4b-h, the HIs constructed by the combination of TSNE and other distance metrics basically do not have any regular change trend. Meanwhile, the methods of SAE and Euclidean distance metric can retain the change trend of the original vibration signal, but the early degradation and complete degradation stage of the bearing cannot be completely distinguished. For the vibration signal of the bearing, PCA is a linear transformation method for each of its principal components. Specifically, the linearity of each point is calculated to obtain the principal components and then downscaled; thus, the global trend of the original signal can be retained. Meanwhile, SAE is a nonlinear learning model that requires a lot of training data to get a satisfactory performance. In contrast, our proposed method, as shown in Figure 4i, is more suitable to reflect the change trend of the original signal and can distinguish the early degradation and complete degradation stage of the bearing, which is beneficial for SACGNet to improve its prediction accuracy.
In order to further illustrate the superiority of our HI construct method, we use the percentage of the use life as the HI; then, we compare the RUL prediction results with our proposed methods on the PHM dataset. The red lines are true values of HI, and the blue lines are the prediction values of the model. It can be seen that when the bearing true remaining useful life percentage is used as the HI label, the model fails to learn the degradation trend of the bearing vibration signal, and the final prediction results are not well-fitted. In contrast, using our proposed HI for prediction, the degradation trend of the vibrtion data is depicted more accurately, and the prediction accuracy of the model is improved. Note that our model can properly predict RUL in the stages of rapid degradation of bearing operation, which is of great value for the actual industrial scenarios.

Ablation Experiments
In order to observe the effects of the different layers in the proposed model, we conduct ablation experiments on the PHM dataset and the XJTU-SY dataset. We let the model constructs remain and respectively remove the Conv1d layer and Multi-Head Attention layer for comparison models. We call them the NoAttention model and NoConv1d model, respectively.
As seen in Table 5, with respect to the PHM dataset, our method achieved the best results in 10 of the 11 bearing data for MSE, RMSE, MAE, and MAPE metrics. Using the MSE metric, our model did not achieve the best result for bearing 2-5; the discrepancy with respect to the best results (i.e., NoAttention model) is 0.002. For the RMSE, MAE, and MAPE metrics, our model does not achieve the optimal results for bearing 2-7, the discrepancy with respect to the best results (i.e., the NoAttention model) is 0.148, 0.124, and 177.228, respectively. Furthermore, we also conducted the ablation experiment on the XJTU-SY dataset, and the results are shown in Table 6. Using the MAE and MAPE metrics, our model did not achieve the best results for bearings 1-5, 2-3, and 2-4 but only a discrepancy of 1.9% from the best results. Using the MSE and RMSE metrics, our model did not achieve the best results for bearings 2-3 and 2-4, but the discrepancy with the best results (i.e., NoAttention model) was only 3.03%. Except for the afore-mentioned results, the performance of our model is superior to the comparison models. Table 6. Ablation experiments in the XJTU-SY dataset. The results of the ablation experiments conducted on both bearing datasets prove that our proposed model achieves the best results on the largest number of testing bearings. In a comprehensive analysis, the degradation features of different bearings with different operating conditions are different, Conv1d can extract the local features of the original vibration signals, and the Self-Attention mechanism focuses on the global features, which can be integrated to achieve more excellent results.s

Results of Different Models
In this section, we compare the overall prediction accuracy of our proposed model with state-of-the-art methods on the PHM dataset and XJTU-SY dataset. The compared models include CNN, RNN, LSTM, and GRU.
As shown in Table 7, with respect to the PHM dataset, our model achieved the best results in nine out of 11 bearings for MSE and RMSE metrics, and in nine out of 11 bearings for MAE and MAPE metrics. Using the MSE and RMSE metrics, our model did not achieve the best results for bearings 2-5 and 2-7. Using the MAE and MAPE metrics, our model did not achieve the best values for bearings 2-7. In order to indicate the superior performance of our model, we add the prediction results of all the comparison models in the PHM dataset. Without loss of generality, we visualize the prediction results for bearing 1-4, 1-5, and 1-6 in order to compare them with the previous prediction results, as shown in Figure 6.
From Figure 6a,d,g,j,m, it can be seen that CNN has the worst fitting results on the testing data. The difference between the CNN predicted values and the original signal is very obvious, because the single CNN models are unsuitable for processing the time-series data. The prediction results of RNN and GRU are slightly superior to those of CNN, but there is still a visible difference from the original vibration signal. Furthermore, the RNN performance is significantly inferior to LSTM on long-term series.
The prediction results of SACGNet and LSTM are similar on bearing 1-4, and the fitting results of SACGNet are more favorable when combined with the four evaluation metrics in Table 7. From Figure 6b,c,e,f,h,i,k,l,n,o, it can be seen that on bearing 1-5 and 1-6, SACGNet has superior prediction results, which is consistent with the comparison results of the four evaluation metrics in Table 7. Upon observation of the original vibration signal of bearing 2-7, the early degradation states of bearing 2-7 show very sharp fluctuations, and the amplitude difference between the early degradation and the complete degradation stage is very small. The vibration fluctuation of the early degradation is even more severe than that of the complete degradation stage. That may be the reason why our model did not achieve optimal results on bearing 2-7.
As shown in Table 8, with respect to the XJTU-SY dataset, it can be seen that our model has achieved the best resluts on most of the testing bearings, expect for the MSE of bearings 2-3 and 3-5, RMSE of bearings 2-3, 2-5 and 3-5, MAE of bearings 1-5, 2-3, 2-5, and 3-5, and MAPE of bearings 1-5, 2-3, and 2-5, respectively. The reason for such results may be due to the fact that the experimental environment of 2-3, 2-5, and 3-5 is slightly different from the conditions of the bearing dataset we chose as the training set, and these datasets produce fluctuations in the degradation stage that are equal to or even higher than the final complete damage. Specifically, the CNN model achieved the best results on the bearing 2-3 dataset for four metrics, which may be because the degradation process of bearing 2-3 is filled with many small-scale local fluctuations, allowing the CNN to fit this process more directly. As for bearings 2-5 and 3-5, they do not show much sharp fluctuations and also due to the small amount of data compared to the other datasets, RNN and LSTM are able to parse their long-term serial relationships on these two datasets.
In summary, from the comparison experiments on the two datasets, in most of the cases, our model achieves the best prediction accuracy compared to the other existing models, which demonstrates that our proposed model is applicable for RUL prediction.

Conclusions
In this paper, we explored a health indicator-based remaining useful life prediction method. First, we combine principal component analysis (PCA) with Euclidean distance measurement to construct the health indicator for tackling the dependency on RUL labels. Then, we design a self-attention augmented convolution GRU network (SACGNet) to predict the RUL task, which utilizes the globe features as well as local important features to improve the prognostic accuracy. To verify the effectiveness of the model, we conducted extensive experiments on two bearing datasets, respectively, and the results demonstrate that SACGNet is superior to these existing models under several evaluation criteria. Meanwhile, the model has excellent generalization performance for multiple bearings.
Author Contributions: J.X. contributed to the conception of the study; S.D performed the experiment and wrote the manuscript; W.C. contributed significantly to analysis; D.W. performed the data analyses; Y.F. helped perform the analysis with constructive discussions..