Remaining Useful Life Prognostics of Bearings Based on a Novel Spatial Graph-Temporal Convolution Network

As key equipment in modern industry, it is important to diagnose and predict the health status of bearings. Data-driven methods for remaining useful life (RUL) prognostics have achieved excellent performance in recent years compared to traditional methods based on physical models. In this paper, we propose a novel data-driven method for predicting the remaining useful life of bearings based on a deep graph convolutional neural network with spatiotemporal domain convolution. This network uses the average sliding root mean square (ASRMS) as the health factor to identify the healthy and degraded states, and then uses correlation coefficient analysis on the hybrid features of the degraded data to construct a spatial graph according to the strength of the correlation between the obtained features. In the time domain, we introduce historical data as the input to the temporal convolution. After the data are processed by the spatial map and the temporal dimension, we perform the prediction of the remaining useful life. The experimental results show the accuracy of the method.


Introduction
Prognostic and health management (PHM) is a critical technology to ensure the safety and reliability of equipment, which has achieved fruitful theoretical results in the past decades and has been widely applied [1][2][3][4][5]. RUL prediction has a guiding value for maintenance decisions and equipment ordering, and has long been the basis and core of PHM technology. Rolling bearings are key equipment of modern industry, the prediction and diagnosis of their condition has vital significance. This kind of equipment under the combined effect of internal and external factors, performance, and health status will inevitably show a trend of decline. When the degradation reaches a certain level, they are unable to perform its normal tasks and functions [6][7][8][9]. A prediction method is needed to determine the health status of rolling bearings during operation, which can be used to indicate impending failures and provide more time for maintenance of the equipment [10][11][12].
In recent years, RUL prediction methods have been characterized by diversity, hybrid, and complexity. Zhang [13] et al. classified the research methods for RUL prediction of rotating machinery into three main categories: failure mechanism-based methods [14], data-driven methods [15], and a combination of both [16]. The failure mechanism-based approach can predict the remaining life of the equipment more accurately, but it requires a lot of physical knowledge, and this approach is difficult to be applied in practice. On the other hand, the complexity of vibration signals makes it too costly to perform accurate failure mechanism modeling. In contrast, data-driven approaches can infer correlations and cause-effect relationships hidden in the data, as well as learning potential trends from the data. Consequently, data-driven modeling-based approaches have seen a large increase in their number of applications as a result of their accurate predictions. Methods that causes relatively large errors. In view of this, this paper proposes a new spatiotemporal composition method for bearing data, which introduces graph convolution with temporal convolution [40,41] for the prediction of RUL of bearings and proposes a method to confirm the first degradation point. The proposed model uses the authoritative bearing degradation dataset for performance testing.
In Section 2, this paper describes the preliminaries, including the confirmation of the first degradation point, graph convolution, temporal convolution, and the graph structure of bearings. The proposed method is demonstrated in Section 3, and the experimental validation is presented in Section 4. The whole paper is summarized in Section 5.

Average Sliding Root Mean Square Value
The international standard ISO 20816-1-2016 [42] provides clear criteria for the degradation of vibrating equipment, and the degradation process of bearings is divided into four stages: stage A is the stage of healthy operation of newly deployed equipment; stage B is the stage in which vibrating equipment can operate for a long time; stage C is usually considered to be the stage in which equipment cannot operate for a long time; Stage D is considered to be the degradation stage, where the equipment is considered to be operating in a very dangerous condition. In view of this, root mean square (RMS) can be a good indicator of the degradation of the vibration signal [43] According to ISO 20816-2016 standard, for a small motor drive bearing, its operating state switches from state A to state B when the RMS value is larger than 0.71. Therefore, we choose when the RMS value is larger than 0.71 as the beginning of bearing degradation. The RMS curves of the bearings are shown in Figure 1.

Spatial Graph-Temporal Convolution
The graph convolutional network (GCN) demonstrates good performance in solving data with graph structure. GCN is generally divided into spatial domain convolution and  We have selected four typical bearing degradation processes, and from the figure, we can see that not all bearings have a monotonic and relatively smooth degradation state. Some bearings may have RMS values exceeding 0.71 at the beginning of operation due to other reasons such as noise, and individual outlier points [44] may be perturbed in the middle of operation and greater than 0.71. It is obviously inappropriate to take such RMS values as the first degradation point (FDP), which will have a huge impact on the prediction of RUL. To address the effect of individual outliers on the first degradation point, we propose the average sliding root mean square (ASRMS) approach to determine the FDP where x i denotes the original vibration value and N represents the number of sample points.
The RMS values after the smoothing operation are shown in Figure 2. By observation, we find that the curve of ASRMS is smoother than that of RMS, and it can well avoid the interference of outlier points to the judgment of the FDP. For example, the RMS curves of bearings 2-4 and 3-1 have individual sampling points greater than 0.71 in the early and middle periods, and it is not correct to judge such outliers as the FDP, but after using the smoothing operation, the effect of outliers is eliminated. In addition, the ASRMS curve has good monotonicity, which can describe the degradation process of the bearing well.
In summary, the ASRMS method proposed in this paper can well identify the health and degradation states of bearings.

Temporal Convolution
The bearing signal collected by the sensor is a time sequential signal, we not only need to extract the features of the signal in the spatial dimension, but also in the temporal dimension. In view of this, we introduce temporal convolution to make the data correlated

Spatial Graph Convolution
The graph convolutional network (GCN) demonstrates good performance in solving data with graph structure. GCN is generally divided into spatial domain convolution and spectral domain convolution. The spatial domain convolution method generally operates directly on the central node and the adjacent nodes for feature extraction by certain rules [45]. Unlike spatial domain convolution, spectral convolution uses the eigenvalues and eigenvectors of the Laplacian matrix of the graph, and this method performs the convolution operation in the frequency domain using the Fourier transform of the graph. Based on the way the bearing data are constructed, we choose spatial convolution as our way of extracting features. The spatial convolution operation is shown in Equation (2) where F i represents the ith output of the feature mapping; K indicates the size of the convolution kernel; k i imply the ith convolution kernel; θ k i is the weight parameter matrix, similar to the original convolution operation, given a weight vector for the input data; L means the input of the previous node or the original data, and A k i is the diagonalized matrix of original adjacency matrix A. The primitive adjacency matrix A is an O × O semi-positive definite matrix, where O are the number of nodes. Each element represents the presence or absence of a node connection, with nodes that do not have a connection having a value of 0 and nodes that have a connection having a value of 1.

Temporal Convolution
The bearing signal collected by the sensor is a time sequential signal, we not only need to extract the features of the signal in the spatial dimension, but also in the temporal dimension. In view of this, we introduce temporal convolution to make the data correlated with each other in time dimension. After the spatial features of the bearings are extracted by the graph convolutional network, we use 2D convolution to extract the current temporal information. Equation (3) describes the temporal convolution formula for layer, l at time t where O (k,l) t represents the output of the lth layer and the kth convolutional block at moment t; a denotes the non-linear activation function; W k indicate the parameters of each convolution kernel f k ; x t serves as the input sequence after the graph convolution, k means the number of convolution kernels; d is the expansion rate of the convolution kernels; and (K − k)d shows the size of the perceptual field.

Spatial Domain Construction of Bearings
The complexity of the signal of vibration entails it difficult to fully characterize the degradation process of the bearing using a single time-domain or frequency-domain feature, which creates a great problem for the prediction of the bearing RUL. On the other hand, the feature extraction of traditional CNNs will destroy the intrinsic topology of complex data, while graph convolution will extract the features of the data also preserving the topology of high-dimensional data. Therefore, in this paper, we propose a novel construction of graphs for vibration signals to extract features to be able to predict RUL well.
In this paper, the original signal is extracted by time and frequency domain features, and max, min, peak to peak value, var, std, mean, rms, skew, kurtosis, mean-abs, 10 features are selected as input nodes, as shown in Figure 3. Figure 3 shows the tendency of different features, in order to analyze the strength of correlation between each feature, that is, to Sensors 2021, 21, 4217 6 of 16 check the connection of each node in the graph, Pearson correlation coefficient analysis is used in this paper, as shown in Equation (4).
where the X, Y indicate extracted feature respectively.
hand, the feature extraction of traditional CNNs will destroy the intrinsic topology of complex data, while graph convolution will extract the features of the data also preserving the topology of high-dimensional data. Therefore, in this paper, we propose a novel construction of graphs for vibration signals to extract features to be able to predict RUL well. In this paper, the original signal is extracted by time and frequency domain features, and max, min, peak to peak value, var, std, mean, rms, skew, kurtosis, mean-abs, 10 features are selected as input nodes, as shown in Figure 3. Figure 3 shows the tendency of different features, in order to analyze the strength of correlation between each feature, that is, to check the connection of each node in the graph, Pearson correlation coefficient analysis is used in this paper, as shown in Equation (4).
where the X, Y indicate extracted feature respectively. We use the heat map to represent the correlation coefficients to obtain an O O correlation matrix, where O means the number of extracted features. The two nodes with correlation coefficient greater than 0.75 are considered to be connected and the corresponding value in the adjacency matrix A is set to 1. The two nodes with coefficients less than 0.71 are unconnected and the corresponding value in is set to 0. Thus, we obtain an O O original adjacency matrix.
The input dimension of the final network is M C T, where M means the number of sample points, C is the number of nodes, and T denotes the time length of each sample point. Figure 4 shows the example of heat map of the correlation analysis, and the correlation matrix between the obtained features becomes the bearing graph structure shown in Figure 5 after the spatial feature mapping. The mean node has less correlation with other nodes, thus there are only nine nodes in Figure 5. The spatial convolution operation of the bearings is shown in Figure 6. We use the heat map to represent the correlation coefficients to obtain an O × O correlation matrix, where O means the number of extracted features. The two nodes with correlation coefficient greater than 0.75 are considered to be connected and the corresponding value in the adjacency matrix A is set to 1. The two nodes with coefficients less than 0.71 are unconnected and the corresponding value in A is set to 0. Thus, we obtain an O × O original adjacency matrix.
The input dimension of the final network is M × C × T, where M means the number of sample points, C is the number of nodes, and T denotes the time length of each sample point. Figure 4 shows the example of heat map of the correlation analysis, and the correlation matrix between the obtained features becomes the bearing graph structure shown in Figure 5 after the spatial feature mapping. The mean node has less correlation with other nodes, thus there are only nine nodes in Figure 5. The spatial convolution operation of the bearings is shown in Figure 6.     Figure 7 shows the structure of spatial graph-temporal convolution network (SG-TCN). First, the training data and test data are processed by spatial composition and used as the input of the deep neural network. A spatial graph convolution layer and a temporal convolution layer are included in each layer, and the original data are convolved by the graph to effectively extract information related to other nodes at each node, which avoids    Figure 7 shows the structure of spatial graph-temporal convolution network (SG-TCN). First, the training data and test data are processed by spatial composition and used as the input of the deep neural network. A spatial graph convolution layer and a temporal convolution layer are included in each layer, and the original data are convolved by the graph to effectively extract information related to other nodes at each node, which avoids  Figure 7 shows the structure of spatial graph-temporal convolution network (SG-TCN). First, the training data and test data are processed by spatial composition and used as the input of the deep neural network. A spatial graph convolution layer and a temporal convolution layer are included in each layer, and the original data are convolved by the graph to effectively extract information related to other nodes at each node, which avoids the defects of using a single feature and failing to characterize the whole degradation process. the defects of using a single feature and failing to characterize the whole degradation process. Secondly, the data after graph convolution is subjected to the operation of temporal convolution. The information contained in the time sequences can have a considerable impact on the prediction of RUL, so the feature extraction on the temporal dimension of the time sequences signals is performed using temporal convolution network (TCN).

Structure of the Proposed Spatial Graph-Temporal Convolution Network
After that, the data that have passed through SG-TCN use the Dropout technique [46], in order to keep the output from overfitting. Generally, the Dropout is set to 0.5. There are five SG-TCN convolution layers, and each layer of graph convolution uses a convolution kernel of convolution size 3 × 1, and the temporal convolution uses a convolutional kernel of size 5 × 1 with a convolution step of 1. After each SG-TCN operation, using the residual module to connect the input to the output to ensure that the necessary features are not lost.
The features extracted after the multilayer network then perform a maximum pooling operation [47], which is used to retain the most salient features. The maximally pooled data is passed through a flatten layer and a fully connected layer containing 256 neurons. The fully connected layer is connected to an output neuron, which executes RUL prediction.
The activation function used for each convolutional layer is rectified linear units (ReLU) [48], which is well suited to avoid the gradient disappearance problem. This study also uses the all-0 padding technique [49] to ensure that the data dimension does not change during the convolution operation.  Figure 8 shows the flow chart of the proposed method. Recent studies have shown that bearings do not degrade from the beginning, so prediction of the entire life cycle is not necessary. Therefore, the ASRMS method is used to identify the degradation characteristics and health characteristics of the bearing's life cycle. The prediction starts when Secondly, the data after graph convolution is subjected to the operation of temporal convolution. The information contained in the time sequences can have a considerable impact on the prediction of RUL, so the feature extraction on the temporal dimension of the time sequences signals is performed using temporal convolution network (TCN).

Flow Chart of the Rul Prognostics Method
After that, the data that have passed through SG-TCN use the Dropout technique [46], in order to keep the output from overfitting. Generally, the Dropout is set to 0.5. There are five SG-TCN convolution layers, and each layer of graph convolution uses a convolution kernel of convolution size 3 × 1, and the temporal convolution uses a convolutional kernel of size 5 × 1 with a convolution step of 1. After each SG-TCN operation, using the residual module to connect the input to the output to ensure that the necessary features are not lost.
The features extracted after the multilayer network then perform a maximum pooling operation [47], which is used to retain the most salient features. The maximally pooled data is passed through a flatten layer and a fully connected layer containing 256 neurons. The fully connected layer is connected to an output neuron, which executes RUL prediction.
The activation function used for each convolutional layer is rectified linear units (ReLU) [48], which is well suited to avoid the gradient disappearance problem. This study also uses the all-0 padding technique [49] to ensure that the data dimension does not change during the convolution operation. Figure 8 shows the flow chart of the proposed method. Recent studies have shown that bearings do not degrade from the beginning, so prediction of the entire life cycle is not necessary. Therefore, the ASRMS method is used to identify the degradation characteristics and health characteristics of the bearing's life cycle. The prediction starts when the bearing operates to the first degradation point (FDP).

Flow Chart of the Rul Prognostics Method
The training sets and test sets are prepared next. This study focuses on the degree of degradation of the machinery and the remaining useful life corresponding to different degrees of degradation. The training labels were used to train the training data sets in the form of RUL percentages. The RUL was defined as the value of the remaining life left from the FDP point and the RUL percentage was defined as the percentage of the RUL value to the full RUL value at the current moment.
After that, the training data are trained with the established spatiotemporal graph convolutional depth network. The bearing time sequences are extracted using temporal convolution, and the output values are matched with RUL labels. When the training sets are completed, the test data can be put into the network for testing.

Training Details and Evaluation Indexes
The PyTorch deep learning framework is used in this experiment. The backpropagation algorithm [50] was used for training with stochastic gradient descent optimizer (SGD) and Nesterov momentum was set to 0.1. Batch size was set to 32, weight decay was set to 0.0001, and learning rate was set to 0.0001. Mean squared error (MSE) is used as the loss function for back propagation, as shown in Equation (5). A total of 100 epochs are trained in this experiment.
where N is the total number of samples, x i represents the actual RUL value, andx i means the RUL value predicted by the deep learning network. In this paper, the mean absolute error (MAE), root mean squared error (RMSE), and the remaining lifetime percentage error are used to evaluate the network performance as shown in Equations (6)- (8). The meaning of the parameters in Equations (6)-(8) are as same as in Equation (5)

Data Sets Description
The experimental data provided by the PRONOSTIA [51] experimental platform was used to describe the degradation of the bearing throughout its operational life, as shown in Figure 9. The measurement part of the platform contains two types of sensors, vibration, and temperature. The vibration sensor consists of two accelerometers, which are placed on the horizontal and vertical axes, respectively. In this paper, the data generated by the horizontal axis accelerometer is used for the experiments. The sampling frequency of the accelerometers is 25.6 KHz, and the sampling is executed every 0.1 s. Each samplei.e., each acc file-contains 2560 sampling points, and the sampling is conducted every 10 s.

Data Sets Description
The experimental data provided by the PRONOSTIA [51] experimental platform was used to describe the degradation of the bearing throughout its operational life, as shown in Figure 9. The measurement part of the platform contains two types of sensors, vibration, and temperature. The vibration sensor consists of two accelerometers, which are placed on the horizontal and vertical axes, respectively. In this paper, the data generated by the horizontal axis accelerometer is used for the experiments. The sampling frequency of the accelerometers is 25.6 KHz, and the sampling is executed every 0.1 s. Each sample-i.e., each acc file-contains 2560 sampling points, and the sampling is conducted every 10 s. During the operation of the bearings, a pressure of 4KN-5KN was applied to the loading part of the RONOSTIA experimental platform in order to be able to accelerate the degradation process of the bearings. After degradation acceleration, the life cycle of the bearing varies from 1-7 h, so it is very competitive to predict its RUL. According to the pressure and speed of the load, the operating conditions of the bearings are divided into three types, as shown in Table 1. In order to avoid causing damage to the experimental platform, the experiment was stopped when the amplitude of the vibration signal exceeded 20 g. In this study, when one bearing is used as the test set, the rest of the bearings are used as the training set.

Experimental Results
This section focuses on the prediction of the RUL values of the tested bearings using the proposed method. The superiority of the proposed method is shown by comparing it with other methods. The RUL prediction curves for the test bearing are shown in Figure  10. Since in the prediction curve usually contains local fluctuations, which have a relatively large effect on the degradation estimation of RUL. Accordingly, we take a smoothing operation for each RUL curve, as shown by the bolded green line in Figure 10. In addition, the smoothed curves for bearings 1-2 are not bolded because the predicted fluctuations for bearings 1-2 are not as large as those for the other bearings, so there is no particular need to highlight the degradation trend. During the operation of the bearings, a pressure of 4KN-5KN was applied to the loading part of the RONOSTIA experimental platform in order to be able to accelerate the degradation process of the bearings. After degradation acceleration, the life cycle of the bearing varies from 1-7 h, so it is very competitive to predict its RUL. According to the pressure and speed of the load, the operating conditions of the bearings are divided into three types, as shown in Table 1. In order to avoid causing damage to the experimental platform, the experiment was stopped when the amplitude of the vibration signal exceeded 20 g. In this study, when one bearing is used as the test set, the rest of the bearings are used as the training set.

Experimental Results
This section focuses on the prediction of the RUL values of the tested bearings using the proposed method. The superiority of the proposed method is shown by comparing it with other methods. The RUL prediction curves for the test bearing are shown in Figure 10. Since in the prediction curve usually contains local fluctuations, which have a relatively large effect on the degradation estimation of RUL. Accordingly, we take a smoothing operation for each RUL curve, as shown by the bolded green line in Figure 10. In addition, the smoothed curves for bearings 1-2 are not bolded because the predicted fluctuations for bearings 1-2 are not as large as those for the other bearings, so there is no particular need to highlight the degradation trend. Sensors 2021, 21,4217 12 of 17 Figure 10. Test bearing RUL prediction results. Where (a-h) represents RUL prediction curves for each of the eight bearings. Figure 10. Test bearing RUL prediction results. Where (a-h) represents RUL prediction curves for each of the eight bearings.
By observing the RUL prediction curve, we can find that the prediction curve can fit the RUL label well with obvious monotonic decline despite the slight local fluctuations. In addition, the prediction of the later stage of bearing degradation state is more important than the prediction of the earlier stage. The RUL prediction curves in Figure 10 converge sharply when the degradation is close to failure to avoid delayed prediction. Accurate prediction at the end of degradation can ensure the reliability and safety of equipment operation.
We can observe from the figure that the RUL curves do not exactly fit the training labels, which is due to the fact that most of the bearings run non-linearly in the degradation phase. On the other hand, the training with linearly degraded labels does not match well with the input data, therefore, the curves do not always lean into the degraded labels. Table 2 shows the prediction percentage errors of the proposed method. The number in each row in Table 2 implies the value of actual RUL and predicted RUL for each moment. In Table 2, it is observed that each bearing has a relatively small prediction percentage error at the beginning of the prediction and a relatively large error at the end of the prediction. This is because the RUL values in the late prediction period are smaller, thus causing a larger percentage of error. of the network, and finally a fully connected layer for RUL prediction. The results are compared with those of the proposed method. All comparison methods are trained using the same training method, where the default parameters of the comparison methods (1), (2) are the same as the proposed method. Figure 11 shows the degradation curves of the different methods on the same bearing. It is found that the proposed method is able to fit the degradation trend of the bearing much better than the other methods. However, the method without temporal convolution could not converge quickly at the end of the operation, and the error between the predicted values and label was large. Although the multi-CNN method can fit well at the end segment, the error is large at the beginning segment. The method without judging the FPD points has the disadvantages of both methods, which can neither fit the RUL labels in the beginning segment nor converge quickly at the end of the operation. ASRMS to determine FDP points, (2) a method that does not involve temporal convolution, and (3) a multi-scale CNN method [31]. The method uses a short-time Fourier transform to pre-process the data, followed by a multi-scale feature extraction method in the middle of the network, and finally a fully connected layer for RUL prediction. The results are compared with those of the proposed method. All comparison methods are trained using the same training method, where the default parameters of the comparison methods (1), (2) are the same as the proposed method. Figure 11 shows the degradation curves of the different methods on the same bearing. It is found that the proposed method is able to fit the degradation trend of the bearing much better than the other methods. However, the method without temporal convolution could not converge quickly at the end of the operation, and the error between the predicted values and label was large. Although the multi-CNN method can fit well at the end segment, the error is large at the beginning segment. The method without judging the FPD points has the disadvantages of both methods, which can neither fit the RUL labels in the beginning segment nor converge quickly at the end of the operation. The results of the comparison with other methods are shown in Table 3. It can be observed that based on different evaluation metrics such as MAE, RMSE, the values of the proposed methods are lower than the methods used for comparison. This result shows that the proposed deep SG-TCN approach is able to extract the deep-level features of the high-dimensional data based on the correlation between multiple features, and thus can predict the degradation state of the bearings successfully. The results of the comparison with other methods are shown in Table 3. It can be observed that based on different evaluation metrics such as MAE, RMSE, the values of the proposed methods are lower than the methods used for comparison. This result shows that the proposed deep SG-TCN approach is able to extract the deep-level features of the high-dimensional data based on the correlation between multiple features, and thus can predict the degradation state of the bearings successfully.

Conclusions
In this paper, a novel spatial graph convolution-based model is proposed to predict the degradation process of bearings. Firstly, the first degradation point of the operating state is identified. Secondly, hybrid features are extracted from the original vibration signal, and then the extracted hybrid features are mapped to the spatial structure of the graph to compose the spatial graph structure of the bearing. Thirdly, the TCN is introduced to extract features for the temporal dimension of the spatial nodes of the bearing. Finally, the prediction of RUL is performed. The data of the PRONOSTIA platform is used to validate the proposed method, and the results show that the method has good performance in prediction.
On top of that, we should also note that the network structure of deep SG-TCN requires more time for training. Therefore, better hardware conditions are needed. Some degeneration states are abruptly changed bearings, which tend to fail at the end. Consequently, the identification of FDPs using ASRMS often can only get a relatively small number of training samples, which is unfavorable for deep learning. Hence, the method of depth-based learning needs to be explored further in the future.