Multi-Sensor Vibration Signal Based Three-Stage Fault Prediction for Rotating Mechanical Equipment

In order to reduce maintenance costs and avoid safety accidents, it is of great significance to carry out fault prediction to reasonably arrange maintenance plans for rotating mechanical equipment. At present, the relevant research mainly focuses on fault diagnosis and remaining useful life (RUL) predictions, which cannot provide information on the specific health condition and fault types of rotating mechanical equipment in advance. In this paper, a novel three-stage fault prediction method is presented to realize the identification of the degradation period and the type of failure simultaneously. Firstly, based on the vibration signals from multiple sensors, a convolutional neural network (CNN) and long short-term memory (LSTM) network are combined to extract the spatiotemporal features of the degradation period and fault type by means of the cross-entropy loss function. Then, to predict the degradation trend and the type of failure, the attention-bidirectional (Bi)-LSTM network is used as the regression model to predict the future trend of features. Furthermore, the predicted features are given to the support vector classification (SVC) model to identify the specific degradation period and fault type, which can eventually realize a comprehensive fault prediction. Finally, the NSF I/UCR Center for Intelligent Maintenance Systems (IMS) dataset is used to verify the feasibility and efficiency of the proposed fault prediction method.


Introduction
In the production processes of modern industries, the performance of rotating mechanical equipment may degrade over time, even resulting in failure due to long-term operation under severe conditions such as high speed, high temperature, high pressure, and heavy loads. To ensure the safety and efficiency of the operation, health monitoring and the establishment of a maintenance strategy have become active research focus in both industry and academia [1][2][3][4]. Initially, the maintenance strategy was implemented after fault diagnosis or preventive maintenance. We know that different types of equipment faults have specific vibration frequency characteristics. Therefore, traditional signal processing methods, such as Fourier transform (FT) [5], short-time Fourier transform (STFT) [6], and wavelet transform (WT) [7], have been proposed to obtain useful features from the vibration signal to reflect the operating status of the rotating mechanical equipment. However, the above fault diagnosis methods rely heavily on expert experience. In order to solve this problem, deep learning methods, such as CNN [8], LSTM [9], and the combined CNN and LSTM [10], have been used in fault diagnosis more recently, displaying the ability of deep feature self-learning without relying on manual intervention and prior knowledge. Although these methods can achieve excellent results in fault diagnosis, they cannot provide early warnings and take recovery measures before the fault occurs. Therefore, fault prediction is gradually emerging as a preventive maintenance method.
In [11], Peng Y et al. pointed out that fault prediction involves determining the RUL or working time of the diagnostic component based on the historical condition of the component. The research on RUL prediction can be divided into two categories: modelbased methods and data-driven methods [12]. Lei Y et al. used maximum likelihood estimation and a particle filter algorithm to predict the RUL of bearings [13]. The proposed method is not well applicable to abrupt degeneration trends. Model-based methods rely on prior knowledge and specific conditions, whereas the data-driven approach attempts to use deep learning to deduce the degradation of equipment based on a large amount of historical data. In [14], a spectrum-principal-energy-vector method was used to extract features first, and then a deep CNN was formulated to obtain the RUL of bearings according to the features. For the first time, Xia Mei et al. divided the monitoring data into different health stages [15]. Based on this approach, the RUL of equipment was predicted using de-noising auto-encoder-based deep neural networks (DNNs). Although the above RUL methods can estimate how long it is until a fault will occur based on historical information, they are unable to provide the exact degradation period and fault type. To solve this problem, based on gray relational analysis, Wei X U et al. used a neural network model to predict the future state of a rolling bearing [16]. In [17], Xu H et al. used two models: a regression model and a classification model, which could not only predict the stage of degradation, but also classify the type of fault that would occur. However, the traditional wavelet packet transform (WPT) method was used to extract the time-frequency domain features of the original vibration signal, which requires expertise to select the appropriate basis function. Moreover, deep learning methods rely heavily on data information. Recent studies have shown that using multi-sensor data with sensor fusion technology can improve the accuracy and robustness of fault diagnosis models [18,19]. Therefore, in this paper, CNN, LSTM, and support vector classification (SVC) are combined to establish a novel three-stage fault prediction model for rotating mechanical equipment based on vibration signals from multiple sensors. Compared with the existing fault prediction results, the main contributions of the paper are as follows: 1.
More formative vibration signals, used for training the fault prediction model, are collected from multiple sensors to improve the accuracy of the prediction method; 2.
Deep features of varies degradation periods and fault types can be extracted by CNN and LSTM automatically without relying on manual intervention and professional knowledge; 3.
The degradation period and fault type can be predicted simultaneously in advance with high accuracy.
The rest of this paper is structured as follows. In Section 2, the proposed threestage fault prediction framework is generally introduced. Section 3 presents the details of the combined CNN and LSTM feature extraction, the attention-bidirectional (Bi)-LSTM regression model and the SVC classification mode. In Section 4, the superiority of the proposed fault prediction method is verified by applying it to the Intelligent Maintenance Systems (IMS) dataset. Section 5 concludes this paper.

Problem Formulation and Main Fault Prediction Framework
To accomplish the tasks of predicting the degradation period and the type of failure, a novel three-stage fault prediction framework is presented based on the multi-sensor vibration signal, using deep learning and machine learning. The proposed architecture is illustrated in Figure 1, and can be divided into three parts: CNN-LSTM-based feature extraction, attention-Bi-LSTM-based prediction, and SVC-based classification.
As shown in Figure 1, the design of three-stage architecture is related to three objectives: 1. In the feature extraction stage, the original vibration signals collected by multiple sensors are sent to the CNN-LSTM network for the extraction of spatiotemporal features, which contain operating status information; 2. In the prediction stage, the attention-Bi-LSTM is trained to predict the trend of the features; 3. In the classification stage, based on the spatiotemporal features and their trends, the SVC model is formulated to identify the degradation period and the future fault type.  The original vibration signal from multiple sensors is collected first. In order to extract the spatiotemporal information of the obtained vibration signal, a CNN with a convolutionpooling-convolution structure and an LSTM network are combined to formulate the feature extraction model. Then, an attention-Bi-LSTM network is used to predict the feature trends, which can reflect the future health state of the rotating mechanical equipment. Finally, the predicted features are sent into the SVC for classification, which can achieve the purpose of predicting the degradation period and fault type simultaneously. The emtire process of the proposed three-stage fault prediction method is detailed in the following section.

Feature Extraction Stage
Vibration signals are collected by multiple sensors is in the form of time series with noise. This is susceptible to the amplitude. Therefore, it is necessary to perform feature extraction on the time-domain vibration signals. The CNN can extract the features of the original vibration signal, reducing the amount of data and diminishing the noise. However, it only can capture the spatial features, and the temporal features are ignored. Therefore, the CNN-LSTM is used in this paper to extract the spatiotemporal features of the original vibration signal. The framework of the CNN-LSTM model is illustrated in Figure 2. First, a CNN with a convolution-pooling-convolution structure is used to extract the spatial features of the original vibration signal. Then, the deep abstract of temporal features can be obtained by adding the LSTM network after the CNN.  We can denote the originally collected vibration signal in the form of a time series from the mth sensor as V m = {V m t } m=1,2,··· ,n;t=1,2,··· . In order to deal with the problem of sample imbalance and obtain more detailed characteristics of the vibration signal, the data are divided into windows . At the same time, it is necessary to ensure that the window size is properly chosen. For the data in each window, 1D convolution is first used to extract the shallow features. The convolution operation formula of the τth window is as follows: where C 1 i (τ) is named the feature map, and it denotes the ith channel output of the convolution operation; L is the total number of channels; V m (τ) denotes the vibration signals in the τth window; * denotes the dot product; f (·) is the activation function; N represents the number of convolution kernels; ω 1 i is the ith weight parameters of the 1st layer; b 1 i represents the ith bias of the 1st layer; and ω and b are undetermined parameters to be trained.
After the 1D convolution operations, 1D maxpooling operations for the processing results C 1 i i=1,2,...,L are performed, which can reduce the dimension of the feature maps and further extract key features of the vibration signal: where C 1 i,k represents the kth value of the ith channel of the 1st layer, d is the number of samples participating in the maxpooling operation, and C 2 i denotes the output of the pooling layer.
Another convolution operation is performed on the processing result of the pooling operation in order to extract the deeper features of the vibration signal: where the symbols in (3) are the same as those in (1).
Considering the time characteristics of the vibration signal, the LSTM neural network is used to extract the temporal features, processing the series data with the forgetting memory scheme while avoiding the gradient disappearance and gradient explosion problems. LSTM realizes the function of long-term and short-term state transfer through the structure of an input gate, forget gate, and output gate. After obtaining the spatial features from the CNN C 3 l l=1,2,... , two LSTM networks are employed to extract the temporal features.The operation principle is as follows: the update of the hidden state h l at time l is based on the joint action of the input gate i l , forgetting gate f l , output gate o l , cell state c l , and hidden state h l−1 at time l − 1. The specific formula is as follows: where W denotes weight and b is bias. σ denotes the sigmoid activation function, and is Hadamard product.
We can input the LSTM-processed features h l l=1,2,... into the two fully connected layers to obtain the eventual features of the original vibration signal, which can be denoted as FT l l=1,2,... .
where ω and ω denote weights, b and b are biases, and f (·) is the activation function. We can iput the obtained feature FT l l=1,2,...
into the Softmax layer to obtain the probability that the current feature belongs to various categoriesp. Then, one can input the probabilityp into the cross-entropy loss function to complete the entire back-propagation process. The cross-entropy loss function is defined as follows: where p (i) denotes the true probability that V m = {V m t } m=1,2,··· ,n;t=1,2,··· belongs to the ith category; andp (i) denotes the corresponding probability obtained from the deep networks' classification result. The Softmax layer and the cross-entropy loss function are used here to provide a target for the back-propagation process of the deep network. It can endow classification attributes to the extracted features.

Prediction Stage
In order to achieve the purpose of failure prediction, after obtaining the operating state characteristics of the rotating machinery, the feature trend should be predicted. Since these features have a time series correlation, attention Bi-LSTM is used as the prediction model here, as shown in Figure 3. Bi-LSTM consists of a forward LSTM and a backward LSTM, which are able to capture features of both past and upcoming time series. In addition, in order to enhance the correlation between the output results of Bi-LSTM, an attention mechanism is added, which can achieve the purpose of redistributing the weights between the output results.
where the subscripts F and B denote forward and backward, respectively, and h l represents the concatenation of the forward output h l F and backward output h l B of Bi-LSTM. The basic idea of adding the attention mechanism in Bi-LSTM is to calculate the correlation between the target hidden state h tar and the output hidden states h l of Bi-LSTM, and then to output the attention vector [20]. The principle formulas are listed as follows: where score(·) is a function, α ts is the attention weight, c p denotes the context vector, a p represents the attention vector, and W and W c denote weights. We input the a p p=1,2,... into the two fully connected layers to obtain the result of the feature prediction stage, which we denote as { f t p } p=1,2,... .
where ω and ω denote weights, b and b are biases, and f (·) is the activation function.

Classification Stage
After obtaining the future spatiotemporal features of the original vibration signal using the attention-Bi-LSTM prediction model, the degradation period and fault type should be identified by formulating a classification model. In fact, the feature types are divided into several categories by training a Softmax classifier in the feature extraction stage. However, since the prediction step is added after the feature extraction step, the classification model and the prediction model cannot perform the same backpropagation. Therefore, considering the errors of the prediction results, the more robust SVC is used as the classifier instead of the Softmax with rigorous function mapping [21]. The basic idea of the SVC model is to find a classifier that maximizes the classification interval between the hyperplane and the support vector. The principle of the SVC can be addressed as follows. For the predicted features { f t p } p=1,2,... , we denote their failure modes as {y p } p=1,2,... , y ∈ {−1, 1}. The SVC problem can be converted into the following quadratic optimization problem: is the kernel function, which can map the linearly inseparable samples in the initial space to the linearly separable samples in the high-dimensional space. In this paper, we use the radial basis function (RBF) as the kernel function.
where γ is the width of the RBF. The output function of the category can be obtained using the following formula: where b * is the classification threshold, which is obtained by substituting support vectors.
The SVC fault diagnosis model in this paper adopts the One vs. Rest (OvR) approach to realize the multi-classification of faults. The main ideas of OvR are as follows: if N categories need to be classified, N binary classifiers should be constructed. In the training process, select one category as positive and the others as negative, and then classify them in turn. In the test process, the test samples are sequentially brought into the trained N classifiers for calculation, and then the final classification result can be calculated.

Implementing the Proposed Fault Prediction Strategy
At last, it is worth summarizing the pseudo-code of the entire fault prediction algorithm, which is shown in Algorithm 1: Calculate 1 st 1 one-dimensional convolution layer for sensor S 1 as C 1 k (S 1 ) based on ω 1 1 and b 1 1 ;. . . ;Calculate 1 st n one-dimensional convolution layer for sensor S n as C 1 k (S n ) based on ω 1 n and b 1 n 9: Calculate 2 nd 1 one-dimensional maxpooling layer for C 1 k (S 1 ) as C 2 k (S 1 );. . . ; Calculate 2 nd n one-dimensional maxpooling layer for C 1 k (S n ) as C 2 k (S n ) 10: Calculate 3 rd 1 one-dimensional convolution layer for C 2 k (S 1 ) as C 3 k (S 1 ) based on ω 3 1 and b 3 1 ;. . . ;Calculate 3 rd n one-dimensional convolution layer for C 2 k (S n ) as C 3 k (S n ) based on ω 3 n and b 3 n 11: Calculate 4 th 1 and 5 th 1 LSTM layers for C 3 k (S 1 ) as C 5 k (S 1 ) based on ω 4,5 1 and b 4,5 1 ;. . .;Calculate 4 th n and 5 th n LSTM layers for C 3 k (S n ) as C 5 k (S n ) based on ω 4,5 n and b 4,5 n Algorithm 1 Cont.

Validating the Proposed Method
In order to verify the fault prediction method proposed in this paper, the IMS dataset was used [22]. The IMS dataset contains three data sets: dataset 1, with two acceleration sensors on each bearing, and dataset 2 and dataset 3, with one accelerometer on each bearing, respectively. In view of the fact that this experiment needs to predict the failure modes through multiple sensors, dataset 1 is used to verify the proposed method. The sampling frequency of dataset 1 is 20 kHz, and the sampling duration is 1 s. The sampling interval of the first 43 rounds of each acceleration sensor was used to collect data every 5 min, then to collect data every 10 min and generate a data file containing 20,480 sampling points. The Python programming environment was used, based on the Keras framework of version 2.3.1. All experiments were performed on Intel Xeon ES-2620 CPU.

The Description of the Dataset
At the end of the test-to-failure experiment, in dataset 1, bearing 3 displayed an inner race defect and bearing 4 displayed a roller element defect [23]. The purpose of this experiment was to classify the bearing degradation period and identify the early fault type of the bearing based on the vibration signals from multiple sensors. Therefore, bearings 1-3 and 1-4 were selected for research, as shown in Table 1. The entire life of the bearing was divided into four stages: the norm period, slight period, severe period, and failure period. In addition, there were two types of faults: the inner race defect and the roller element defect. Therefore, there were seven fault modes in the experiment, which were the norm period, the slight inner race defect, the severe inner race defect, the inner race failure, the slight roller element defect, the severe roller element defect, and the roller element failure. The IMS dataset does not have a detailed true label of the degradation period and specific failure of the bearing. Therefore, according to the labeling method in [17,24], the threshold of each stage should be set according to actual needs.In this simulation, the root mean square (RMS) features of the 20,480 vibration signals collected per second from each sensor were first extracted. Then, expertise was involved in labeling the degradation period, which is shown in Figure 4 and Table 2. In our simulation, 20,480 samples of vibration signals were collected every 10 min, and these samples were saved in one file. There were 2156 sampling files in the whole life cycle. For example, the samples from the 2120th file to the 2151th file belonged to the severe period for bearing 1-3 H. The data need to be processed appropriately in order to extract the more detailed characteristics of the vibration signal and train the better performance of fault prediction model. Using the trial and error method, the samples in each file with 20,480 samples were divided into 80 windows.Therefore, there were 256 samples in every window.
In the feature extraction stage, the task of the CNN-LSTM network is to effectively extract the spatiotemporal features of the degradation period and fault type simultaneously. The framework of the CNN-LSTM network for feature extraction in Figure 2 was used here. The network settings are shown in Table 3. The inputs of Sensor H and Sensor V were both N × 256 × 1, with N being the number of windows. The batch size was 512. The number of epochs was 25, and the optimizer used was Adam. The cross-entropy in (7) is employed here as the loss function.   -N  7  Concatenate  ---N  8  Dense1  128  --N  9 Dense2 Cross-entropy ---N

Verifying the Validity of the Feature Extraction Model
In order to verify the effectiveness of the proposed multi-sensor CNN-LSTM feature extraction model, a CNN-LSTM network with a single sensor was used as the comparative model. The evaluation equation was chosen as follows: where TP is the number of true positives and FP is the number of false positives. The comparison results between our model and other models are shown in Table 4: From Table 4, it can be seen that the classification accuracy based on a single-sensor CNN-LSTM feature extraction model was lower than that based on the multi-sensor model. The reason is that the large fluctuation and noise of the vibration signals from a single sensor may mislead the identification of the degradation period, such as the fluctuation of 1-4V shown in Figure 4. However, the vibration signals from multiple sensors can provide more comprehensive information, which could improve the classification accuracy.

Training Process
The input of the prediction model is the features obtained from the previous section. The characteristics of the normal period were not used to predict degradation trend, because the data in that period often show no trend information. In this simulation, the severe and failure periods for the inner race and the roller element were predicted, respectively. In order to illustrate the effectiveness of the proposed Attention Bi-LSTM model, only the prediction results of the failure period for roller defects are presented in the following as a representation of the model's performance.The input of our training process was set at 1994 × window width × 10, and the input of the testing process was 400 × window width × 10. The parameter settings of the prediction model are shown in Table 5. Sliding time window technology was used to segment the dataset. When the window moves backward, a series of sample data covering each other will be formed. In the selection of the sliding window width, we tried three, six, and nine sample points to predict the next sample point, and used the root mean square error (RMSE) as the scoring criterion for the prediction error, which is defined in (21). The results are shown in Table 6. It was finally determined that the width of the input window with the smallest prediction error was six. Therefore, six sample points were used to predict the next sample point.
where y i is the true value andŷ i is the predicted value.

Testing Results
During model training and evaluation stages, the RMSE was used as the loss function of the prediction model, and the prediction results are shown in Figure 5. From Figure 5, it can be seen that the general feature trends can be predicted with certain errors. The reason for this is that the IMS dataset we used was designed for RUL prediction instead of degradation period prediction [25]. The final task of this paper was to classify the predicted features, which allows for an acceptable prediction error.

Comparison with Other Models
In order to verify the prediction effect of the proposed attention-Bi-LSTM model, LSTM, Bi-LSTM, and attention-LSTM were applied to the IMS dataset for comparison. The comparative results are shown in Table 7. Since the attention mechanism can redistribute the proportions between sequences according to the predicted target, the use of a prediction model with an attention mechanism was able to improve the accuracy. Moreover, Bi-LSTM was able to capture features of both past and upcoming time series, and the performance of Bi-LSTM was also better than that of a single LSTM. We can see that the best prediction model was attention-Bi-LSTM (1.818/1.686).

Classification
After obtaining the spatiotemporal features and the degradation trends of the bearing, the SVC was used to identify the degradation period and fault type. In this step, the modes of classification of the severe period and failure period for inner race and roller element faults are presented, which is more significant than normal mode identification. In this simulation, the data were collected every ten minutes, and our task was to realize the identification of the degradation period and fault type for the future 50 min. The classification results are shown in Figure 6. Figure 6. Confusion matrix of failure prediction results("in" and "rl" denote inner and roller element defects, respectively; "sl", "se" and "fa" represent slight, severe, and failure periods).
As seen in the confusion matrix in Figure 6, the classification accuracy of the proposed SVC model combined with the attention-Bi-LSTM prediction model can reach 0.944. Furthermore, the classification accuracy of the failure mode can even reach 0.985. According to the background running data, it can be found that samples with classification errors are distributed at the connection of two adjacent periods, whereas the other samples are almost classified correctly. In summary, we can achieve short-term predictions of failure types and degradation periods through the use of our proposed fault prediction method.

Remark 1.
The time-series signals in each window have one class label. The window size is determined using trial and error with a simulation method. The final accuracy of the prediction results with different window sizes is listed in Table 8. Table 8 shows a comparison of the test accuracy, sampling time, and test time of the proposed fault prediction method with three different window sizes. From Table 8, it can be seen that under the window size of 256, the test accuracy of the proposed method was higher than that of 2048 and 4096. This is because when the window size is larger, the number of windows is fewer. This directly leads to a lack of training set data in the feature prediction stage, especially for severe and failure periods, and the key information cannot be captured during feature prediction. As a result, the final classification accuracy of the predicted features would be lower. On the other hand, although the test time with a window size of 256 was about 0.01 s more than the other two cases, the fault prediction accuracy was about 0.1 higher. Therefore, the window size was chosen to be 256 in this simulation. When the programming environment and CPU change, the fault prediction time will also change accordingly.

Conclusions
In this study, we divided the fault prediction task into three stages: feature extraction, feature prediction, and fault mode classification. In the first stage, the spatiotemporal features of the degradation period and fault mode are extracted through CNN-LSTM, based on vibration signals from multiple sensors. In the second stage, the features are sent to a Bi-LSTM network with an added attention mechanism to predict the feature trends. The Bi-LSTM method can take into account two-way sequences, and the attention mechanism can make the Bi-LSTM network work more efficiently by adjusting the weights. Finally, an SVC is used to classify the predicted spatiotemporal features of the deterioration period and fault type simultaneously. The IMS dataset was used to illustrate the effectiveness of the proposed fault prediction method. The simulation results show that a short-term prediction of deterioration period failure modes was achieved using the established fault prediction model. This would be helpful in arranging a maintenance plan in an industrial production setting.
The efficiency of the proposed three-stage fault prediction method is based on the premise that the training set and test set obey the same distribution, with plenty of samples. However, in real engineering scenarios, rolling bearings usually work in normal conditions under different work conditions, which leads to different distributions and few fault data. Therefore, fault prediction strategies should be investigated considering these problems in the future work.