A Novel Hybrid Deep Learning Method for Fault Diagnosis of Rotating Machinery Based on Extended WDCNN and Long Short-Term Memory

Deep learning (DL) plays a very important role in the fault diagnosis of rotating machinery. To enhance the self-learning capacity and improve the intelligent diagnosis accuracy of DL for rotating machinery, a novel hybrid deep learning method (NHDLM) based on Extended Deep Convolutional Neural Networks with Wide First-layer Kernels (EWDCNN) and long short-term memory (LSTM) is proposed for complex environments. First, the EWDCNN method is presented by extending the convolution layer of WDCNN, which can further improve automatic feature extraction. The LSTM then changes the geometric architecture of the EWDCNN to produce a novel hybrid method (NHDLM), which further improves the performance for feature classification. Compared with CNN, WDCNN, and EWDCNN, the proposed NHDLM method has the greatest performance and identification accuracy for the fault diagnosis of rotating machinery.


Introduction
In modern industries, rotating machinery is widely used in complex mechanical fields that work under severe conditions. Therefore, various failures can occur over a long running time. Fault diagnosis uses various methods to ensure safe operation and reduce losses for rotating machinery, which is very important [1,2]. Traditionally, fault diagnosis contains signal acquisition, feature extraction, and fault classification that can detect most intrinsic mechanical information and identify fault features [3][4][5]. There are a variety of fault diagnosis methods, including time-frequency analysis, that can analyze the characteristic frequency of a vibration signal in the time domain; however, more time is required for data consumption. A wavelet transform (WT) was presented to filter noise and analysis features for vibration signals. Even so, WT still suffers some drawbacks when processing large amounts of nonlinear vibration signals [6]. Empirical Mode Decomposition (EMD) was successfully applied to the fault diagnosis of the vibration signal; however, after extracting impulsive features from the vibration signals, the mode mixing problem remained [7].
In recent years, many methods based on machine learning have been developed for fault diagnosis. Support vector machine (SVM) has an advantage as a global optimal solution that can be widely applied to classification in fault diagnosis; however, SVM is a shallow model and cannot immediately achieve feature self-learning from vibration signals [8,9]. The k-nearest neighbor (KNN) method can achieve a good performance for classification; however, it still has some difficulties for high-dimensional data [10]. Machine learning methods have achieved good performances for classification of rotating machinery. However, their architecture still lacks multi-layer nonlinear mapping ability, and as a result (1) The novel hybrid DL method can be implemented under different bearing health conditions, as it fully uses previous information for extraction features and has a stronger self-learning capacity to achieve a high accuracy for fault classification. (2) The proposed method integrates the advantages of the improved WDCNN method and LSTM method, and uses various neural networks for stronger features of selflearning from large vibration signals. (3) The t-distributed Stochastic Neighbor Embedding (t-SNE) can show different mapping abilities of the neural networks for fault classification. This visualization can display how different layers capture the information step by step, and ten different fault styles of vibration bearing signals were easily recognized by the novel hybrid DL method.
The rest of the paper is organized as follows. In Sections 2 and 3 the basic theory of the WDCNN method and LSTM method are introduced, respectively. In Section 4, the improved WDCNN method and proposed framework of the novel hybrid deep learning method (NHDLM) are described in detail. In Section 5, experiments are presented to demonstrate the different performances of the traditional CNN method, WDCNN method, EWDCNN method, and the proposed method (NHDLM). The conclusions are presented in Section 6.

Architecture of the WDCNN Model
To address the problems of the traditional CNN method, the WDCNN method was proposed for 1-D vibration signals under different conditions. The overall architecture contains convolutional layers, pooling layers, and one classification stage. Compared with the traditional CNN method, this architecture takes advantage of the first wide kernels to capture useful information from high-frequency noise signals, and then small kernels were used to acquire low frequency features. The multilayer convolutional kernels make the networks deeper.

Convolutional Layer
In the convolutional layer, convolution operation was conducted on the input local region with filter kernels/weight and the output features generated from input signals, input x l (j), convolution kernel K l i , and bias b l i of the i− th filter kernel in layer l to produce an output feature map of the j− th local region in layer l. The convolution process is described as follows: where the * notation computes the dot product of the kernel and local regions.

Activation Layer
After the convolution operation, a Rectified Linear Unit (ReLU) was used to enhance the representation ability and learn features for convolutional networks. As a useful activation unit, ReLU can adjust the parameters for training layers by weights, where y l+1 i (j) is the output value of the convolution operation. The formula is described as follows: where a l+1 i (j) is the activation.

Pooling Layer
After the convolutional layer in the architecture, a max-pooling layer was used to reduce the spatial size of the features and parameters of a neuron in the previous layer, which performs the local max operation over the input feature map x i and produces the output feature map y i . The max-pooling layer is described as follows: where the formula is a pooling operation for max-pooling and y i denotes the corresponding value of the neuron.

Batch Normalization
After a convolutional layer or fully connected layer, the batch normalization (BN) layer was applied to reduce the shift of internal covariance. When input datum is x = x 1 , · · · , x p , the BN layer is described as: where γ i is the scale parameter, β i is shift parameter, and y i is the output.

Architecture of the WDCNN Model
The overall architecture of WDCNN has filter stages and a classification stage as a traditional CNN model, as shown in Figure 1. However, the WDCNN model takes advantage of the first wide convolutional layer and multi-stage convolutional layers for stronger extraction features for input vibration signals. The WDCNN is used for 1-D input vibration signals without any other transformation, and multi-layer small convolutional kernels can make the networks deeper to extract good representation. Batch normalization and a fully connected layer are then used to accelerate the training process.

Architecture of the WDCNN Model
The overall architecture of WDCNN has filter stages and a classification stage as a traditional CNN model, as shown in Figure 1. However, the WDCNN model takes advantage of the first wide convolutional layer and multi-stage convolutional layers for stronger extraction features for input vibration signals. The WDCNN is used for 1-D input vibration signals without any other transformation, and multi-layer small convolutional kernels can make the networks deeper to extract good representation. Batch normalization and a fully connected layer are then used to accelerate the training process. After extracting the features, the classification stage uses the fully connected layers for fault style classification. For the output layer, the SoftMax function used logits of ten neurons for the probability distribution of ten different bearing health conditions, described as: where is the logits of the − ℎ output neuron.

Long Short-Term Memory (LSTM)
As a modified version of a recurrent neural network (RNN), LSTM can take advantage of current and previous information of the current task, and can address the drawback of traditional RNN for long-term memory. The LSTM network contains four gates, the forget gate, update gate, input gate, and output gate. The LSTM method, as shown in Figure 2, can flexibly keep the long-term memory of previous learning information, which means that the architecture of LSTM is more suitable for processing time series than RNNs. After extracting the features, the classification stage uses the fully connected layers for fault style classification. For the output layer, the SoftMax function used logits of ten neurons for the probability distribution of ten different bearing health conditions, described as: where z j is the logits of the j − th output neuron.

Long Short-Term Memory (LSTM)
As a modified version of a recurrent neural network (RNN), LSTM can take advantage of current and previous information of the current task, and can address the drawback of traditional RNN for long-term memory. The LSTM network contains four gates, the forget gate, update gate, input gate, and output gate. The LSTM method, as shown in Figure 2, can flexibly keep the long-term memory of previous learning information, which means that the architecture of LSTM is more suitable for processing time series than RNNs.
The forget gate was applied to capture important features from the previous neuron state in a one-layer neural network. The forget gate is described as follows: where x t is the input vector at time t, h t−1 is the output of the memory block at time t − 1, W T f , U T f are the weight vectors, and b f is the bias vector. The input gate i t can determine how much new information should be added to the current state of the neuron:  The forget gate was applied to capture important features from the previous neuron state in a one-layer neural network. The forget gate is described as follows: where is the input vector at time t, ℎ is the output of the memory block at time t-1, , are the weight vectors, and is the bias vector. The input gate can determine how much new information should be added to the current state of the neuron: The new memory content is calculated and updated as follows: where is the state information, is the candidate state information, and tanh is the hyperbolic tangent activation function. The output gate is calculated to determine how much information should be used in the next time step: The output of memory block is calculated as follows: In the architecture of LSTM, the memory unit depends on the forget gate output , the candidate memory information , and the long-term memory information . The output ℎ of memory block depends on output of the output gate and value the memory unit .

The Proposed Novel Hybrid Deep Learning Method
Before the proposed NHDLM method, the EWDCNN method was developed by extending the convolution layer of WDCNN. The convolutional layers were added to the architecture to develop the EWDCNN. Compared with the WDCNN model, the EWDCNN has six convolutional layers and results in a stronger deep learning capacity for signals. The architecture of EWDCNN was then exchanged with LSTM and the novel hybrid deep learning method (NHDLM) was developed, as shown in Figure 3. The architecture mainly consists of six convolutional layers with pooling layers, two LSTM, and one full connection layer. The new memory content is calculated and updated as follows: where C t is the state information, C t is the candidate state information, and tanh is the hyperbolic tangent activation function. The output gate o t is calculated to determine how much information should be used in the next time step: The output of memory block is calculated as follows: In the architecture of LSTM, the memory unit C t depends on the forget gate output f t , the candidate memory information C t , and the long-term memory information C t−1 . The output h t of memory block depends on output o t of the output gate and value the memory unit C t .

The Proposed Novel Hybrid Deep Learning Method
Before the proposed NHDLM method, the EWDCNN method was developed by extending the convolution layer of WDCNN. The convolutional layers were added to the architecture to develop the EWDCNN. Compared with the WDCNN model, the EWDCNN has six convolutional layers and results in a stronger deep learning capacity for signals. The architecture of EWDCNN was then exchanged with LSTM and the novel hybrid deep learning method (NHDLM) was developed, as shown in Figure 3. The architecture mainly consists of six convolutional layers with pooling layers, two LSTM, and one full connection layer.
The novel hybrid deep learning method is designed to extract the spatial and temporal variation features of 1-D vibration bearing signals. The multiple convolutional layers can remove the noise from the vibration signals and extract the special features step by step. The proposed method based on two LSTMs can also take advantage of keeping the long short-term memory for extracting the time variation features of vibration signals and improve prediction. Finally, the full connection layer was used to integrate the features information, which was convenient for fault classification, realizing the nonlinear mapping, as shown in Figure 4.  The novel hybrid deep learning method is designed to extract the spatial and temporal variation features of 1-D vibration bearing signals. The multiple convolutional layers can remove the noise from the vibration signals and extract the special features step by step. The proposed method based on two LSTMs can also take advantage of keeping the long short-term memory for extracting the time variation features of vibration signals and improve prediction. Finally, the full connection layer was used to integrate the features information, which was convenient for fault classification, realizing the nonlinear mapping, as shown in Figure 4.

Experiment
To validate this proposed method, public experimental data from Case Western Reserve University (CWRU) were applied for fault classification, as shown in Figure 5. This experiment platform includes a transducer, dynamometer, and induction motor. The testing sampling frequency was 12 kHz, and in addition to the Normal Condition (NC), there were also three fault types of the vibration bearing: Ball Fault (BF), Inner Race Fault (IF), and Out Race Fault (OF). Each fault type has three levels of severity with fault diameters of 0.07 inches, 0.014 inches and 0.021 inches, respectively. There were ten data styles of bearing health for training and testing: NC, BF7, BF14, BF21, IF7, IF14, IF21, OF7, OF14,   The novel hybrid deep learning method is designed to extract the spatial and tem poral variation features of 1-D vibration bearing signals. The multiple convolutional la ers can remove the noise from the vibration signals and extract the special features ste by step. The proposed method based on two LSTMs can also take advantage of keepin the long short-term memory for extracting the time variation features of vibration signa and improve prediction. Finally, the full connection layer was used to integrate the fe tures information, which was convenient for fault classification, realizing the nonlinea mapping, as shown in Figure 4.

Experiment
To validate this proposed method, public experimental data from Case Western R serve University (CWRU) were applied for fault classification, as shown in Figure 5. Th experiment platform includes a transducer, dynamometer, and induction motor. The tes ing sampling frequency was 12 kHz, and in addition to the Normal Condition (NC), the were also three fault types of the vibration bearing: Ball Fault (BF), Inner Race Fault (IF and Out Race Fault (OF). Each fault type has three levels of severity with fault diamete of 0.07 inches, 0.014 inches and 0.021 inches, respectively. There were ten data styles bearing health for training and testing: NC, BF7, BF14, BF21, IF7, IF14, IF21, OF7, OF1

Experiment
To validate this proposed method, public experimental data from Case Western Reserve University (CWRU) were applied for fault classification, as shown in Figure 5. This experiment platform includes a transducer, dynamometer, and induction motor. The testing sampling frequency was 12 kHz, and in addition to the Normal Condition (NC), there were also three fault types of the vibration bearing: Ball Fault (BF), Inner Race Fault (IF), and Out Race Fault (OF). Each fault type has three levels of severity with fault diameters of 0.07 inches, 0.014 inches and 0.021 inches, respectively. There were ten data styles of bearing health for training and testing: NC, BF7, BF14, BF21, IF7, IF14, IF21, OF7, OF14, and OF21. Each sample has 1024 points. Datasets A, B, C, and D, respectively, contain 700 training samples and 100 testing samples, 1400 training samples and 200 testing samples, 2100 training samples and 300 testing samples, and 2800 training samples and 400 testing samples of ten different fault conditions under loads of 0, 1, 2, and 3 hp. More details of the datasets are described in Table 1. and OF21. Each sample has 1024 points. Datasets A, B, C, and D, respectively, contain 70 training samples and 100 testing samples, 1400 training samples and 200 testing sample 2100 training samples and 300 testing samples, and 2800 training samples and 400 testin samples of ten different fault conditions under loads of 0, 1, 2, and 3 hp. More details the datasets are described in Table 1.  We also add vibration signatures of bearing fault in Figure 6. It is very difficult diagnose for Ball Fault (BF) as many signals have impulsive content, and the ball fault only engaging with the races in signals and seemingly at random. The Inner Race Fau (IF) exhibits some strong harmonics, with very clear impulsive modulation at shaft spee Out Race Fault (OF) exhibits unnormal characteristic symptoms in the envelope spect and has the most modulation at shaft speed.  We also add vibration signatures of bearing fault in Figure 6. It is very difficult to diagnose for Ball Fault (BF) as many signals have impulsive content, and the ball fault is only engaging with the races in signals and seemingly at random. The Inner Race Fault (IF) exhibits some strong harmonics, with very clear impulsive modulation at shaft speed. Out Race Fault (OF) exhibits unnormal characteristic symptoms in the envelope spectra and has the most modulation at shaft speed.

Parameters of the Proposed Novel Hybrid Deep Learning Method
In this experiment, the architecture of the proposed novel hybrid DL method has 6 convolutional and pooling layers, 2 LSTM networks, fully connected hidden layers, and a soft-max layer. In this architecture, the size of the first convolutional kernel is 64 × 1 and the rest of the kernel sizes are 3 × 1. Max pooling is added after each convolutional layer, and then batch normalization is used to improve the performance of the method to select the adaptive size of the neuron. The parameters of the convolutional layers and LSTM network are detailed in Table 2, where Google TensorFlow and Python 3.7 were applied for the experiment.

Parameters of the Proposed Novel Hybrid Deep Learning Method
In this experiment, the architecture of the proposed novel hybrid DL method has convolutional and pooling layers, 2 LSTM networks, fully connected hidden layers, and soft-max layer. In this architecture, the size of the first convolutional kernel is 64 × 1 a the rest of the kernel sizes are 3 × 1. Max pooling is added after each convolutional lay and then batch normalization is used to improve the performance of the method to sele the adaptive size of the neuron. The parameters of the convolutional layers and LST network are detailed in Table 2, where Google TensorFlow and Python 3.7 were appli for the experiment.

Feature Visualization Analysis
According to feature visualization by mapping, the t-distributed Stochastic Neighbor Embedding (t-SNE) was applied to verify the self-learning feature capacity of the novel hybrid deep learning method under different neuron layers, as shown in Figure 7, which means ten fault styles of the vibration signals were easier to recognize. This visualization displays how different layers can capture information step by step using nonlinear mapping. All of the fault signals become separable, and it does not perform very well in early layers; however, as the layer goes deeper, the model can make full use of the self-learning capability  As shown in Figure 8, there are approximately four confusion matrices for four predictive models (traditional CNN method, WDCNN method, improved EWDCNN method, proposed NHDLM method) trained on the 2800/400 train-test split. In each confusion, a blue rectangle means that all ten fault signals were correctly classified, a green rectangle means that the fault type was not classified correctly, and the number in a rectangle denotes the number of tests. For the first case (predictive model 1-traditional CNN), (400-43) out of 400 tests were correctly classified, which means that the traditional CNN hit an accuracy of 89% on sample testing. For predictive model 2 (WDCNN), (400-31) out of 400 tests were correctly classified for an accuracy of 92%. For predictive model 3 (EWDCNN), (400-11) out of 400 were correctly classified for a prediction accuracy of 97%. For the proposed method, (400-3) out of 400 were correctly classified for a classification accuracy hit of 99%. Many fault types were classified.
To further verify the effectiveness of the proposed method in vibration signals, the A, B, C, and D datasets were applied to compare the CNN, WDCNN, and EWDCNN methods with the proposed NHDLM method. The simulation results are presented in Figure 9 and described as below.
(a) Dataset A: For 700 training samples and 100 testing samples, the CNN approach achieved 50% prediction accuracy, the WDCNN method was 50%, the EWDCNN method was 51%, and the proposed NHDLM method was 58%. For 1400 training samples and 200 testing samples, the CNN approach achieved 73% prediction accuracy, the WDCNN method was 70%, the EWDCNN method was 71%, and the proposed NHDLM method was 81%. For 2100 training samples and 300 testing samples, the CNN approach achieved 76% prediction accuracy, the WDCNN method was 89%, the EWDCNN method was 92%, and the proposed NHDLM method was 96%. For 2800 training samples and 400 testing samples, the CNN approach achieved 89% prediction accuracy, the WDCNN method was 92%, the EWDCNN method was 97%, and the proposed NHDLM method was 99%.
(b) Dataset B: For 700 training samples and 100 testing samples, the CNN approach achieved 43% prediction accuracy, the WDCNN method was 50%, the EWDCNN method was 54%, and the proposed NHDLM method was 55%. For 1400 training samples and 200 testing samples, the CNN approach achieved 59% prediction accuracy, the WDCNN method was 64%, the EWDCNN method was 70%, and the proposed NHDLM method was 72%. For 2100 training samples and 300 testing samples, the CNN approach achieved 72% prediction accuracy, the WDCNN method was 92%, the EWDCNN method was 94%, and the proposed NHDLM method was 95%. For 2800 training samples and 400 testing samples, the CNN approach achieved 87% prediction accuracy, the WDCNN method was 94%, the EWDCNN method was 96%, and the proposed NHDLM method was 98%. (c) Dataset C: For 700 training samples and 100 testing samples, the CNN approach achieved 54% prediction accuracy, the WDCNN method was 62%, the EWDCNN method was 63%, and the proposed NHDLM method was 76%. The traditional CNN method still has some drawbacks for larger datasets; based on CNN, the WDCNN method can take advantage of the first wide convolutional layer and multi-stage convolutional layers for stronger extraction features. To enhance the selflearning capacity of WDCNN method, the EWDCNN method is presented by extending the convolution layer of WDCNN, which can further improve automatic feature extraction. The LSTM then changes the geometric architecture of the EWDCNN to produce a novel hybrid method (NHDLM), which further improves the performance for feature classification. So, the proposed NHDLM method had the greatest identification accuracy for bearing datasets.
More information is shown in Table 3 to further illustrate the feasibility of the proposed method for different training and testing.
When the training samples increased from 700 to 2800 and the testing sample increased from 100 to 400, the accuracy of these methods increased very obviously with the increasing samples. With 2800 training samples, the accuracy of the proposed method was greater than 99% and greater than the other methods. The proposed method had the greatest recognition accuracy in different datasets and methods, indicating promise for feature self-learning performance for vibration signals.
To further verify the effectiveness of the proposed method in training time, some datasets were applied to compare the CNN, WDCNN, and EWDCNN methods with the proposed NHDLM method. Training times are presented in Table 4   tangle denotes the number of tests. For the first case (predictive model 1-traditional CNN), (400-43) out of 400 tests were correctly classified, which means that the traditional CNN hit an accuracy of 89% on sample testing. For predictive model 2 (WDCNN), (400-31) out of 400 tests were correctly classified for an accuracy of 92%. For predictive model 3 (EWDCNN), (400-11) out of 400 were correctly classified for a prediction accuracy of 97%. For the proposed method, (400-3) out of 400 were correctly classified for a classification accuracy hit of 99%. Many fault types were classified. To further verify the effectiveness of the proposed method in vibration signals, the A, B, C, and D datasets were applied to compare the CNN, WDCNN, and EWDCNN methods with the proposed NHDLM method. The simulation results are presented in Figure 9 and described as below.
(a) Dataset A: For 700 training samples and 100 testing samples, the CNN approach achieved 50% prediction accuracy, the WDCNN method was 50%, the EWDCNN method was 51%, and the proposed NHDLM method was 58%. For 1400 training samples and 200 testing samples, the CNN approach achieved 73% prediction accuracy, the WDCNN method was 70%, the EWDCNN method was 71%, and the proposed NHDLM method was 81%. For 2100 training samples and 300 testing samples,    (c) Dataset C: For 700 training samples and 100 testing samples, the CNN approach achieved 54% prediction accuracy, the WDCNN method was 62%, the EWDCNN method was 63%, and the proposed NHDLM method was 76%. For 1400 training samples and 200 testing samples, the CNN approach achieved 61% prediction accuracy, the WDCNN method was 83%, the EWDCNN method was 85%, and the proposed NHDLM method was 90%. For 2100 training samples and 300 testing samples, the CNN approach achieved 66% prediction accuracy, the WDCNN method was  to the datasets, experiment can prove that the prediction accuracy can show the robustness of the proposed method.

Conclusions
In this paper, a novel hybrid DL method was proposed for fault classification for rotating machinery under complex working conditions. The health states contain different pressure load, speed, Fault Types, Fault Diameters, and training/testing samples. The proposed method is suitable for processing large datasets about vibration bearing signals and can achieve a good performance for prediction.
Based on the WDCNN method, the EWDCNN method was developed by extending the convolution layer, and LSTM changed the EWDCNN's geometric architecture to develop the proposed NHDLM method. In the proposed NHDLM method, the architecture has different convolutional layers and LSTM networks. The LSTM networks can effectively increase the self-learning capability of the convolutional layers for ten fault styles of vibration signals, and the proposed model can effectively integrate the layer. The experiment proves that the proposed NHDLM method has a better performance than the existing CNN, WDCNN and EWDCNN methods, and NHDLM can achieve a greater prediction accuracy on different fault styles of vibration signals.
The experiment proved that the proposed NHDLM method exhibits a better performance than the existing CNN, WDCNN, and EWDCNN methods, and that NHDLM can achieve greater prediction accuracy on different fault styles of vibration signals. Before deep learning method process datasets, denoising preprocessing methods are not used. Therefore, some denoising methods can be used to preprocess and improve the performance of the deep learning methods in the future.