Bearing Fault Diagnosis via Improved One-Dimensional Multi-Scale Dilated CNN

Bearings are the key and important components of rotating machinery. Effective bearing fault diagnosis can ensure operation safety and reduce maintenance costs. This paper aims to develop a novel bearing fault diagnosis method via an improved multi-scale convolutional neural network (IMSCNN). In traditional convolutional neural network (CNN), a fixed convolutional kernel is often employed in the convolutional layer. Thus, informative features can not be fully extracted for fault diagnosis. In the proposed IMSCNN, a 1D dimensional convolutional layer is used to mitigate the effect of noise contained in vibration signals. Then, four dilated convolutional kernels with different dilation rates are integrated to extract multi-scale features through the inception structure. Experimental results from the popular CWRU and PU datasets show the superiority of the proposed method by comparison with other related methods.


Introduction
Bearings are regarded as critical components in rotating machinery. However, bearings often suffer from failure conditions, since they are usually working in a harsh working environment [1,2]. Early and effective bearing fault diagnosis technique plays an important role in avoiding unforeseen downtime of rotating machinery.
Compared to current signals [3] and acoustic emission signals [4], vibration signals [5,6] contain abundant information that reflects the health state of bearings. Thus, vibration signals are widely used in bearing fault diagnosis. Generally, fault diagnosis techniques can be categorized into two types, signal analysis, and data-driven methods. For signal analysis methods, vibration signals are first dealt with signal processing methods such as time-domain analysis [7], frequency domain analysis [8] and time-frequency domain analysis [9,10]. Then, based on the expert knowledge, features extracted from different domains are used to detect bearings health changes and assess health states. A major limitation of signal analysis methods is that comprehensive and great expert knowledge is required to determine the health states and faulty types of bearings from extracted features.
Different from signal analysis methods, data-driven methods only rely on the collected vibration data for fault diagnosis. In data-driven methods, labeled vibration data are first collected. Then, features are extracted from different domains similar to the signal analysis method. For further fault diagnosis purpose, with these extracted features, classifiers are trained using machine learning methods such as Support Vector Machine (SVM) [11,12], Random Forest (RF) [13,14] and Multi-Layer Perceptron (MLP) [15].
Recently, deep learning methods have gained considerable attention in the field of data-driven fault diagnosis. A huge advantage of deep learning is that deep features can be extracted automatically. Generally, deep features can exhibit more useful information for fault diagnosis, compared to shallow features extracted from traditional machine learning methods. It has proved that better diagnostic performance can be achieved using deep learning methods [16]. As representative deep learning methods, auto-encoder (AE), deep belief networks(DBN), and convolutional neural network (CNN) have shown their superiority in bearing fault diagnosis. For example, Chen and Li fed the extracted timedomain and frequency-domain features from the different sensor signals into multiple two-layer sparse autoencoders (SAE) neural networks for fault classification [17]. Gan et al. designed a two-layer hierarchical diagnosis network (HDN) to identify fault types and recognize fault severity ranking by employing deep belief networks (DBNs) to provide representative features [18].
As one of the most effective deep learning methods, a convolutional neural network (CNN) has also been applied to fault diagnosis. The common CNN-based methods can be categorized into one-dimensional (1-D) CNN-based and two-dimension (2-D) CNN-based methods. For 1-D CNN methods, the raw 1-D time-domain vibration signals are directly fed into the 1D CNN model [19]. For 2-D CNN methods, the raw vibration signals are usually transformed into 2-D time-frequency domain data, and then the 2D data are dealt with 2D-CNN [20]. Levent et al. [19] took the raw vibration data as the input, and used the compact adaptive 1D-CNN to diagnose the bearing fault. Gao et al. [21] proposed a novel hybrid deep learning method (NHDLM) based on Extended Deep Convolutional Neural Networks with Wide First-layer Kernels (EWDCNN) and long short-term memory (LSTM) to enhance diagnosis accuracy for rotating machinery in complex environments. Han et al. [22] presented a novel diagnosis framework that combines the Spatio-temporal pattern network (STPN) approach with CNN and applied it to fault diagnosis of complex systems. Wang et al. [23] fused the multi-sensor vibration signals and transformed them into images to obtain more informative features. Then the input was fed to the bottleneck CNN for fault diagnosis. However, in CNN-based fault diagnosis methods, each convolutional operation often uses convolutional kernels of the same size. To further extract more informative features from vibration data, inspired by the inception structure, convolutional kernels of different sizes can be selected to improve the performance of fault diagnosis. To extract multi-scale features, Qiao et al. [24] employed the convolutional kernels with different widths to act as filters with different scales of frequency domain resolution to simultaneously extract features of different frequency bands of the vibration signal. Further, Wang et al. [25] combined the dilated convolutional with multi-scale convolutional for remaining useful life prediction. Compared with the convolutional layer, dilated convolutional layer has a larger receptive field with the same size convolutional kernel. Due to this advantage, dilated convolutional can ignore the redundant information in vibration signals.
Motivated by the above discussions, an improved multi-scale convolutional neural network is developed for bearing fault diagnosis in this paper. To extract more informative features, we employ four dilated convolutional kernels with different dilation rates in multiscale CNN. Among these four dilated convolutional kernels, the dilation rates of the two kernels are set as 1. Thus, these two dilated convolutional kernels become convolutional kernels. Moreover, different from multi-scale CNN [24], a 1D convolutional layer is adopted before using multi-scale CNN to mitigate the effect of noise for bearing fault diagnosis. In summary, the main contributions of the proposed method can be listed as follows, (1) To enlarge the receptive of multi-scale CNN, four dilated convolutional kernels with different dilation rates are designed. Thus, more informative features can be extracted for fault diagnosis. (2) For reduction of the noise in vibration signals, an additional one-dimensional convolutional layer is adopted to extract the features before dilated convolutional layer. (3) Two widely used datasets including CWRU and PU datasets are employed to evaluate the performance of the proposed method compared with other related methods.
Results show the superiority of the proposed method.
The rest of this paper is organized as follows. Section 2 offers a brief review of CNN and its inception structure. In Section 3, the improved multi-scale dilated CNN is developed for bearing fault diagnosis. Two widely used experimental cases are carried out to evaluate the performance of the proposed method compared with other related methods in Section 4. In the final section, the conclusions are drawn.

Convolutional Neural Networks (CNN)
CNN is one of the most popular deep neural networks in recent years [26]. It has been widely used in computer vision [27], natural language processing [28] and other fields [29]. CNN usually consists of three parts: (1) convolutional layer; (2) pooling layer; (3) fully connected (FC) layer.

Convolutional Layer
Due to the characteristics of sparse interactions and parameter sharing of convolutional operation, the number of weights of convolutional operation is determined by the size and number of the convolutional kernel in CNN. Each convolutional kernel deals with a part of the input data. Thus, the number of weights is significantly reduced. Meanwhile, the convolutional operation has the characteristics of equal representations since the fixed convolutional kernel is adopted. The convolutional operation is below, where x l j represents the jth output of the lth layer network. M j is the number of inputs. k l j , b l j and * represent the convolutional kernel, the bias and the convolutional operation, respectively. f (·) is the activation function.
Compared to the commonly used sigmoid function, ReLU function has become a ubiquitous activation function in DNN, due to its computational efficiency and the ability of reducing the gradient vanishing. ReLU function is defined as follows, where x l j is the input and Z l j is the output.

Pooling Layer
To reduce the overfitting, pooling layers are often adopted in CNN. The commonly used pooling methods include average pooling and max pooling. In this study, the max pooling is adopted.It can be defined below: where P l j,t is the tth output in the jth feature map of the lth layer network. W t (·) represents the tth sliding window operation for input with the size of n × n pooling window.

Fully Connected Layer
For classification, a fully connected (FC) layer is usually employed as classifier where all the inputs from one layer are connected to every activation unit of the next layer. The formula of FC layer is defined as follows: where x l j is the output of the lth FC layer and x l−1 i is the output of l − 1th layer.

Dilated Convolutional Neural Networks
Although the pooling layer is widely used to maintain invariance and control overfitting, it will suffer from the reduction of spatial resolution. Thus, the spatial information of the feature map would be lost. The dilated convolutional layer was developed in the field of image segmentation to address this issue [30]. Dilated convolutional can expand the receptive field without increasing the number of parameters or the amount of calculation. The formula for dilated convolutional is as follows: where D r represents the dilated operation with dilation rate r.
In dilated convolutional operation, dilation rate defines a spacing between the values in a kernel. Figure 1 shows the 3 × 3 convolutional process with dilation rates of 1 and 2, respectively. As displayed in Figure 1, if the dilation rate is set to 1, the dilated convolutional becomes the traditional convolutional. In Figure 1b, a receptive field of 5 × 5 is presented (r = 1), while a receptive field of 3 × 3 is obtained (r = 2) as shown in Figure 1a.

Inception Architecture
To improve the performance of networks, inception architecture was introduced in CNN. In GoogLeNet [31], the inception V1 structure was developed as shown in Figure 2. By employing the concept of inception architecture, the depth and width of the networks are increased. Additionally, the advantage of inception V1 architecture is that computational costs can be reduced.
As shown in Figure 2, there are three convolutional kernels with different sizes and a max-pooling layer in inception V1 architecture. The formula of inception can be expressed as: where F l C1 , F l C2 , F l C3 and F l P represent feature maps after three convolutional layers with different convolutional kernel sizes and pooling layer, respectively. F l represents the feature map of lth layer which combines F l C1 , F l C2 , F l C3 and F l P . Since F l contains features computed over different scales, the subsequent network will select the more useful features adaptive.

The Architecture of the Proposed IMSCNN
In the proposed method, the raw vibration data is used as the input of the neural network. The raw vibration data is divided into a number of groups. To facilitate subsequent processing and accelerate the convergence of the neural networks, maximum-minimum normalization is used to deal with each group of input data, where in i is the ith sample. in min i and in max i are the smallest and largest values in the group. N is the size of samples in group.
The structure of the proposed IMSCNN is shown in Figure 3. In practice, the vibration signals are often contaminated by noises. To alleviate this problem, a 1-D convolutional layer is first employed in the proposed method. By using a 1D convolutional layer, noises contained in the raw vibration signals can be filtered. To enhance the ability of feature extraction, a dilated multi-scale convolutional (DMSConv) layer with a larger kernel size is employed to extract multi-scale features. Inside the DMSConv layer, there are four multi-scale convolutional as shown in Figure 4.  In the DMSconv layer, four dilated convolutional kernels with different dilation rates are integrated to extract features through the inception structure. The details of DMSconv layer are shown in Table 1, where KS and NC represent the kernel size and number of channels, respectively. From Table 1, it is noticed that the kernel size of each convolutional layer is singular in multi-scale convolutional. This is for the convenience of using the same convolutional to unify the size of feature map output. Therefore, the number of output channels of the DMSconv layer is 4 × NC. Table 1. Architecture-related hyperparameters of DMSconv layer.

NO.
Layer Name Layer Size Dilation Rate The multi-scale feature map (MSFM) is defined as, where the F T and F D represent the feature maps after dilated convolutions, respectively. To increase the robustness and reduce the computational effort, the max-pooling operation is performed on the FM obtained after the first DMSconv layer. To extract deeper features, a second DMSconv layer with a smaller kernel size is utilized. Additionally, the global average pool (GAP) is used to compress the features of each channel into four features. Finally, these features are fed into an FC layer for classification.
The structure of the proposed IMSCNN model is shown in Table 2. Usually, vibration signals are often collected under high-frequency noise background. Thus, in the first and second DMSconv layers, relatively large kernel size and small kernel size are selected to suppress the high-frequency noise. According to [32], the kernel sizes of the two DMSconv layers are 32 and 2 in this study, respectively. To train the proposed IMSCNN model, cross-entropy loss function is adopted for fault diagnosis. The widely used Adam [33] method is employed. And batch normalization (BN) [34] is used to regularize the model and reduce the need for Dropout, where x i represents the output of the ith layer.

Experiments and Results
To verify the performance of the proposed IMSCNN method, two cases including CWRU and PU datasets are carried out. For comparison, the widely used neural networks including MLP, CNN and MSCNN are employed. The details of these neural networks are described as follows, • MLP: it is composed of five FC layers. The details are shown in the Table 3.   Figure 3 is ignored.
For all comparative methods, the batch size is set to 64. Adam is used as an optimizer. The maximum number of epochs is selected as 100. 1024 data points are set as a group of data input to the neural network. The working environment is Intel Core i7-8750h CPU@ 2.20 GHz, 24.0 GB ram, and Geforce GTX 2070 GPU under Windows 10 operating system. All methods are implemented through Python 3.6.12 and Pytorch 1.7.1.

Case 1: CWRU
The CWRU datasets were provided by the Case Western Reserve University bearing data center [35]. The vibration data was collected under three faulty conditions and one normal condition. Each fault has three kinds of faults in different positions, so there are a total of 9 kinds of faults to be classified. In this study, the data with the acquisition frequency of 12 K is selected. The details of the fault are shown in the Table 5. Table 5 shows that in addition to the normal bearings there are three different fault locations, Ball (B), Inner Race (IR), and Out Race (OR). Each fault location contains three fault diameters of 0.07 inches, 0.014 inches, and 0.021 inches respectively. All faults were created manually by electro-discharge machining (EDM). In the experiment, 80% of the collected data from each condition is used for training and the other 20% is for testing. The accuracy results are shown in Table 6. The confusion matrix is shown in Figure 5. It can be seen from Table 6 that both CNN and MLP offer satisfactory performance, where the accuracy reaches 99.77% and 94.63% respectively. Through extracting multi-scale features, MSCNN, SimpleIMSCNN, and IMSCNN can provide 100% accuracy.
To further compare the ability of feature extraction, t-SNE [36] is used to visualize the extracted features for all methods. As shown in Figure 6, it can be found that the features extracted by MLP are close between class 1, class 4, class 7, and class 8, while the features extracted by CNN are close between classes 5 and 8. Thus, there exist misclassified results by MLP and CNN. The data in the confusion matrix can also prove this point as plotted in Figure 5. Contrary, the distance between features extracted from MSCNN, SimpleIMSCNN, and IMSCNN are relative far.   [37]. In the PU dataset, there are 14 faulty conditions. In this study, the vibration data was collected under the working conditions of rotating speed 1500 rpm, load torque 0.7 nm, and radial force 1000 N. The descriptions of 14 faults are listed in Table 7. In Table 7, the fault location is represented by fault mode. Since the fault type of NO.13 KI04 is the same as NO. 8 KI14, we only consider NO. 8 KI14. Thus, our goal is to classify the 13 faulty conditions. All data are collected on the test rig through the transducer. The sampling frequency of vibration data is 64 k Hz and the sampling time is 4 s. The real damages bearing used in this experiment were obtained by accelerated lifetime test. Low viscosity oil was also used during the experiments, which was more conducive to the appearance of damage. Most damage is caused by fatigue damages, which arise in the form of pittings. The rest of the damage types are mainly plastic deformation in the form of indentations caused by the debris. We use 80% of the data from each condition for training and 20% for testing. The confusion matrix is displayed in Figure 7. As shown in Figure 7, there are many faulty samples misclassified by MLP and CNN. For MLP, the accuracy rate is only 58.89% for fault 7. For CNN, the accuracy of fault 11 is only 67.9%, and 13.58% of samples of fault 11 are misclassified as fault 8. From the data in confusion matrices of SimpleIMSCNN, MSCNN, and IMSCNN, it can be seen that there are much fewer misclassified samples. In a similar way, t-SNE is used to visualize the extracted for all comparative methods. The visualization results are plotted in Figure 8. As displayed in Figure 8, it can be seen that the features extracted from MLP and CNN can not be well separated. Compared to MSCNN and SimpleIMSCNN, the features extracted from IMSCNN are more distinguishable.  Table 8 lists the accuracy results. From Table 8, the accuracy of MLP and 1DCNN are 69.69% and 85.64%, respectively. Through extracting multi-scale features, the accuracy of MSCNN is 95.53%. On the other hand, the accuracy of SimpleIMSCNN is 92.1%. The proposed IMSCNN method can provide the best performance among the comparative methods, where the accuracy reaches 96.55%. It indicates that the noises contained in vibration signals can be filtered by the first 1D convolutional layer of the proposed IMSCNN. Thus, the diagnostic performance of IMSCNN is improved.

Conclusions
A novel CNN-based bearing fault diagnosis method called IMSCNN is developed in this paper. In the proposed IMSCNN method, the one-dimensional original vibration signal is preprocessed through a one-dimensional convolutional layer to alleviate the influence of noise. To extract more informative features, a multi-scale feature extraction layer called the DMSCov layer which consists of four dilated convolutional operation with different dilation rates is employed for fault diagnosis. Two widely used CWRU and PU datasets are utilized to verify the superiority of the proposed IMSCNN by comparison with other related methods.