## 1. Introduction

Rolling element bearings are the core components in rotating mechanisms, whose health conditions, for example, the fault diameters in different places under different loads, could have enormous impact on the performance, stability and life span of the mechanism. The most common way to prevent possible damage is to implement a real-time monitoring of vibration when the rotating mechanism is in operation. With the vibration signals under different conditions collected by the sensors, intelligent fault diagnosis methods are applied to recognize the fault types [

1,

2,

3]. Common intelligent fault diagnosis methods can be divided into two steps, namely, feature extraction and classification [

4,

5]. The vibration signals collected from machines are raw temporal signals which contain the useful information of the machine, as well as useless noise. Therefore, it’s necessary to find a way to extract useful features that represent the intrinsic information of the machine. Common signal processing techniques used to extract the representative features from the raw signal include time-domain statistical analysis [

6], wavelet transformation [

7], and Fourier spectral analysis [

8]. Usually after feature extraction, a feature selection step will be implemented to get rid of useless and insensitive features, and reduce the dimensions for the sake of computational efficiency. Common dimension reduction methods include principal component analysis (PCA) [

9], independent component analysis (ICA) [

10], and feature discriminant analysis. With the useful features extracted and selected from raw signals, the last step is to train classifiers like k-nearest neighbor (KNN) [

11], artificial neural networks(ANN), also known as Multi-layer Perceptron (MLP) [

12,

13], or support vector machine (SVM) [

14] with these features. After training, the classifiers should be tested on test samples to see if they can generalize well unseen signal samples.

In recent years, Huang et al. proposed a genetic algorithm-based SVM (GA-SVM) model that can determine the optimal parameters of SVM with high accuracy and generalization ability [

15]. In [

16], continuous wavelet transform was used to overcome the shortcomings of the traditionally used Fourier transform, like not being able to tell when a particular event took place, and then SVM is used as the classifier to analyze frame vibrations. MLP, known for its capability to learn features with complex and nonlinear patterns, has also been a very common classifier used in fault diagnosis. Amar proposed a fault diagnosis method which uses a preprocessed FFT spectrum image as input of ANN. The FFT spectrum image generated from raw vibration signal is first averaged using a 2D averaging filter and then converted to binary image with appropriate threshold selection [

17]. In [

18], the discrete wavelet transform is used for feature extraction and an artificial neural network is used for classification.

In recent years, with the surging popularity of deep learning as a computational framework in various research fields, some papers have tried to use convolutional neural networks [

19] to diagnose the fault of mechanical parts. CNNs have two main features: weights sharing and spatial pooling, which makes it very suitable for computer vision applications whose inputs are usually 2D data, but it has also been used to address natural language processing and speech recognition tasks whose inputs are 1D data [

20,

21]. Therefore, in fault diagnosis problem, the inputs of CNNs can be either 2D, e.g., frequency spectrum image, or 1D, e.g., time-series signal or spectral coefficients. Janssens et al. proposed a CNN model for rotating machinery conditions recognition whose input is DFT of two lines of signals collected from two sensors placed perpendicular to each other [

22]. The model has one convolutional layer and one fully connected layer, and on top of the network is a softmax layer for classification into four categories. In [

23], a hierarchical adaptive deep convolutional neural network. The model has two hierarchically arranged components: a fault determination layer and a fault size evaluation layer. In [

24], the inputs of the CNN model for motor fault detection is 1D raw time series data, which successfully avoids the time-consuming feature extraction process. In [

25], the proposed model also uses 1D vibration signal as input. It can perform real-time damage detection and localization. With raw vibration signals fed directly into the network, the optimal damage-sensitive features are learned automatically. In [

26], we proposed a CNN model with two convolutional layers to diagnose the faults of bearings with a huge number of training data.

Though many of the works mentioned above have achieved pretty good results, there is still plenty of room for improvement. For example, in many studies, the classifier was trained with a very specific type of data, which means it may achieve high accuracy on similar data while performing poorly with another type. This may be caused by the wrong presentative features extracted from the raw signals. Besides, when analyzing a highly complex system, the choice of suitable feature functions requires considerable machinery expertise and abundant mathematical knowledge. In other hands, because of the manual feature extraction and selection, the accuracy of the diagnostic result would not be stable when dealing with different data. Therefore, some studies proposed that the classifier should have the ability to classify the data from the raw signal directly, without feature extraction or manual selection [

24,

27]. In other words, the classifier should have the ability to process the raw signal automatically and adaptively and extract the presentative features more precisely. To sum up, there are three main problems existing in intelligent fault diagnosis.

First, although many methods can achieve good results in fault diagnosis, few of them work directly on raw temporal signals. Most of the algorithms have the same classifiers, such as SVM and MLP, etc. These paper mainly focuses on improving feature representation and extraction.

Second, many diagnosis methods have poor domain adaptation ability. It is not uncommon to find that a classifier trained with data from one working load fails to classify samples obtained from another working load properly.

Third, few algorithms perform well under noisy environment conditions. Various pre-processing methods are used to remove noise and to improve the classification accuracy, but few methods can classify the signals directly on raw noisy signals with high accuracy.

In order to address the problems above, in this paper, we proposed a method named Deep Convolution Neural Networks with Wide first-layer kernels (WDCNN). The contributions of this paper are summarized below:

- (1)
We propose a novel and simple learning framework, which works directly on raw temporal signals. A comparison with traditional methods that require extra feature extraction is shown in

Figure 1.

- (2)
This algorithm itself has strong domain adaptation capacity, and the performance can be easily improved by a simple domain adaptation method named AdaBN.

- (3)
This algorithm performs well under noisy environment conditions, when working directly on raw noisy signals with no pre-denoising methods.

- (4)
We try to explore the inner mechanism of WDCNN model in mechanical feature learning and classification by visualizing the feature maps learned by WDCNN.

The remainder of this paper is organized as follows: a brief introduction of CNN is provided in

Section 2. The intelligent diagnosis method based on WDCNN is introduced in

Section 3. Some experiments are conducted to evaluate our method against some other common methods. After this, discussion about the results of the experiments is presented in

Section 4. We draw the conclusions and present the future work in

Section 5.

## 3. Proposed WDCNN Intelligent Diagnosis Method

As mentioned in

Section 1, CNN has already been applied to fault diagnosis. However, these models fail to achieve a higher performance than traditional methods. Most of the models are not deep enough, for example, the model used in [

24] only has three convolutional layers, which makes it hard to obtain the high nonlinear expression of the input signal. Therefore, in order to give the kernels in the third layer a large enough receptive field to capture low frequency features, e.g., periodical changes in the signal, the size of the convolutional kernels cannot be too small. On the other hand, in order to preserve local features, the convolutional kernels cannot be too large. As a compromise, these models use middle size kernels.

Besides, 1-D vibration signals are different from 2D images. For a 224 × 224 image in Imagenet, the VGGnet [

29] performs well with all small 3 × 3 convolutional kernels. However, for a 2048 × 1 vibration signals, designing a model with all small 3 × 1 kernels is unrealistic. This will result in a very deep network, making it very hard to train. In addition, small kernels at the first layer are easily disturbed by high frequency noise common in industrial environments. Therefore, to capture the useful information of vibration signals in the intermediate and low frequency bands, we first used wide kernels to extract features, and then use successive small 3 × 1 kernels to acquire better feature representation, hence the model is deeper than the former CNN method. That’s why we name our model WDCNN, with W denoting wide kernels in the first layer and D denoting the deep structure. The overall framework of proposed WDCNN with AdaBN domain adaptation is shown in

Figure 2. Details of each parts are elaborated in the following subsections.

#### 3.1. Architecture of the Proposed WDCNN Model

As shown in

Figure 3, the input of the CNN is a segment of normalized bearing fault vibration temporal signals. The first convolutional layer extracts features from the input raw signal without any other transformation. The overall architecture of proposed WDCNN model is the same as that of normal CNN models. It is composed of some filter stages and one classification stage. The major difference is that, in the filter stages, the first convolutional kernels are wide, and the following convolutional kernels are small (specifically, 3 × 1). The wide kernels in the first convolutional layer can better suppress high frequency noise compared with small kernels. Multilayer small convolutional kernels make the networks deeper, which helps to acquire good representations of the input signals and improve the performance of the network. Batch normalization is implemented right after the convolutional layers and the fully-connected layer to accelerate the training process.

The classification stage is composed of two fully-connected layers for classification. In the output layer, the softmax function is used to transform the logits of the ten neurons to conform the form of probability distribution for the ten different bearing health conditions. The softmax function is described as:

where

**z**_{j} denotes the logits of the

j-th output neuron.

#### 3.2. Training of the WDCNN

The architecture of WDCNN is designed to take advantage of the 1D structure of the input signals. Details about the architecture of WDCNN can be found in

Section 4.2. Major structural differences between traditional 2D CNN and 1-D CNN like the proposed WDCNN are the use of 1-D kernels and 1D feature maps. Therefore, different from 2D convolution (conv2D) and lateral rotation (rot180) during backpropagation, here we have 1D convolution (conv1D) and reverse. In this part, we will elaborate the training process of WDCNN using back propagation algorithm.

The loss function of our CNN model is the cross-entropy between the estimated softmax output probability distribution and the target class probability distribution. No regularization term is added to the loss function, considering Batch Normalization already has a similar effect as regularization. Let

p(

x) denote the target distribution and

q(

x) denote the estimated distribution, so the cross-entropy between

p(

x) and

q(

x) is:

The fully connected layers are identical to the layers in a standard multilayer ANN. Specifically, let

${\delta}^{l+1}$ be the error for the

l + 1 layer in the fully connected network with a cost function

**H**, where (

**W,b**) are the parameters. Then the error for the

l layer is computed as:

where “●” denotes the element-wise product operator.

The iteration of gradient descent updates the parameters as follows:

where

α is the learning rate.

Pooling layer down-samples statistics to obtain summary statistics from the training set. Down-sampling is an operation like convolution, however g is applied to non-overlapping regions.

Let

m be the size of pooling region,

x be the input, and

y be the output of the pooling layer. The term downsample(

f,g)[

n] denotes the

n-th element of downsample(

f,g):

Here we use Max Pooling, so g(x) = max(x).

Backpropagation in the pooling layer reverses the above equation, which means error signals for each example are computed by up-sampling. In max pooling, the unit which was the max at forward feed receives all the error at backward propagation:

where

${g}_{n}^{\prime}$ is changeable depending on pooling region

n:

Finally, to calculate the gradient of the filter maps, we rely on the convolution operation again and flip the error matrix

${\delta}_{k}^{l}$ in the same way as we flip the filters in the convolutional layer:

where

a^{l} is the input to the

l-th layer. The operation “

$*$” computes the valid convolution between

i-th input in the

l-th layer and the error of the

k-th kernel. The flip results from derivation of delta error in Convolution Neural Network.

#### 3.3. Domain Adaptation Framework for WDCNN

Domain adaptation is a realistic and challenging problem in fault diagnosis. It is hard to classify a sample from a working environment while the classifier is trained by the samples collected in another working environment. The working environment can be considered as a domain, so the domain in which we acquire labeled data and train our model is called source domain, and the domain in which we only obtain unlabeled data and test our model is named target domain. Then the problem above can be regarded as a domain adaptation problem.

In 2016, Li et al. proposed a simple method named Adaptive Batch Normalization (AdaBN) [

30] to utilize BN to endow neural networks with good domain adaptation capacity. This algorithm can be easily combined with our WDCNN model because of the heavy use of BN in our model. The main problem existing in domain adaptation is the divergence of distribution between the target domain and source domain. AdaBN standardizes each sample by the statistics in the domain it belongs to instead of using the statistics of the source domain all the time. Its purpose is to ensure that each layer receives data complying with a similar distribution, regardless of the source domain or target domain. The framework of AdaBN is shown in

Figure 4, and details of AdaBN for WDCNN is described in Algorithm 1.

**Algorithm 1** AdaBN for WDCNN |

**Input:** | Input of neuron i in BN layers of WDCNN for unlabeled target signal p, ${x}_{t}^{\left(i\right)}\left(p\right)\in {x}_{t}^{\left(i\right)}$,where ${x}_{t}^{\left(i\right)}=\{{x}_{t}^{\left(i\right)}\left(1\right),\dots ,{x}_{t}^{\left(i\right)}\left(n\right)\}$ The trained scale and shift parameters ${\gamma}_{s}^{\left(i\right)}$ and ${\beta}_{s}^{\left(i\right)}$ for neuron i using the labeled source signals. |

**output:** | Adjusted structure of WDCNN |

**For** | Each neuron i and each signal p in target domain Calculate the mean and variance of all the samples in target domain: ${\mu}_{t}^{\left(i\right)}\leftarrow E[{x}_{t}^{\left(i\right)}]$ ${\sigma}_{t}^{\left(i\right)}\leftarrow Var[{x}_{t}^{\left(i\right)}]$ Calculate the BN output by: ${\hat{x}}_{t}^{\left(i\right)}\left(p\right)=\frac{{x}_{t}^{\left(i\right)}\left(p\right)-{\mu}_{t}^{\left(i\right)}}{{\sigma}_{t}^{\left(i\right)}}$ ${\hat{y}}_{t}^{\left(i\right)}\left(p\right)={\gamma}^{\left(i\right)}{\hat{x}}_{t}^{\left(i\right)}\left(p\right){\beta}^{\left(i\right)}$ |

**End for** | |

#### 3.4. Data Augumentation

To acquire strong feature representations of the input raw signals, the WDCNN model is deep, and the first layer is wide, but this kind of structure could lead to the consequence that the model will easily get overfit without sufficient training samples. In computer vision, data augmentation is frequently used to increase the number of training samples to enhance the generalization performance of CNN [

31]. Horizontal flips, random crops/scales, and color jitter are widely used to augment training samples in computer vision assignments. In fault diagnosis, data augmentation is also necessary for a convolutional neural network to achieve high classification precision. However, it is much easier to obtain huge amounts of data by slicing the training samples with overlap. This process is shown in

Figure 5. The training samples is prepared with overlap. For example, a vibration signal with 60,000 points can provide the WDCNN with at most 57,953 training samples, each with a length of 2048 when the shift size is 1. Many papers overlook the effects of this simple operation, and most of these work use hundreds of training samples without any overlap [

23,

24]. In

Section 4.3, we will validate the necessity of data augmentation.

## 5. Conclusions

This paper proposes a new model named WDCNN, to address the fault diagnosis problem. WDCNN works directly on raw vibration signals without any time-consuming hand-crafted feature extraction process. WDCNN has two main features, wide first-layer convolutional kernel and deep network structure with small convolutional layers. With the help of data augmentation, the proposed WDCNN model can easily achieve 100% accuracy in public CWRU bearing data set.

Results in

Section 4 shows that, although state-of-the-art DNN model could achieve pretty high accuracy on normal datasets, its performance suffer from rapid degradation under noisy environment conditions or when the working load changes. However, WDCNN, with high classification accuracy, is also very robust to working load changes and noise.

When trained with data from one load domain, WDCNN can diagnose data from another load domain with high accuracy and the performance can be improved by an easy domain adaptation method named AdaBN. Besides, this model performs well under noisy environment conditions without any denoising pre-processing. In addition, network visualizations are used to investigate the inner mechanism of the proposed WDCNN model.

As mentioned in

Section 4.3, the dataset used in this paper is completely balanced, while in practice, it’s possible to encounter unbalanced datasets. Therefore, in future work, we would investigate the performance of WDCNN on unbalanced dataset to expand the range of application of the algorithm.

It’s worth noticing that even though there is still room for improvement of WDCNN’s performance with the help of AdaBN, we should be aware that AdaBN requires statistical knowledge of the whole test data. However, in practice, only part of this knowledge is available, so we need to estimate the statistical information of the whole test data from the information of part of the data. This will make our algorithm more practical to use in real world applications.

Compared with traditional features, the features extracted by WDCNN are not easily influenced by environmental changes. Therefore, in future work, we can try using the first few layers as a feature extractor, and then train a classifier targeting to specific working environment, which may further improve the performance of the algorithm under different working environment.