Bearing Fault Diagnosis Based on Shallow Multi-Scale Convolutional Neural Network with Attention

Recently, deep learning technology was successfully applied to mechanical fault diagnosis. The convolutional neural network (CNN), as a prevalent deep learning model, occupies a place in intelligent fault diagnosis, which reduces the need for human feature extraction and prior knowledge, thereby achieving an end-to-end intelligent fault diagnosis model. However, the data for mechanical fault diagnosis in practical application are limited, the CNN model is too deep and too complex, making it prone to overfitting, and a model with too simple a structure and shallow layers cannot fully learn the effective features of the data. Convolutional filters with fixed window sizes are widely used in existing CNN models, which cannot flexibly select variable pivotal features. The model may be interfered with by redundant information in feature maps during training. Therefore, in this paper, a novel shallow multi-scale convolutional neural network with attention is proposed for bearing fault diagnosis. The shallow multi-scale convolutional neural network structure can fully learn the feature information of input data without overfitting. For the first time, a feature attention mechanism is developed for fault diagnosis to adaptively select features for classification more effectively, where the pivotal feature was emphasized, and the redundant feature was weakened through an attention mechanism. The time frequency representations as the input of the model were obtained from the vibration time domain signals, which contain the complete time domain and frequency domain information of the vibration signals. Compared with the current popular diagnostic methods, the results show that the proposed diagnostic method has fairly high accuracy, and its performance is superior to the existing methods. The average recognition accuracy was 99.86%, and the weak recognition rate of I-07 and I-14 labels was improved.


Introduction
The rolling element bearing, an essential component of rotating machinery, is one of the most common fault sources of equipment. Mechanical failure of bearings results in significant property losses. However, the practical application environments of bearings are diverse and complex; thus, systematic identification of fault types and fault degrees without human intervention is still a significant challenge. The traditional engineering approaches include many data-driven methods, among which signal processing methods are widely used [1]. Because of the periodicity of the fault bearing signal, this method usually contains three parts: data acquisition, feature extraction [2,3], and fault location. In feature extraction, the collected bearing signals are analyzed, and the useful features containing fault information are selected according to prior knowledge. Generally, the time domain features that can be simple a structure and shallow layers cannot fully learn the effective features of the data. Two aspects need to be improved in the existing CNN-based models for bearing fault diagnosis. Firstly, in the structure of the traditional CNN, only feature maps in the last convolutional layer are provided for classification, and the feature maps are more constant and robust with the loss of pivotal information. Secondly, convolutional filters with fixed window sizes are widely adopted in most existing CNN models, which cannot flexibly select variable pivotal features in bearing fault diagnosis, and the model may be disturbed by redundant information in feature maps during training.
To overcome these limitations, this paper proposes a shallow multi-scale (MS) CNN with a multi-attention mechanism for bearing fault diagnosis. The time-frequency representation (TFR) as the input of the model was generated by original vibration data of the bearing. TFR bearing degradation signals are complex, nonstationary, and more effective. Zhu et al. [22] proposed an effective deep feature learning approach for remaining useful life prediction of bearings, which relied on the TFR and CNN. The TFR was applied to analyze the transients, including rapid changes in amplitude or phase during an event relative to post-event conditions [23]. TFR worked better than the vibration image used by Hoang and Kang in CNNs based on bearing fault diagnosis, which retained more comprehensive information in image data [24]. Wang et al. [25] compared eight time-frequency analysis methods for creating images, and the results indicated that continuous wavelet transform and fast Stockwell transform were the best methods for bearing diagnosis. Because of the low visual complexity of TFR images, the shallow CNN structure effectively avoids the problem of deep network training, which does not converge or overfit. MS convolutional networks were studied by Sermanet and LeCun [26], and used for recognizing traffic signs. Their research showed that the multi-scale features combined with precise details were more robust and invariant than the deep features based on the traditional CNN structure. By studying the recent literature, most of the deep learning-based fault diagnosis methods improved the depth of the network structure and the data of the training network, and they neglected the utilization efficiency of the features in the model training process. In the study of Sun et al. [27], before generating a multi-scale layer, the pooling layer and the last convolutional layer were combined, and favorable performances were obtained in a face identification task. Therefore, it can be predicted that, by keeping the global and local information synchronously, more identifiable features can be obtained between bearing health status and modes. Using the MS layer as the last convolutional layer in this study, the global and local features were sustained to increase the network capacity, allowing more scale features to be extracted for classification. Furthermore, a multi-attention mechanism was proposed to adaptively select vital features to obtain superior recognition results. Attention is an effective mechanism for selecting vital information to achieve excellent results. There are some effective attention mechanisms for image caption and machine translation, such as soft and hard attention [28], and global and local attention [29]. In this study, a deep learning model combined with the attention mechanism was adopted. The attention mechanism was used to focus on the more sensitive features of specific labels in the training process for improving the performance of model. Deep neural networks, including CNNs and recurrent neural networks, can achieve better results if they are equipped with an attention mechanism. In this paper, a multi-scale convolutional neural network (MSCNN) was combined with the multi-attention mechanism to propose a novel method for bearing fault diagnosis. The MSCNN is different from the multi-scale information in Reference [14], which was obtained from the signal before the input of the CNN; here, the multi-scale feature was obtained from the training process, combining it with the multi-attention mechanism. The proposed method achieved excellent results in simultaneously identifying the fault type and fault degree of bearings. Furthermore, the identification of specific bearing conditions was improved using the multi-attention mechanism.

Proposed Method
As discussed above, a shallow multi-scale convolutional neural network with multi-attention (MA-MSCNN) is proposed for bearing fault diagnosis. The TFR, as the input of the model, contains time and frequency domain information of bearing vibration data, and the TFR can effectively represent the complex and nonstationary information of bearing degradation signals [22]. Compared with the existing methods which only use the TFR to extract features manually, MSCNN can more fully mine the multi-scale information in the data for classification. Because of the low visual complexity of TFR images, the shallow structure of MA-MSCNN effectively avoids overfitting while ensuring fault diagnosis accuracy. More importantly, the proposed multi-attention mechanism allows the model to pay more attention to features which are valuable for classification, and, based on the experimental results in Section 3, the effectiveness of the model for fault identification was further improved. Firstly, the samples were generated by enhanced sampling [30]. Then, 1D vibration data were converted to two-dimensional TFR image data via continuous wavelet transform. Because the frequency range of the vibration data was broad, and the size of the generated TFR image was huge, the bilinear interpolation method was used to reduce the size of the TFR. The resized TFR image was used as the input of MA-MSCNN. During the training process, the parameters were updated using the Adam optimizer. The procedure of the proposed method is illustrated in Figure 1. complexity of TFR images, the shallow structure of MA-MSCNN effectively avoids overfitting while ensuring fault diagnosis accuracy. More importantly, the proposed multi-attention mechanism allows the model to pay more attention to features which are valuable for classification, and, based on the experimental results in Section 3, the effectiveness of the model for fault identification was further improved. Firstly, the samples were generated by enhanced sampling [30]. Then, 1D vibration data were converted to two-dimensional TFR image data via continuous wavelet transform. Because the frequency range of the vibration data was broad, and the size of the generated TFR image was huge, the bilinear interpolation method was used to reduce the size of the TFR. The resized TFR image was used as the input of MA-MSCNN. During the training process, the parameters were updated using the Adam optimizer. The procedure of the proposed method is illustrated in Figure 1.

Time-Frequency Representation
Firstly, samples were generated by improved sampling, and the enhanced sampling as a data augmentation technology is shown in Figure 2. Samples were sampled in the vibration data of each health condition, and each degree represented the damage by enhanced sampling. Then, each sample was converted to 2D image data. Compared with the time-frequency image obtained by short-time Fourier transform, the time-frequency image obtained by wavelet transform is better [17], because the resolution of wavelet transform at high frequency can be adjusted automatically, and the resolution of the TFR image obtained is higher. It has a sinusoidal basis function, which is different from Fourier transform. In wavelet transform, the signal can be decomposed into different resolutions

Time-Frequency Representation
Firstly, samples were generated by improved sampling, and the enhanced sampling as a data augmentation technology is shown in Figure 2. Samples were sampled in the vibration data of each health condition, and each degree represented the damage by enhanced sampling. Then, each sample was converted to 2D image data. Compared with the time-frequency image obtained by short-time Fourier transform, the time-frequency image obtained by wavelet transform is better [17], because the resolution of wavelet transform at high frequency can be adjusted automatically, and the resolution of the TFR image obtained is higher. It has a sinusoidal basis function, which is different from Fourier transform. In wavelet transform, the signal can be decomposed into different resolutions in different time and space scales by transforming and scaling the wavelet basis function. The reason why this operation can be performed is due to the limited width of the time domain and frequency domain of the wavelet basis function used in the wavelet transform. Monitoring rotating machinery conditions is one of its main application fields [31]. A new concept of wavelet was firstly proposed, and then continuous wavelet transform was applied as shown below.
where α is the scaling parameter, β is the translating parameter, and Ψ(t) is a continuous wavelet, the shape and displacement of which are determined by α and β, respectively. The Morlet wavelet, similar to the impulse signal of rotating machinery, was chosen as the mother wavelet, due to the lack of a standard or general method to select mother wavelets [32], where U(·) is the wavelet 2D coefficient of the 1D degradation signal x(t), which is the time-frequency representation (TFR). in different time and space scales by transforming and scaling the wavelet basis function. The reason why this operation can be performed is due to the limited width of the time domain and frequency domain of the wavelet basis function used in the wavelet transform. Monitoring rotating machinery conditions is one of its main application fields [31]. A new concept of wavelet was firstly proposed, and then continuous wavelet transform was applied as shown below.
where is the scaling parameter, is the translating parameter, and Ψ( ) is a continuous wavelet, the shape and displacement of which are determined by and , respectively. The Morlet wavelet, similar to the impulse signal of rotating machinery, was chosen as the mother wavelet, due to the lack of a standard or general method to select mother wavelets [32], where U(•) is the wavelet 2D coefficient of the 1D degradation signal ( ), which is the time-frequency representation (TFR). The TFR images were generated from all the samples by continuous wavelet transform through Equations (1)- (3). In order to illustrate the variation of frequency energy with different health conditions and the variation of frequency energy with time, TFR images of each label are given in Figure 3. When the bearing was in a normal condition, the rotation frequency was apparent in the TFR image, and the frequency fluctuation was not evident. However, in the conditions of faults on the testing bearings at the inner raceway, outer raceway, and ball, the bearings under defect conditions had periodic impact phenomena, causing the effect of modulation.

Stride
Overlapping part Training sample Training sample The TFR images were generated from all the samples by continuous wavelet transform through Equations (1)- (3). In order to illustrate the variation of frequency energy with different health conditions and the variation of frequency energy with time, TFR images of each label are given in Figure 3. When the bearing was in a normal condition, the rotation frequency was apparent in the TFR image, and the frequency fluctuation was not evident. However, in the conditions of faults on the testing bearings at the inner raceway, outer raceway, and ball, the bearings under defect conditions had periodic impact phenomena, causing the effect of modulation. Equations (1)- (3). In order to illustrate the variation of frequency energy with different health conditions and the variation of frequency energy with time, TFR images of each label are given in Figure 3. When the bearing was in a normal condition, the rotation frequency was apparent in the TFR image, and the frequency fluctuation was not evident. However, in the conditions of faults on the testing bearings at the inner raceway, outer raceway, and ball, the bearings under defect conditions had periodic impact phenomena, causing the effect of modulation.

Dimensionality Reduction (Image Resize)
The vibration data of the testing bearing were collected by an accelerometer with a sampling frequency of 12 kHz, and the vibration signal included a frequency range of 0-6000 Hz. The size of the original generated TFR image was 1000 × 6000, and we needed to reduce the dimensions due to the high-dimensional features of these TFRs. Instead of using general approaches such as principal component analysis (PCA) and nearest-neighbor interpolation, a simple and effective method was introduced, named bilinear interpolation, which was effectively applied in image processing [33]. The bilinear interpolation performed a linear interpolation operation in two directions, making full use of the actual pixel values in the source image to determine the pixel value in the target image. Therefore, it had a much better scaling effect than the simple nearest-neighbor interpolation. Figure 4 shows that Q 1,1 , Q 1,2 , Q 2,1 , and Q 2,2 were the four pixels in the original image, and the corresponding positions were (x 1 , y 1 ), (x 1 , y 2 ), (x 2 , y 1 ), and (x 2 , y 2 ); P(x, y) was the result of the pixels resized, as shown below.

Dimensionality Reduction (Image Resize)
The vibration data of the testing bearing were collected by an accelerometer with a sampling frequency of 12 kHz, and the vibration signal included a frequency range of 0-6000 Hz. The size of the original generated TFR image was 1000 × 6000, and we needed to reduce the dimensions due to the high-dimensional features of these TFRs. Instead of using general approaches such as principal component analysis (PCA) and nearest-neighbor interpolation, a simple and effective method was introduced, named bilinear interpolation, which was effectively applied in image processing [33]. The bilinear interpolation performed a linear interpolation operation in two directions, making full use of the actual pixel values in the source image to determine the pixel value in the target image. Therefore, it had a much better scaling effect than the simple nearest-neighbor interpolation.  Figure 4 shows that , , , , , , and , were the four pixels in the original image, and the corresponding positions were ( , ), ( , ), ( , ), and ( , ); P( , ) was the result of the pixels resized, as shown below. These TFR images were resized to 28 × 28 to train the MA-MSCNN model using bilinear interpolation through Equations (4)-(6); the resized image of one TFR image is shown in Figure 5.

Multi-Scale Convolutional Neural Network (MSCNN)
Classification in related fault diagnosis based on CNNs only uses the features of the last layer. Thus, many detailed pieces of information in the inter-layers are lost through these feature flows. Inspired by LeCun [26], the MS convolutional layer was proposed aiming to keep the global and local information to increase the network capacity. The MS layer combined multiple convolution kernels of different sizes in the same convolutional layer. The resulting MS convolutional feature map was mixed, then passed to the next pooling layer for subsampling, and imposed to the fully connected layer; finally, the result was obtained through the output layer by softmax.
After the convolution of the kernels and the input image, local features were formed in a convolutional layer, and then a nonlinear activation function was imposed. A three-dimensional tensor, containing a stack of matrices called feature maps, was the output of the convolutional layer. The representation of the output feature map in the convolutional layer is shown below.
where the * operator was used for the 2D convolution of the channel., In this convolution operation, is the j-th input tensor of layer T − 1, and is the j-th output tensor of layer T; is the weight of convolution bias, while is the weight of convolution kernel; (•) is a nonlinear activation function with, which the final output can be achieved.
In the traditional neural network, the high-level feature obtained by the last convolution layer after the layer-by-layer convolution operation is used for the final task fitting, and the high-level feature tends to be stable after multi-layer convolution. In some cases, some detailed low-level features may be overlooked. The MS layer of MA-MSCNN was combined for convolution kernels on different scales in the second convolutional layer before the formation of a mixed layer. The mixed layer helped the net to learn higher-level features and low-level features, in order to represent the image features with fewer neurons. The MS layer is illustrated in Figure 6. The output of the mixed layer can be written as

Multi-Scale Convolutional Neural Network (MSCNN)
Classification in related fault diagnosis based on CNNs only uses the features of the last layer. Thus, many detailed pieces of information in the inter-layers are lost through these feature flows. Inspired by LeCun [26], the MS convolutional layer was proposed aiming to keep the global and local information to increase the network capacity. The MS layer combined multiple convolution kernels of different sizes in the same convolutional layer. The resulting MS convolutional feature map was mixed, then passed to the next pooling layer for subsampling, and imposed to the fully connected layer; finally, the result was obtained through the output layer by softmax.
After the convolution of the kernels and the input image, local features were formed in a convolutional layer, and then a nonlinear activation function was imposed. A three-dimensional tensor, containing a stack of matrices called feature maps, was the output of the convolutional layer. The representation of the output feature map in the convolutional layer is shown below.
where the * operator was used for the 2D convolution of the channel., In this convolution operation, is the j-th input tensor of layer T − 1, and Y T j is the j-th output tensor of layer T; b T j is the weight of convolution bias, while ω T ij is the weight of convolution kernel; ϕ(·) is a nonlinear activation function with, which the final output can be achieved.
In the traditional neural network, the high-level feature obtained by the last convolution layer after the layer-by-layer convolution operation is used for the final task fitting, and the high-level feature tends to be stable after multi-layer convolution. In some cases, some detailed low-level features may be overlooked. The MS layer of MA-MSCNN was combined for convolution kernels on different scales in the second convolutional layer before the formation of a mixed layer. The mixed layer helped the net to learn higher-level features and low-level features, in order to represent the image features with fewer neurons. The MS layer is illustrated in Figure 6. The output of the mixed layer can be written as where , and x 4 i , ω 4 ij denote the neurons and weights from the kernels of the multi-scale convolution layer, whereas n i means there are n filters in Conv2_i. Y, as the output of the mixed layer, is sent into the next layer. The size of all feature maps remains the same, as they are fed into the attention module, which requires the same size for multi-scale features.
Energies 2019, 12, x FOR PEER REVIEW  8 of 19 where , , , , , , and , denote the neurons and weights from the kernels of the multi-scale convolution layer, whereas means there are n filters in Conv2_ . Y, as the output of the mixed layer, is sent into the next layer. The size of all feature maps remains the same, as they are fed into the attention module, which requires the same size for multi-scale features.

Spatial Attention
According to the TFR image of the bearing vibration data, partial regions of the image corresponded to the different fault types, which were observed by the spatial attention mechanism. × represents the output of a feature map from the MC layer. The width (W) and height (H) were unfolded to reshape × into a vector ( , , … , ) in which = × . Here, was regarded to be the feature of the i-th location. The single-layer neural network and softmax function were applied successively on the image area to form the attention distribution. The attention distribution α was generated by a multi-layer perceptron and a softmax function. The spatial attention model can be defined as where and are the weight and bias of model. After the spatial attention weights ( ) were generated through Equation (9), the feature was represented as a vector ( , , … , ) by element multiplication. Finally, the obtained feature vectors were reshaped into × .

Channel-Based Attention
Spatial attention from the visual feature V was not the basis of attention. The feature V in this study was analyzed by introducing a channel-based attention mechanism. Notably, the mode detectors were provided by the CNN filters, and the response of the corresponding convolution filter could activate the feature map channel in the CNN. Therefore, the application of the attention mechanism in channel mode could be regarded as a way to select the most sensitive feature channel

Spatial Attention
According to the TFR image of the bearing vibration data, partial regions of the image corresponded to the different fault types, which were observed by the spatial attention mechanism. Y W×H represents the output of a feature map from the MC layer. The width (W) and height (H) were unfolded to reshape Y W×H into a vector (y 1 , y 2 , . . . , y m ) in which m = W × H. Here, y i was regarded to be the feature of the i-th location. The single-layer neural network and softmax function were applied successively on the image area to form the attention distribution. The attention distribution α was generated by a multi-layer perceptron and a softmax function. The spatial attention model φ s can be defined as where ω i and b i are the weight and bias of model. After the spatial attention weights (α i ) were generated through Equation (9), the feature was represented as a vector (α 1 y 1 , α 2 y 2 , . . . , α m y m ) by element multiplication. Finally, the obtained feature vectors were reshaped into Y W×H .

Channel-Based Attention
Spatial attention from the visual feature V was not the basis of attention. The feature V in this study was analyzed by introducing a channel-based attention mechanism. Notably, the mode detectors were provided by the CNN filters, and the response of the corresponding convolution filter could activate the feature map channel in the CNN. Therefore, the application of the attention mechanism in channel mode could be regarded as a way to select the most sensitive feature channel for fault recognition. As shown in Figure 6, Y = [y 1 , y 2 , . . . , y m ] were the feature maps generated from the previous layer, where y j is the j-th channel of the feature maps Y, and m is the total number of channels. The channel attention model φ c can be defined after the definition of the spatial attention model, which is shown below.
β j = so f tmax ϕ ω j y j + b j , where β j is the channel-wise attention weight, and ω j and b j are weight and bias terms. The final representation Y atten and channel attention weights β j can be defined as follows: In addition to using channel attention and spatial attention separately, there were two kinds of models according to the different realization order of the two attention mechanisms, which combined the two attention mechanisms. The first type, SC-Attention, applied spatial attention before channel-wise attention. The second type, denoted as CS-Attention, was the model with spatial attention implemented first. All training objectives were to minimize the cross-entropy loss. The effects of the two models on the results of fault diagnosis are also compared in the experimental analysis.

MA-MSCNN Training
The topology of the MA-MSCNN is shown in Figure 7. The classification layer after multi-attention was a two-layer fully connected multi-layer perceptron with rectified linear unit (ReLU) activation function and softmax output, and the dropout layer with a 50% rate was set between the two fully connected layers to prevent overfitting. Features can be aggregated in different locations of feature mapping through the pooling layer following the convolution layer. The convolution feature dimensions of convolution layers can also be reduced through pooling; as the size of the feature graph decreases, the computational efficiency is improved. Max-pooling was employed, which is given as where Y T−1 j and Y T j are the j-th input tensor of layer T − 1 and layer T, and the pooling size is s × s. At last, the fully connected layer enables the expansion of the output of the last pooling layer to be the input of the softmax layer for diagnosis. In this process, the cross-entropy lies between the true label and the estimated softmax output, which is the loss function of MA-MSCNN, defined as where f is the expanding feature in the last layer, and ρ is the desired output for diagnosis; h θ is the regression function for result predicting, and θ = {ω, b} denotes the parameters of the function. There are many similarities between traditional CNN and other parts of MA-MSCNN. The initialization of the weights and biases for all layers was carried out firstly in the training of MA-MSCNN. In order to minimize the loss function when the learning rate was 0.001, the Adam optimizer was used to optimize the parameter set θ of MA-MSCNN. A learning rate too high or too low may result in training divergence or slow convergence, respectively, neither of which are favorable. Generally, in order to ensure the speed and stability of training, repeated experiments in training were used to determine the appropriate learning rate. The random division of the training samples into several small batches, containing 16 samples in each batch, was carried out in each epoch. Then, they were put into the network. The details of the architecture of MA-MSCNN are shown in Table 1.
concatenated conv2_1, conv2_2, conv2_3, and conv2_4 generated by the MS-layer, as shown in Figure  6. The attention feature maps were generated using the attention mechanism described in Section 2.4. From the channel attention feature maps shown in Figure 8, when the color of the feature channel is darker, the channel attention weight in Equation (10) is larger, which means that the channel of feature is more sensitive for fault recognition. Lighter colors have the opposite effect. In the spatial attention feature maps, the darker part represents a larger spatial attention weight in Equation (9), which means that, in each feature map, the information in this area is more relevant to the true label of the sample. The reweighted feature maps using the attention mechanism pay more attention to the features related to bearing health condition.  MSCNN). C1 is the first convolution layer with 32 filters with a size of 7 × 7. The MS layer as the second convolution layer includes four convolution layers of 64 filters with a size of 3 × 3, 64 filters with a size of 5 × 5, 64 filters with a size of 7 × 7, and 64 filters with a size of 9 × 9. The classification layer includes two fully connected layers with 1024 units, and the dropout layer between two fully connected layers avoids overfitting with a 50% dropout rate. The multi-scale features [26] and deep hidden identity features [27] showed that the last convolutional layer had more constant and robust deep features (global information) than the low-level layers, making it suitable for "large data" in complex operating conditions. Many precise details (local information) contained in low-level features which are sensitive to interference would be partially lost in high-level layers, which is unfavorable for the dissemination of information in the network. In this case, by inputting the multi-scale convolution result of the last convolution layer in the fully connected layer, the global and local features could be preserved simultaneously, making the classification more accurate. The topology of the MA-MSCNN proposed is shown in Figure 7. The feature maps of C1 were the outputs of the C1 layer using 32 filters. The multi-scale feature maps concatenated conv2_1, conv2_2, conv2_3, and conv2_4 generated by the MS-layer, as shown in Figure 6. The attention feature maps were generated using the attention mechanism described in Section 2.4. From the channel attention feature maps shown in Figure 8, when the color of the feature channel is darker, the channel attention weight β j in Equation (10) is larger, which means that the channel of feature is more sensitive for fault recognition. Lighter colors have the opposite effect. In the spatial attention feature maps, the darker part represents a larger spatial attention weight α i in Equation (9), which means that, in each feature map, the information in this area is more relevant to the true label of the sample. The reweighted feature maps using the attention mechanism pay more attention to the features related to bearing health condition. layer includes two fully connected layers with 1024 units, and the dropout layer between two fully connected layers avoids overfitting with a 50% dropout rate.

MA-MSCNN Fault Diagnosis
The original vibration data were cut into samples of the same size by enhanced sampling, and then the samples were labeled and divided into a training set and test set. The vibration data sample was converted into a cropped TFR to construct the input of the MA-MSCNN. The detected fault condition came from the result of fault detection, which was also the output of the model.

Experimental Verification
A series of experiments were carried out to evaluate the effectiveness of the proposed bearing fault diagnosis method. The Bearing Data Center of Case Western Reserve University provided experimental data for multiple faults [34]. Single point faults of 0.007, 0.014, and 0.021 inches were distributed on the rolling parts, and the inner and outer rings of drive end bearings, respectively. In the experiment, there were four load conditions, including 0, 1, 2, and 3 hp. The vibration generated in the test was measured at 12-kHz sampling frequency. According to the proposed method, fault type and fault degree could be separated simultaneously. The damage degree of the bearing was indicated by the fault sizes of 0.007, 0.014, 0.021, and 0.028 inches. Twelve kinds of bearing health conditions under four kinds of loads were included in the dataset, among which the same health conditions under different loads were divided equally. Table 2 shows the 12 data labels from different fault types and different fault degrees. Samples were obtained in the vibration data of each health condition by enhanced sampling. Each sample contained 1000 points, and the stride of enhanced sampling was 100. Therefore, each dataset contained 4360 samples. In total, 70% of the 17,440 samples

MA-MSCNN Fault Diagnosis
The original vibration data were cut into samples of the same size by enhanced sampling, and then the samples were labeled and divided into a training set and test set. The vibration data sample was converted into a cropped TFR to construct the input of the MA-MSCNN. The detected fault condition came from the result of fault detection, which was also the output of the model.

Experimental Verification
A series of experiments were carried out to evaluate the effectiveness of the proposed bearing fault diagnosis method. The Bearing Data Center of Case Western Reserve University provided experimental data for multiple faults [34]. Single point faults of 0.007, 0.014, and 0.021 inches were distributed on the rolling parts, and the inner and outer rings of drive end bearings, respectively. In the experiment, there were four load conditions, including 0, 1, 2, and 3 hp. The vibration generated in the test was measured at 12-kHz sampling frequency. According to the proposed method, fault type and fault degree could be separated simultaneously. The damage degree of the bearing was indicated by the fault sizes of 0.007, 0.014, 0.021, and 0.028 inches. Twelve kinds of bearing health conditions under four kinds of loads were included in the dataset, among which the same health conditions under different loads were divided equally. Table 2 shows the 12 data labels from different fault types and different fault degrees. Samples were obtained in the vibration data of each health condition by enhanced sampling. Each sample contained 1000 points, and the stride of enhanced sampling was 100. Therefore, each dataset contained 4360 samples. In total, 70% of the 17,440 samples were used as training samples and 30% were used as test samples. Figure 9 shows that the loss value and accuracy of the proposed MA-MSCNN tended to stabilize during the training steps of 8000 to 10,000. Thus, in the experiment, the number of training steps was determined to be 10,000 steps. The model was built by TensorFlow (1.8.0) and was only completed by central processing unit (CPU) training. The training time was nearly six hours on a laptop computer (64-bit, i7 7700HQ 2.8-GHz CPU, 16 GB random-access memory (RAM)). were used as training samples and 30% were used as test samples. Figure 9 shows that the loss value and accuracy of the proposed MA-MSCNN tended to stabilize during the training steps of 8000 to 10,000. Thus, in the experiment, the number of training steps was determined to be 10,000 steps. The model was built by TensorFlow (1.8.0) and was only completed by central processing unit (CPU) training. The training time was nearly six hours on a laptop computer (64-bit, i7 7700HQ 2.8-GHz CPU, 16 GB random-access memory (RAM)).

Evaluations of Single Attention
In this section, the comparison of a single kind of attention mechanism with the multi-scale CNN is evaluated. S_1 was a pure spatial attention mechanism followed by the first convolution layer (C1). After getting the spatial attention weights from the attention mechanism, we combined it with the feature maps of the C1 layer through element multiplication and fed it into the next layer. S_2 was a pure spatial attention mechanism followed by the MS layer. Then, the spatial weighted feature maps were fed into the next layer for classification. C_1 was a pure channel-based attention mechanism followed by the first convolution layer (C1). After getting the channel attention weights from the attention mechanism, we combined it with the feature maps of the C1 layer using Equations (11) and (12) and fed it into the next layer. C_2 was a pure channel-based attention mechanism followed by the MS layer. Then, the channel weighted feature maps were fed into the next layer for classification. N_0 was the MSCNN without an attention mechanism, and the architecture was similar to the multiattention layer removed in Figure 7.
According to the statistics during the experiment, there were a total of 51,124 valid samples participating in this section of the validation, of which 35,787 were for training, and 15,337 were for testing. All the results are reported in Figure 10. The identification of each faulty label in each method is represented in the form of a confusion matrix. According to the results from Figure 10, we can make a few observations. Firstly, from the average recognition accuracy, shown in Figure 10f, by identifying all the test samples, we can see that the fault recognition ability of the model was improved by setting the single attention mechanism followed by the first convolution layer (C1).Secondly, from the confusion matrix shown in Figures 10a-e, we can see that the single attention mechanism had an impact on the recognition of the specific condition of the bearing, and the accuracy of normal condition (N_0) recognition was significantly improved by the model. Thirdly, from the

Evaluations of Single Attention
In this section, the comparison of a single kind of attention mechanism with the multi-scale CNN is evaluated. S_1 was a pure spatial attention mechanism followed by the first convolution layer (C1). After getting the spatial attention weights from the attention mechanism, we combined it with the feature maps of the C1 layer through element multiplication and fed it into the next layer. S_2 was a pure spatial attention mechanism followed by the MS layer. Then, the spatial weighted feature maps were fed into the next layer for classification. C_1 was a pure channel-based attention mechanism followed by the first convolution layer (C1). After getting the channel attention weights from the attention mechanism, we combined it with the feature maps of the C1 layer using Equations (11) and (12) and fed it into the next layer. C_2 was a pure channel-based attention mechanism followed by the MS layer. Then, the channel weighted feature maps were fed into the next layer for classification. N_0 was the MSCNN without an attention mechanism, and the architecture was similar to the multi-attention layer removed in Figure 7.
According to the statistics during the experiment, there were a total of 51,124 valid samples participating in this section of the validation, of which 35,787 were for training, and 15,337 were for testing. All the results are reported in Figure 10. The identification of each faulty label in each method is represented in the form of a confusion matrix. According to the results from Figure 10, we can make a few observations. Firstly, from the average recognition accuracy, shown in Figure 10f, by identifying all the test samples, we can see that the fault recognition ability of the model was improved by setting the single attention mechanism followed by the first convolution layer (C1). Secondly, from the confusion matrix shown in Figure 10a-e, we can see that the single attention mechanism had an impact on the recognition of the specific condition of the bearing, and the accuracy of normal condition (N_0) recognition was significantly improved by the model. Thirdly, from the recognition results, the identification of the label I-14 sample using the single attention mechanism was not improved, and the sample identification accuracy of the label I-07 was even reduced.

Evaluations of Multi-Attention
Depending on the order of implementation of channel-based attention and spatial attention, there were two types of models that combined the two attention mechanisms. The distinction between these two types is shown in Figure 11. The first type, named the SC-Attention mechanism, applied spatial attention before channel-based attention. Firstly, initial feature map V l was given, and the spatial attention weight α l was obtained by using the spatial attention φ s . The spatial weighted feature maps were obtained by the linear combination of α l and V l . Then, the spatial weighted feature maps were input into the channel-based attention model φ c to receive the channel attention weight β l . Finally, the channel attention weights β l and feature maps after spatial attention were multiplied in the channel dimension to get the final feature X l . The second type, denoted as the CS-Attention mechanism, was a model with the channel-based attention implemented first. For the CS-Attention mechanism, given the initial feature map V l , the channel attention weight β l was firstly obtained using the channel-based attention φ c . Then, the channel weighted feature maps were input into the spatial attention model φ s to obtain the spatial attention weight α l . Finally, feature maps X l were obtained by multiplying the spatial weights α l and the feature maps after channel attention in the spatial dimension.    Figure 11. Two types of multi-attention mechanism.
In this section, the comparisons of MA-MSCNN with different kinds of multi-attention mechanisms are evaluated. SC_1 was the SC-Attention mechanism followed by the first convolution layer (C1). The feature maps after multi-attention mechanism were fed into the next layer. CS_1 was the SC-Attention mechanism followed by the first convolution layer (C1). SC_2 was the SC-Attention mechanism followed by the MS layer. Then, the attention weighted feature maps were fed into the next layer for classification. CS_2 was the CS-Attention mechanism followed by the MS layer. The results of these four comparisons are shown in Figure 12. According to the results from Figure 12, we can make a few observations. Firstly, from the average recognition accuracy, shown in Figure 12e, by identifying all the test samples, we can see that the multi-attention mechanism after a multi-scale convolutional layer (MS-Layer) is better than a single attention mechanism. This shows that the MA-MSCNN model combined with the multi-scale convolutional layer and the multi-attention mechanism proposed in this paper is more effective in bearing fault diagnosis. Secondly, by using the multi-attention mechanism, the ability of the model to identify individual fault types was also improved. It can be seen from the experimental results that the identification of fault labels I-07 and I-14 was not excellent using the single attention mechanism, and the recognition accuracy of the two fault labels was significantly improved by the multi-attention mechanism. Tables 3 and 4 show the identification results for different situations using the single attention mechanism and the multi-attention mechanism, respectively. Comparing Tables 3 and 4, the results show clearly that the model with the multi-attention mechanism had better diagnosis accuracies for labels I-07 and I-14 than the model with the single attention mechanism. Table 5 shows the recognition results of the different fault degrees under each fault type. It can be seen that the diagnosis method proposed in this paper can also perform well in the case of a small difference in fault degree.

Comparison with Related Works
As a common method in the study of mechanical fault diagnosis, the rolling bearing dataset used in this investigation is very popular. Many excellent classification results were reported in recent years (95%) and higher testing accuracies were achieved in References [9,14,15,21]. However, when studying the existing methods for this dataset, it was found that the accuracy of the current intelligent diagnosis reached a ceiling. Most studies focused on the input data and the structure of model, whereas very limited work could be found on efficiently mining multi-scale features using the attention mechanism.
In the latter case, a testing accuracy of 97.91% was obtained in Reference [9] using optimized support vector machines. The model was trained by 880 samples, and the number of test samples was 1320. Then, 1000 test samples were divided into 11 classes with different fault types and degrees. An improved multi-scale cascade CNN (MC-CNN) was proposed in Reference [14] to mine multi-scale features of input signals. This study focused on the input data of the CNN, and the multi-scale information was obtained in the input layer to improve the performance of the CNN. Only four bearing conditions were classified, and 99.61% testing accuracy was obtained based on 50 experiments. The 800 samples used in the experiment were divided into training sets and testing sets according to three proportions, and the model had the best testing accuracy when the training set had the largest number of samples. Ten bearing conditions were considered in Reference [15], and 20,000 samples were obtained via data augmentation. As a result, as high as 98.8% classification accuracy was achieved from 2500 testing samples. In Reference [21], a two-stage machine learning method based on unsupervised feature learning and sparse filtering was proposed. The experimental dataset contained 4000 samples, and a fairly high identification accuracy of 99.66% was obtained when 10% of samples were used for training.
The method proposed in this paper allowed achieving an accuracy of bearing fault diagnosis as high as 99.86%. The MSCNN model with a multi-scale attention mechanism provided higher recognition accuracy, as shown in Figure 11e. This study carried out a more detailed condition segmentation using the same dataset obtained from Case Western Reserve University. Considering 12 bearing health conditions, the trained model identified 15,337 testing samples. A detailed study on the comparisons of classification accuracy with other researches on the same bearing dataset with diagnosis accuracy higher than 95% is shown in Table 6.

Conclusions
In this paper, a novel multi-scale convolutional neural network with a multi-attention model, dubbed MA-MSCNN, was proposed for bearing fault diagnosis. MA-MSCNN combines multi-scale convolutional layers with a multi-attention mechanism to optimize the model's use of multi-scale information to achieve advanced good performance in intelligent bearing fault diagnosis. Since there is an MS layer in the MA-MSCNN, both global and local features can be preserved. Then, attentive features generated by a multi-attention mechanism allow the better utilization of label-related information in classification. Comprehensive experiments were carried out to evaluate the value of the attention mechanism. Different kinds of single attention mechanisms and multi-attention mechanisms were compared. We found that the multi-attention mechanism could effectively improve the diagnosis accuracy by paying more attention to the features valuable for classification. A comparison with other methods and related studies was provided to verify the superiority of the proposed method. The results showed that samples of different fault types and degrees were well distinguished by this method. Furthermore, the method proposed in this paper effectively improved the accuracy of data identification of the I-14 label, which was a problem present in Reference [9]. This reveals that the proposed method can diagnose bearing faults more effectively with varying loads.
In future work, two points need to be addressed. Firstly, the experimental data were collected from a stable environment, and actual working environments are more complicated. Therefore, in future work, we will collect fault data in actual work environments and further evaluate the performance of the proposed model. Secondly, we will develop a real-time fault diagnosis system based on the method proposed in this paper.