Intelligent Fault Diagnosis Method through ACCC-Based Improved Convolutional Neural Network

: Fault diagnosis plays an important role in improving the safety and reliability of complex equipment. Convolutional neural networks (CNN) have been widely used to diagnose faults due to their powerful feature extraction and learning capabilities. In practical industrial applications, the obtained signals always are disturbed by strong and highly non-stationary noise, so the timing relationships of the signals should be highlighted more. However, most CNN-based fault diagnosis methods directly use a pooling layer, which may corrupt the timing relationship of the signals easily. More importantly, due to a lack of an attention mechanism, it is difﬁcult to extract deep informative features from noisy signals. To solve the shortcomings, an intelligent fault diagnosis method is proposed in this paper by using an improved convolutional neural network (ICNN) model. Three innovations are developed. Firstly, the receptive ﬁeld is used as a guideline to design diagnosis network structures, and the receptive ﬁeld of the last layer is close to the length of the original signal, which can enable the network to fully learn each sample. Secondly, the dilated convolution is adopted instead of standard convolution to obtain larger-scale information and preserves the internal structure and temporal relation of the signal when performing down-sampling. Thirdly, an attention mechanism block named advanced convolution and channel calibration (ACCC) is presented to calibrate the feature channels, thus the deep informative features are distributed in larger weights while noise-related features are effectively suppressed. Finally, two experiments show the ICNN-based fault diagnosis method can not only process strong noise signals but also diagnose early and minor faults. Compared with other methods, it achieves the highest average accuracy at 94.78% and 90.26%, which are 6.53% and 7.70% higher than the CNN methods, respectively. In complex machine bearing failure conditions, this method can be used to better diagnose the type of failure; in voice calls, this method can be used to better distinguish between voice and noisy background sounds to improve call quality.


Introduction
In recent years, deep learning (DL) methods are increasingly used in fault diagnosis and prediction [1][2][3]. Deep learning is an algorithm based on data representation learning in machine learning. An intelligent diagnosis using deep learning gets rid of the dilemma that traditional fault diagnosis methods rely too much on diagnostics experts and professional technicians and breaks the deadlock between a large amount of diagnostic data for mechanical equipment and the relatively few diagnostic experts. Traditional machine learning techniques are limited in their ability to process natural data in its raw form. The most obvious difference between deep learning models and traditional models is that DL can learn the abstract representation features of the raw data automatically [4].
Several DL methods, such as the deep belief network (DBN), deep auto-encoder (DAE), and convolutional neural network (CNN) have been applied to fault diagnosis [5]. These (1) In the traditional deep neural network, the data are down sampled by reusing the pooling layer. The pooling layer can reduce the number of training parameters to achieve the effect of reducing computing costs and improving computing efficiency. However, the pooling operation will lose the position information between the data, and a certain degree of translation invariance is achieved to a certain extent. Position information is an extremely important feature in time series signals. It reflects the overall change trend of the signal. Pooling operations may change the local change trend of the signal, leading to misjudgment. (2) In traditional convolutional networks, each feature channel is treated equally. Among them, some features may be important features, and some are redundant or even irrelevant features. The above research does not pay attention to the weight of each feature map channel, which may lead to feature redundancy to a certain extent.
In recent years, the achievements of the attention mechanism in the field of computer vision have attracted wide attention from researchers. It can selectively enhance useful features and weaken redundant features. Jie Hu et al. [27] designed squeeze-and-excitation networks (SENets), which learn channel attention for each convolution block, bringing clear performance gain for various deep CNN architectures. Zilin Gao et al. [28] improved the SE block by capturing more sophisticated channel-wise dependencies or by combining it with additional spatial attention. Jun Fu et al. [29] proposed a dual attention network (DANet) to adaptively integrate local features with their global dependencies. Chen et al. [30] proposed a transferable convolutional neural network to improve the learning of target tasks. Wang et al. [31] proposed a novel multi-task attention convolutional neural network (MTA-CNN) that can automatically give feature-level attention to specific tasks. The MTA-CNN consists of a global feature shared network (GFS network) for learning globally shared features and K task-specific networks with a feature-level attention module (FLA module). This architecture allows the FLA module to automatically learn the features of specific tasks from globally shared features, thereby sharing information among different tasks. Although these methods achieved higher accuracy, they also bring higher model complexity and more computation. Fang et al. [32] extended an efficient feature extraction method based on CNN and used a lightweight network to complete high-precision fault diagnosis tasks. The spatial attention mechanism (SAM) is used to adjust the weight of the output feature map. This method has good anti-noise ability and domain adaptability. Wang Hui et al. [33] proposed a new intelligent bearing fault diagnosis method, which combined the symmetric point mode (SDP) representation with the squeeze-and-excitation networks (SE-CNN) model. This method can assign a certain weight to each feature extraction channel, further strengthen the bearing diagnosis model with the main feature as the center and reduce redundant information.
Inspired by the analyses mentioned above, this paper proposes a novel improved convolutional neural network (ICNN) fault diagnosis method. This paper has the following three contributions: (1) The receptive field is used as a guiding principle for the design of the network model.
In this paper, the model is always designed so that the receptive field of the last layer is close to the length of the original signal, which ensures that each feature extracted by the model is focused on the complete sample. (2) ACCC blocks are used to obtain features of suitable scale while avoiding the use of pooling layers, which can damage the signal timing relationship. What is more, this block can calibrate feature channels, informative features are significantly enhanced, and irrelevant features are effectively suppressed.
(3) After being tested on two data sets, the proposed method is better than the other nine methods and achieves the highest average accuracy rate. The results show that the proposed method has good performance.
The rest of this paper is organized as follows. The standard CNN theory and the proposed method are given in Section 2. In Section 3, the experiment results are discussed and verified by NASA data sets. In Section 4, the conclusions are given.

Materials and Methods
CNN is a feed-forward neural network that includes several layers of data processing, as shown in Figure 1. The convolutional neural network consists of a convolution layer and a pooling layer, followed by a fully connected layer to obtain the final result. To avoid excessive fitting, it is also possible to add a dropout layer between the fully connected layers.
Version March 19, 2023 submitted to Actuators 4 of 18 theory is given in Section 2. In Section 3, the proposed method is introduced. In Section 4, 143 the experiment results are discussed. In Section 5, NASA data sets were used to verify the 144 feasibility of the algorithm. In Section 6, the conclusions are given.

146
CNN is a feed-forward Neural Network, which includes several layers of data pro-147 cessing, as shown in FigureFIgure 1. The convolutional neural network consists of a 148 convolution layer and a pooling layer, followed by a fully connected layer to obtain the 149 final result. To avoid excessive fitting, it is also possible to add a dropout layer between the 150 fully connected layers.

Convolutional layer 152
Convolution is one of the most important techniques in convolutional neural network. In the convolutional layer, the activation value of the top layer is subjected to a multipleconvolution check, and then the bias is added. Then, it is possible to get the activation value of the next layer by means of an activation function. The process is shown in Equation 1: in the formula, A i+1 j is the activation values of the next layer; M j is the j-th region of the 153 upper layer that needs to be convolved, A l i is an element in M j ; W i+1 ij is the weight matrix; 154 b l+1 j is the bias coefficient. f is the activation function which is responsible for introducing 155 nonlinear characteristics into the network. The pooling layer is responsible for dimensionality reductionreducing the size of the activation values output by the upper convolution layer, which can not only reduce the data input, but also keep the characteristic scale variancescalevariance of the activation values. The pooling methods mainly include maximum pooling and average pooling. Maximum pooling is defined as the maximum value of a pool window, and average pooling refers to the average value in the input pool windowaveragepooling is an average of an input pool window. The mathematical descriptionmathematicaldescription is shown as Where a l(i,t) is the activation value of the t neuron in frame i of layer l; W is the width of 158 the pool region.

Convolutional Layer
Convolution is one of the most important techniques in convolutional neural networks. In the convolutional layer, the activation value of the top layer is subjected to a multiple convolution check, and then the bias is added. Then, it is possible to get the activation value of the next layer by means of an activation function. The process is shown in Equation (1): in the formula, A i+1 j is the activation values of the next layer; M j is the j-th region of the upper layer that needs to be convolved; A l i is an element in M j ; W i+1 ij is the weight matrix; b l+1 j is the bias coefficient. f is the activation function that is responsible for introducing nonlinear characteristics into the network.

Pooling Layer
The pooling layer is responsible for dimensionality reduction of the activation values output by the upper convolution layer, which can not only reduce the data input, but also keep the characteristic scale variance of the activation values. The pooling methods mainly include maximum pooling and average pooling. Maximum pooling is defined as the maximum value of a pool window, and average pooling refers to the average value in the input pool window. The mathematical description is shown as where a l(i,t) is the activation value of the t neuron in frame i of layer l; W is the width of the pool region.

Fully Connected Layer
After the input data are propagated alternately between convolutional and pooling layers, it is also necessary to classify the features that are extracted from the fully connected layer. This layer reduces the matrix output by the upper layer to a 1 × n dimension matrix, and outputs the probability that the sample is divided into one of the n classes through the activation function. The forward propagation formula of the fully connected layer is as follows: where W l ij is the weight between the i-th neuron of layer l and the j-th neuron of layer l + 1; z l+1(j) is the logits value of the j-th output neuron in layer l + 1. b l j is the offset value of all neurons in layer l to the j-th neuron in layer l + 1.
When the layer l + 1 is hidden layer, the activation function is ReLU: When the layer l + 1 is output layer, the activation function is softmax:

ICNN Structures
In CNN architecture, the majority of layers are composed of convolution and pooling, which are two key parts of CNN. Generally speaking, for image classification tasks, a convolutional and pooling layer can be used to extract the best features. The convolution layer performs feature extraction, and the pooling layer does feature aggregation, but the latter has some degree of translation invariability, which can also reduce the computational capacity of the convolution layer. Finally, we apply the classification results to the fully connected layer [34].
Dilated convolution is a method for expanding the receptive field by introducing a gap between each pixel in a convolution and a normal convolution. The higher the rate of expansion, the larger the receptive field is. Since the original vibration signal of a bearing has weak coupling properties and is often drowned by noise, it is necessary to use the convolution of a large receptive field [35].
Assume that the filter size of the convolution kernel is k. The standard convolution kernel scans adjacent k elements on the feature map each time, and the dilated convolution also scans k elements, but there is an interval between each element, and the step length between each element is called the dilate factor. The dilate factor of the standard convolution is one. Figure 2 shows the difference between dilated convolution and standard convolution. The size of the convolution kernel of the two convolutions in the figure is 3, and stride is 1. The expansion factor is set to 1 in the extended convolution, and the expansion factor is set to 2. In Figure 2, the input signal has a length of 7, and the number of neuron receptive fields in the third layer is equal to that of the second-level convolution. The expansion convolution adopts only 6 parameters, which is 33% smaller than the normal convolution. The structure of the proposed approach is illustrated in Figure 3.      An ACCC block is a computational unit which can be built upon a transformation F tr mapping an input X ∈ R L ′ ×C ′ to feature maps U ∈ R L×C , L is the length of feature map and C is the number of filters. In the notation that F tr is set to be a dilated convolutional operator and V = [v 1 , v 2 , . . . , v c ] is used to denote the learned set of filter kernels, where v c refers to the parameters of the c-th filter. Then the outputs can be written as U = [u 1 , u 2 , . . . , u c ], where: Where . , x c ′ c and u c ∈ R L . In order to obtain channel information and avoid increasing the number of model parameters. some algorithms areSome algorithm is needed to compress the global information of the channel into a channel descriptor. Global average pooling is used in the An ACCC block is a computational unit that can be built upon a transformation F tr mapping an input X ∈ R L ×C to feature maps U ∈ R L×C ; L is the length of feature map and C is the number of filters. In the notation that F tr is set to be a dilated convolutional operator and V = [v 1 , v 2 , . . . , v c ] is used to denote the learned set of filter kernels, where v c refers to the parameters of the c-th filter. Then the outputs can be written as U = [u 1 , u 2 , . . . , u c ]: where . , x c c and u c ∈ R L . In order to obtain channel information and avoid increasing the number of model parameters, some algorithms are needed to compress the global information of the channel into a channel descriptor. Global average pooling is used in the proposed method. Formally, a statistic z ∈ R C is generated by shrinking U through its spatial dimensions L, the c-th element of L is calculated by: The next step is the excitation operation, which uses the channel information obtained from the squeeze operation to capture channel dependencies. Two full connection layers are used to obtain weights. The first activation function is ReLU, and the second activation function is sigmoid. The structure can not only fully learn the dependencies between channels, but also strengthen or inhibit multiple channels. To meet these criteria, a simple gating mechanism with sigmoid activation is chosen to use: where δ refers to the ReLU function, W 1 ∈ R C r ×C , and W 2 ∈ R C× C r . To limit model complexity and aid generalization, the gating mechanism is parameterized by forming a bottleneck with two fully-connected (FC) layers around the non-linearity, i.e., a dimensionalityreduction layer with reduction ratio r (regarding the choice of r, it will be discussed later). A ReLU and then a dimensionality-increasing layer returning to the channel dimension of the transformation output U. The final output of the block is obtained by rescaling U with the activations s: where X = [x 1 ,x 2 , . . . ,x c ] and F scale (u c , s c ) refers to channel-wise multiplication between the scalar s c and the feature map U c ∈ R L . After stacking a certain number of ACCC blocks, ICNN is formed. This block has the following advantages: (1) The pooling layer of traditional convolution is replaced by the ACCC block. The number of layers required by the network is calculated according to the receptive field, and only the length of the receptive field at the last layer is approximate to that of the original signal. So the complex network design steps are eliminated. (2) The attentional mechanism can perform feature calibration on the feature map after dilated convolution. The key features are reinforced, and irrelevant features are suppressed. Through the accumulation of the network, key features are sifted layer by layer while irrelevant features are suppressed early.

General Procedure of the Proposed Method
In this paper, a novel ICNN is developed for bearing fault diagnosis. The framework is shown in Figure 4. The general procedures, as shown in Figure 5, are summarized as follows:

•
Step 1: Collect bearing vibration signal data. • Step 2: The signal is divided into a training part and a test part. The next two parts use data enhancement to segment the samples. The length of each movement is the step. In this paper, the step is calculated automatically. Suppose the total length of the sample is L, the signal length of each sample is l, the number of samples is n, and the step is s. The [] means rounding. The s is calculated as follows: • Step 3: Design a neural network and input the processed data set into the network for training. • Step 4: Use test sets or other data sets under different loads to verify the accuracy of the model.

Validation on the CWRU Bearing Dataset
In this study, the data set comes from the bearing data set of Case Western Reserve University, shown in Figure 6, which is divided into four types of conditions: normal, ball fault, inner race fault, and outer race failure. Each type of fault can be divided into different types according to the depth of the fault or the location of the fault. There are three different load data sets, HP1, HP2, and HP3, and each of them has 16 conditions. Each condition contains 400 samples, of which 300 are train samples and 100 are test samples. Each sample is a collected vibration signal containing 1024 data points. The condition descriptions are shown in Table 1.  In the table, the parameters of the Conv1D function are filter number, kernel size, and dilate rate. The default value of dilate rate is 1. In the pooling layer, the pool size, and strides are both 2. The four types of CNN network structures are shown in Table 2. Softmax (16, ) The white noise signal is added to the sample to make the SNR 0 dB. Then each method is trained 10 times to obtain the average accuracy. The average accuracy of the CNN method is much higher than BPNN and SVM with raw data, which are 52.29% and 66.25%, respectively. After feature extraction, the accuracies of BPNN and SVM increase greatly. However, their accuracy is still inferior to the proposed method. Among the four CNN networks, the traditional CNN has the lowest accuracy rate of 93.16%. Networks with dilate convolution or attention block have a certain improvement in accuracy compared to the traditional CNN. The network using these two technologies has the highest accuracy rate of 97.11%. The diagnosis results are shown in Table 3 and Figure 7.  The influence of sample size on the performance of the proposed method is investigated in Table 4. The confusion matrix and t-SNE visualization of the four CNN networks are shown in Figure 8. From the results, the accuracy rate decreases when the percentage of training samples is relatively small. On balance, 300 is taken as the test sample and 100 is taken as the test sample.
In the ACCC block, one of the most important features is the compression ratio. Ratios of 4,8,16,32, and 64 are chosen to test the accuracy. We tested CNN and SECNN, DCNN, and ICNN, respectively. We recorded the accuracy and time, and the results are shown in Table 5.
It can be concluded that: (1) After adding the ACCC block, the accuracy rate under different compression ratios has a certain improvement compared with the traditional CNN. (2) In 1D-CNN, after the introduction of the ACCC block, the average time is more than one second. The accuracy rate is increased by more than 2%, and the effect is very satisfactory. In 1D-DCNN, after introducing the ACCC block, it took six seconds longer and the accuracy rate increased by nearly 1% and still has good results. It is worthwhile to increase the training time slightly in exchange for accuracy.
Therefore, considering the accuracy and time factors, the ratio is chosen as 16 for the next experiment. In this experiment, the samples are added with different intensities of noise. The above 10 methods are used to train separately. The accuracies they obtained are shown in Table 6 and Figure 9.  It can be concluded that traditional methods need to manually extract features, otherwise the accuracy rate will be very low. However, the CNN does not need to manually extract features, only input the original signal. In any case, the accuracy of the traditional method is not comparable to that of the CNN. Compared with the traditional CNN, dilated convolution and attention block can improve the accuracy to a certain extent. The proposed method is inferior to 1D-SECNN when the SNR is over zero, but it achieves the highest accuracy in a high-noise environment. This may indicate that the proposed method has strong anti-interference ability and domain adaptability.

Validation on the NASA Bearing Dataset
In this section, the NASA bearing data set is used to further demonstrate the superiority of the proposed method. Data from day 25 to 35 of bearing No. 3 in data set No. 1 are extracted for the experiment. A total of 9600 training samples and 9600 test samples are collected. Three kinds of bearing operation conditions are created, which are in health condition, slight degradation condition, and severe degradation condition. The sample length is 1024. The data set description is shown in Table 7. The ten methods mentioned above have been tested, and the results are shown in Table 8. Experimental results show that although the training time of the proposed method is slightly increased compared with other methods, it can more accurately distinguish the life cycle stages of bearings.

Conclusions
In this paper, a new ICNN for bearing fault diagnosis is proposed. The network can not only retain the timing relation of the original signal to the maximum extent, but also strengthen the important features and suppress the irrelevant features. The proposed algorithm is validated on two data sets. The results show that the proposed algorithm achieves the highest average accuracy, which is better than traditional methods and ordinary deep learning methods.