A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism

Hou, Peng; Zhang, Jianjie; Jiang, Zhangzheng; Tang, Yiyu; Lin, Ying

doi:10.3390/app132312770

Open AccessArticle

A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism

by

Peng Hou

¹,

Jianjie Zhang

^2,*,

Zhangzheng Jiang

²,

Yiyu Tang

² and

Ying Lin

¹

College of Software, Xinjiang University, Urumqi 830091, China

²

College of Mechanical Engineering, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12770; https://doi.org/10.3390/app132312770

Submission received: 26 October 2023 / Revised: 15 November 2023 / Accepted: 22 November 2023 / Published: 28 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

Rolling bearings serve as the fundamental components of rotating machinery. Failure to detect damage early in these components can result in equipment shutdown, leading not only to economic losses but also to a threat to worker safety. Given the diverse range of rotating parts, it is crucial to promptly identify and accurately diagnose early bearing failures during the maintenance of large-scale machinery. To achieve quick and precise fault diagnosis, this study proposes a method based on dilated convolution, a Bidirectional Gated Recurrent Unit (BiGRU), and a multi-head self-attention mechanism. The key advantage lies in its ability to directly process raw 1D sampled data without requiring complex time–frequency domain conversion. To validate the model’s accuracy and stability, we conducted empirical studies using both the HUST bearing dataset proposed by Thuan, Nguyen et al. and the CWRU bearing dataset from Western Reserve University. The results demonstrate that our model achieves an impressive accuracy rate of 99.94%, along with an f1 value for the test set when dealing with multiple operating conditions for all five types of bearings in the HUST dataset. Moreover, when applied to the CWRU dataset, these two metrics even reached 99.95%. Furthermore, the proposed model achieves a significant prediction accuracy of more than 98.5% on two datasets containing different types of noise and different levels of white Gaussian noise, highlighting its great potential in practical applications of early bearing fault diagnosis.

Keywords:

rolling bearing; fault diagnosis; dilated convolution; Bidirectional Gated Neural Unit; self-attention mechanism

1. Introduction

Along with industry’s progress, rotating machinery has been widely applied in various fields, such as aviation and agriculture. Bearings signify an integral component of any rotating equipment. Having healthy bearings in place is crucial for the unhindered functionality of such devices. According to surveys, almost 40%~50% of motor faults in rotating machinery come from rolling bearings [1,2]. Rolling bearing faults can cause up to hundreds of millions of yuan of economic losses and, at crucial moments, may cause the loss of personnel. Therefore, research on the state diagnosis of rolling bearings has also attracted the attention of scholars. Scholars have promoted industry innovation by studying advanced technologies and using data for analysis.

As early as the beginning of industrialization in the 1950s, bearing diagnosis had already appeared, and it mainly relied on human intuitive observation and auditory judgment at that time. With industry progress, people began to use vibration analysis technology to diagnose bearing faults [3]. This technique allows for assessing the bearing’s functionality by analyzing its vibration signal. However, this method requires professional equipment and expertise. It has high requirements for operators but cannot accurately and efficiently locate and identify faults, fault types, and fault degrees, lacking early warning capabilities especially [4].

With the advent of computer technology, digital signal processing has become integral to bearing fault diagnosis. Initially, traditional data-driven fault diagnosis methods encompassed three key steps: feature extraction, selection, and classification. In extracting features, it often converts the collected time domain data into frequency domain or time-frequency domain data through mathematics to obtain valuable features. The Fast Fourier Transform (FFT) [5] is a typical example of this type of transformation method, while the wavelet transform (WT) [6], the Hilbert transform (HT), and its variants [7] are commonly used to generate time–frequency domain data from time domain data. In the feature selection phase, it aims to preprocess the dataset before sending it to the classifier to select its features further. Classical algorithms include Binary Grey Wolf Optimization (BGWO) [8], Ant Colony Optimization (ACO) [9], binary particle Swarm optimization, and other methods [10]. In the classification stage, classical algorithms include machine learning (ML) [11] methods such as Support Vector Machine (SVM) [12], Random Forest (RF) [11,13], and K-Nearest Neighbor (KNN) [14].

Traditional fault diagnosis methodologies often grapple with limited model generalization, challenging human interaction, and extensive workload when confronted with voluminous data. In response to these challenges and spurred by the big data revolution, scholars are increasingly adopting deep learning (DL) [15] techniques for rotating machinery fault diagnosis. Deep learning excels in robust feature extraction, automates data representation, and handles vast datasets with notable flexibility. For example, Liu et al. converted the original vibration signal into a two-dimensional time–frequency image (TF-image) through a differential continuous wavelet transform [16]. They used this image as the network input to realize the gear fault diagnosis. The advantages of Graph Convolutional Neural Networks are also apparent. Images can provide rich spatial and frequency information [17], and more in-depth research can be done. However, the original signal can be converted into an image; this adds to the cost of pretreatment and calculation and could introduce errors in the preprocessing step. Compared with processing images, the one-dimensional deep learning model can be directly trained using the original data without additional transformation processing, and its computational cost is smaller and more lightweight [18]. The advantages of the one-dimensional deep learning model are also more suitable for the practical application of fault diagnosis.

Recently, extensive research has been conducted on one-dimensional deep learning for fault diagnosis. Q. Cheng et al. developed a model combining a Wavelet Convolutional Neural Network (WCNN) with a Bidirectional Gated Recurrent Unit (BiGRU) named WCNN-BiGRU, which achieved over 99% diagnostic accuracy [19]. However, the model’s resistance to noise has yet to be tested. L. Guo et al. proposed a new intelligent fault diagnosis model (IDCNN-GRU) combining improved Deep Convolutional Neural Networks and Gated Recurrent Units [20], and the fault recognition accuracy of this model reached 97.9% under the interference of multiple loads and intense noise. Furthermore, researchers have successfully integrated the attention mechanism into the traditional deep learning framework to address model limitations and enhance holistic feature extraction. For instance, Cheng et al. introduced an enhanced CBAM-1DCNN model for rolling bearing fault diagnosis, which achieved optimal fitting with fewer training iterations and exhibited superior fault identification accuracy and generalizability [21]. Zhang et al. designed a model that combines a one-dimensional Convolutional Neural Network with a self-attention mechanism for detecting natural gas pipeline leakage faults [22], significantly outperforming traditional methods by emphasizing critical fault information. Moreover, Ge et al. presented a deep learning model that leverages a temporal convolutional network and an attention mechanism for real-time motor fault diagnosis, which outperformed both 1D-CNN and Temporal Convolutional Networks (TCN) in multi-motion dataset experiments [23].

While deep learning has advanced fault diagnosis significantly, key challenges persist. Firstly, datasets such as the CWRU [24] provide data for only two types of bearings, restricting the breadth of research. There is a notable gap in studies on mixed faults in multiple bearing types, and it is still being determined if models can retain their accuracy in such complex scenarios. Secondly, traditional CNNs and RNNs often need help with long-sequence data dependencies. Thirdly, most models need more holistic optimization despite improvements in accuracy, speed, and noise immunity.

In response to these challenges, this paper introduces several advancements:

(1) Utilization of Thuan Nguyen et al.’s practical bearing datasets from HUST [25], demonstrating our model’s high accuracy in diagnosing hybrid faults across five distinct bearing types.

(2) A pioneering fault diagnosis model that synergizes ACN, BiGRU, and a multi-head self-attention mechanism to manage long-distance dependencies in sequence data effectively.

(3) Our model exhibits rapid convergence, superior accuracy, and noise immunity performance compared to recent state-of-the-art models.

These contributions signify a substantial leap forward in deep learning applications for fault diagnosis, showcasing the potential for more reliable and efficient predictive maintenance solutions.

The remainder of this paper is structured as follows. Section 2 provides a detailed introduction to Convolutional Neural Networks (CNNs), Gated Recurrent Units (GRU), and the self-attention mechanism. Section 3 presents our experimental methodology. Section 4 is devoted to an introduction of the two datasets used in the study and a detailed explanation of the experimental protocol and setup, as well as a presentation of our model's experimental results. Section 5 is devoted to our discussion of the experiments. Section 6 concludes the paper with a summary of the experiments.

2. Relevant Methodology

Given that the suggested model pertains to Convolutional Neural Networks and the self-attention mechanism, there is a need to elucidate associated methodologies before delineating the model’s structure.

2.1. Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) [26,27] are increasingly utilized in diverse fields such as image recognition, language interpretation, and speech recognition. Compared to traditional Multi-Layer Perceptron (MLP) [28], CNNs reduce the number of trainable parameters by using a specific number of convolutional kernels, effectively diminishing the complexity of the model. This mechanism of weight sharing not only mitigates the risk of overfitting but also enhances the generalization capability of the model. Incorporating pooling operations within the CNN architecture further reduces the number of neurons, bolstering the robustness of the model against variations in the input space. Additionally, the scalability of CNNs enables the construction of deep network structures. Such deep models possess a greater expressive capacity to handle more complex classification challenges.

Shao et al. [29] proposed an adaptive 1D-CNN autoencoder that adeptly extracts rich and robust features from raw vibration signals, a critical aspect of fault diagnosis. This method effectively minimizes the shift between different data domains when combined with CORAL. It allows the model to accurately and promptly detect cross-domain faults without needing target domain label samples. Extensive experimental testing has validated the efficacy and practicality of this method. In summary, structural optimizations in CNNs make these models more efficient and robust. Meanwhile, the novel autoencoder approaches demonstrate formidable performance in specific applications, such as fault diagnosis.

2.2. GRU

The Gated Recurrent Unit (GRU), a variant of the Recurrent Neural Network (RNN), was introduced by Cho et al. in 2014 to address the long-term dependency challenges in RNNs [30,31]. GRUs employ a gating mechanism to retain information effectively. In a GRU, data are initialized using a weight matrix and bias vector, which are fine-tuned during training. Critical components of GRUs include the update gate, which determines the extent of retaining old hidden state information for new state constructions, and the reset gate, which similarly influences the retention of old hidden state information for constructing candidate hidden states. Furthermore, GRU considers the candidate’s hidden state, incorporating new information for updating the hidden state. The final new hidden state is a blend of the old hidden state and candidate hidden state information, modulated by the update gate.

In a practical application, Liu et al. [32] demonstrated the efficacy of an end-to-end bearing diagnostic classification model based on a 1D Convolutional Neural Network (CNN) and a GRU. This model leverages the synergy of GRUs and 1DCNNs to extract spatial features from vibration signals and learn time series characteristics, enhancing data representation. This approach also reduces the complexity of the diagnostic process by diminishing the reliance on specialized knowledge typically required in traditional feature extraction methods.

The GRU model structure is shown in Figure 1.

The Bidirectional Gated Neural Unit (BiGRU) is derived from GRU, a variant of GRU. However, in BiGRU, two separate GRU units perform the forward and backward computations, respectively. Each unit has independent parameters, such as forward and backward update gates, reset gates, and weights and biases of candidate hidden states. Therefore, there are more than two sets of parameters in a BiGRU. At the end, the forward and backward hidden states are concatenated to obtain bidirectional hidden states, which can better capture the information in the sequence.

The basic structure of a BiGRU is shown in Figure 2.

2.3. Self-Attention Mechanism

The attention mechanism mainly allocates resources by assigning weights to the state information sequence output by the upper network, automatically filtering important information, and discarding interference to grasp the input content’s core fully. In the blueprint of a profound learning network, an attention scheme can be employed to adaptively assign weights to diverse fault signal characteristics, thereby emphasizing crucial data and mitigating irrelevant features [33].

The self-attention mechanism [34] is a variant of attention mechanisms specifically designed to handle sequence data. It calculates the correlation between each element and all other elements in the sequence, thus creating a contextual representation for each. This approach allows the model to consider the entire sequence's information when generating the representation of each element, and it also enables the handling of input sequences of varying lengths. This means that it can capture the interaction of two different position information in the same sequence, pay more attention to the characteristics of the data itself and the internal interaction between the data, reduce the degree of dependence on external information, and improve the utilization of information. This helps the model to capture global dependency management, and this is a tradition models such as the RNN model find hard to do.

We use

x_{1}, x_{2}, x_{3}, \dots, x_{t}

to denote the data processed by the BiGRU layer. Then, we multiply them with three matrices

W_{q}, W_{k}, W_{v}

to obtain

q_{i}, k_{i}, v_{i}

,

i \in (1,2, 3, \dots, t)

. Next, the

b_{i}

corresponding to each

x_{i}

is calculated. Figure 3 displays its calculation process, and Equation (1) illustrates the principle of the calculation:

b_{i} = \sum_{i = 1}^{T} s o f t m a x (\frac{q_{i}^{t r a n s p o s e} \times k_{i}}{\sqrt{d_{q, k}}}) \times v_{i}

(1)

For the input sequence

x_{1}, x_{2}, x_{3}, \dots, x_{t}

, the self-attention mechanism can compute

x_{1}, x_{2}, x_{3}, \dots, x_{t}

in parallel, which greatly improves the speed of feature extraction for the input sequence, i.e., the speed of obtaining

b_{1}, b_{2}, b_{3}, \dots, b_{t}

. Figure 4 presents its calculation process.

3. Detailed Methods

This section details the proposed fault diagnosis method based on ACN_BM, whose overall architecture is shown in Figure 5.

This model integrates an enhanced ACN module, a BiGRU module, and a multi-head self-attention mechanism. ACN focuses on extracting relevant short-term features and filtering out non-essential elements. The BiGRU module tackles long-term dependencies in data, while the multi-head self-attention mechanism robustly captures inter-sequence relations, bolstering noise resistance. To address potential overfitting, especially with the smaller CWRU dataset, data augmentation and preventive measures like L2 regularization and early stopping were employed.

3.1. ACN

ACN, as an advanced 1D CNN model, features novel input/output designs and incorporates dilated convolutions and altered activation functions in its convolutional layers.

Dilated convolution is a special convolution operation in Convolutional Neural Networks with a convolutional kernel structure different from traditional convolution. Compared to regular convolution, dilated convolution introduces a parameter called “dilation rate”, which determines the gap within the convolutional kernel. Dilated convolution is typically used to increase the receptive field while reducing the number of parameters.

Popular two-dimensional neural networks, such as VGGnet and ResNet, all contain stacked 3 × 3 convolutional kernels. This design deepens the network and achieves a larger receptive field with fewer parameters, thereby suppressing overfitting. However, for one-dimensional vibration signals, the structure of two layers of 3 × 1 convolution, at the cost of 6 weights, only achieves a 5 × 1 receptive field, turning the advantage, as mentioned earlier, into a disadvantage. Therefore, the visual field’s network structures are unsuitable for the bearing fault diagnosis field.

The first convolutional kernel of our designed ACN model is 7 × 1, aiming to extract short-term features similar to the short-time Fourier transform. The difference is that the window function of the short-time Fourier transform is a sine function. At the same time, the sizeable convolutional kernel of the first layer of ACN is obtained through optimization algorithms. Its advantage is that it can automatically learn diagnosis-oriented features and automatically exclude features that do not help in diagnosis.

To enhance the expressive power of ACN, the convolutional kernel size for all layers except for the first layer is 3 × 1. Due to the small number of kernel parameters, this is beneficial for deepening the network and suppressing overfitting. After each convolution operation, batch normalization is performed, followed by 2 × 2 max-pooling to remove unnecessary features.

3.2. Improved ReLU Functions

In bearing fault signal samples, there are negative data. The regular ReLU function, when dealing with such inputs, sets values less than zero to zero while keeping positive values unchanged. This effectively addresses the vanishing gradient problem and speeds up the training process. In a PReLU (Parametric Rectified Linear Unit), ‘α’ is a learnable parameter. Negative inputs allow for a small slope when the input is less than zero and no longer a fixed zero slope. This enables the neural network to better fit the data.

P R e L U = \{\begin{matrix} x, x > 0 \\ α x, x \leq 0 \end{matrix}

(2)

3.3. Multi-Head Self-Attention Mechanism

The multi-head self-attention mechanism [35] employs numerous attention heads to compute a variety of weighted associations. We have carefully calibrated the mechanism in our design to maintain a consistent output dimension to input dimension ratio across each head. This is achieved by dividing the input dimension by the number of heads, ensuring that the total output dimension, when multiplied by the number of heads, equals the input dimension. Increasing the number of heads while reducing the dimensionality of each holds the total number of parameters steady but may raise computational complexity due to the increased overhead from parallel processing. Nevertheless, this complexity is justified as it boosts the representational power of the model by enabling it to capture information across a broader range of subspaces. We have chosen to configure our mechanism with four attention heads.

The structure of multi-head self-attention mechanism is shown in Figure 6. By adopting such a structure, the multi-head self-attention mechanism outperforms the traditional self-attention mechanism by capturing more nuanced inter-positional relationships and offering a richer informational expression. It excels by processing all sequence elements in parallel and capturing diverse features across various representational subspaces. This leads to a more comprehensive capture of associative information within the sequence, resulting in superior performance in more complex processing tasks.

3.4. ACN_BM Model

This paper used four convolutional layers with a BiGRU module and a multi-head self-attention mechanism to build our model. Among them, the convolutional layer mainly extracts short-time features and removes features that are not helpful for diagnosis. After the data are processed in the convolutional layer, the BiGRU module can fully solve the long dependencies within the input data. Then, the multi-head self-attention mechanism is introduced, which can capture the correlation information between the input sequences more comprehensively to have better anti-interference abilities. To reduce the possibility of overfitting, the model adds two dropout layers at the end of the last convolutional layer and after the BiGRU layer, respectively, and uses L2 regularization and dynamically adjusts the learning rate during training. Finally, the number of parameters was reduced using global average pooling to achieve accurate classification. See Table 1 for a detailed description of the model parameters.

4. Experimental Demonstration

In this paper, the performance of the model under different working conditions of five types of bearings is verified on the HUST dataset. The CWRU dataset is used to compare and verify each model. When the model comparison verification is completed, the experimental results of all aspects are obtained. Moreover, we visualized it to express the performance of the model more intuitively. Their experimental platform is shown in Figure 7.

4.1. Experimental Dataset

4.1.1. CWRU Dataset

CWRU dataset [24,36] is a public dataset from Case Western Reserve University in the United States, and it is also a world-recognized standard dataset for bearing fault diagnosis. Many scholars’ research is based on the dataset so we chose the dataset, as we compared the current model to the main reason for the other.

The CWRU dataset contains one normal condition and three fault data of the bearing’s inner ring, outer ring, and ball. The faults in the dataset were also artificially caused, and single-point faults were introduced into the test bearings using EDM, with fault diameters of 7, 14, 21, and 28 mils. SKF bearings were used for diameter faults of 7, 14, and 21 mils, and NTN equivalent bearings were used for faults of 28 mils. In this paper, to verify the training effect of the model on different bearings under complex working conditions, we divided the data into 12 categories in total. We used the sampling data at 12 kHz as the comparison dataset of this experiment.

4.1.2. HUST Dataset

The HUST [23] dataset is a practical dataset for ball bearing fault diagnosis proposed by Nguyen et al., 2023. The dataset of bearing defects is generated manually. A wire-cut method was used to create a 0.2 mm microcrack width (see Figure 8) to simulate early faults. The sampled data in the dataset were collected on a motor operating at low load with a sampling rate of 51,200 samples per second. What is more innovative about this dataset is the defect faults it creates. It takes into account that when the bearing appears in one place of damage, it often leads to the appearance of another damage due to mechanical interaction. Therefore, the dataset contains a total of 90 raw vibration data of six types of defects (internal cracks, external cracks, ball cracks, and their two combinations) of five types of bearings under three working conditions.

4.2. Test Scheme

Since the HUST utility bearing dataset is a newly available dataset, we needed to let the model perform preliminary experiments on the CWRU rolling bearing dataset, which is Experiment I. Then, we used the HUST bearing dataset to investigate whether our model has high accuracy in facing different types of mixed bearing failures, which is Experiment II. The third experiment is the noise immunity experiment. The following is the reason why we performed these three experiments.

In the initial validation of the model performance using the CWRU dataset, we needed to perform data enhancement on the CWRU dataset due to its lesser amount of data than that of other datasets. We used sliding window sampling here to increase each sample type to 800. To verify the model’s training effect on different bearings under complex working conditions, we divided the data into 12 classes. Subsequently, it was necessary to juxtapose various models to exemplify the model’s efficiency and include ablation experiments to corroborate the indispensability of each segment within the model.
The HUST dataset was sufficiently sampled, and no data enhancement was performed. Here, the data were divided into seven categories in total, and for the sake of experimental rigor, we set all other variables as the same. Finally, we also compared five models to prove that the model has made some modest contributions to the academic community in the face of the fault recognition rate of five types of bearings.
In actual operational settings, the model for diagnosing faults within rotating machinery has to confront the noise produced by the mutual oscillations and abrasion amongst the machine’s components. Such conditions make the vibration signals detected by the sensors susceptible to noise pollution, resulting in a blurring of the fault details contained within these signals. According to the survey, the noise of the diagnostic signal is generally additive white Gaussian noise [36]. Therefore, in this paper, different levels of additive white Gaussian noise were added to the test set’s signals to test the trained model’s anti-noise performance.

4.3. Test Setting

The device of the model proposed in this paper is a computer equipped with the Windows 10 system and an AMD Ryzen 7 5800H, 16 G memory, and RTX 3060 6 G graphics card.

The ratio of the training set, validation set, and test set in the experiment is 7:1:2. Since the authors of the HUST dataset used 2048 sampling points for each sample when comparing with other models, we also set the number of sampling points for each sample to 2048. Batch_size is 64. We also adopted the method of dynamically adjusting the learning rate; when the validation loss did not decrease in five consecutive cycles, the current learning rate was multiplied by 0.1 for further learning. If the learning rate had been reduced to its minimum and there was no improvement in the validation loss over the subsequent ten cycles, the training ended early. This approach helps prevent overfitting and speeds up training convergence to some extent.

4.4. Result

4.4.1. Experiment I: Preliminary Experiments on the CWRU Dataset

We conducted experiments to compare five models. They are the traditional MLP model, the FCN model, the 1DCNN+LSTM, the 1DCNN+BiGRU model without the self-attention mechanism, and the improved 1DCNN+GRU model released by L. Guo et al. last year [20].

We can see the accuracy of each model on the test set in Figure 9. Although there was still room for growth in the effect of the traditional MLP model, the effect was not ideal after 20 iterations. The FCN model had better results than the MLP, but the training was not stable overall. 1DCNN+LSTM model has a good performance compared with FCN model and MLP model. However, the improved 1DCNN+GRU model by L. Guo et al. [20] had better performance, reached an accuracy of more than 95% in the 10th epoch, and was still growing slowly. The red solid line represents our model; at its fourth epoch, it had obtained 98% accuracy; compared with the 1DCNN_BiGRU model from which the attention mechanism has been removed, our model converges more quickly. model from which the attention mechanism has been removed, our model converges more quickly. However, after the eighth iteration, the two curves almost overlap. Its accuracy fluctuated up and down to 99.9%, with a small range. Here, we can see the excellent performance of the 1DCNN_BiGRU_Att model.

To achieve a better view of the data distribution across the models at each iteration, we grouped the box plots of the models. The four parameters in the box plot are also indicators of our evaluation of the model. Since the MLP and FCN models performed poorly in the accuracy plots above, only the box plots of the remaining four models are compared here. The specific information is shown in Figure 10. The accuracy, precision, recall, and F1-score of each model are evident in these four bin plots, along with the outliers, median, upper quartile, and lower quartile of these parameters. Our model performed very well in general, with the smallest outliers of all the models, and its upper quartile, lower quartile, and median almost coincide and are close to 1.0. Through the internationally recognized bearing dataset—although there are only two types of bearings and the dataset is not balanced—it can also be preliminarily proved that the model has a good effect on the fault diagnosis of multiple types of bearings under complex working conditions.

4.4.2. Experiment II: To Verify the Performance of the Model on Different Bearings

The accuracy curves of our models running on the HUST dataset are shown in Figure 11. In these models, the ACN_BM_2 model is a version of the ACN_BM model with the improved ReLU function removed, also known as an ablated model. The ACN_BM_1 model, on the other hand, is an ablated model of ACN_BM that has removed dilated convolutions, dilation rates, and the improved ReLU function. The 1DCNN_BM model is a variant of the ACN_BM model, where a regular CNN convolution module has replaced the ACN module. The 1DCNN_BiGRU model has removed the multi-head self-attention mechanism module from the 1DCNN_BM model.

The figure shows that the ACN_BM (represented by a red line) converged very quickly, stabilizing by the sixth iteration, and its accuracy had already exceeded 99.8%. When comparing the ACN_BM, ACN_BM_1, and ACN_BM_2 models, we find that the ACN_BM_2 model (represented by a light green dotted line) does not have as high prediction accuracy as the ACN_BM model in the first two iterations of the process, indicating that the improved ReLU function learned more features during training. However, compared to the ACN_BM_1 model (represented by a blue dotted line), the ACN_BM_2 model performed better in the first two iterations, indicating that dilated convolution helped the model capture more short-term features at the beginning of the training.

When comparing the ACN_BM model with the 1DCNN_BM model, we can see that the ACN convolution module we designed had a much higher prediction accuracy in the first two iterations of training than the regular CNN convolution, indicating that the can convolution module is more efficient at learning features. L. Guo et al.’s [20] 1DCNN+GRU model also performed well under the influence of a large 64 × 1 convolution kernel, but it still lagged slightly behind our model.

These models all performed well in terms of prediction accuracy on the HUST dataset, and there is not much difference compared with the training results on the CWRU dataset. This also effectively demonstrates that these models maintain a high degree of accuracy in diagnosing different types of bearings.

4.4.3. Experiment III: Noise Immunity Experiment of the Model

Since most of the noise is Gaussian white noise, the standard for evaluating the strength of noise in a signal is the Signal Noise Ratio (SNR). Let

P_{s}

and

P_{n}

denote the energies of the signal and noise, respectively; then, the SNR formula is given in Equation (3). The more noise the signal contains, the smaller the SNR value will be. The signal and noise contain equal energy when the SNR value is 0 dB.

{S N R}_{d B} = \frac{P_{s}}{P_{n}}

(3)

In addition to Gaussian noise, practical applications may encounter other types of noise. We have also incorporated non-stationary signals and harmonic interference into our analysis. To simulate non-stationary signals, we added a component to the original signal that changes linearly over time. This addition alters the statistical properties, such as mean and variance, making them time-variant. Furthermore, we generated harmonic interference by defining specific frequencies, amplitudes, and phases to disrupt the periodicity of the original signal. This process mimics the variety of interferences encountered in real-world settings.

In the noise immunity experiment, we first normalized the two datasets. We used the signal-to-noise ratio to measure the noise when adding white Gaussian noise to the dataset. We set the signal-to-noise ratio (SNR) in the range of −10 dB to 6 dB with a 2 dB interval. Then, different levels of white Gaussian noise were added to the normalized dataset according to the range of the SNR. Then, we added non-stationary components and harmonic interference. Finally, the ratio of the training set, validation set, and test set was 7:1:2, the number of training iterations was 30, and the learning rate was set to 0.001.

Figure 12a shows the image of the HUST dataset data after preprocessing by t-SNE dimensionality reduction visualization. The preprocessing of the data included the addition of various noises and so on. The image of the t-SNE dimensionality reduction visualization after model processing and classification is shown in Figure 12b.

The training sets of the two datasets were used to train the model, and then the test set’s prediction accuracy in the four models’ noisy environment was compared. Its comprehensive average accuracy and standard deviation after five experiments are shown in Table 2.

As illustrated in Table 2, the results of the model runs show that the traditional model takes a long time and the diagnostic results are low. Although the diagnostic results of the 1DCNN-based LSTM and GRU models are close to 90%, these results may cause some trouble when applied to the industrial field. Our models in both datasets have good performance, with prediction accuracies of over 98%, and over 95% for the model that removes the multinomial self-attention mechanism. The difference between the two indicates that the multi-head self-attention mechanism has a good effect on noise immunity. In practice, since the model learns to deal with diverse noisy data, it improves its robustness and adaptability in practical applications.

After conducting basic anti-noise experiments, we evaluated the performance of the multi-head attention mechanism. We are now set to further investigate the effect of varying the number of heads within the multi-head self-attention framework.

Our unique internal configuration of the multi-head attention mechanism maintains a constant number of parameters at 686,023, regardless of the number of heads, occupying 5359.5 KB of memory. The experimental results in Table 3 show that under identical noise conditions, the multi-head self-attention mechanism outperforms its single-head counterpart due to its capacity to learn a broader range of representations. With an increase in the number of heads, there is a slight but incremental improvement in the test set’s prediction accuracy at an SNR of 0. However, at an SNR of −8, the rise in the number of heads leads to a substantial increase in prediction accuracy. However, the accuracy gain from expanding from four to eight heads is marginal. Despite the unchanged volume of parameters, we opted to set the number of heads to four, taking into account the computational overhead associated with the parallel processing of multiple heads.

5. Discussion

During our research, we adjusted the parameters of various models and selected an optimal set of parameters, thus training a decent model. Despite this, several issues remain that warrant further exploration:

Data collection: How can we enable sensors to gather the most complete signals from numerous mechanical components? Additionally, the details concerning data transmission are worth examining. These aspects are critical to improving the data quality used in our models.

Unknown factors: Given the diverse and complex industrial fault modes, there might still be unidentified factors affecting the accuracy of our models in real-world applications. Exploring these unknown elements is necessary to improve our diagnostic tools.

Fault diagnosis techniques: After years of development, most fault diagnosis models have achieved high accuracy. However, we need to revisit weak fault feature extraction methods based on advanced signal processing techniques as they are crucial for intelligent fault diagnosis. One example is the noise-enhanced weak signal detection of fractional nonlinear systems and its applications in mechanical fault diagnosis. However, this area requires specialized domain knowledge to support its development.

In summary, despite the progress we have made, many issues warrant deeper research and discussion. Through continuous effort and exploration in the future, we will gain a deeper understanding of these issues and further enhance the performance of our model and its applications.

6. Conclusions

Our experiments show that multiple models have achieved high accuracy when predicting the seven faults of five bearing types on the HUST dataset. Most models can diagnose faults of various bearing types accurately. The proposed model achieves more than 99.9% prediction accuracy when predicting bearing faults. Compared to other models, our method exhibits faster convergence and higher accuracy, stabilizing above 99% in only five to six cycles without introducing any noise. In addition, when facing different types of noise and Gaussian white noise data containing different signal-to-noise ratios, the model’s prediction accuracy reaches more than 98.5%, and it still has good robustness in the face of complex environments. These results highlight the great potential of the proposed model for early bearing fault diagnosis applications.

Author Contributions

Conceptualization, P.H. and J.Z.; methodology, P.H.; validation, P.H. and J.Z.; formal analysis, Y.T.; investigation, Y.L.; resources, P.H.; data curation, P.H.; writing—original draft preparation, P.H.; writing—review and editing, P.H.; visualization, P.H.; supervision, Z.J. and Y.T.; project administration, J.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program of Xinjiang Uygur Autonomous Region, grant number 2022B02038.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available on request from the authors. These data were derived from a dataset that is already publicly available in the relevant field, hence we did not provide them again in our manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nandi, S.; Toliyat, H.A.; Li, X. Condition Monitoring and Fault Diagnosis of Electrical Motors—A Review. IEEE Trans. Energy Convers. 2005, 20, 719–729. [Google Scholar] [CrossRef]
Mohammad-Alikhani, A.; Pradhan, S.; Dhale, S.; Mobarakeh, B.N. A Variable Speed Fault Detection Approach for Electric Motors in EV Applications based on STFT and RegNet. In Proceedings of the 2023 IEEE Transportation Electrification Conference & Expo (ITEC), Detroit, MI, USA, 21–23 June 2023; pp. 1–5. [Google Scholar]
Alonso-González, M.; Díaz, V.G.; Pérez, B.L.; G-Bustelo, B.C.P.; Anzola, J.P. Bearing Fault Diagnosis with Envelope Analysis and Machine Learning Approaches Using CWRU Dataset. IEEE Access 2023, 11, 57796–57805. [Google Scholar] [CrossRef]
Zou, X.-L.; Han, K.-X.; Chien, W.; Gan, X.-Y.; Shi, L.-Y. Overview of Bearing Fault Diagnosis Based on Deep Learning. In Proceedings of the 2023 IEEE 3rd International Conference on Electronic Communications, Internet of Things and Big Data (ICEIB), Taichung, Taiwan, 14–16 April 2023; pp. 324–326. [Google Scholar]
Brigham, E.O.; Morrow, R.E. The fast Fourier transform. IEEE Spectr. 1967, 4, 63–70. [Google Scholar] [CrossRef]
Donnelly, D. The Fast Fourier and Hilbert-Huang Transforms: A Comparison. In Proceedings of the Multiconference on Computational Engineering in Systems Applications, Beijing, China, 4–6 October 2006; pp. 84–88. [Google Scholar]
Yin, P.; Nie, J.; Liang, X.; Yu, S.; Wang, C.; Nie, W.; Ding, X. A Multiscale Graph Convolutional Neural Network Framework for Fault Diagnosis of Rolling Bearing. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
Al-Tashi, Q.; Kadir, S.J.A.; Rais, H.M.; Mirjalili, S.; Alhussian, H. Binary Optimization Using Hybrid Grey Wolf Optimization for Feature Selection. IEEE Access 2019, 7, 39496–39508. [Google Scholar] [CrossRef]
Dahan, F.; El Hindi, K.; Ghoneim, A.; Alsalman, H. An Enhanced Ant Colony Optimization Based Algorithm to Solve QoS-Aware Web Service Composition. IEEE Access 2021, 9, 34098–34111. [Google Scholar] [CrossRef]
Jin, N.; Rahmat-Samii, Y. Advances in Particle Swarm Optimization for Antenna Designs: Real-Number, Binary, Single-Objective and Multiobjective Implementations. IEEE Trans. Antennas Propag. 2007, 55, 556–567. [Google Scholar] [CrossRef]
Lee, C.-Y.; Le, T.-A.; Lin, Y.-T. A feature selection approach hybrid grey wolf and heap-based optimizer applied in bearing fault diagnosis. IEEE Access 2022, 10, 56691–56705. [Google Scholar] [CrossRef]
Kwon, B.; Kim, J.; Lee, K.; Lee, Y.K.; Park, S.; Lee, S. Implementation of a virtual training simulator based on 360° multi-view human action recognition. IEEE Access 2017, 5, 12496–12511. [Google Scholar] [CrossRef]
NMohamed, N.; Baskaran, N.K.; Patil, P.P.; Alatba, S.R.; Aich, S.C. Thermal Images Captured and Classifier-based Fault Detection System for Electric Motors Through ML Based Model. In Proceedings of the 2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), Greater Noida, India, 12–13 May 2023; pp. 649–654. [Google Scholar]
Tian, J.; Morillo, C.; Azarian, M.H.; Pecht, M. Motor Bearing Fault Detection Using Spectral Kurtosis-Based Feature Extraction Coupled With K-Nearest Neighbor Distance Analysis. IEEE Trans. Ind. Electron. 2016, 63, 1793–1803. [Google Scholar] [CrossRef]
Chen, J.; Hu, W.; Cao, D.; Zhang, M.; Huang, Q.; Chen, Z.; Blaabjerg, F. Novel data-driven approach based on capsule network for intelligent multi-fault detection in electric motors. IEEE Trans. Energy Convers. 2020, 36, 2173–2184. [Google Scholar] [CrossRef]
Liu, Y.; Wen, W.; Bai, Y.; Meng, Q. Self-supervised feature extraction via time–frequency contrast for intelligent fault diagnosis of rotating machinery. Measurement 2023, 210, 112551. [Google Scholar] [CrossRef]
Chen, Y.; Xiao, L.; Li, Z. Partial Domain Fault Diagnosis of Bearings under Cross-Speed Conditions Based on 1D-CNN. In Proceedings of the 2022 Global Reliability and Prognostics and Health Management (PHM-Yantai), Yantai, China, 13–16 October 2022; pp. 1–8. [Google Scholar]
Cheng, Q.; Peng, B.; Li, Q.; Liu, S. A rolling bearing fault diagnosis model based on WCNN-BiGRU. In Proceedings of the 2021 China Automation Congress (CAC), Beijing, China, 22–24 October 2021; pp. 3368–3372. [Google Scholar]
Guo, L.; Zhang, S.; Huang, Q. Rolling bearing fault diagnosis based on the combination of improved deep convolution network and gated recurrent unit. In Proceedings of the 2022 34th Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 October 2022; pp. 1473–1478. [Google Scholar]
Xu, P.; Zhang, L. A Fault Diagnosis Method for Rolling Bearing Based on 1D-ViT Model. IEEE Access 2023, 11, 39664–39674. [Google Scholar] [CrossRef]
Cheng, L.; Dong, Z.; Wang, S.; Zhang, J.; Chen, J. Based on improved one-dimensional convolutional neural network analysis of the rolling bearing fault diagnosis. J. Mech. Des. Res. 2023, 6, 126–130. [Google Scholar]
Thuan, N.D.; Hong, H.S. HUST bearing: A practical dataset for ball bearing fault diagnosis. BMC Res. Notes 2023, 16, 138. [Google Scholar] [CrossRef]
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Zhang, Y.; Yao, L.; Zhang, L.; Luo, H. Fault diagnosis of natural gas pipeline leakage based on 1D-CNN and self-attention mechanism. In Proceedings of the 2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Beijing, China, 3–5 October 2022; pp. 1282–1286. [Google Scholar]
Li, X.; Zhang, W.; Ding, Q. Understanding and improving deep learning-based rolling bearing fault diagnosis with attention mechanism. Signal Process. 2019, 161, 136–154. [Google Scholar] [CrossRef]
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
Kayed, M.; Anter, A.; Mohamed, H. Classification of Garments from Fashion MNIST Dataset Using CNN LeNet-5 Architecture. In Proceedings of the 2020 International Conference on Innovative Trends in Communication and Computer Engineering (ITCE), Aswan, Egypt, 8–9 February 2020; pp. 238–243. [Google Scholar]
Zhang, J.; Sun, Y.; Guo, L.; Gao, H.; Hong, X.; Song, H. A new bearing fault diagnosis method based on modified convolutional neural networks. Chin. J. Aeronaut. 2020, 33, 439–447. [Google Scholar] [CrossRef]
Shao, X.; Kim, C.-S. Unsupervised Domain Adaptive 1D-CNN for Fault Diagnosis of Bearing. Sensors 2022, 22, 4156. [Google Scholar] [CrossRef]
Graves, A. Long short-term memory. In Supervised Sequence Labeling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Choe, D.E.; Kim, H.C.; Kim, M.H. Sequence-based modeling of deep learning with LSTM and GRU networks for structural damage detection of floating offshore wind turbine blades. Renew. Energy 2021, 174, 218–235. [Google Scholar] [CrossRef]
Liu, Z. Bearing Fault Diagnosis of End-to-End Model Design Based on 1DCNN-GRU Network. Discret. Dyn. Nat. Soc. 2022, 2022, 7167821. [Google Scholar]
Yang, Y.C.; Liu, T.; Liu, X.Q. One dimensional convolution neural network fault diagnosis of planetary gearbox based on attention mechanism. Mach. Electron. 2021, 39, 3–8. [Google Scholar]
Shao, S.; McAleer, S.; Yan, R.; Baldi, P. Highly accurate machine fault diagnosis using deep transfer learning. IEEE Trans. Ind. Informat. 2019, 15, 2446–2455. [Google Scholar] [CrossRef]
Yang, X.; Xiao, Y. Named Entity Recognition Based on BERT-MBiGRU-CRF and Multi-head Self-attention Mechanism. In Proceedings of the 2022 4th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 25–27 March 2022; pp. 178–183. [Google Scholar]
Amar, M.; Gondal, I.; Wilson, C. Vibration spectrum imaging: A novel bearing fault classification approach. IEEE Trans. Ind. Electron. 2015, 62, 494–502. [Google Scholar] [CrossRef]

Figure 1. The GRU model.

Figure 2. The BiGRU model.

Figure 3. The

b_{i}

calculation process.

Figure 3. The

b_{i}

calculation process.

Figure 4. Self-attention mechanism.

Figure 5. ACN_BM Mechanism.

Figure 6. Multi-head self-attention mechanism.

Figure 7. (a) CWRU; (b) HUST.

Figure 8. Illustrations of defects on actual test bearings: (a) inner crack; (b) outer crack; (c) ball crack; (d) inner and outer cracks; (e) inner and ball cracks; (f) outer and ball cracks. The red circle is the fault location of the bearing.

Figure 9. Accuracy of the individual models in CWRU.

Figure 10. Box plot of CWRU test.

Figure 11. Accuracy of the individual models in HUST.

Figure 12. (a) The noisy data. (b) The classified data.

Table 1. ACN_BM network structure.

Network Layer	Kernel_Size/Strid	Parameter	Input Size	Input Size
Conv1d	7/1	2	(1, 2048)	(64, 2048)
Conv1d	3/1	2	(64, 2048)	(128, 2048)
Conv1d	3/1	2	(128, 1024)	(256, 1024)
Conv1d	3/1	2	(256, 512)	(128, 512)
Dropout	-	0.3	(128, 256)	(128, 256)
BiGRU	-	128	(128, 256)	(128, 256)
Dropout		0.5	(128, 256)	(128, 256)
Multi-Head Self-attention	-	4	(128, 256)	(128, 256)
Avg_pool	-		(256, 128)	(256)
Fc	-		(256)	(7)

Table 2. Noise immunity experimental data.

Dataset	Accuracy %	1DCNN+LSTM	1DCNN+GRU	1DCNN+BiGRU	ACN_BM
HUST	Training-set	$87.7 \pm$ 0.23	$97.68 \pm$ 0.14	$98.60 \pm$ 0.09	99.89 $\pm$ 0.06
HUST	test-set	$86.76 \pm$ 0.35	$96.28 \pm$ 0.36	$96.13 \pm$ 0.21	$98.78 \pm$ 0.15
CWRU	Training-set	$86.87 \pm$ 0.33	$97.96 \pm$ 0.26	96.24 $\pm$ 0.11	$97.82 \pm$ 0.21
CWRU	test-set	$86.64 \pm$ 0.29	$97.38 \pm$ 0.49	$94.57 \pm$ 0.57	98.64 $\pm$ 0.16

Table 3. Analysis of multi-head self-attention performance influenced by varying head numbers.

Head numbers	1	2	4	8	1	2	4	8
SNR	0	0	0	0	−8	−8	−8	−8
Average precision	98.7%	98.76%	98.83%	98.94%	86.87%	88.32%	89.62%	89.92%
Final learning rate	0.0001	0.000001	0.000001	0.000001	0.0001	0.000001	0.000001	0.000001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, P.; Zhang, J.; Jiang, Z.; Tang, Y.; Lin, Y. A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism. Appl. Sci. 2023, 13, 12770. https://doi.org/10.3390/app132312770

AMA Style

Hou P, Zhang J, Jiang Z, Tang Y, Lin Y. A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism. Applied Sciences. 2023; 13(23):12770. https://doi.org/10.3390/app132312770

Chicago/Turabian Style

Hou, Peng, Jianjie Zhang, Zhangzheng Jiang, Yiyu Tang, and Ying Lin. 2023. "A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism" Applied Sciences 13, no. 23: 12770. https://doi.org/10.3390/app132312770

APA Style

Hou, P., Zhang, J., Jiang, Z., Tang, Y., & Lin, Y. (2023). A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism. Applied Sciences, 13(23), 12770. https://doi.org/10.3390/app132312770

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Bearing Fault Diagnosis Method Based on Dilated Convolution and Multi-Head Self-Attention Mechanism

Abstract

1. Introduction

2. Relevant Methodology

2.1. Convolutional Neural Networks (CNNs)

2.2. GRU

2.3. Self-Attention Mechanism

3. Detailed Methods

3.1. ACN

3.2. Improved ReLU Functions

3.3. Multi-Head Self-Attention Mechanism

3.4. ACN_BM Model

4. Experimental Demonstration

4.1. Experimental Dataset

4.1.1. CWRU Dataset

4.1.2. HUST Dataset

4.2. Test Scheme

4.3. Test Setting

4.4. Result

4.4.1. Experiment I: Preliminary Experiments on the CWRU Dataset

4.4.2. Experiment II: To Verify the Performance of the Model on Different Bearings

4.4.3. Experiment III: Noise Immunity Experiment of the Model

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI