Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism

Yang, Jianjian; Han, Haifeng; Dong, Xuan; Wang, Guoyong; Zhang, Shaocong

doi:10.3390/app15031531

Open AccessArticle

Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism

by

Jianjian Yang

^1,2,3,

Haifeng Han

^1,2,3,*

,

Xuan Dong

^1,2,3,

Guoyong Wang

² and

Shaocong Zhang

^1,2,3

¹

School of Mechanical and Electrical Engineering, China University of Mining and Technology (Beijing), Beijing 100083, China

²

Inner Mongolia Research Institute, University of Mining and Technology (Beijing), Ordos 017004, China

³

Key Laboratory of Intelligent Mining and Robotics, Ministry of Emergency Management, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1531; https://doi.org/10.3390/app15031531

Submission received: 19 December 2024 / Revised: 21 January 2025 / Accepted: 29 January 2025 / Published: 3 February 2025

(This article belongs to the Collection Bearing Fault Detection and Diagnosis)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a novel method called Fusion Attention Network for Bearing Diagnosis (FAN-BD) to address the challenges in effectively extracting and fusing key information from current and vibration signals in traditional methods. The research is validated using the public dataset Vibration, Acoustic, Temperature, and Motor Current Dataset of Rotating Machines under Varying Operating Conditions for Fault Diagnosis. The method first converts current and vibration signals into two-dimensional grayscale images, extracts local features through multi-layer convolutional neural networks, and captures global information using the self-attention mechanism in the Vision Transformer (ViT). Furthermore, it innovatively introduces the Channel-Based Multi-Head Attention (CBMA) mechanism for the efficient fusion of features from different modalities, maximizing the complementarity between signals. The experimental results show that compared to mainstream algorithms such as Vision Transformer, Swin Transformer, and ConvNeXt, the Fusion Attention Network for Bearing Diagnosis (FAN-BD) achieves higher accuracy and robustness in fault diagnosis tasks, providing an efficient and reliable solution for bearing fault diagnosis.The proposed model outperforms ViT, Swin Transformer, ConvNeXt, and CBMA-ViT in terms of classification accuracy, achieving an accuracy of 97.5%. The comparative results clearly demonstrate that the proposed Fusion Attention Network for Bearing Diagnosis yields significant improvements in classification outcomes.

Keywords:

bearing fault diagnosis; vision transformer; convolutional neural network; multi-modal recognition; channel-based multi-head attention

1. Introduction

In mining operations, mechanical equipment such as cranes, conveyors, and ventilation fans are widely used for processing ore and materials, with bearings playing a crucial role in ensuring the stability of the rotating components [1]. However, when bearings fail, they may cause vibrations, overheating, or even shutdown among the equipment, resulting in serious economic losses and safety risks [2,3]. Therefore, the timely diagnosis and of bearing faults and replacement of faulty equipment can effectively prevent these problems.

Traditional fault diagnosis methods grounded in vibration signal analysis typically include data feature extraction and fault classification [4]. Feature extraction relies on time domain, frequency domain, and time–frequency domain analysis, while fault classification uses shallow models such as support vector machines [5], neural networks [6], and K-nearest neighbors [7]. Sun Yawei et al. proposed an autoregressive data generation method based on wavelet packet transform and cascaded stochastic quantization. This method can generate high-quality pseudo-samples to address the unbalanced sample problem in bearing fault diagnosis, improving the diagnosis performance [8]. However, these methods often depend on complex signal processing techniques and experience, and manual feature extraction has drawbacks such as strong subjectivity and poor universality. Moreover, shallow models perform poorly in handling complex nonlinear problems, limiting the effectiveness and applicability of traditional diagnostic methods.

Therefore, it is necessary to overcome the limitations of traditional methods and shallow models while effectively integrating and fusing vibration and current signals for fault diagnosis.

In recent years, deep learning has played an important role in solving these problems, especially regarding the successful application of deep convolutional neural networks (CNN) in image processing, which has improved research progress in the field of fault diagnosis [9]. CNN can effectively extract key features from rolling bearing fault data, thereby improving the accuracy and efficiency of fault diagnosis. For example, Guo et al. [10] combined Attention CNN (ACNN) and BiLSTM for rolling bearing fault diagnosis. They processed raw vibration signals using a wide one-dimensional CNN to extract their spatial features, which are enhanced by the Convolutional Block Attention Module (CBAM), which prioritizes important features. These features are then passed to BiLSTM for temporal feature extraction. The model is robust against noise interference and does not require preprocessing. While it achieves high accuracy even in noisy environments, its drawbacks include longer training times and potential instability under extreme noise conditions. Additionally, it does not address data imbalances. Lin et al. [6] conducted detailed fault diagnosis experiments based on wavelet analysis and neural networks to extract feature vectors from rolling bearing signals. They compared fault diagnosis methods based on BP neural networks and radial basis function neural networks, and analyzed the diagnostic results based on the relaxation neural networks and closed neural networks. Tian et al. proposed a rolling bearing fault diagnosis method combining a power spectrum analysis and support vector machine (SVM). The experimental results showed that this method could be effectively applied to rolling bearing fault diagnosis.

Furthermore, Vision Transformer (ViT) [11,12], a new visual sensor technology based on attention mechanism, can effectively integrate the global information of vibration signals and has a strong parallel computing ability and storage mechanism. Although ViT is inferior to CNN in its local information extraction ability and parameter efficiency, it has obvious advantages in terms of global information integration and computational speed. Therefore, ViT provides a new method of rolling bearing fault diagnosis and has great application potential.

Guo Haike et al. [13] adopted data-level fusion, directly combining the raw data of vibration signals and current signals. Although this method can increase input information in some cases, due to differences in physical characteristics, sampling frequencies, and value ranges between vibration and current signals, this direct fusion may lead to information loss or noise interference, increasing model complexity and training difficulties. Hongfeng Tao et al. proposed a semi-supervised planetary gearbox fault diagnosis model based on GAT using fuzzy distance nearest neighbor clustering. This can better extract fault features with few labeled samples, providing a new means of gearbox fault diagnosis [14]. Hongfeng Tao et al. also developed an indirect iterative learning control scheme for batch processes with time-varying uncertainties, input delays, and disturbances. This scheme is based on the repetitive process stability theory and can effectively control batch processes [15]. Zhongmei Wang et al. [16] adopted decision-level fusion, performing feature extraction and classification separately for each modality before decision fusion. While this method is simple to implement, it cannot fully explore the deep associations between different modalities and tends to ignore complementary information between modalities. Therefore, following a comparison of different fusion strategies, this paper chose to perform fusion at the feature level [17], which can avoid heterogeneity problems in data-level fusion while fully extracting the deep features of each modality through convolutional networks and the Vision Transformer (ViT) [18] self-attention mechanism. It used the Channel-Based Multi-Head Attention (CBMA) mechanism [19] to efficiently fuse features from different modalities, maximizing the complementary advantages of multi-modal data to achieve more accurate fault diagnosis results. This strategy is relatively mature and has obvious advantages, effectively improving the accuracy and robustness of fault diagnosis.

This study introduces a novel approach that converts the current and vibration signals into 2D images, which facilitates enhanced feature extraction through advanced image-processing techniques. Multi-layer convolutional networks are used to extract local features, while the self-attention mechanism of the Vision Transformer (ViT) improves the global feature representation. The Channel-Based Multi-Head Attention (CBMA) mechanism effectively fuses the two signals, improving the overall signal analysis process. These contributions represent a significant advancement in signal processing for fault diagnosis.

The paper is structured as follows: experiments were conducted using a multi-sensor, multi-modal public dataset, and multi-modal fusion was employed to verify the accuracy of the results. The proposed model was compared with CBMA-VIT, ViT, Swin Transformer, and ConvNeXt, demonstrating its superior performance in terms of classification accuracy. The results highlight the effectiveness and practical potential of the proposed methodology in signal processing and classification tasks.

2. Methods

2.1. Data Processing

First, the vibration and current signals underwent data preprocessing, where they were divided into data groups of size

256 \times 4

and

256 \times 3

, respectively, and normalized according to Formula (1):

X_{n o r m} = \frac{X - X_{m i n}}{X_{m a x} - X_{m i n}} \times 255

(1)

where

X_{m a x}

is the maximum value of the sample and

X_{m i n}

is the minimum value of the sample. After normalization, the data values were confined to between 0 and 255. In grayscale images, 0 represents black, 255 represents white, and the values in between represent different shades of gray. Therefore, the data can be converted into grayscale image groups of

16 \times 16 \times 4

and

16 \times 16 \times 3

.

Next, the feature information of the grayscale image groups was integrated through convolutional neural networks, converting them into single-channel 16 × 16 grayscale images. The neural network structure was designed as shown in Figure 1.

The model processed an input vibration signal grayscale image of the shape (16, 16, 4) through three convolutional layers and two activation layers, ultimately outputting a single-channel grayscale image of the shape (16, 16, 1). In convolutional layers 1 and 2, 3 × 3 convolution kernels and “same” padding were used to increase the number of channels from 4 to 16, then to increase this number to 32, effectively extracting features within the same time dimension. Subsequently, convolutional layer 3 reduced the number of channels to 1 using 1 × 1 convolution kernels, generating grayscale images containing key information. Throughout the process, ReLU activation functions were used in activation layers 1 and 2 to introduce non-linearity, while activation layer 3 used linear activation to preserve features. Normalization helped to reduce the impact of outliers, enabling these grayscale images to simultaneously reflect the characteristics of both vibration and current signals. After adding labels and classifying these grayscale images, they can be further used in training and prediction classification models.

2.2. FAN-BD

The overall structure of the proposed convolutional vision neural network method is shown in Figure 2. This method aims to achieve precise bearing information recognition through the effective processing of current and vibration signals.

Since the original VIT image feature extraction method focuses on the global characteristics of targets while neglecting local characteristics, we used small convolution kernels and small strides to extract more detailed local information, and used the CBMA attention mechanism for feature calibration and fusion to reach our conclusions. The specific steps are as follows.

This process first extracts features from the input vibration signal grayscale image through convolutional and activation layers, then adds position encoding information, as shown in Equation (2), for both vibration and current signal two-dimensional grayscale images. Next, the Vision Transformer (ViT) is used to extract features from current signals (3) and vibration signals (4). These two types of signal features are then fused through the CBMA module (5), generating queries, keys, and values (6), and calculating channel attention through the attention mechanism (7). The multi-head attention mechanism further processes these features, enhancing feature representation through concatenation and linear transformation (8) and (9). Subsequently, the fused features undergo nonlinear transformation through a multilayer perceptron (MLP) (10), and finally, classification results are obtained through layer normalization and linear layers (11). This model structure can effectively process and analyze multi-modal data, extract key information, and achieve accurate classification.

z_{0} = [\begin{matrix} x_{dass}; x_{p}^{1} E; x_{p}^{2} E; \dots; x_{p}^{n} E \end{matrix}] + E_{p e s}

(2)

X_{c u r r e n t} = ViT (I_{c u r r e n t})

(3)

X_{v i b r a t i o n} = ViT (I_{v i b r a t i o n})

(4)

X_{c o m b i n e d} = C o n c a t (X_{c u r r e n t}, X_{v i b r a t i o n})

(5)

Q = X_{c o m b i n e d} W_{Q}, K = X_{c o m b i n e d} W_{K}, V = X_{c o m b i n e d} W_{V}

(6)

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q κ^{T}}{\sqrt{d_{k}}}) V

(7)

MultiHead (Q, K, V) = Concat ({head}_{1} \dots {Head}_{h}) W^{0}

(8)

{Head}_{i} = Attention (Q_{i}^{e}, K_{i}^{κ}, V_{i}^{v})

(9)

z ß = M L P (L N (M u l t i H e a d))

(10)

y = L N (z ι)

(11)

The detailed layer information is shown in Table 1:

3. Data

The research data used in this paper derived from the public dataset Vibration, Acoustic, Temperature, and Motor Current Dataset of Rotating Machine under Varying Operating Conditions for Fault Diagnosis, published by Jung Wonhee, Kim Seonghun, Yun Sunghyuk, et al. [20]. The dataset is divided into two parts, and this paper uses the second part, which is a fault diagnosis dataset containing vibration signals and current signals. Furthermore, the simultaneous acquisition of both vibration and current signals allows for the extraction of more effective features, which enhances the model’s ability to capture diverse fault characteristics. The large size of the dataset also improves the generalization and reliability of the model. The data information is organized in Table 2.

The experimental equipment structure is shown in Figure 3.

This paper extracted vibration signals for various fault types. Figure 4 and Figure 5 show the time-domain waveforms of vibration signals and current signals, respectively. It can be observed that, under normal conditions, the vibration signal waveform is relatively stable, while under rolling element fault and inner-ring fault conditions, obvious periodic impact features appear in the signals. The three-phase current waveforms are basically symmetrical under normal conditions, but obvious current imbalance phenomena appear under outer-ring fault conditions, reflecting the impact of faults on motor performance.

4. Experimental Design and Results

4.1. Experimental Environment

The experiment used the PyTorch framework and Python programming, with Python version 3.11. The computer parameters used for network training and testing included an Intel(R) Core(TM) i7-8250U CPU and NVIDIA GeForce MX150 GPU, running on the Windows operating system.

4.2. Dataset Classification

The public dataset includes the following fault types: normal, inner-race fault, outer-race fault, and ball fault. After data preprocessing, the experimental data were processed in units of 256 data points, normalized, and converted through convolution into grayscale images. After adding labels, the data were randomly divided into training and testing sets in a 7:3 ratio. The algorithm parameters were adjusted using the training set, and the performance metrics, such as accuracy, precision, recall, and F1 score, along with the confusion matrices and classification results, were calculated using the test set.

4.3. Experimental Results

In this study, cross-entropy function (12) was employed, and after multiple iterations using the training set, the algorithm’s performance was evaluated, resulting in the confusion matrix shown in Figure 6. Additionally, data visualization using the t-SNE (t-distributed stochastic neighbor embedding) algorithm (Figure 7) confirmed an accuracy of 97.5% for the algorithm.

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} log ({\hat{y}}_{i}) + (1 - y_{i}) log (1 - {\hat{y}}_{i})]

(12)

To further validate the feasibility of the proposed algorithm, comparisons were made with mainstream algorithms, including CBMA-ViT, ViT, Swin Transformer, and ConvNeXt. Four evaluation metrics were employed in this study: accuracy, precision, recall, and F1 score Table 3 [21].

(1) Accuracy: represents the proportion of correctly classified samples in the total samples

A c c u r a c y = \frac{T P + T N}{T P + F N + F P + T N}

(13)

(2) Precision: represents the proportion of actual class i samples among samples predicted as class i

P r e c i s i o n_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}}

(14)

(3) Recall: represents the proportion of correctly predicted class i samples among actual class i samples

R e c a l l_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}

(15)

(4) F1-score is a metric used to evaluate the performance of the classification models. It is the harmonic mean of the precision and recall. The formula is as follows:

F 1 - s c o r e_{i} = \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

To demonstrate each parameter, confusion matrices were created, as shown in Figure 8, based on the results of the five algorithms used in the testing sets. The accuracy derived from the confusion matrices for the four algorithms is as follows: CBMA-ViT (Accuracy = 0.929), ViT (Accuracy = 0.907), Swin Transformer (Accuracy = 0.931), and ConvNeXt (Accuracy = 0.915). These results highlight the superior performance of the proposed FAN-BD algorithm compared to the other models. As seen in Table 4, the FAN-BD excels in terms of precision, recall, and F1 score, demonstrating the algorithm’s ability to effectively balance both the detection and classification of faults, ensuring high accuracy and reliability.

To verify the feasibility of the proposed FAN-BD algorithm, we compared it with mainstream algorithms, including CBMA-ViT, ViT, Swin Transformer, and ConvNeXt. The cross-entropy loss function was used to optimize the classification accuracy, chosen for its ability to effectively measure the difference between predicted probabilities and actual labels. A comparative analysis of accuracy and loss across multiple iterations highlights the superior performance of FAN-BD, as illustrated in Figure 9.

The changes in loss functions during the training process for FAN-BD, CBMA-ViT, ViT Model, Swin Transformer, and ConvNeXt algorithms are shown through line graphs. From this figure, we can see that the FAN-BD algorithm demonstrated the fastest convergence speed and the lowest loss value throughout the training cycle, indicating its superior performance in this experiment. While other algorithms also showed a downward trend in loss values, their final loss values were all higher than those of FAN-BD, with ConvNeXt and CBMA-ViT following closely behind, while ViT Model and Swin Transformer performed relatively worse. FAN-BD is an improved version of CBMA-ViT, which performs classification tasks by extracting all feature information. Compared to the multimodal feature fusion approach of FAN-BD, CBMA-ViT is less effective in fault diagnosis based on vibration and current signals.

Figure 10 visualizes the classification results of CBMA-ViT, ViT, Swin Transformer, and ConvNeXt when using the testing sets by mapping them into a two-dimensional space using t-SNE. Each subplot in Figure 10 corresponds to one algorithm, allowing for a direct comparison of their performance with the FAN-BD results shown in Figure 7. The t-SNE algorithm reduces the dimensionality of the data, enabling an intuitive observation of the classification results through the color and position of the points. As shown in the figure, FAN-BD achieved the highest accuracy, with fewer than ten misclassifications of six hundred samples. In contrast, CBMA-ViT and ConvNeXt misclassified more than thirty points, while both the ViT model and Swin Transformer misclassified at least fifty points each. This highlights the superior performance of FAN-BD compared to the other algorithms.

5. Conclusions

In conclusion, the multi-modal fusion algorithm proposed in this paper, which combines convolutional networks with self-attention mechanisms, provides an efficient and precise solution for the intelligent fault diagnosis of rotating machinery. The experimental results demonstrate that our method achieved an impressive classification accuracy of 97.5%, outperforming other mainstream methods, such as ViT, Swin Transformer, and ConvNeXt. In addition to its accuracy, our method has several advantages: it enhances the adaptability to complex operating conditions, exhibits strong robustness, and performs well across various fault types, ensuring reliable fault diagnosis under diverse scenarios. Additionally, the introduction of the Channel-Based Multi-Head Attention (CBMA) mechanism effectively leverages the complementary information from vibration and current signals, improving the fusion of multi-modal features.

However, there are some potential limitations to consider. The method’s performance may be influenced by the quality and consistency of the input data, and the increased complexity of the model may lead to increased computational costs in certain real-time applications. Despite these challenges, the proposed approach presents a significant advancement in the field of fault diagnosis and opens up new directions for future research. With further refinement, this method holds promise for broader applications in intelligent fault diagnosis systems, particularly in industrial environments with diverse operating conditions.

Future research will focus on several key areas. First, we plan to optimize the method used for real-time applications by improving its computational efficiency and reducing the model’s complexity. Second, we aim to enhance the robustness of the model by investigating strategies to handle noisy or incomplete data. Additionally, exploring the fusion of other data modalities, such as temperature and acoustic signals, will be a promising avenue to expand the model’s diagnostic capabilities. Finally, further refinements to the attention mechanisms, such as introducing more advanced attention strategies, could lead to improved performance in complex fault diagnosis tasks. With these advancements, the method holds promise for broader applications in intelligent fault diagnosis systems, particularly in industrial environments with diverse operating conditions.

Author Contributions

Investigation, X.D.; Resources, S.Z.; Writing—original draft, H.H.; Writing—review & editing, J.Y.; Supervision, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under Grant 2022YFB4703703, the Theory and Method of Excavation-Support-Anchor Parallel Control for Intelligent Excavation Complex System under Grant 52104169, and the Green, Intelligent, and Safe Mining of Coal Resources under Grant 52121003.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are unavailable due to privacy restrictions. No public involvement in any aspect of this research. No specific reporting guidelines were used in drafting this manuscript. AI or AI-assisted tools were not used in drafting any aspect of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Daniel, R.V.; Siddhappa, S.A.; Gajanan, S.B.; Philip, S.V.; Paul, P.S. Effect of bearings on vibration in rotating machinery. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Busan, Republic of Korea, 25–27 August 2017; p. 012264. [Google Scholar]
Ahn, G.; Lee, H.; Park, J.; Hur, S. Development of indicator of data sufficiency for feature-based early time series classification with applications of bearing fault diagnosis. Processes 2020, 8, 790. [Google Scholar] [CrossRef]
Tang, X.; He, Q.; Gu, X.; Li, C.; Zhang, H.; Lu, J. A novel bearing fault diagnosis method based on GL-mRMR-SVM. Processes 2020, 8, 784. [Google Scholar] [CrossRef]
Liu, J. Detrended fluctuation analysis of vibration signals for bearing fault detection. In Proceedings of the 2011 IEEE Conference on Prognostics and Health Management, Denver, CO, USA, 20–23 June 2011; pp. 1–5. [Google Scholar]
Guishuang, T.; Wang, S.; Zhang, C. A method for rolling bearing fault diagnosis based on the power spectrum analysis and support vector machine. In Proceedings of the IEEE 10th International Conference on Industrial Informatics, Beijing, China, 25–27 July 2012; pp. 546–549. [Google Scholar]
Wu, S.L.; Liu, J.X.; Li, L. Fault Diagnosis of Rolling Bearing on the Basis of Wavelet Neural Network. Appl. Mech. Mater. 2014, 598, 244–249. [Google Scholar] [CrossRef]
Dong, S.; Luo, T.; Zhong, L.; Chen, L.; Xu, X. Fault diagnosis of bearing based on the kernel principal component analysis and optimized k-nearest neighbour model. J. Low Freq. Noise Vib. Act. Control 2017, 36, 354–365. [Google Scholar] [CrossRef]
Sun, Y.; Tao, H.; Stojanovic, V. Autoregressive data generation method based on wavelet packet transform and cascaded stochastic quantization for bearing fault diagnosis under unbalanced samples. Eng. Appl. Artif. Intell. 2024, 138, 109402. [Google Scholar] [CrossRef]
Wen, L.; Li, X.; Gao, L.; Zhang, Y. A new convolutional neural network-based data-driven fault diagnosis method. IEEE Trans. Ind. Electron. 2017, 65, 5990–5998. [Google Scholar] [CrossRef]
Guo, Y.; Mao, J.; Zhao, M. Rolling bearing fault diagnosis method based on attention CNN and BiLSTM network. Neural Process. Lett. 2023, 55, 3377–3410. [Google Scholar] [CrossRef]
Mangalam, K.; Fan, H.; Li, Y.; Wu, C.-Y.; Xiong, B.; Feichtenhofer, C.; Malik, J. Reversible vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10830–10840. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Guo, H.; Zhao, X. Intelligent Diagnosis of Dual-channel Parallel Rolling Bearings Based on Feature Fusion. IEEE Sens. J. 2024, 24, 10640–10655. [Google Scholar] [CrossRef]
Tao, H.; Shi, H.; Qiu, J.; Jin, G.; Stojanovic, V. Planetary gearbox fault diagnosis based on FDKNN-DGAT with few labeled data. Meas. Sci. Technol. 2023, 35, 025036. [Google Scholar] [CrossRef]
Tao, H.; Zheng, J.; Wei, J.; Paszke, W.; Rogers, E.; Stojanovic, V. Repetitive process based indirect-type iterative learning control for batch processes with model uncertainty and input delay. J. Process Control 2023, 132, 103112. [Google Scholar] [CrossRef]
Wang, Z.; Nie, P.; Liu, J.; He, J.; Wu, H.; Guo, P. Bearing fault diagnosis based on a Multiple-Constraint Modal-invariant Graph Convolutional Fusion Network. High-Speed Railw. 2024, 2, 92–100. [Google Scholar] [CrossRef]
Gao, J.; Li, P.; Chen, Z.; Zhang, J. A survey on deep learning for multimodal data fusion. Neural Comput. 2020, 32, 829–864. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Yang, J.; Zhang, Y.; Wang, K.; Tong, Y.; Liu, J.; Wang, G. Coal–Rock Data Recognition Method Based on Spectral Dimension Transform and CBAM-VIT. Appl. Sci. 2024, 14, 593. [Google Scholar] [CrossRef]
Jung, W.; Kim, S.-H.; Yun, S.-H.; Bae, J.; Park, Y.-H. Vibration, acoustic, temperature, and motor current dataset of rotating machine under varying operating conditions for fault diagnosis. Data Brief 2023, 48, 109049. [Google Scholar] [CrossRef] [PubMed]
Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar]

Figure 1. Flow chart of vibration signal and current signal data preprocessing.

Figure 2. Flow chart of vibration signal and current signal data preprocessing.

Figure 3. Data collection device [20].

Figure 4. Voltage signals under normal, ball fault, inner-race fault, and outer-race fault states.

Figure 5. Current signals under normal, ball fault, inner-race fault, and outer-race fault states.

Figure 6. Confusion matrix diagram.

Figure 7. FAN-BD Classification results diagram.

Figure 8. Confusion Matrix Diagram of CBMA-ViT, ViT, Swin Transformer, and ConvNeXt.

Figure 9. Comparison of loss functions between FAN-BD, ViT, Swin Transformer, and ConvNeXt.

Figure 10. Comparison of Classification results of CBMA-ViT, ViT, Swin Transformer, and ConvNeXt.

Table 1. FAN-BD model layer information.

Layer Name Detailed Architecture Description	Output Tensor Shape	Layer Description
Conv1	[2, 64, 120, 100]	Convolutional layer: input channels: 1; output channels: 64; kernel size: 3 × 3; padding = 1. Extracts low-level features.
ReLU1	[2, 64, 120, 100]	Activation function: applies ReLU activation to Conv1 output
Conv2	[2, 128, 120, 100]	Convolutional layer: input channels: 64; output channels: 128; kernel size: 3 × 3; padding = 1. Extracts higher-level features.
ReLU2	[2, 128, 120, 100]	Activation function: applies ReLU activation to Conv2 output.
Padding	[2, 128, 120, 100]	Padding operation: ensures image dimensions are divisible by patch_size = 8.
Unfold (Patch Division)	[2, 15, 12, 128, 8.8]	Divides image into patches of size 8 × 8 and flattens each patch, resulting in (batch_size, num_patches_h, num_patches_w, channels, patch_size, patch_size).
Patch Embedding	[2, 180, 256]	Linear projection layer, flattens each patch and maps to embed_dim = 256 dimension. Results in [batch_size, num_patches, embed_dim].
Position Embedding	[1, 180, 256]	Dynamically generates position encoding and adds to patch_embeddings.
Transformer Encoder	[2, 180, 256]	Transformer encoder section with 4 TransformerEncoderLayers, with each layer containing num_heads = 4. Performs encoding on input patches.
CBMA Attention	[2, 180, 256]	Multi-head self-attention mechanism (CBMA): input is patch embeddings after Transformer encoding; dimensions remain unchanged
MLP	[2, 180, 256]	Multi-layer perceptron, including GELU activation.
Mean Pooling	[2, 256]	Pooling operation: computes mean along num_patches dimension, resulting in [batch_size, embed_dim].
Fully Connected	[2, 4]	Classification layer: outputs num_classes = 4 categories, representing the classification results.

Table 2. Data collection equipment and parameter information.

Data Type	Collection Location/Sensors	Collection Equipment	Sampling Frequency	Unit	Included Columns
Vibration Data	Two bearing housings (A and B) in x and y directions, using 4 accelerometers (PCB352C34)	Siemens SCADAS Mobile 5PM50	25.6 kHz	Gravity constant (g)	Timestamp, x-direction (Bearing A), y-direction (Bearing A), x-direction (Bearing B), y-direction (Bearing B)
Current Data	Current sensors (3 CT sensors, Hioki CT6700)	NI9775	100 kHz	Ampere (A)	Timestamp, R-phase current, S-phase current, T-phase current

Table 3. Evaluation indicators.

Evaluation Indicators	Prediction
Evaluation Indicators		Positive	Negative
Actual	Positive	True Positive (TP)	False Negative (FN)
Actual	Negative	False Positive (FP)	Ture Negative (TN)

Table 4. Performance comparison of algorithms used in test set.

Algorithm	Normal	Inner Fault	Outer Fault	Ball Fault
FAN-BD Precision	97.0%	97.5%	97.5%	97.0%
CBMA-VIT Precision	91.3%	92.2%	95.0%	91.9%
ViT Precision	93.0%	90.4%	91.0%	89.5%
Swin Transformer Precision	93.9%	93.5%	93.9%	92.1%
ConvNeXt Precision	92.0%	90.5%	91.1%	91.5%
FAN-BD Recall	97.5%	97.5%	97.5%	98.0%
CBMA-VIT Recall	90.0%	94.5%	95.0%	90.6%
ViT Recall	93.0%	89.5%	91.0%	89.5%
Swin Transformer Recall	93.0%	93.5%	93.0%	93.0%
ConvNeXt Recall	92.0%	90.5%	92.9%	91.5%
FAN-BD F1-score	97.0%	97.5%	97.5%	97.5%
CBMA-VIT F1-score	90.6%	93.3%	95.0%	91.2%
ViT F1-score	93.0%	89.9%	91.0%	89.5%
Swin Transformer F1-score	93.4%	93.5%	93.4%	92.5%
ConvNeXt F1-score	92.0%	90.5%	92.0%	91.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Han, H.; Dong, X.; Wang, G.; Zhang, S. Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism. Appl. Sci. 2025, 15, 1531. https://doi.org/10.3390/app15031531

AMA Style

Yang J, Han H, Dong X, Wang G, Zhang S. Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism. Applied Sciences. 2025; 15(3):1531. https://doi.org/10.3390/app15031531

Chicago/Turabian Style

Yang, Jianjian, Haifeng Han, Xuan Dong, Guoyong Wang, and Shaocong Zhang. 2025. "Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism" Applied Sciences 15, no. 3: 1531. https://doi.org/10.3390/app15031531

APA Style

Yang, J., Han, H., Dong, X., Wang, G., & Zhang, S. (2025). Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism. Applied Sciences, 15(3), 1531. https://doi.org/10.3390/app15031531

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism

Abstract

1. Introduction

2. Methods

2.1. Data Processing

2.2. FAN-BD

3. Data

4. Experimental Design and Results

4.1. Experimental Environment

4.2. Dataset Classification

4.3. Experimental Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI