A Multimodal Three-Channel Bearing Fault Diagnosis Method Based on CNN Fusion Attention Mechanism Under Strong Noise Conditions

Zou, Yingyong; Li, Chunfang; Zhang, Yu; Si, Zhiqiang; Li, Long

doi:10.3390/a19020144

Open AccessArticle

A Multimodal Three-Channel Bearing Fault Diagnosis Method Based on CNN Fusion Attention Mechanism Under Strong Noise Conditions

by

Yingyong Zou

^*

,

Chunfang Li

,

Yu Zhang

,

Zhiqiang Si

and

Long Li

College of Mechanical and Vehicular Engineering, Changchun University, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(2), 144; https://doi.org/10.3390/a19020144

Submission received: 8 January 2026 / Revised: 31 January 2026 / Accepted: 4 February 2026 / Published: 10 February 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Bearings, as core components of mechanical equipment, play a critical role in ensuring equipment safety and reliability. Early fault detection holds significant importance. Addressing the challenges of insufficient robustness in bearing fault diagnosis under industrial high-noise conditions and the difficulty of extracting fault features from a single modality, this study proposes a three-channel multimodal fault diagnosis method that integrates a Convolutional Auto-Encoder (CAE) with a dual attention mechanism (M-CNNBiAM). This approach provides an effective technical solution for the precise diagnosis of bearing faults in high-noise environments. To suppress substantial noise interference, a CAE denoising module was designed to filter out intense noise, providing high-quality input for subsequent diagnostic networks. To address the limitations of single-modal feature extraction and restricted generalization capabilities, a three-channel time–frequency signal joint diagnosis model combining the Continuous Wavelet Transform (CWT) with an attention mechanism was proposed. This approach enables deep mining and efficient fusion of multi-domain features, thereby enhancing fault diagnosis accuracy and generalization capabilities. Experimental results demonstrate that the designed CAE module maintains excellent noise reduction performance even under −10 dB strong noise conditions. When combined with the proposed diagnostic model, it achieves an average diagnostic accuracy of 98% across both the CWRU and self-test datasets, demonstrating outstanding diagnostic precision. Furthermore, under −4 dB noise conditions, it achieves a 94% diagnostic accuracy even without relying on the CAE denoising module. With a single training cycle taking only 6.8 s, it balances training efficiency and diagnostic performance, making it well-suited for real-time, reliable bearing fault diagnosis in industrial environments with high noise levels.

Keywords:

multimodal; BiGRU; CNN; CBAM; SW-MSA; W-MSA; CAE

1. Introduction

Bearings are critical components in mechanical systems [1], including high-speed trains [2], aircraft [3], and CNC machine tools [4]. During operation, bearings often endure alternating loads and operate in complex environments, making them prone to failure. These failures frequently exhibit subtle early warning signs. If not detected promptly, they may lead to equipment shutdowns, structural damage, and in severe cases, major safety incidents. Consequently, early fault detection is critically important. Implementing condition monitoring and intelligent diagnostics for rolling bearings has become a key direction in ensuring stable equipment operation [5,6].

In bearing fault diagnosis technology, traditional diagnostic methods, centered on experience-driven approaches, primarily analyze the statistical characteristics of time-domain vibration signals or convert time-domain signals to the frequency domain using the Fast Fourier Transform (FFT) [7], followed by fault identification and judgment by specialists. This approach exhibits significant limitations: On one hand, diagnostic outcomes heavily rely on operator expertise and experience, leading to high subjectivity, limited generalization, and suboptimal real-time performance, making it challenging to meet diagnostic demands in high-noise environments. On the other hand, traditional signal processing methods, such as the FFT, struggle to handle non-stationary and nonlinear signals, making it challenging to capture early, subtle fault characteristics. With the advancement of modern industry, traditional experience-driven methods cannot meet contemporary industrial demands. A bearing fault diagnosis method that balances real-time performance and diagnostic accuracy in high-noise environments has become a research hotspot.

In recent years, advancements in artificial intelligence technology have introduced novel approaches to bearing fault diagnosis. Data-driven deep learning methods, leveraging their robust feature adaptation extraction and nonlinear fitting capabilities, have emerged as a research hotspot [8]. Among these, fusion techniques combining time-domain and frequency-domain signals have garnered significant attention. Convolutional Neural Networks (CNNs) are widely applied across various fields due to their excellent feature extraction capabilities and their adaptability to dual-input modes for both one-dimensional signals and two-dimensional images. For instance, Yin et al. proposed an improved fault diagnosis method combining integrated noise-reconstruction empirical mode decomposition (IENEMD) with a parallel multi-scale CNN to extract fault features under noisy conditions [9]. Jiang et al. proposed a highly interpretable CNN fault diagnosis method by converting it into binary grayscale image analysis via singular-value decomposition. They introduced an Average Score Decrease (ASD) metric to enhance the interpretability of the CNN decision-making process through quantitative analysis [10]. Despite its robust feature extraction capabilities, CNN still faces limitations. Deep CNN models are prone to gradient vanishing issues. To mitigate this, He et al. introduced the residual concept, enabling direct connections between feature maps and outputs to provide a fast path for gradients [11]. For instance, Wang et al. proposed a fault diagnosis method combining GRU with Residual Networks (Res-Net) for time-varying operating conditions, achieving high diagnostic accuracy under such conditions [12]. CNNs exhibit strong short-term memory but suffer from forgetting long-term signals. Long-term prediction models offer solutions, such as the Bidirectional Gated Recurrent Unit (BiGRU). As both a recurrent neural network and a gating mechanism, BiGRU balances long-term memory and computational efficiency, often integrated with other models for fault diagnosis tasks [13]. For instance, Jie Man et al. proposed a novel organizational format for shaft temperature data, structuring measurement points into a graph based on their spatial positions. Subsequently, they employed a Graph Convolutional Network (GCN) and a GRU model to extract features and predict shaft temperature [14].

Single-modality approaches often struggle to extract complete fault features in noisy environments. Therefore, time–frequency joint methods are frequently employed for diagnostic tasks. The Continuous Wavelet Transform (CWT) maps one-dimensional time-domain signals into two-dimensional time–frequency matrices by convolving the original signal with adjustable-scale wavelet basis functions, thereby precisely describing the frequency distribution at different time points. It is often the preferred choice for obtaining frequency-domain signals [15]. For instance, Cui et al. combined Sparse Representation (SR) and CWT theory to propose a novel shared-frequency approach for bearing fault diagnosis [16]. When an image is input to the network, it is segmented into numerous regions, with the size of each region determined by the convolutional kernel size used in the network (in CNNs). This requires stacking enough layers for the model to capture global information. Increasing the number of layers deepens the network, potentially smoothing out local details in the original input image while also increasing computational complexity. Although deep networks can learn more abstract and advanced features, this does not guarantee that these advanced features are suitable for the current task. Achieving comparable performance with fewer layers has become a research hotspot. The emergence of attention mechanisms offers a solution. By assigning higher weights to critical fault-feature locations in the current feature map via attention, the model focuses on identifying fault patterns. For example, the Convolutional Block Attention Module (CBAM) is a lightweight attention mechanism composed of a channel attention mechanism (CAM) and a spatial attention mechanism (SAM) connected in series. This dual-attention coordination mechanism guides the model to adaptively focus on critical feature regions related to bearing faults. By assigning higher weight coefficients to fault-sensitive information, it suppresses interference from irrelevant background noise and redundant features, thereby significantly enhancing the model’s ability to detect subtle fault characteristics and improving the precision and effectiveness of feature extraction [17]. Qin et al. proposed a rolling bearing fault diagnosis method that integrates the CBAM attention mechanism with ResNet. By embedding the CBAM into the residual block structure of ResNet, the attention mechanism enables adaptive focusing on critical fault features, effectively enhancing the model’s feature extraction capability and fault identification specificity [18]. Attention mechanisms can only operate within the current window and cannot access global information. The most common approach is to apply them after feature extraction is completed at each layer. However, calculating attention weights across the entire feature map is computationally intensive, especially for larger images, leading to significant overhead. How to compute attention weights within the current window while simultaneously capturing global information has attracted widespread attention. The emergence of the Swin-Transformer addresses the challenge of cross-window connections. Liu et al. proposed the Swin-Transformer framework, which employs Window-based Multi-head Self-Attention (W-SAM) and Shifted Window-based Multi-head Self-Attention (SW-SAM) to enable cross-window connections while performing computations within the current window, thereby reducing resource consumption [19]. For instance, Zhou et al. designed a novel multi-source information fusion network (FEV-Swin) based on the Swin-Transformer framework. This network effectively fuses and diagnoses multi-source fault information using an embedded feature pyramid fusion module and a domain adaptation module [20].

To address the issues of insufficient single-modal feature extraction and inefficient multimodal information fusion under strong noise conditions, this paper proposes a three-channel multimodal fault diagnosis method that integrates a Convolutional Auto-Encoder (CAE) with a dual attention mechanism (M-CNNBiAM). A dedicated CAE denoising module is designed to recover periodic fault features from signals contaminated by intense noise. By integrating the CWT time–frequency transform with attention mechanisms, a multi-channel joint diagnostic model is constructed to fuse multidimensional features efficiently. This ultimately enables precise, rapid diagnosis of bearing faults in noisy environments. The main contributions of this paper are as follows:

(1): CNN Receptive Field Constraints: Incorporating SW-SAM and W-SAM into CNN layers restricts attention calculations within the current window while enabling cross-window connections to integrate global information.
(2): Feature Enhancement: Introducing CBAM after the window mechanism module compels the model to focus more intently on the fault location, thereby enhancing its ability to extract fault features.
(3): Multimodal Fusion: Integrating fault features extracted from two-dimensional frequency-domain images and one-dimensional time-series signals to leverage the complementary nature of these two data types fully. Dual-channel data fusion enables the model to simultaneously leverage both temporal and spectral information, thereby significantly improving fault classification accuracy.
(4): CAE was employed to reduce noise signals. Experimental results ultimately demonstrated that in high-noise environments, CAE not only exhibits strong noise reduction capabilities but also achieves high noise reduction efficiency.

This paper is organized as follows: Section 2 introduces the fundamental theory; Section 3 elaborates on the proposed model; Section 4 conducts experiments on the CWRU and self-test datasets; Section 5 summarizes the work and outlines future research directions.

2. Theoretical Basis

2.1. Convolutional Neural Networks

CNNs feature local connectivity, parameter sharing, and translation invariance. The same set of convolutional parameters is shared throughout the entire feature extraction process. This parameter-sharing mechanism significantly reduces the network’s parameter count, effectively alleviating computational complexity during model training and enhancing generalization [21]. Figure 1 illustrates the structure of a CNN, where the convolution operation can be expressed as

x_{i}^{l} = f (\sum_{j \in M_{i}} x_{i}^{l - 1} * w_{j i}^{l} + B_{i}^{l})

(1)

where

x_{i}^{l}

represents layer-level feature outputs.

M_{i}

represents the i-th convolutional region of the (l-1)th layer’s feature map.

w_{j i}^{l}

and

B_{i}^{l}

describe the parameters to be optimized during model training, including the weight and bias matrices for feature extraction.

f

represents the activation function, which introduces nonlinearity to enhance the model’s ability to fit complex features.

2.2. Gated Recurrent Unit (GRU)

Gated Recurrent Units (GRUs) are a core variant of recurrent neural networks (RNNs), achieving long-term memory of temporal signals solely through update gates and reset gates. Figure 2 illustrates the gating mechanism, with the GRU gating principle proceeding as follows [22]:

(1): Calculate the update gate $z_{t}$

z_{t} = σ (W_{z} x_{t} + U_{z} h_{t - 1} + b_{z})

(2)

σ

represents the sigmoid function, with an output range of [0, 1], used for gated “switch” control;

W_{z}

and

U_{z}

represent the parameter matrix, where

W_{z}

corresponds to the weights of the input and

U_{z}

corresponds to the weights of the hidden state;

b_{z}

represents the bias vector, used for adjusting the offset in linear transformations.

(2): Calculate the reset gate $r_{t}$

r_{t} = σ (W_{r} x_{t} + U_{r} h_{t - 1} + b_{r})

(3)

(3): Calculate the candidate hiding state ${\tilde{h}}_{t}$

{\tilde{h}}_{t} = \tanh (W_{h} • [r_{t} • h_{t - 1}, x_{t}] + b_{h})

(4)

\tanh

represents the hyperbolic tangent activation function, with an output range of [−1, 1], used for feature mapping of candidate states; ⊙ denotes element-wise multiplication, enabling per-element weight adjustment.

{\tilde{h}}_{t}

represents the newly generated candidate state value based on the current input and resets historical information.

(4): Update the hidden status $h_{t}$

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}

(5)

h_{t}

represents the integration of historical information and the current candidate state, outputting the temporal feature representation for the current time step.

2.3. Collaborative Attention Mechanism

Figure 3 illustrates the structure of the collaborative attention mechanism. In Figure 3, the input feature map is normalized before calculating attention scores for the current window. After further normalization, the feature is fed into an MLP within the sliding-window attention mechanism to obtain global information via cross-window connections [23]. The output features, now enriched with global information, are fed into the CBAM. Here, spatial-dimensional feature enhancement strengthens global feature interactions, followed by channel-dimensional enhancement that discards redundant channels, culminating in the final feature output [24]. The overall feature output is represented by Equation (6).

\{\begin{cases} F_{input} = z^{l - 1} \\ {\hat{z}}^{l} = W - M S A (L N (z^{l - 1})) + z^{l - 1}, z^{l} = M L P (L N ({\hat{z}}^{l})) + {\hat{z}}^{l} \\ {\hat{z}}^{l + 1} = S W - M S A (L N (z^{l})) + z^{l}, z^{l + 1} = M L P (L N ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1} \\ M_{c} (z^{l + 1}) = σ (M L P (F_{a v g}^{c}) + M L P (F_{\max}^{s})) \\ F_{c}^{'} = M_{c} (z^{l + 1}) \otimes z^{l + 1} \\ M_{s} (F_{c}^{'}) = σ (C o n v_{7 \times 7} ([F_{a v g}^{s}, F_{\max}^{s}])) \\ F_{s}^{'} = M_{s} (F_{c}^{'}) \otimes F_{c}^{'} \\ F_{o u t p u t} = F_{s}^{'} \end{cases}

(6)

{\hat{z}}^{l}

represents the W−MSA output feature;

z^{l}

represents the MLP output feature;

{\hat{z}}^{l + 1}

represents the W−MSA output feature;

z^{l + 1}

represents the S(W)−MSA output feature;

M c (F)

represents the channel attention weights;

F_{i n p u t}

represents the input feature map;

σ

denotes the Sigmoid function;

F_{a v g}^{c}, F_{\max}^{c}

denote mean pooling and max pooling, respectively;

F_{s}^{'}, F_{c}^{'}

represent the enhanced spatial and channel feature maps;

\otimes

denotes element-wise multiplication.

2.4. Wavelet Transform

CWT [25] calculates the similarity between the original signal and wavelet basis functions at different scales and time-domain positions by adjusting the scale factor and shift amount, thereby constructing a two-dimensional time–frequency feature matrix. Through the synergistic interaction between the scale factor and shift amount, CWT provides a joint representation of the signal’s local time-domain features and frequency-domain distribution patterns, enabling adaptive analysis of non-stationary signals. The principle and steps of the Continuous Wavelet Transform are as follows:

(1): Wavelet basis function construction

ψ_{a, b} (t) = (\frac{1}{\sqrt{| a |}}) ψ (\frac{t - b}{a})

(7)

In the formula,

a

denotes the scaling factor, which controls the degree of stretching of the wavelet basis;

ψ_{a, b} (t)

represents the transformed wavelet basis function;

b

denotes the translation factor, which controls the position of the wavelet basis in the time domain.

(2): Calculate the wavelet coefficients

W_{f} (a, b) = \int_{- \infty}^{+ \infty} f (t) \cdot ψ_{a, b} (t) d t = \frac{1}{\sqrt{| a |}} \int_{- \infty}^{+ \infty} f (t) \cdot ψ (\frac{t - b}{a}) d t

(8)

(3): Construction of time–frequency images

The wavelet coefficients calculated are arranged into a two-dimensional matrix according to scale factors and translation factors to construct the time–frequency image (Figure 4).

2.5. Convolutional Auto-Encoder

CAE compresses the input signal through an encoder to map it into a low-dimensional space. The decoder reconstructs the signal from this low-dimensional space back to its original dimension, aiming to maintain consistency between the reconstructed and original signals [26]. Since noise exhibits disordered characteristics and is distributed across the entire frequency band, while fault signals typically follow periodic impact patterns, CAE is particularly well-suited for noise reduction tasks in bearing vibration signals. The CAE architecture is shown in Figure 5. The noise signal first passes through the encoding layer, then sequentially through three convolutional layers to compress it into a low-dimensional space. A flattening operation and a linear layer for dimensional adjustment follow this. The decoder uses three layers of transposed convolutions to restore the input signal’s dimensionality from a low-dimensional space, achieving the noise reduction objective. The following equations describe the signal’s encoding and decoding.

Encoding operation:

F_{e n c o d e r} = C o n v_{3 \times 3} (C o n v_{5 \times 5} (C o n v_{7 \times 7} ({Signal}_{noise})))

(9)

Decoding operation:

F_{d e c o d e r} = C o n v_{7 \times 7} (C o n v_{5 \times 5} (C o n v_{3 \times 3} (F_{e n c o d e r})))

(10)

3. M-CNNBiAM Model

This section introduces the M-CNNBiAM model, providing its specific parameters, structural diagram, data dimension transformation table, and model diagnostic workflow.

The M-CNNBiAM architecture is shown in Figure 6, with the process divided into four steps: 1. Noise addition; 2. Construction of time–frequency data; 3. Feature extraction; 4. Fault classification. Table 1 and Table 2: Module parameters; Table 3: Image feature extraction; Table 4: Vibration signal feature extraction.

1.: Noise Addition

This paper added Gaussian noise with SNR values of [−10, −8, −6, −4, −2, 0, 2, None] dB to the original vibration signal and verified the accuracy of the noise addition.

2.: Constructing Time–Frequency Data

The noise signal first enters the CAE for denoising, as shown in Figure 6. The denoised signal is converted to a frequency-domain image via CWT using the complex Morlet wavelet with a bandwidth parameter of 100 and a center frequency parameter of 1. The CWT signal and vibration signal together form a multimodal dataset.

3.: Feature Extraction

The data first undergoes channel expansion via a convolutional layer with kernel size

7 \times 7

and stride of 2, followed by image size reduction to reduce computational complexity. The preprocessed signal enters a dual-parallel image branch. Data entering image branch 1 is sequentially passed through three layers of a CNN (ResNetBlock) with kernel size

7 \times 7 \to 7 \times 7 \to 1 \times 1

and stride of 1. This is followed by a window attention module and CBAM, with the output features re-entering ResNetBlock to generate final features. Branch 2 shares the same architecture as branch 1, except its first three CNN layers use a kernel size of

5 \times 5 \to 5 \times 5 \to 1 \times 1

, while all other parameters remain identical. Finally, features are concatenated to produce frequency-domain modal features. The vibration signal first passes through a convolutional layer with a kernel size

7 \times 7

and a stride of 2 for dimensionality reduction and channel expansion. It then sequentially passes through a convolutional layer with a kernel size

5 \times 5 \to 5 \times 5 \to 1 \times 1

, a ResNetBlock, and finally enters a BiGRU with [32, 64] neurons to output the time-domain modal feature. All CNN layers are followed by batch normalization (BN), a ReLU activation function, and a dropout layer. Finally, the time-domain feature (F3) and frequency-domain features (F1, F2) are concatenated along the channel dimension to obtain the multimodal feature.

4.: Fault Classification

Multimodal features are concatenated along the channel dimension before entering the classification layer, enabling fault diagnosis.

In Figure 6, the noisy signal first passes through a three-layer convolutional layer with kernel sizes [7, 5, 3]. It is then deconvolved to complete the encoder output. Subsequently, the signal undergoes deconvolution before entering the decoder, which processes it through three layers of transposed convolutions with kernel sizes [3, 5, 7]. Finally, after deconvolution, the decoder outputs the denoised signal. The main hyperparameter settings for CAE are as follows: iteration count: 320, loss function: MSE loss, learning rate: 0.0001.

Figure 6 shows that the wavelet signals are fed into the preprocessing CNN in parallel to adjust the input dimensions for subsequent CNN processing. They then pass through three ResNetBlocks. The first branch employs large kernels with expanded receptive fields, enabling coverage of broader image regions. This configuration excels at capturing global structural features while offering superior suppression of large-scale noise. Residual connections prevent gradient vanishing. The features are then entered into the Swin-Transformer block. W-MAS restricts computations to the current window, reducing computational load, while SW-MAS enables cross-window connections to capture global information. Subsequently, the features are processed by CBAM to further focus on the fault location before entering the final ResNetBlock for dimension adjustment and feature output. Small kernel branches possess smaller receptive fields, excelling at capturing local detail features while preserving finer information. Their heightened sensitivity to small-scale noise facilitates precise filtering. This combination enables coverage of multi-scale features from local to global, avoiding the one-dimensional feature capture inherent in single-scale convolutional kernels. The temporal domain employs five ResNetBlocks with stride control. The first three layers suppress noise while capturing long-term dependencies, while the latter two layers extract global features, covering the full spectrum from macro trends to micro-transients. Each convolutional layer utilizes residual connections. The temporal features are finally fed into a BiGRU to prevent long-term sequence forgetting. The gating design of BiGRU inherently provides noise resistance, further mitigating residual noise. The three-channel features are concatenated along the channel dimension before being fed into the classification layer. To mitigate overfitting in complex models, a dropout layer with 2d = 0.1 is applied after each convolutional layer in the image branch. Dropout1d = 0.1 is applied after the five convolutional layers in the time-domain branch, and Dropout1d = 0.2 is applied after the BiGRU. The final fusion layer employs Dropout1d = 0.5, effectively preventing overfitting.

4. Experiments and Results

This section validates the proposed model’s generalization capability and diagnostic accuracy through two sets of comparative experiments. Testing was conducted using both the Case Western Reserve University (CWRU) bearing failure benchmark dataset and a simulated operating condition dataset, ensuring coverage of both real-world failure scenarios and controlled simulations to enhance the reliability of the conclusions. To comprehensively evaluate model performance across multiple dimensions while quantifying the impact of hyperparameters on diagnostic effectiveness, the following core evaluation metrics were selected:

First, we employed accuracy, precision, recall, and F1-score. Accuracy reflects the model’s overall classification accuracy; precision measures the reliability of predictions; recall indicates the ability to identify target faults and reduce missed detections; and the F1-score, as the harmonic mean of the two, balances comprehensive classification performance in scenarios with imbalanced samples. Additionally, we included runtime and parameter count metrics. Runtime is the total time for one training cycle and reflects training efficiency. Parameter count represents the total number of learnable parameters in the model, measuring its complexity and storage overhead. Finally, a confusion matrix visually presents the classification results. Rows correspond to the actual fault categories of samples, columns represent the model’s predicted categories, and element values indicate the number of classifications for each category. This clearly identifies the correct classification status and misclassification types for various faults, providing a clear direction for model optimization.

The hyperparameters for this model are set as follows: learning rate of 0.0001; AdamW optimizer; weight decay of 1 × 10⁻⁴; training for 50 epochs.

The experimental platform configuration is as follows: CPU: Intel Core i5-13450HX, RAM: 16 GB, GPU: NVIDIA GeForce RTX 4060. The PyTorch deep learning framework (version 3.11.7) was employed for model training and experimental validation.

4.1. Noise Addition and Noise Reduction Effect Verification

To better simulate the high noise levels present in real-world working environments, Gaussian noise is added to the original signal. The signal-to-noise ratio (SNR) can be expressed as

S N R_{db} = 10 L o g_{10} (\frac{P_{s i g n a l}}{P_{n o i s e}})

(11)

Among these,

P_{s i g n a l}

represents the original signal;

P_{n o i s e}

represents the added noise power.

The noise added to the experiment is Gaussian. To verify the effectiveness of noise addition, the spectrum, time-domain vibration, and Continuous Wavelet Transform diagrams are provided, taking a Gaussian noise signal with SNR = −6 dB as an example.

Figure 7 shows the frequency-domain signal comparison. The frequency domain provides a more intuitive view: the original signal exhibits distinct peaks at specific frequencies due to the fault’s impact, while fluctuations at other frequencies are relatively uniform. After adding −6 dB noise, the fault peaks are significantly masked, and fluctuations across the entire frequency band become irregular. Traditional models are prone to getting stuck in local optima under such high-noise conditions, failing to learn the fault information or instead learning the noise. This leads to high accuracy during training but random probability during validation.

Figure 8 compares time-domain vibration signals. The original time-domain signal exhibits a distinct periodic pattern, with noise completely obscuring the features, making it impossible to identify fault characteristics.

As shown in Figure 9, the time-domain periodicity of the original vibration signal is preserved after CWT transformation, appearing as banded stripes. Peak values in the frequency domain appear as red dots at specific locations, indicating where the CWT assigned larger wavelet coefficients. Through time–frequency complementarity, this approach enables the model to capture comprehensive fault characteristics, enhancing its generalization capability and diagnostic effectiveness.

To validate the auto-encoder’s noise reduction effectiveness, we visualized the signal’s envelope spectrum. As shown in Figure 10, the red-boxed section represents noise, which has completely obscured the signal’s fault frequency. After noise reduction using the auto-encoder designed in this paper, we observe that the fault features are perfectly restored. Compared to the envelope spectrum of the original signal, only very faint noise remains. To verify the gap between the denoised and original signals, we use the MSE metric, as shown in Figure 11. The MSE of the original signal is 0.08, while that of the denoised signal is 0.11, fully validating the proposed AE’s noise reduction capability.

4.2. CWRU Experimental Data

The CWRU dataset includes three types of bearing faults: inner race, outer race, and rolling element. Each fault type has three fault sizes (7, 14, and 21) plus a healthy state, forming a total of ten classification categories. Bearing vibration signals are acquired at a sampling frequency of 12 kHz. For the experimental sample data, each sample is set to 1024, and the overlap rate is set to 0.5 to utilize the data while avoiding feature redundancy entirely. The dataset is randomly divided into training, validation, and test sets in a 7:3:1 ratio. Specifically, the training set is used for parameter learning, the validation set for hyperparameter tuning and model overfitting monitoring, and the test set for evaluating the model’s generalization performance. Detailed information about the experimental data is shown in Table 5.

To validate the fault diagnosis performance of the proposed model, experiments were conducted using the CWRU bearing dataset. Table 6 lists the model’s core evaluation metrics. As shown in Table 6, even under extremely high-noise conditions (SNR = −10 dB), the model maintains an accuracy of 98.57%, attributed to CAE’s powerful noise reduction capability. As SNR gradually decreases, the model’s extraction capability progressively increases, further improving diagnostic accuracy. When SNR = None, the accuracy reaches 100%. Visualizing the confusion matrix facilitates better analysis of the model’s classification capability for fault categories. As shown in Figure 12, the misclassified data points in the confusion matrix are generally dispersed across categories rather than concentrated in a single one. This indicates that the model possesses good diagnostic capability, is not sensitive to certain fault types, and exhibits acceptable robustness. As shown in Figure 13, the T-SNE visualization demonstrates strong overall clustering, with distinct separation between categories, confirming that the model learned universal features. To further analyze the decision-making process during training, we visualized the model’s training curve. As shown in Figure 14, the model’s training process shows minimal overall fluctuations. It achieves a high accuracy rate by the 10th iteration, with the training and validation curves converging by the 20th iteration, indicating optimal diagnostic performance. The subsequent 30 iterations proceed smoothly without overfitting. This stability stems from residual connections ensuring stable gradient propagation, coupled with a cosine annealing strategy for learning rate scheduling, which facilitates a more stable and efficient search for the optimal parameter combination.

4.2.1. Comparative Experiments

To validate the superiority and stability of the proposed model, we compared it with other advanced models. Table 7 presents the evaluation metrics for the comparative models. Qiao et al. proposed a dual-input neural network model combining CNN and LSTM. This model incorporates time–frequency signals as input using mini-batch and batch normalization methods, achieving high fault recognition rates under noisy conditions along with excellent noise immunity and load adaptability on the CWRU dataset [27]. Zhang et al. proposed a fault diagnosis model integrating wavelet denoising with KANTransformer. The front end filters redundant noise from raw signals via a wavelet denoising module, providing high-quality data for feature extraction. The core innovation lies in KANTransformer’s introduction of learnable activation functions within its linear layer, overcoming the expressive limitations of traditional fixed activation functions and significantly enhancing the model’s ability to capture nonlinear and non-stationary features in fault signals. Experimental validation demonstrates that this model exhibits outstanding interference resistance under complex noise conditions, effectively distinguishing between noise and fault features, and delivering excellent fault diagnosis accuracy and robustness [28]. Shi et al. proposed an EWSNet network algorithm that features wavelet weight initialization and a balanced dynamic adaptive thresholding algorithm, demonstrating the network’s effectiveness and reliability across four datasets [29]. Zhang et al. proposed a small-sample fault diagnosis method that combines dual-path convolutional attention (DCA) with Bidirectional Gated Recurrent Units (BiGRUs) for noisy, variable operating conditions. This model achieves fault diagnosis by integrating spatio-temporal features using a regularization-based training strategy coupled with BiGRU. Results demonstrate its strong generalization capability and robustness [30]. All models were evaluated under −4 dB noise conditions. Figure 15 presents the accuracy bar chart for comparative experiments.

As shown in Figure 16, both model (a) and model (b) exhibit high diagnostic accuracy exceeding 99% under strong noise conditions with SNR = −4 dB. This is attributed to the inclusion of pre-denoising modules in both models, which denoise the signal before network processing. For instance, method (b) employs wavelet denoising, thereby maintaining high diagnostic precision. Although models (c) and (d) exhibit slightly lower accuracy under noisy conditions, these methods were tested on small datasets. The results presented in the paper were obtained using 150 samples, which represents the maximum sample size in the study. Generally, accuracy improves to some extent as the sample size increases. Therefore, models (c) and (d) may demonstrate superior diagnostic performance when tested on datasets of comparable size.

The aforementioned models each possess distinct advantages; however, most denoising network models rely excessively on data preprocessing while neglecting the model’s inherent denoising capabilities. Furthermore, when relying on neural network models for denoising under extremely high-noise conditions, training becomes unstable and requires significantly longer training times. Therefore, if the model itself exhibits a degree of noise resistance under low-noise conditions, it would be more advantageous for practical deployment. Consequently, this paper tested the model solely through its core architecture under noise conditions of [−4, −2, 0, 2], with results shown in Table 8. The model achieved 94.42% accuracy even at −4 dB, attributable to its structural design. The residual CNN block first suppresses noise while ensuring proper gradient propagation. It collaborates with the attention mechanism to enhance key spatial- and channel-dimension features. The temporal branching introduces BiGRU, which further suppresses noise through gating mechanisms while preserving long-term memory. The deep fusion of three feature streams achieves 94.42% accuracy even at an SNR of −4 dB without CAE. As observed in Figure 16, the training curve exhibits relatively stable fluctuations at −4 dB, indicating the model’s adequate learning capability under noisy conditions. Consequently, this model can be deployed more readily in practical engineering applications with moderate noise intensity. When noise becomes excessive, employing CAE for pre-denoising ensures high diagnostic accuracy. Thus, the proposed model demonstrates significant advantages.

4.2.2. Ablation Experiment

To more clearly demonstrate the contribution of each model module, ablation experiments were conducted on the CWRU dataset at SNR = −4 dB. Specific ablation models are listed in Table 9. Model-1 represents the first image channel; Model-2 represents the second image channel; Model-3 incorporates the entire image splitting; Model-4 omits the shift attention mechanism compared to Model-3; Model-5 omits the CBAM compared to Model-3. Model-6 includes only the vibration signal channel. Table 10 lists the evaluation metrics, and Figure 17 displays the confusion matrix.

In Table 10, Model-1 achieves an accuracy of 90.99%, while Model-2 reaches 96.14%, representing a 5.15% improvement. This is attributed to Model-2’s larger convolutional kernels, which capture a broader receptive field and exhibit stronger noise suppression, thereby yielding higher accuracy. Conversely, Model-1 better preserves fine details, complementing Model-2’s strengths. Model-3, which fuses two branches, achieves an accuracy of 98.71%, representing a 7.72% improvement over Model-1 and a 2.57% increase over Model-2, demonstrating the advantages of dual-channel fusion. Model-4, lacking both the sliding-window attention mechanism and positional attention mechanism, achieves only 93.13% accuracy—a 5.58% decrease from Model-3—highlighting the critical importance of the sliding-window attention mechanism. Model-5, without CBAM, shows a 1.29% drop in accuracy compared to Model-3, underscoring the significance of CBAM’s final redundant channel pruning. Model-6 achieves 98.28% accuracy, attributed to BiGRU’s inclusion, which suppresses redundant noise while providing long-term memory capabilities. Consequently, its accuracy only decreases by 0.43% relative to Model-3. The three-channel fusion ultimately attains 99.57% accuracy. Ablation results indicate that incorporating dual attention mechanisms significantly improves model accuracy, and the proposed three-channel feature fusion method is better suited for fault diagnosis tasks.

4.3. Bearing Data Under Simulated Conditions

The acquisition parameters and operating conditions for the simulated dataset used in this section are as follows: The dataset’s sampling frequency is set to 20 kHz to ensure accurate capture of high-frequency fault-feature signals during bearing operation. The 6007 deep groove ball bearing was selected as the research subject, with a rotational speed of 1500 rpm to simulate typical medium-to-low-speed operating conditions in industrial applications. The dataset encompasses four critical bearing operating states: regular operation, inner-ring failure, outer-ring failure, and rolling element failure, constituting a four-class fault diagnosis task. Detailed information on the simulated dataset is presented in Table 11, while Figure 18 shows the physical layout of the simulation test bench used for data acquisition.

The proposed model maintains high classification accuracy and good generalization capability as validated by simulation data. Figure 19 shows the confusion matrix, and Table 12 lists the model’s core evaluation metrics.

4.4. Model Parameters and Timeliness

In industrial deployment scenarios, the parameter efficiency and real-time response capability of fault diagnosis models are core indicators determining their engineering applicability, directly impacting the timeliness of fault diagnosis and the feasibility of hardware deployment. To validate the engineering potential of the proposed model, the CWRU bearing dataset was used for testing. Each vibration signal in this dataset spans 1024 points, comprising 2330 valid samples. The model’s timeliness was assessed by measuring its single-run processing time. The statistical results of the model’s parameter count are shown in Table 13 and Table 14. It can be seen that the overall parameter count of the proposed model is at an intermediate level within the industry, without excessive redundancy, providing the fundamental conditions for lightweight deployment. However, as shown in Table 15, the model’s average single-round runtime is 6.8 s. This is due to the high computational complexity of the block’s window attention mechanism. Nevertheless, the 6.8 s runtime offers significant advantages in practical applications. On one hand, the fault evolution process of rolling bearings typically involves specific time scales, and a 6.8 s diagnostic delay enables rapid response to fault warnings, facilitating subsequent maintenance decisions. On the other hand, in industrial settings, runtime can be further reduced through hardware computing power upgrades or algorithm optimization. Therefore, this model retains significant practical value for industrial deployment.

4.5. Model Interpretability

Model interpretability is crucial as it helps us understand how models make decisions, enhances their credibility, and facilitates targeted improvements. To explain how CNNs extract fault features, we visualized the weight maps of the five convolutional layers in the vibration signal feature extraction pipeline. These five layers comprise the preprocessing convolutional layer and the final convolutional layers within the four residual blocks. As shown in Figure 20, darker colors indicate larger weights. The figure reveals that the first three convolutional layers assign greater weights to the fault location, demonstrating that the model has learned fault patterns and effectively identifies fault features. The subsequent two convolutional layers incorporate global features, fully validating the model’s feature extraction capability and global temporal modeling.

Understanding how attention mechanisms enable models to capture fault features accurately is a core issue in bearing fault diagnosis. To visually validate the working principles and model interpretability of channel and spatial attention mechanisms, this paper uses feature response heatmaps for both mechanisms. The bar chart in Figure 21 reveals significant variations in attention weights across feature channels. Channel importance distribution differs across distinct model modules, reflecting the model’s dynamic adjustment of attention levels to each channel based on task requirements during feature processing. This confirms the core capability of channel attention: the model automatically learns and highlights channels more relevant to the task. This mechanism enhances fault-feature extraction accuracy and diagnostic reliability through adaptive feature-channel filtering. The spatial attention heatmap in Figure 22 indicates that the model automatically assigns high weights to regions in the Continuous Wavelet Transform (CWT) feature map that are directly related to faults. This demonstrates that spatial attention precisely localizes and amplifies fault-specific features in the spatial domain, thereby improving the distinguishability between fault and non-fault features, highlighting its value within the model.

5. Conclusions

This paper addresses the challenges of feature extraction and low generalization in boisterous environments by proposing a three-channel M-CNNBiAM fault diagnosis method that integrates image and vibration signals. Noise signals are first processed by CAE for denoising, generating denoised signals. These denoised signals are converted into frequency-domain images via CWT, constructing joint time–frequency data. The frequency-domain images are simultaneously fed into a dual-branch residual-connected CNN to suppress residual noise, followed by a collaborative attention mechanism that forces the model to focus more on fault locations. Finally, dual-channel features are concatenated along the channel dimension to obtain frequency-domain features. The denoised vibration signal is directly fed into a five-layer residual-connected CNN. This is combined with a BiGRU network to specifically extract long-term dynamic features from the vibration signal, outputting temporal features. Finally, the time–frequency features are fused along the channel dimension and fed into a classification layer for fault diagnosis. Experimental results demonstrate that under SNR = −10 dB conditions, the MSE index of CAE after noise reduction is only 0.03 higher than that of the high-original-signal MSE, effectively resolving the challenge of extracting fault features under intense noise. The large-kernel branch of the image branch provides a larger receptive field, while the small-kernel branch preserves more details. At SNR = −4 dB, the dual-branch design improves diagnostic accuracy by 7.72% and 2.57% compared to single small-kernel and single large-kernel branches, respectively. The addition of the vibration signal branch further improves accuracy by 1.29% compared to the image branch alone, fully validating the complementary advantages of time–frequency multimodal information fusion within the three-channel architecture. Moreover, at SNR = −4 dB, the model achieves 94.42% accuracy solely through its own capabilities. Therefore, this study demonstrates significant advantages. In practical industrial settings, acquiring high-quality labeled data is challenging. Future research will focus on fault diagnosis under few-label conditions.

Author Contributions

Conceptualization, Y.Z. (Yingyong Zou) and C.L.; methodology, Y.Z. (Yingyong Zou), C.L. and Y.Z. (Yu Zhang); software, C.L.; validation, C.L., Y.Z. (Yu Zhang) and Z.S.; formal analysis, C.L.; investigation, C.L.; resources, C.L.; data curation, L.L. and Z.S.; writing—original draft preparation, C.L.; writing—review and editing, C.L., Z.S., Y.Z. (Yu Zhang) and L.L.; visualization, Z.S.; supervision, Y.Z. (Yingyong Zou); project administration, Y.Z. (Yingyong Zou); funding acquisition, Y.Z. (Yingyong Zou) All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jilin Provincial Department of Science and Technology, grant number 20230101208JC.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jin, L.; Wang, S.; Zhou, J.; Ding, B.; Chen, X. Fast Sparse Morphological Decomposition with Controllable Sparsity for High-Speed Bearing Fault Diagnosis. Mech. Syst. Signal Process. 2025, 226, 112330. [Google Scholar] [CrossRef]
Wang, C.; Qi, H.; Hou, D.; Han, D.; Yang, J. Ensefgram: An Optimal Demodulation Band Selection Method for the Early Fault Diagnosis of High-Speed Train Bearings. Mech. Syst. Signal Process. 2024, 213, 111346. [Google Scholar] [CrossRef]
Luan, X.; Xia, A.; Gao, X.; Zhang, Z.; Yang, J.; Sha, Y. Aviation Gas Turbine Engine Bearings Faults Diagnosis Method Based on Multi-Parameter Fusion Criterion Judgment and AO-PNN. Struct. Health Monit. 2025, 14759217251329080. [Google Scholar] [CrossRef]
Li, X.; Chen, J.; Wang, J.; Wang, J.; Li, X.; Kan, Y. Research on Fault Diagnosis Method of Bearings in the Spindle System for CNC Machine Tools Based on DRSN-Transformer. IEEE Access 2024, 12, 74586–74595. [Google Scholar] [CrossRef]
Mao, W.; Ding, L.; Tian, S.; Liang, X. Online Detection for Bearing Incipient Fault Based on Deep Transfer Learning. Measurement 2020, 152, 107278. [Google Scholar] [CrossRef]
Xu, Y.; Zhen, D.; Gu, J.X.; Rabeyee, K.; Chu, F.; Gu, F.; Ball, A.D. Autocorrelated Envelopes for Early Fault Detection of Rolling Bearings. Mech. Syst. Signal Process. 2021, 146, 106990. [Google Scholar] [CrossRef]
Luo, X.; Wang, H.; Han, T.; Zhang, Y. FFT-Trans: Enhancing Robustness in Mechanical Fault Diagnosis with Fourier Transform-Based Transformer under Noisy Conditions. IEEE Trans. Instrum. Meas. 2024, 73, 2515112. [Google Scholar] [CrossRef]
Ding, P.; Xu, Y.; Qin, P.; Sun, X.-M. A Novel Deep Learning Approach for Intelligent Bearing Fault Diagnosis under Extremely Small Samples. Appl. Intell. 2024, 54, 5306–5316. [Google Scholar] [CrossRef]
Yin, C.; Lee, H.P.; Ko, J.H.; Wang, Y. Intelligent Fault Diagnosis of Rolling Bearings in Strong Noise Environment: An Attention-Driven Hybrid Model Based on IENEMD and Parallel Multiscale CNN. Int. J. Precis. Eng. Manuf.-Green Technol. 2025, 12, 1091–1116. [Google Scholar] [CrossRef]
Jiang, K.; Yang, Z.; Jin, T.; Chen, C.; Liu, Z.; Zhang, B. CNN-Based Rolling Bearing Fault Diagnosis Method with Quantifiable Interpretability. IEEE Trans. Instrum. Meas. 2025, 74, 1. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Wang, Z.; Xu, X.; Zhang, Y.; Wang, Z.; Li, Y.; Liu, Z.; Zhang, Y. A Bearing Fault Diagnosis Method Based on a Residual Network and a Gated Recurrent Unit under Time-Varying Working Conditions. Sensors 2023, 23, 6730. [Google Scholar] [CrossRef]
Yin, S.; Chen, Z. Research on Compound Fault Diagnosis of Bearings Using an Improved DRSN-GRU Dual-Channel Model. IEEE Sens. J. 2024, 24, 35304–35311. [Google Scholar] [CrossRef]
Man, J.; Dong, H.; Yang, X.; Meng, Z.; Jia, L.; Qin, Y.; Xin, G. GCG: Graph Convolutional Network and Gated Recurrent Unit Method for High-Speed Train Axle Temperature Forecasting. Mech. Syst. Signal Process. 2022, 163, 108102. [Google Scholar] [CrossRef]
Chen, X.; Fan, F.; Zhou, K.; He, Z. Wheel-Bearing Fault Diagnosis of Trains Using Empirical Wavelet Transform. Measurement 2016, 82, 439–449. [Google Scholar] [CrossRef]
Cui, H.; Qiao, Y.; Yin, Y.; Hong, M. An Investigation on Early Bearing Fault Diagnosis Based on Wavelet Transform and Sparse Component Analysis. Struct. Health Monit. 2016, 16, 39–49. [Google Scholar] [CrossRef]
Cui, W.; Meng, G.; Gou, T.; Wang, A.; Xiao, R.; Zhang, X. Intelligent Rolling Bearing Fault Diagnosis Method Using Symmetrized Dot Pattern Images and CBAM-DRN. Sensors 2022, 22, 9954. [Google Scholar] [CrossRef]
Qin, H.; Pan, J.; Li, J.; Huang, F. Fault Diagnosis Method of Rolling Bearing Based on CBAM_ResNet and ACON Activation Function. Appl. Sci. 2023, 13, 7593. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Zhou, K.; Lu, N.; Jiang, B.; Ye, Z. FEV-Swin: Multi-Source Heterogeneous Information Fusion under a Variant Swin Transformer Framework for Intelligent Cross-Domain Fault Diagnosis. Knowl. Based Syst. 2025, 310, 112982. [Google Scholar] [CrossRef]
Liu, H.; Zhou, J.; Zheng, Y.; Jiang, W.; Zhang, Y. Fault diagnosis of rolling bearings with recurrent neural network-based autoencoders. ISA Trans. 2018, 77, 167–178. [Google Scholar] [CrossRef]
Wen, L.; Su, S.; Li, X.; Ding, W.; Feng, K. GRU-AE-wiener: A generative adversarial network assisted hybrid gated recurrent unit with Wiener model for bearing remaining useful life estimation. Mech. Syst. Signal Process. 2024, 220, 111663. [Google Scholar] [CrossRef]
Tao, L.; Liu, H.; Ning, G.; Cao, W.; Huang, B.; Lu, C. LLM-based framework for bearing fault diagnosis. Mech. Syst. Signal Process. 2024, 224, 112127. [Google Scholar] [CrossRef]
Jiang, K.; Zhang, C.; Wei, B.; Li, Z.; Kochan, O. Fault diagnosis of RV reducer based on denoising time–frequency attention neural network. Expert Syst. Appl. 2024, 238, 121762. [Google Scholar] [CrossRef]
Melluso, F.; Spirto, M.; Nicolella, A.; Malfi, P.; Tordela, C.; Cosenza, C.; Savino, S.; Niola, V. Torque fault signal extraction in hybrid electric powertrains through a wavelet-supported processing of residuals. Mech. Syst. Signal Process. 2026, 242, 113652. [Google Scholar] [CrossRef]
Zhang, C.; Geng, Y.; Han, Z.; Liu, Y.; Fu, H.; Hu, Q. Autoencoder in Autoencoder Networks. In IEEE Transactions on Neural Networks and Learning Systems; IEEE: New York, NY, USA, 2022; pp. 1–13. [Google Scholar]
Qiao, M.; Yan, S.; Tang, X.; Xu, C. Deep Convolutional and LSTM Recurrent Neural Networks for Rolling Bearing Fault Diagnosis Under Strong Noises and Variable Loads. IEEE Access 2020, 8, 66257–66269. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, X.; Peng, Z.; Xu, R.; Chen, P. WD-KANTF: An interpretable intelligent fault diagnosis framework for rotating machinery under noise environments and small sample conditions. Adv. Eng. Inform. 2025, 66, 103452. [Google Scholar] [CrossRef]
He, C.; Shi, H.; Si, J.; Li, J. Physics-informed interpretable wavelet weight initialization and balanced dynamic adaptive threshold for intelligent fault diagnosis of rolling bearings. J. Manuf. Syst. 2023, 70, 579–592. [Google Scholar] [CrossRef]
Zhang, X.; He, C.; Lu, Y.; Chen, B.; Zhu, L.; Zhang, L. Fault diagnosis for small samples based on attention mechanism. Measurement 2022, 187, 110242. [Google Scholar] [CrossRef]

Figure 1. CNN structure.

Figure 2. GRU gating mechanism. * represents independent weighting and is not shared with other doors.

Figure 3. Structure of the collaborative attention mechanism.

Figure 4. Wavelet time–frequency image for bearing fault diagnosis.

Figure 5. CAE structural diagram.

Figure 6. M-CNNBiAM model.

Figure 7. Frequency-domain signal comparison diagram.

Figure 8. Vibration signal comparison chart.

Figure 9. Comparison of Continuous Wavelet Transform.

Figure 10. Signal envelope comparison. To validate the noise reduction effectiveness of the autoencoder, we performed a visual analysis of the signal envelope spectrum. As shown in Figure 10, (a–c) represent the fault signal envelope diagrams corresponding to the original signal; (d–f) depict the envelope spectra with −6dB noise added, where the red-boxed regions indicate noise designed to completely mask the signal’s fault frequencies. (g–i) show the envelope spectra after noise reduction processing by the autoencoder designed in this paper, where the fault features are perfectly restored. Compared to the original signal’s envelope spectrum, only extremely faint noise remains. To verify the difference between the denoised signal and the original signal, we use the MSE metric for comparison (as shown in Figure 11). The MSE of the original signal is 0.08, while the MSE of the denoised signal increases to 0.11, fully validating the denoising capability of the proposed autoencoder.

Figure 11. MSE metric.

Figure 12. Confusion matrix.

Figure 13. Visualization of T-SNE.

Figure 14. Training visualization.

Figure 15. Accuracy of comparative tests.

Figure 16. The −4 dB training curve.

Figure 17. Confusion matrix. (a–f) correspond to elements a–f in Table 9 and Table 10, representing the corresponding model labels.

Figure 18. Bearing test bench.

Figure 19. Confusion matrix.

Figure 20. Visualization of convolution layer weights.

Figure 21. Channel attention weights.

Figure 22. Spatial attention weights.

Table 1. Module parameter list.

Module	Kernel Size	Stride	Input	Output
Frequency domain	—	—	—	—
CNN0	7-7	2	3	32
Res-Block1	7-7	1	64	64
Res-Block2	7-7	1	64	64
Res-Block3	7-7	1	64	64
Conv1-1	1-1	1	64	128
(S)W-Attention	4-4	1	64	64
Res-Block4	7-7	1	64	64
Res-Block5	5-5	1	64	64
Res-Block6	5-5	1	64	64
Res-Block7	5-5	1	64	64
Res-Block8	5-5	1	64	64
Conv1-1	1-1	1	64	128

Table 2. Module.

Module	Kernel Size	Stride	Input	Output
Time domain	—	—	—	—
Res-Block1	5-5	1	64	64
Res-Block2	5-5	1	64	64
Res-Block3	5-5	1	64	64
Res-Block4	5-5	1	64	64
BiGRU	Neuron	—	—	—
	[32, 64]	—	64	256

Table 3. Image feature extraction.

Module	Input	Output
CNN0	[32, 3, 128, 128]	[32, 32, 64, 64]
MAX-Pool	[32, 32, 64, 64]	[32, 64, 32, 32]
Branch 1	—	—
Res-Block1	[32, 64, 32, 32]	[32, 64, 32, 32]
Res-Block2	[32, 64, 32, 32]	[32, 64, 32, 32]
Res-Block3	[32, 64, 32, 32]	[32, 64, 32, 32]
(S)W-Block	[32, 64, 32, 32]	[32, 64, 32, 32]
CBAM	[32, 64, 32, 32]	[32, 64, 32, 32]
Res-Block4	[32, 64, 32, 32]	[32, 64, 32, 32]
Conv1-1	[32, 64, 32, 32]	[32, 128, 32, 32]
Branch 2	—	—
Res-Block5	[32, 64, 32, 32]	[32, 64, 32, 32]
Res-Block6	[32, 64, 32, 32]	[32, 64, 32, 32]
Res-Block7	[32, 64, 32, 32]	[32, 64, 32, 32]
Res-Block8	[32, 64, 32, 32]	[32, 64, 32, 32]
Conv1-1	[32, 128, 32, 32]	[32, 128, 32, 32]
Feature fusion (F1 + F2)	[32, 128, 32, 32]	[32, 256, 32, 32]

Table 4. Vibration signal feature extraction.

Module	Input	Output
CNN0	[32, 1, 1024]	[32, 64, 512]
Res-Block1	[32, 64, 512]	[32, 64, 512]
Res-Block2	[32, 64, 512]	[32, 64, 512]
Res-Block3	[32, 64, 512]	[32, 64, 512]
Res-Block4	[32, 64, 512]	[32, 64, 512]
Permute	[32, 64, 512]	[32, 512, 64]
BiGRU	[32, 512, 64]	[32, 512, 256]
Llinear projection	[32, 256]	[32, 128]
Llinear projection	[32, 128, 1, 1]	[32, 128, 32, 32]
Feature fusion (F1 + F2 + F3)	[32, 128, 32, 32]−3	[32, 384, 32, 32]
Feature project	[32, 384, 32, 32]	[32, 10]

Table 5. Introduction to the CWRU dataset.

RPM	State of Health	Fault Diameter	Sample Count	Tag
1797 rpm	Normal	None	233	0
	Inner	0.007	233	1
		0.014	233	2
		0.021	233	3
	Ball	0.007	233	4
		0.014	233	5
		0.021	233	6
	Outer	0.007	233	7
		0.014	233	8
		0.021	233	9

Table 6. Evaluation indicators.

SNR	Accuracy (Avg)	Recall (Avg)	F1-Score (Avg)	Precision (Avg)
−10 dB	98.71%	98.90%	98.84%	98.79%
−8 dB	98.71%	98.90%	98.89%	98.89%
−6 dB	99.14%	99.23%	99.20%	99.22%
−4 dB	99.57%	99.62%	99.63%	99.66%
−2 dB	99.57%	99.64%	99.63%	99.63%
0 dB	99.14%	99.19%	99.16%	99.18%
2 dB	99.14%	99.29%	99.21%	99.18%
None	100%	100%	100%	100%

Table 7. CWRU model comparison table.

Model		Accuracy (Avg)
CNN-LSTM	a	99.48%
KANTransformer	b	99.07%
EWSNet	c	83.88%
DCA-BiGRU	d	84.40%

Table 8. Noise diagnosis results without AE model.

SNR	Accuracy	Recall	F1-Score	Precision
−4 dB	94.42%	94.83%	94.70%	94.63%
−2 dB	97.42%	97.66%	97.58%	97.64%
0 dB	97.51%	95.91%	95.77%	96.19%
2 dB	97.42%	97.48%	97.53%	97.76%

Table 9. Ablation experiment model.

Model		Description
Model-1	a	CNN + Windows–Attention + CBAM-7
Model-2	b	CNN + Windows–Attention + CBAM-5
Model-3	c	CNN + Windows–Attention + CBAM-[7, 5]
Model-4	d	CNN–CBAM-[7, 5]
Model-5	e	CNN + Windows–Attention-[7, 5]
Model-6	e	CNN–BiGRU
Proposed model	f	CNN + BiGRU–Windows–Attention + CBAM

Table 10. Ablation model evaluation metrics.

Model		Accuracy (Avg)	Recall (Avg)	F1-Score (Avg)	Precision (Avg)
Model-1	a	90.99%	90.88%	89.18%	91.48%
Model-2	b	96.14%	95.83%	95.35%	95.47%
Model-3	c	98.71%	98.36%	98.38%	98.43%
Model-4	d	93.13%	92.55%	91.89%	92.12%
Model-5	e	97.42%	97.03%	96.91%	96.98%
Model-6	f	98.28%	97.88%	97.79%	97.77%
Proposed Model	g	99.57%	99.62%	99.63%	99.66%

Table 11. Presentation of the simulation experiment data set.

RPM	State of Health	Sample Count	Tag
1500 rpm	Normal	234	0
	Inner	234	1
	Ball	234	2
	Outer	234	3

Table 12. Evaluation indicators.

SNR	Accuracy (Avg)	Recall (Avg)	F1-Score (Avg)	Precision (Avg)
−10 dB	98.71%	98.87%	98.73%	98.62%
−8 dB	98.71%	98.87%	98.73%	98.62%
−6 dB	99.57%	99.68%	99.61%	99.55%
−4 dB	99.57%	99.68%	99.61%	99.55%
−2 dB	99.57%	99.52%	99.50%	99.50%
0 dB	100%	100%	100%	100%
2 dB	100%	100%	100%	100%
None	100%	100%	100%	100%

Table 13. Image signal branch parameters.

Module	Dimension/Kernel	Parameter
Image branch	—	—
CNN0	in = 3, out = 32, kernel = 7	9408
Res-Block1	in = 32, out = 64, kernel = 7	73,984
Res-Block2	in = 64, out = 64, kernel = 7	73,984
Res-Block3	in = 66, out = 64, kernel = 7	73,984
Res-Block4	in = 64, out = 32, kernel = 7	73,984
(S)W-Attention	in = 64, out = 64, window = 4	50,033
CBAM	in = 64, out = 64	678
Res-Block5	in = 64, out = 64, kernel = 5	205,056
Res-Block6	in = 64, out = 64, kernel = 5	205,056
Res-Block7	in = 64, out = 64, kernel = 5	205,056
Res-Block8	in = 64, out = 64, kernel = 5	205,056

Table 14. Vibration signal branch parameters.

Module	Dimension/Kernel	Parameter
Time-domain branch	—	—
CNN0	in = 1, out = 32, kernel = 5	9408
Res-Block1	in = 64, out = 64, kernel = 5	24,832
Res-Block2	in = 64, out = 64, kernel = 5	24,832
Res-Block3	in = 64, out = 64, kernel = 5	24,832
Res-Block4	in = 64, out = 64, kernel = 5	24,832
BiGRU	in = 64, out = 256, N = [32, 64]	445,440
Total number of parameters	—	1,730,455

Table 15. Model runtime.

Model/Epochs	Time/s
1	6.8
50	340.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zou, Y.; Li, C.; Zhang, Y.; Si, Z.; Li, L. A Multimodal Three-Channel Bearing Fault Diagnosis Method Based on CNN Fusion Attention Mechanism Under Strong Noise Conditions. Algorithms 2026, 19, 144. https://doi.org/10.3390/a19020144

AMA Style

Zou Y, Li C, Zhang Y, Si Z, Li L. A Multimodal Three-Channel Bearing Fault Diagnosis Method Based on CNN Fusion Attention Mechanism Under Strong Noise Conditions. Algorithms. 2026; 19(2):144. https://doi.org/10.3390/a19020144

Chicago/Turabian Style

Zou, Yingyong, Chunfang Li, Yu Zhang, Zhiqiang Si, and Long Li. 2026. "A Multimodal Three-Channel Bearing Fault Diagnosis Method Based on CNN Fusion Attention Mechanism Under Strong Noise Conditions" Algorithms 19, no. 2: 144. https://doi.org/10.3390/a19020144

APA Style

Zou, Y., Li, C., Zhang, Y., Si, Z., & Li, L. (2026). A Multimodal Three-Channel Bearing Fault Diagnosis Method Based on CNN Fusion Attention Mechanism Under Strong Noise Conditions. Algorithms, 19(2), 144. https://doi.org/10.3390/a19020144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multimodal Three-Channel Bearing Fault Diagnosis Method Based on CNN Fusion Attention Mechanism Under Strong Noise Conditions

Abstract

1. Introduction

2. Theoretical Basis

2.1. Convolutional Neural Networks

2.2. Gated Recurrent Unit (GRU)

2.3. Collaborative Attention Mechanism

2.4. Wavelet Transform

2.5. Convolutional Auto-Encoder

3. M-CNNBiAM Model

4. Experiments and Results

4.1. Noise Addition and Noise Reduction Effect Verification

4.2. CWRU Experimental Data

4.2.1. Comparative Experiments

4.2.2. Ablation Experiment

4.3. Bearing Data Under Simulated Conditions

4.4. Model Parameters and Timeliness

4.5. Model Interpretability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI