1. Introduction
Bearings are critical components in mechanical systems [
1], including high-speed trains [
2], aircraft [
3], and CNC machine tools [
4]. During operation, bearings often endure alternating loads and operate in complex environments, making them prone to failure. These failures frequently exhibit subtle early warning signs. If not detected promptly, they may lead to equipment shutdowns, structural damage, and in severe cases, major safety incidents. Consequently, early fault detection is critically important. Implementing condition monitoring and intelligent diagnostics for rolling bearings has become a key direction in ensuring stable equipment operation [
5,
6].
In bearing fault diagnosis technology, traditional diagnostic methods, centered on experience-driven approaches, primarily analyze the statistical characteristics of time-domain vibration signals or convert time-domain signals to the frequency domain using the Fast Fourier Transform (FFT) [
7], followed by fault identification and judgment by specialists. This approach exhibits significant limitations: On one hand, diagnostic outcomes heavily rely on operator expertise and experience, leading to high subjectivity, limited generalization, and suboptimal real-time performance, making it challenging to meet diagnostic demands in high-noise environments. On the other hand, traditional signal processing methods, such as the FFT, struggle to handle non-stationary and nonlinear signals, making it challenging to capture early, subtle fault characteristics. With the advancement of modern industry, traditional experience-driven methods cannot meet contemporary industrial demands. A bearing fault diagnosis method that balances real-time performance and diagnostic accuracy in high-noise environments has become a research hotspot.
In recent years, advancements in artificial intelligence technology have introduced novel approaches to bearing fault diagnosis. Data-driven deep learning methods, leveraging their robust feature adaptation extraction and nonlinear fitting capabilities, have emerged as a research hotspot [
8]. Among these, fusion techniques combining time-domain and frequency-domain signals have garnered significant attention. Convolutional Neural Networks (CNNs) are widely applied across various fields due to their excellent feature extraction capabilities and their adaptability to dual-input modes for both one-dimensional signals and two-dimensional images. For instance, Yin et al. proposed an improved fault diagnosis method combining integrated noise-reconstruction empirical mode decomposition (IENEMD) with a parallel multi-scale CNN to extract fault features under noisy conditions [
9]. Jiang et al. proposed a highly interpretable CNN fault diagnosis method by converting it into binary grayscale image analysis via singular-value decomposition. They introduced an Average Score Decrease (ASD) metric to enhance the interpretability of the CNN decision-making process through quantitative analysis [
10]. Despite its robust feature extraction capabilities, CNN still faces limitations. Deep CNN models are prone to gradient vanishing issues. To mitigate this, He et al. introduced the residual concept, enabling direct connections between feature maps and outputs to provide a fast path for gradients [
11]. For instance, Wang et al. proposed a fault diagnosis method combining GRU with Residual Networks (Res-Net) for time-varying operating conditions, achieving high diagnostic accuracy under such conditions [
12]. CNNs exhibit strong short-term memory but suffer from forgetting long-term signals. Long-term prediction models offer solutions, such as the Bidirectional Gated Recurrent Unit (BiGRU). As both a recurrent neural network and a gating mechanism, BiGRU balances long-term memory and computational efficiency, often integrated with other models for fault diagnosis tasks [
13]. For instance, Jie Man et al. proposed a novel organizational format for shaft temperature data, structuring measurement points into a graph based on their spatial positions. Subsequently, they employed a Graph Convolutional Network (GCN) and a GRU model to extract features and predict shaft temperature [
14].
Single-modality approaches often struggle to extract complete fault features in noisy environments. Therefore, time–frequency joint methods are frequently employed for diagnostic tasks. The Continuous Wavelet Transform (CWT) maps one-dimensional time-domain signals into two-dimensional time–frequency matrices by convolving the original signal with adjustable-scale wavelet basis functions, thereby precisely describing the frequency distribution at different time points. It is often the preferred choice for obtaining frequency-domain signals [
15]. For instance, Cui et al. combined Sparse Representation (SR) and CWT theory to propose a novel shared-frequency approach for bearing fault diagnosis [
16]. When an image is input to the network, it is segmented into numerous regions, with the size of each region determined by the convolutional kernel size used in the network (in CNNs). This requires stacking enough layers for the model to capture global information. Increasing the number of layers deepens the network, potentially smoothing out local details in the original input image while also increasing computational complexity. Although deep networks can learn more abstract and advanced features, this does not guarantee that these advanced features are suitable for the current task. Achieving comparable performance with fewer layers has become a research hotspot. The emergence of attention mechanisms offers a solution. By assigning higher weights to critical fault-feature locations in the current feature map via attention, the model focuses on identifying fault patterns. For example, the Convolutional Block Attention Module (CBAM) is a lightweight attention mechanism composed of a channel attention mechanism (CAM) and a spatial attention mechanism (SAM) connected in series. This dual-attention coordination mechanism guides the model to adaptively focus on critical feature regions related to bearing faults. By assigning higher weight coefficients to fault-sensitive information, it suppresses interference from irrelevant background noise and redundant features, thereby significantly enhancing the model’s ability to detect subtle fault characteristics and improving the precision and effectiveness of feature extraction [
17]. Qin et al. proposed a rolling bearing fault diagnosis method that integrates the CBAM attention mechanism with ResNet. By embedding the CBAM into the residual block structure of ResNet, the attention mechanism enables adaptive focusing on critical fault features, effectively enhancing the model’s feature extraction capability and fault identification specificity [
18]. Attention mechanisms can only operate within the current window and cannot access global information. The most common approach is to apply them after feature extraction is completed at each layer. However, calculating attention weights across the entire feature map is computationally intensive, especially for larger images, leading to significant overhead. How to compute attention weights within the current window while simultaneously capturing global information has attracted widespread attention. The emergence of the Swin-Transformer addresses the challenge of cross-window connections. Liu et al. proposed the Swin-Transformer framework, which employs Window-based Multi-head Self-Attention (W-SAM) and Shifted Window-based Multi-head Self-Attention (SW-SAM) to enable cross-window connections while performing computations within the current window, thereby reducing resource consumption [
19]. For instance, Zhou et al. designed a novel multi-source information fusion network (FEV-Swin) based on the Swin-Transformer framework. This network effectively fuses and diagnoses multi-source fault information using an embedded feature pyramid fusion module and a domain adaptation module [
20].
To address the issues of insufficient single-modal feature extraction and inefficient multimodal information fusion under strong noise conditions, this paper proposes a three-channel multimodal fault diagnosis method that integrates a Convolutional Auto-Encoder (CAE) with a dual attention mechanism (M-CNNBiAM). A dedicated CAE denoising module is designed to recover periodic fault features from signals contaminated by intense noise. By integrating the CWT time–frequency transform with attention mechanisms, a multi-channel joint diagnostic model is constructed to fuse multidimensional features efficiently. This ultimately enables precise, rapid diagnosis of bearing faults in noisy environments. The main contributions of this paper are as follows:
- (1)
CNN Receptive Field Constraints: Incorporating SW-SAM and W-SAM into CNN layers restricts attention calculations within the current window while enabling cross-window connections to integrate global information.
- (2)
Feature Enhancement: Introducing CBAM after the window mechanism module compels the model to focus more intently on the fault location, thereby enhancing its ability to extract fault features.
- (3)
Multimodal Fusion: Integrating fault features extracted from two-dimensional frequency-domain images and one-dimensional time-series signals to leverage the complementary nature of these two data types fully. Dual-channel data fusion enables the model to simultaneously leverage both temporal and spectral information, thereby significantly improving fault classification accuracy.
- (4)
CAE was employed to reduce noise signals. Experimental results ultimately demonstrated that in high-noise environments, CAE not only exhibits strong noise reduction capabilities but also achieves high noise reduction efficiency.
This paper is organized as follows:
Section 2 introduces the fundamental theory;
Section 3 elaborates on the proposed model;
Section 4 conducts experiments on the CWRU and self-test datasets;
Section 5 summarizes the work and outlines future research directions.
2. Theoretical Basis
2.1. Convolutional Neural Networks
CNNs feature local connectivity, parameter sharing, and translation invariance. The same set of convolutional parameters is shared throughout the entire feature extraction process. This parameter-sharing mechanism significantly reduces the network’s parameter count, effectively alleviating computational complexity during model training and enhancing generalization [
21].
Figure 1 illustrates the structure of a CNN, where the convolution operation can be expressed as
where
represents layer-level feature outputs.
represents the i-th convolutional region of the (l-1)th layer’s feature map.
and
describe the parameters to be optimized during model training, including the weight and bias matrices for feature extraction.
represents the activation function, which introduces nonlinearity to enhance the model’s ability to fit complex features.
2.2. Gated Recurrent Unit (GRU)
Gated Recurrent Units (GRUs) are a core variant of recurrent neural networks (RNNs), achieving long-term memory of temporal signals solely through update gates and reset gates.
Figure 2 illustrates the gating mechanism, with the GRU gating principle proceeding as follows [
22]:
- (1)
Calculate the update gate
represents the sigmoid function, with an output range of [0, 1], used for gated “switch” control; and represent the parameter matrix, where corresponds to the weights of the input and corresponds to the weights of the hidden state; represents the bias vector, used for adjusting the offset in linear transformations.
- (2)
Calculate the reset gate
- (3)
Calculate the candidate hiding state
represents the hyperbolic tangent activation function, with an output range of [−1, 1], used for feature mapping of candidate states; ⊙ denotes element-wise multiplication, enabling per-element weight adjustment. represents the newly generated candidate state value based on the current input and resets historical information.
- (4)
Update the hidden status
represents the integration of historical information and the current candidate state, outputting the temporal feature representation for the current time step.
2.3. Collaborative Attention Mechanism
Figure 3 illustrates the structure of the collaborative attention mechanism. In
Figure 3, the input feature map is normalized before calculating attention scores for the current window. After further normalization, the feature is fed into an MLP within the sliding-window attention mechanism to obtain global information via cross-window connections [
23]. The output features, now enriched with global information, are fed into the CBAM. Here, spatial-dimensional feature enhancement strengthens global feature interactions, followed by channel-dimensional enhancement that discards redundant channels, culminating in the final feature output [
24]. The overall feature output is represented by Equation (6).
represents the W−MSA output feature; represents the MLP output feature; represents the W−MSA output feature; represents the S(W)−MSA output feature; represents the channel attention weights; represents the input feature map; denotes the Sigmoid function; denote mean pooling and max pooling, respectively; represent the enhanced spatial and channel feature maps; denotes element-wise multiplication.
2.4. Wavelet Transform
CWT [
25] calculates the similarity between the original signal and wavelet basis functions at different scales and time-domain positions by adjusting the scale factor and shift amount, thereby constructing a two-dimensional time–frequency feature matrix. Through the synergistic interaction between the scale factor and shift amount, CWT provides a joint representation of the signal’s local time-domain features and frequency-domain distribution patterns, enabling adaptive analysis of non-stationary signals. The principle and steps of the Continuous Wavelet Transform are as follows:
- (1)
Wavelet basis function construction
In the formula, denotes the scaling factor, which controls the degree of stretching of the wavelet basis;
represents the transformed wavelet basis function;
denotes the translation factor, which controls the position of the wavelet basis in the time domain.
- (2)
Calculate the wavelet coefficients
- (3)
Construction of time–frequency images
The wavelet coefficients calculated are arranged into a two-dimensional matrix according to scale factors and translation factors to construct the time–frequency image (
Figure 4).
2.5. Convolutional Auto-Encoder
CAE compresses the input signal through an encoder to map it into a low-dimensional space. The decoder reconstructs the signal from this low-dimensional space back to its original dimension, aiming to maintain consistency between the reconstructed and original signals [
26]. Since noise exhibits disordered characteristics and is distributed across the entire frequency band, while fault signals typically follow periodic impact patterns, CAE is particularly well-suited for noise reduction tasks in bearing vibration signals. The CAE architecture is shown in
Figure 5. The noise signal first passes through the encoding layer, then sequentially through three convolutional layers to compress it into a low-dimensional space. A flattening operation and a linear layer for dimensional adjustment follow this. The decoder uses three layers of transposed convolutions to restore the input signal’s dimensionality from a low-dimensional space, achieving the noise reduction objective. The following equations describe the signal’s encoding and decoding.
3. M-CNNBiAM Model
This section introduces the M-CNNBiAM model, providing its specific parameters, structural diagram, data dimension transformation table, and model diagnostic workflow.
The M-CNNBiAM architecture is shown in
Figure 6, with the process divided into four steps: 1. Noise addition; 2. Construction of time–frequency data; 3. Feature extraction; 4. Fault classification.
Table 1 and
Table 2: Module parameters;
Table 3: Image feature extraction;
Table 4: Vibration signal feature extraction.
- 1.
Noise Addition
This paper added Gaussian noise with SNR values of [−10, −8, −6, −4, −2, 0, 2, None] dB to the original vibration signal and verified the accuracy of the noise addition.
- 2.
Constructing Time–Frequency Data
The noise signal first enters the CAE for denoising, as shown in
Figure 6. The denoised signal is converted to a frequency-domain image via CWT using the complex Morlet wavelet with a bandwidth parameter of 100 and a center frequency parameter of 1. The CWT signal and vibration signal together form a multimodal dataset.
- 3.
Feature Extraction
The data first undergoes channel expansion via a convolutional layer with kernel size and stride of 2, followed by image size reduction to reduce computational complexity. The preprocessed signal enters a dual-parallel image branch. Data entering image branch 1 is sequentially passed through three layers of a CNN (ResNetBlock) with kernel size and stride of 1. This is followed by a window attention module and CBAM, with the output features re-entering ResNetBlock to generate final features. Branch 2 shares the same architecture as branch 1, except its first three CNN layers use a kernel size of , while all other parameters remain identical. Finally, features are concatenated to produce frequency-domain modal features. The vibration signal first passes through a convolutional layer with a kernel size and a stride of 2 for dimensionality reduction and channel expansion. It then sequentially passes through a convolutional layer with a kernel size , a ResNetBlock, and finally enters a BiGRU with [32, 64] neurons to output the time-domain modal feature. All CNN layers are followed by batch normalization (BN), a ReLU activation function, and a dropout layer. Finally, the time-domain feature (F3) and frequency-domain features (F1, F2) are concatenated along the channel dimension to obtain the multimodal feature.
- 4.
Fault Classification
Multimodal features are concatenated along the channel dimension before entering the classification layer, enabling fault diagnosis.
In
Figure 6, the noisy signal first passes through a three-layer convolutional layer with kernel sizes [7, 5, 3]. It is then deconvolved to complete the encoder output. Subsequently, the signal undergoes deconvolution before entering the decoder, which processes it through three layers of transposed convolutions with kernel sizes [3, 5, 7]. Finally, after deconvolution, the decoder outputs the denoised signal. The main hyperparameter settings for CAE are as follows: iteration count: 320, loss function: MSE loss, learning rate: 0.0001.
Figure 6 shows that the wavelet signals are fed into the preprocessing CNN in parallel to adjust the input dimensions for subsequent CNN processing. They then pass through three ResNetBlocks. The first branch employs large kernels with expanded receptive fields, enabling coverage of broader image regions. This configuration excels at capturing global structural features while offering superior suppression of large-scale noise. Residual connections prevent gradient vanishing. The features are then entered into the Swin-Transformer block. W-MAS restricts computations to the current window, reducing computational load, while SW-MAS enables cross-window connections to capture global information. Subsequently, the features are processed by CBAM to further focus on the fault location before entering the final ResNetBlock for dimension adjustment and feature output. Small kernel branches possess smaller receptive fields, excelling at capturing local detail features while preserving finer information. Their heightened sensitivity to small-scale noise facilitates precise filtering. This combination enables coverage of multi-scale features from local to global, avoiding the one-dimensional feature capture inherent in single-scale convolutional kernels. The temporal domain employs five ResNetBlocks with stride control. The first three layers suppress noise while capturing long-term dependencies, while the latter two layers extract global features, covering the full spectrum from macro trends to micro-transients. Each convolutional layer utilizes residual connections. The temporal features are finally fed into a BiGRU to prevent long-term sequence forgetting. The gating design of BiGRU inherently provides noise resistance, further mitigating residual noise. The three-channel features are concatenated along the channel dimension before being fed into the classification layer. To mitigate overfitting in complex models, a dropout layer with 2d = 0.1 is applied after each convolutional layer in the image branch. Dropout1d = 0.1 is applied after the five convolutional layers in the time-domain branch, and Dropout1d = 0.2 is applied after the BiGRU. The final fusion layer employs Dropout1d = 0.5, effectively preventing overfitting.
4. Experiments and Results
This section validates the proposed model’s generalization capability and diagnostic accuracy through two sets of comparative experiments. Testing was conducted using both the Case Western Reserve University (CWRU) bearing failure benchmark dataset and a simulated operating condition dataset, ensuring coverage of both real-world failure scenarios and controlled simulations to enhance the reliability of the conclusions. To comprehensively evaluate model performance across multiple dimensions while quantifying the impact of hyperparameters on diagnostic effectiveness, the following core evaluation metrics were selected:
First, we employed accuracy, precision, recall, and F1-score. Accuracy reflects the model’s overall classification accuracy; precision measures the reliability of predictions; recall indicates the ability to identify target faults and reduce missed detections; and the F1-score, as the harmonic mean of the two, balances comprehensive classification performance in scenarios with imbalanced samples. Additionally, we included runtime and parameter count metrics. Runtime is the total time for one training cycle and reflects training efficiency. Parameter count represents the total number of learnable parameters in the model, measuring its complexity and storage overhead. Finally, a confusion matrix visually presents the classification results. Rows correspond to the actual fault categories of samples, columns represent the model’s predicted categories, and element values indicate the number of classifications for each category. This clearly identifies the correct classification status and misclassification types for various faults, providing a clear direction for model optimization.
The hyperparameters for this model are set as follows: learning rate of 0.0001; AdamW optimizer; weight decay of 1 × 10−4; training for 50 epochs.
The experimental platform configuration is as follows: CPU: Intel Core i5-13450HX, RAM: 16 GB, GPU: NVIDIA GeForce RTX 4060. The PyTorch deep learning framework (version 3.11.7) was employed for model training and experimental validation.
4.1. Noise Addition and Noise Reduction Effect Verification
To better simulate the high noise levels present in real-world working environments, Gaussian noise is added to the original signal. The signal-to-noise ratio (SNR) can be expressed as
Among these, represents the original signal; represents the added noise power.
The noise added to the experiment is Gaussian. To verify the effectiveness of noise addition, the spectrum, time-domain vibration, and Continuous Wavelet Transform diagrams are provided, taking a Gaussian noise signal with SNR = −6 dB as an example.
Figure 7 shows the frequency-domain signal comparison. The frequency domain provides a more intuitive view: the original signal exhibits distinct peaks at specific frequencies due to the fault’s impact, while fluctuations at other frequencies are relatively uniform. After adding −6 dB noise, the fault peaks are significantly masked, and fluctuations across the entire frequency band become irregular. Traditional models are prone to getting stuck in local optima under such high-noise conditions, failing to learn the fault information or instead learning the noise. This leads to high accuracy during training but random probability during validation.
Figure 8 compares time-domain vibration signals. The original time-domain signal exhibits a distinct periodic pattern, with noise completely obscuring the features, making it impossible to identify fault characteristics.
As shown in
Figure 9, the time-domain periodicity of the original vibration signal is preserved after CWT transformation, appearing as banded stripes. Peak values in the frequency domain appear as red dots at specific locations, indicating where the CWT assigned larger wavelet coefficients. Through time–frequency complementarity, this approach enables the model to capture comprehensive fault characteristics, enhancing its generalization capability and diagnostic effectiveness.
To validate the auto-encoder’s noise reduction effectiveness, we visualized the signal’s envelope spectrum. As shown in
Figure 10, the red-boxed section represents noise, which has completely obscured the signal’s fault frequency. After noise reduction using the auto-encoder designed in this paper, we observe that the fault features are perfectly restored. Compared to the envelope spectrum of the original signal, only very faint noise remains. To verify the gap between the denoised and original signals, we use the MSE metric, as shown in
Figure 11. The MSE of the original signal is 0.08, while that of the denoised signal is 0.11, fully validating the proposed AE’s noise reduction capability.
4.2. CWRU Experimental Data
The CWRU dataset includes three types of bearing faults: inner race, outer race, and rolling element. Each fault type has three fault sizes (7, 14, and 21) plus a healthy state, forming a total of ten classification categories. Bearing vibration signals are acquired at a sampling frequency of 12 kHz. For the experimental sample data, each sample is set to 1024, and the overlap rate is set to 0.5 to utilize the data while avoiding feature redundancy entirely. The dataset is randomly divided into training, validation, and test sets in a 7:3:1 ratio. Specifically, the training set is used for parameter learning, the validation set for hyperparameter tuning and model overfitting monitoring, and the test set for evaluating the model’s generalization performance. Detailed information about the experimental data is shown in
Table 5.
To validate the fault diagnosis performance of the proposed model, experiments were conducted using the CWRU bearing dataset.
Table 6 lists the model’s core evaluation metrics. As shown in
Table 6, even under extremely high-noise conditions (SNR = −10 dB), the model maintains an accuracy of 98.57%, attributed to CAE’s powerful noise reduction capability. As SNR gradually decreases, the model’s extraction capability progressively increases, further improving diagnostic accuracy. When SNR = None, the accuracy reaches 100%. Visualizing the confusion matrix facilitates better analysis of the model’s classification capability for fault categories. As shown in
Figure 12, the misclassified data points in the confusion matrix are generally dispersed across categories rather than concentrated in a single one. This indicates that the model possesses good diagnostic capability, is not sensitive to certain fault types, and exhibits acceptable robustness. As shown in
Figure 13, the T-SNE visualization demonstrates strong overall clustering, with distinct separation between categories, confirming that the model learned universal features. To further analyze the decision-making process during training, we visualized the model’s training curve. As shown in
Figure 14, the model’s training process shows minimal overall fluctuations. It achieves a high accuracy rate by the 10th iteration, with the training and validation curves converging by the 20th iteration, indicating optimal diagnostic performance. The subsequent 30 iterations proceed smoothly without overfitting. This stability stems from residual connections ensuring stable gradient propagation, coupled with a cosine annealing strategy for learning rate scheduling, which facilitates a more stable and efficient search for the optimal parameter combination.
4.2.1. Comparative Experiments
To validate the superiority and stability of the proposed model, we compared it with other advanced models.
Table 7 presents the evaluation metrics for the comparative models. Qiao et al. proposed a dual-input neural network model combining CNN and LSTM. This model incorporates time–frequency signals as input using mini-batch and batch normalization methods, achieving high fault recognition rates under noisy conditions along with excellent noise immunity and load adaptability on the CWRU dataset [
27]. Zhang et al. proposed a fault diagnosis model integrating wavelet denoising with KANTransformer. The front end filters redundant noise from raw signals via a wavelet denoising module, providing high-quality data for feature extraction. The core innovation lies in KANTransformer’s introduction of learnable activation functions within its linear layer, overcoming the expressive limitations of traditional fixed activation functions and significantly enhancing the model’s ability to capture nonlinear and non-stationary features in fault signals. Experimental validation demonstrates that this model exhibits outstanding interference resistance under complex noise conditions, effectively distinguishing between noise and fault features, and delivering excellent fault diagnosis accuracy and robustness [
28]. Shi et al. proposed an EWSNet network algorithm that features wavelet weight initialization and a balanced dynamic adaptive thresholding algorithm, demonstrating the network’s effectiveness and reliability across four datasets [
29]. Zhang et al. proposed a small-sample fault diagnosis method that combines dual-path convolutional attention (DCA) with Bidirectional Gated Recurrent Units (BiGRUs) for noisy, variable operating conditions. This model achieves fault diagnosis by integrating spatio-temporal features using a regularization-based training strategy coupled with BiGRU. Results demonstrate its strong generalization capability and robustness [
30]. All models were evaluated under −4 dB noise conditions.
Figure 15 presents the accuracy bar chart for comparative experiments.
As shown in
Figure 16, both model (a) and model (b) exhibit high diagnostic accuracy exceeding 99% under strong noise conditions with SNR = −4 dB. This is attributed to the inclusion of pre-denoising modules in both models, which denoise the signal before network processing. For instance, method (b) employs wavelet denoising, thereby maintaining high diagnostic precision. Although models (c) and (d) exhibit slightly lower accuracy under noisy conditions, these methods were tested on small datasets. The results presented in the paper were obtained using 150 samples, which represents the maximum sample size in the study. Generally, accuracy improves to some extent as the sample size increases. Therefore, models (c) and (d) may demonstrate superior diagnostic performance when tested on datasets of comparable size.
The aforementioned models each possess distinct advantages; however, most denoising network models rely excessively on data preprocessing while neglecting the model’s inherent denoising capabilities. Furthermore, when relying on neural network models for denoising under extremely high-noise conditions, training becomes unstable and requires significantly longer training times. Therefore, if the model itself exhibits a degree of noise resistance under low-noise conditions, it would be more advantageous for practical deployment. Consequently, this paper tested the model solely through its core architecture under noise conditions of [−4, −2, 0, 2], with results shown in
Table 8. The model achieved 94.42% accuracy even at −4 dB, attributable to its structural design. The residual CNN block first suppresses noise while ensuring proper gradient propagation. It collaborates with the attention mechanism to enhance key spatial- and channel-dimension features. The temporal branching introduces BiGRU, which further suppresses noise through gating mechanisms while preserving long-term memory. The deep fusion of three feature streams achieves 94.42% accuracy even at an SNR of −4 dB without CAE. As observed in
Figure 16, the training curve exhibits relatively stable fluctuations at −4 dB, indicating the model’s adequate learning capability under noisy conditions. Consequently, this model can be deployed more readily in practical engineering applications with moderate noise intensity. When noise becomes excessive, employing CAE for pre-denoising ensures high diagnostic accuracy. Thus, the proposed model demonstrates significant advantages.
4.2.2. Ablation Experiment
To more clearly demonstrate the contribution of each model module, ablation experiments were conducted on the CWRU dataset at SNR = −4 dB. Specific ablation models are listed in
Table 9. Model-1 represents the first image channel; Model-2 represents the second image channel; Model-3 incorporates the entire image splitting; Model-4 omits the shift attention mechanism compared to Model-3; Model-5 omits the CBAM compared to Model-3. Model-6 includes only the vibration signal channel.
Table 10 lists the evaluation metrics, and
Figure 17 displays the confusion matrix.
In
Table 10, Model-1 achieves an accuracy of 90.99%, while Model-2 reaches 96.14%, representing a 5.15% improvement. This is attributed to Model-2’s larger convolutional kernels, which capture a broader receptive field and exhibit stronger noise suppression, thereby yielding higher accuracy. Conversely, Model-1 better preserves fine details, complementing Model-2’s strengths. Model-3, which fuses two branches, achieves an accuracy of 98.71%, representing a 7.72% improvement over Model-1 and a 2.57% increase over Model-2, demonstrating the advantages of dual-channel fusion. Model-4, lacking both the sliding-window attention mechanism and positional attention mechanism, achieves only 93.13% accuracy—a 5.58% decrease from Model-3—highlighting the critical importance of the sliding-window attention mechanism. Model-5, without CBAM, shows a 1.29% drop in accuracy compared to Model-3, underscoring the significance of CBAM’s final redundant channel pruning. Model-6 achieves 98.28% accuracy, attributed to BiGRU’s inclusion, which suppresses redundant noise while providing long-term memory capabilities. Consequently, its accuracy only decreases by 0.43% relative to Model-3. The three-channel fusion ultimately attains 99.57% accuracy. Ablation results indicate that incorporating dual attention mechanisms significantly improves model accuracy, and the proposed three-channel feature fusion method is better suited for fault diagnosis tasks.
4.3. Bearing Data Under Simulated Conditions
The acquisition parameters and operating conditions for the simulated dataset used in this section are as follows: The dataset’s sampling frequency is set to 20 kHz to ensure accurate capture of high-frequency fault-feature signals during bearing operation. The 6007 deep groove ball bearing was selected as the research subject, with a rotational speed of 1500 rpm to simulate typical medium-to-low-speed operating conditions in industrial applications. The dataset encompasses four critical bearing operating states: regular operation, inner-ring failure, outer-ring failure, and rolling element failure, constituting a four-class fault diagnosis task. Detailed information on the simulated dataset is presented in
Table 11, while
Figure 18 shows the physical layout of the simulation test bench used for data acquisition.
The proposed model maintains high classification accuracy and good generalization capability as validated by simulation data.
Figure 19 shows the confusion matrix, and
Table 12 lists the model’s core evaluation metrics.
4.4. Model Parameters and Timeliness
In industrial deployment scenarios, the parameter efficiency and real-time response capability of fault diagnosis models are core indicators determining their engineering applicability, directly impacting the timeliness of fault diagnosis and the feasibility of hardware deployment. To validate the engineering potential of the proposed model, the CWRU bearing dataset was used for testing. Each vibration signal in this dataset spans 1024 points, comprising 2330 valid samples. The model’s timeliness was assessed by measuring its single-run processing time. The statistical results of the model’s parameter count are shown in
Table 13 and
Table 14. It can be seen that the overall parameter count of the proposed model is at an intermediate level within the industry, without excessive redundancy, providing the fundamental conditions for lightweight deployment. However, as shown in
Table 15, the model’s average single-round runtime is 6.8 s. This is due to the high computational complexity of the block’s window attention mechanism. Nevertheless, the 6.8 s runtime offers significant advantages in practical applications. On one hand, the fault evolution process of rolling bearings typically involves specific time scales, and a 6.8 s diagnostic delay enables rapid response to fault warnings, facilitating subsequent maintenance decisions. On the other hand, in industrial settings, runtime can be further reduced through hardware computing power upgrades or algorithm optimization. Therefore, this model retains significant practical value for industrial deployment.
4.5. Model Interpretability
Model interpretability is crucial as it helps us understand how models make decisions, enhances their credibility, and facilitates targeted improvements. To explain how CNNs extract fault features, we visualized the weight maps of the five convolutional layers in the vibration signal feature extraction pipeline. These five layers comprise the preprocessing convolutional layer and the final convolutional layers within the four residual blocks. As shown in
Figure 20, darker colors indicate larger weights. The figure reveals that the first three convolutional layers assign greater weights to the fault location, demonstrating that the model has learned fault patterns and effectively identifies fault features. The subsequent two convolutional layers incorporate global features, fully validating the model’s feature extraction capability and global temporal modeling.
Understanding how attention mechanisms enable models to capture fault features accurately is a core issue in bearing fault diagnosis. To visually validate the working principles and model interpretability of channel and spatial attention mechanisms, this paper uses feature response heatmaps for both mechanisms. The bar chart in
Figure 21 reveals significant variations in attention weights across feature channels. Channel importance distribution differs across distinct model modules, reflecting the model’s dynamic adjustment of attention levels to each channel based on task requirements during feature processing. This confirms the core capability of channel attention: the model automatically learns and highlights channels more relevant to the task. This mechanism enhances fault-feature extraction accuracy and diagnostic reliability through adaptive feature-channel filtering. The spatial attention heatmap in
Figure 22 indicates that the model automatically assigns high weights to regions in the Continuous Wavelet Transform (CWT) feature map that are directly related to faults. This demonstrates that spatial attention precisely localizes and amplifies fault-specific features in the spatial domain, thereby improving the distinguishability between fault and non-fault features, highlighting its value within the model.
5. Conclusions
This paper addresses the challenges of feature extraction and low generalization in boisterous environments by proposing a three-channel M-CNNBiAM fault diagnosis method that integrates image and vibration signals. Noise signals are first processed by CAE for denoising, generating denoised signals. These denoised signals are converted into frequency-domain images via CWT, constructing joint time–frequency data. The frequency-domain images are simultaneously fed into a dual-branch residual-connected CNN to suppress residual noise, followed by a collaborative attention mechanism that forces the model to focus more on fault locations. Finally, dual-channel features are concatenated along the channel dimension to obtain frequency-domain features. The denoised vibration signal is directly fed into a five-layer residual-connected CNN. This is combined with a BiGRU network to specifically extract long-term dynamic features from the vibration signal, outputting temporal features. Finally, the time–frequency features are fused along the channel dimension and fed into a classification layer for fault diagnosis. Experimental results demonstrate that under SNR = −10 dB conditions, the MSE index of CAE after noise reduction is only 0.03 higher than that of the high-original-signal MSE, effectively resolving the challenge of extracting fault features under intense noise. The large-kernel branch of the image branch provides a larger receptive field, while the small-kernel branch preserves more details. At SNR = −4 dB, the dual-branch design improves diagnostic accuracy by 7.72% and 2.57% compared to single small-kernel and single large-kernel branches, respectively. The addition of the vibration signal branch further improves accuracy by 1.29% compared to the image branch alone, fully validating the complementary advantages of time–frequency multimodal information fusion within the three-channel architecture. Moreover, at SNR = −4 dB, the model achieves 94.42% accuracy solely through its own capabilities. Therefore, this study demonstrates significant advantages. In practical industrial settings, acquiring high-quality labeled data is challenging. Future research will focus on fault diagnosis under few-label conditions.