1. Introduction
Rolling bearings are fundamental mechanical components in rotating machinery systems, widely deployed across critical industrial applications including wind turbines, aircraft engines, railway vehicles, marine propulsion systems, and manufacturing equipment. As the primary load-bearing and motion-transmission elements, bearings operate under harsh conditions involving high-speed rotation, heavy loads, and severe environmental stress, making them among the most failure-prone components in mechanical systems. Bearing faults account for approximately 30% of all machinery failures, and undetected bearing degradation can lead to catastrophic equipment breakdowns, unplanned production downtime, substantial economic losses, and severe safety hazards [
1,
2,
3]. Consequently, accurate and timely bearing fault diagnosis has become a critical enabler for predictive maintenance strategies, condition-based monitoring systems, and intelligent manufacturing paradigms, ensuring operational reliability, minimizing maintenance costs, and preventing potential accidents in industrial settings.
Deep learning has revolutionized bearing fault diagnosis by enabling end-to-end automatic feature learning from raw vibration signals, eliminating the need for manual feature engineering that relies heavily on domain expertise. Convolutional neural networks have demonstrated remarkable capabilities in extracting hierarchical spatial representations from time-domain, frequency-domain, and time-frequency domain signals [
4,
5,
6]. Recurrent architectures, including LSTM and GRU, effectively capture temporal dependencies in sequential vibration data, while hybrid CNN-RNN models leverage complementary strengths of spatial feature extraction and temporal modeling [
1,
7,
8]. Transfer learning and domain adaptation techniques address cross-domain diagnosis challenges by transferring knowledge across different operating conditions, equipment types, and fault scenarios [
9,
10,
11]. Despite these advances, existing deep learning methods face critical limitations. First, most architectures lack systematic integration of multi-scale feature extraction mechanisms to simultaneously capture both fine-grained transient fault impulses and long-range temporal patterns with varying characteristic frequencies. Second, attention mechanisms are typically applied globally without dedicated architectural components for adaptive channel and spatial feature emphasis tailored to bearing fault characteristics. Third, industrial bearing vibration signals are severely contaminated by measurement noise, background vibrations, and interference from adjacent rotating components, yet most methods lack explicit denoising modules designed to suppress noise while preserving fault-relevant transient information.
Multi-scale feature extraction is essential for comprehensive bearing fault diagnosis because fault signatures manifest across different temporal resolutions and frequency scales. Early-stage bearing defects generate weak impulsive transients with short durations requiring fine-scale receptive fields, while advanced degradation produces periodic impact patterns with longer intervals demanding large-scale temporal modeling. Existing approaches address multi-scale learning through signal decomposition preprocessing [
7,
12] or multi-branch parallel architectures [
13], but these methods either operate independently from deep feature learning or lack adaptive mechanisms to dynamically adjust receptive fields based on input characteristics. Multi-dilated convolutions offer an elegant solution by employing parallel convolutional layers with different dilation rates to simultaneously capture multi-scale temporal patterns within a unified architecture, enabling the network to attend to both local transient impulses and global periodic structures. However, existing multi-scale convolutional designs rarely incorporate systematic dilated convolutions specifically optimized for bearing vibration signal characteristics and lack integration with attention mechanisms for selective feature emphasis.
Attention mechanisms have emerged as powerful tools to emphasize discriminative features and suppress irrelevant information in bearing fault diagnosis. Channel attention, exemplified by Squeeze-and-Excitation Networks, adaptively recalibrates feature channel weights to highlight frequency bands containing fault-relevant information while attenuating noise-dominated channels [
6,
14]. Spatial attention focuses on informative temporal regions where fault impulses concentrate, reducing the influence of steady-state vibrations and random noise [
13]. Convolutional Block Attention Module synergistically combines channel and spatial attention for complementary feature emphasis. Despite their proven effectiveness, most attention-based bearing diagnosis methods apply these mechanisms globally across entire feature maps without considering the unique characteristics of industrial noise interference. Furthermore, limited research exists on systematically integrating dual attention mechanisms (channel and spatial) with multi-scale feature extraction and dedicated denoising modules within a unified architectural framework specifically designed for noisy bearing vibration signal analysis.
Industrial bearing fault diagnosis faces severe challenges from measurement noise and environmental interference that significantly degrade diagnostic accuracy in real-world applications. Vibration sensors inevitably capture background noise from adjacent machinery, electromagnetic interference, sensor mounting vibrations, and measurement quantization errors that mask weak early-stage fault signatures [
3,
15,
16]. Most existing denoising approaches operate as preprocessing steps through signal filtering, wavelet denoising, or variational mode decomposition, which risk removing fault-relevant transient impulses along with noise. Alternative methods incorporate denoising objectives as regularization terms during training but lack explicit architectural components dedicated to noise suppression. Recent work has begun exploring learnable denoising modules integrated within deep networks [
15], yet these approaches typically employ simple filtering operations without gated attention mechanisms to selectively preserve fault features while suppressing noise and lack residual connections to prevent information loss during denoising operations.
To address these fundamental limitations and systematically integrate multi-scale feature extraction, dual attention mechanisms, and adaptive denoising within a unified framework, we propose MDCAD-Net (Multi-Dilated Convolution Attention Denoising Network), a novel deep learning architecture specifically designed for robust bearing fault diagnosis under noisy industrial conditions. MDCAD-Net introduces three key architectural innovations. First, a multi-dilated convolution module employs parallel dilated convolutional layers with complementary dilation rates to simultaneously extract fault features across multiple temporal scales, capturing both short-duration transient impulses from early-stage defects and long-range periodic patterns from advanced degradation. Second, dual attention mechanisms synergistically combine Squeeze-and-Excitation Networks for channel-wise feature recalibration and Convolutional Block Attention Module for spatial-temporal feature emphasis, enabling the network to adaptively focus on discriminative frequency bands and informative temporal regions containing fault signatures. Third, a dedicated denoising block incorporates gated attention mechanisms and residual connections to explicitly suppress measurement noise and background interference while preserving fault-relevant transient impulses through selective information propagation, ensuring robust feature learning under severe noise contamination without requiring preprocessing or manual signal filtering. Unlike existing architectures that address these challenges in isolation—CNN-LSTM models lack explicit multi-scale receptive fields and noise suppression, ResNet-based models employ fixed-scale convolutions without adaptive feature emphasis, and attention-based CNNs apply global attention without dedicated denoising modules—MDCAD-Net is specifically designed so that each module enhances the effectiveness of subsequent modules: channel recalibration (SENet) pre-emphasizes fault-relevant frequency bands before multi-scale extraction (MDC), spatial-temporal attention (CBAM) refines the multi-scale representations, and the denoising block operates on the most refined features to suppress residual noise. As demonstrated by the combined ablation experiments in
Section 4.6, this sequential synergy produces performance gains that exceed the sum of individual module contributions, confirming that the architectural integration creates diagnostic capability that none of the individual components can achieve alone.
The main contributions of this work are:
We propose MDCAD-Net, a novel multi-dilated convolution attention denoising network that systematically integrates multi-scale temporal feature extraction, dual attention mechanisms, and adaptive denoising within a unified end-to-end learnable architecture specifically designed for robust bearing fault diagnosis under noisy industrial conditions, addressing the critical limitations of existing methods that treat these components in isolation.
We design a multi-dilated convolution module with parallel dilated convolutional layers employing complementary dilation rates to simultaneously capture fault signatures across multiple temporal scales, enabling comprehensive feature extraction of both short-duration transient impulses from early-stage defects and long-range periodic patterns from advanced bearing degradation, surpassing conventional multi-scale approaches that rely on signal decomposition preprocessing or fixed receptive field designs.
We develop a synergistic dual attention framework combining Squeeze-and-Excitation Networks for adaptive channel-wise feature recalibration and Convolutional Block Attention Module for spatial-temporal feature emphasis, enabling the network to selectively focus on discriminative frequency bands and informative temporal regions containing fault-relevant information while suppressing noise-dominated channels and steady-state vibration segments, with comprehensive ablation studies validating the complementary benefits of channel and spatial attention for bearing fault diagnosis.
We introduce a dedicated denoising block incorporating gated attention mechanisms and residual connections to explicitly suppress measurement noise and background interference during feature learning, achieving robust fault diagnosis under severe noise contamination without requiring preprocessing or manual signal filtering, and conduct comprehensive experiments on the CWRU bearing dataset demonstrating that MDCAD-Net achieves 0.9893 accuracy, outperforming eight competitive baseline models including ResNet-18, Inception-v3, VGG-16, and Transformer-based architectures.
The remainder of this paper is organized as follows.
Section 2 reviews related work on deep learning approaches, multi-scale feature extraction, and attention mechanisms for bearing fault diagnosis.
Section 3 presents the MDCAD-Net methodology with detailed descriptions of the multi-dilated convolution module, dual attention mechanisms, and denoising block architecture.
Section 4 provides comprehensive experimental validation on the CWRU dataset, including performance comparisons with baseline models, ablation studies, and sensitivity analysis.
Section 5 concludes with key findings and future research directions.
2. Related Work
Bearing fault diagnosis has been a critical research area in predictive maintenance and condition monitoring. Recent advances in deep learning have significantly improved diagnostic accuracy, while multi-scale feature extraction and attention mechanisms have emerged as key enablers for handling complex vibration signals. This section reviews existing methods across three key aspects aligned with the MDCAD-Net framework: deep learning approaches, multi-scale feature extraction techniques, and attention mechanisms with denoising strategies.
2.1. Deep Learning Methods for Bearing Fault Diagnosis
Deep neural networks have revolutionized bearing fault diagnosis by enabling automatic feature learning from raw vibration signals, and existing architectures can be broadly categorized into CNN-based, recurrent, and hybrid approaches.
Among CNN-based methods, Chen et al. [
4] proposed a recurrence plot fusion neural network that converts vibration signals into color images and integrates bidirectional GRUs with multi-head attention for bearing fault diagnosis under noisy conditions, achieving 92% accuracy on noise-contaminated data. Ma et al. [
5] designed a CNN for bearing diagnosis using weak magnetic signals with optimized input size and kernel configurations, achieving over 99% accuracy on the CWRU dataset. To reduce computational overhead for edge deployment, Luu and Huynh [
6] developed a depthwise separable multi-scale CNN combined with CBAM and spatial pyramid pooling for lightweight bearing diagnosis, demonstrating over 99% accuracy with significantly fewer parameters on both CWRU and HUST datasets. Similarly, Hu et al. [
17] introduced multiscale depthwise separable convolution with network pruning for bearing diagnosis under variable operating conditions, reducing model size by over 90% while maintaining accuracy above 98%.
Recurrent architectures capture temporal dependencies that CNNs may overlook. Wei and Yuan [
7] proposed a VMD-DCA-BiGRU method combining variational mode decomposition with dual-channel convolutional attention and bidirectional GRUs for bearing diagnosis under variable operating conditions, demonstrating improved robustness across multiple load settings on the CWRU dataset. Wu et al. [
8] introduced a physics-informed attention LSTM framework that embeds bearing dynamic mechanisms into attention-equipped networks for small-sample fault diagnosis, achieving 99.20–99.89% accuracy with enhanced interpretability.
Hybrid architectures that integrate CNNs with Transformers have emerged to combine spatial feature extraction with self-attention capabilities. Chen et al. [
18] proposed a multi-feature fusion dual-channel CNN-Transformer-CAM framework that cross-fuses GADF and GST feature images for gearbox bearing diagnosis under noise interference, achieving over 98% accuracy across multiple operating states. Shi et al. [
19] developed a digital twin-based multi-scale CNN-attention-BiGRU approach for bearing fault identification through virtual-physical data fusion, achieving 99.5% accuracy on simulated-real hybrid datasets.
Transfer learning and domain adaptation address the practical challenges of data scarcity and distribution shift that limit supervised approaches. Domain adaptation methods aim to align feature distributions across different operating conditions or machines. Su et al. [
9] proposed a physics-informed blind wavelet deconvolution transfer network for cross-machine bearing diagnosis, embedding physical priors into wavelet filters to enhance interpretability and achieving over 97% cross-domain accuracy. Deng et al. [
10] introduced a dual-space multi-node collaborative transfer learning method with dynamic MK-MMD and attention LMMD loss functions for multi-source bearing fault diagnosis, effectively aligning heterogeneous source domains and achieving over 95% accuracy on target tasks. Chen et al. [
20] proposed a knowledge-sharing multi-task model for bearing diagnosis that automatically transfers useful features from high sampling frequency data to low sampling frequency tasks, achieving 96.8% accuracy for resolution-degraded signals.
Few-shot and zero-shot methods reduce the dependence on large labeled datasets. Huang et al. [
11] developed an inter-domain similarity-guided meta-learning framework for cross-scenario few-shot bearing diagnosis, achieving over 95% accuracy with only five labeled samples per class. Xia et al. [
21] proposed a cross-scale hybrid contrast network for generalized zero-shot fault diagnosis, generating features for inaccessible fault classes through fault description guidance and achieving over 85% accuracy on unseen categories. Lei et al. [
22] introduced a deep transfer learning approach based on joint generalized sliced Wasserstein distances for semi-supervised bearing fault diagnosis with small samples, achieving 97.56% accuracy through dynamic domain alignment.
Multi-task learning and collaborative frameworks extend diagnostic capability beyond single-task classification. Zhang et al. [
23] proposed an integrated multitasking scheme based on representation learning for bearing fault detection, classification, and unknown fault identification under imbalanced sample conditions, effectively handling class imbalance through modified denoising autoencoders. Xu et al. [
24] developed a federated open-set fault diagnosis method enabling collaborative model training across distributed clients without centralized data sharing, maintaining over 90% accuracy while preserving data privacy. Jia et al. [
25] introduced simulation-reality domain mixup adaptation for bearing fault diagnosis by leveraging bearing simulation models to generate synthetic training data, bridging the sim-to-real gap with over 93% transfer accuracy.
Generative adversarial networks have been applied to address imbalanced and limited-sample challenges. Liu et al. [
26] introduced a condition multidomain GAN framework for bearing diagnosis under limited raw samples, fusing two-domain information to capture sample distributions and improving classification accuracy by over 5% compared to non-augmented baselines. Luo et al. [
27] proposed a Wasserstein GAN meta-learning algorithm for early motor bearing fault diagnosis, achieving feature generation through adversarial learning and demonstrating 96.5% accuracy for incipient faults. Wu et al. [
28] developed a GA-optimized wavelet transform combined with Wasserstein conditional GAN for shearer rocker arm bearing diagnosis, achieving 98.2% accuracy under small-sample conditions through targeted data augmentation. Despite the effectiveness of these deep learning methods, most require extensive labeled data and substantial computational resources while lacking explicit mechanisms to handle industrial noise and extract multi-scale fault features simultaneously.
2.2. Multi-Scale Feature Extraction and Signal Processing
Multi-scale feature learning captures fault signatures across different temporal and frequency resolutions, which is critical for detecting transient fault impulses whose characteristics vary with fault type and severity. Among the available approaches, time-frequency decomposition methods and multi-scale convolutional architectures represent two complementary strategies.
Time-frequency decomposition methods provide adaptive multi-resolution representations of non-stationary bearing signals. Variational mode decomposition (VMD) and its adaptive variants have been particularly effective for separating multi-component signals. Liu et al. [
12] developed an AOVMD-ScaleShift-Net framework that uses an enhanced crested porcupine optimizer to co-optimize spectral kurtosis and time-frequency entropy for bearing fault diagnosis, achieving 99.52% accuracy on the CWRU dataset. Liu et al. [
29] proposed a hybrid denoising model integrating improved whale optimization-based VMD with dataset-specific wavelet thresholding for rolling bearing early fault signal preprocessing, achieving RMSE as low as 0.00013–0.00041 and NCC of 0.9689–0.9798. Shi et al. [
30] proposed a generalized envelope nonlinear Gini index–gram guided two-stage chirp mode decomposition for shield machine main bearing diagnosis, effectively separating fault impulses from strong noise interference and achieving over 95% identification accuracy. Guo et al. [
31] introduced an improved red-billed blue magpie optimizer to enhance maximum second-order cyclostationary blind deconvolution for bearing fault diagnosis in metro train bogies, autonomously optimizing filter length and cyclic frequency parameters to achieve 100% classification accuracy.
Multi-scale convolutional architectures extract features at different receptive field scales through learned filters. Chen et al. [
13] developed a time-frequency-aware feature disentanglement framework with dynamic convolutions for bearing diagnosis under variable speed conditions, adaptively adjusting kernel responses to non-stationary features and achieving over 97% accuracy across multiple speed settings. Du et al. [
1] proposed a PSR-CNN-DLSTM model integrating phase space reconstruction with deep LSTM for bearing fault diagnosis, effectively capturing nonlinear dynamics through multi-scale temporal modeling and achieving 98.6% accuracy. Dynamic modeling approaches provide complementary physical insights: E et al. [
32] established a lumped parameter dynamic model for cycloid gear bearings in rotate vector reducers, revealing fault vibration characteristics through time-varying mesh stiffness and Hertz contact force analysis, which informs feature selection for data-driven methods.
Beyond decomposition-based approaches, advanced feature extraction techniques characterize fault patterns through mathematical transforms and complexity measures. Envelope spectrum analysis and resonance demodulation are widely used to extract fault characteristic frequencies from modulated vibration signals. Chen et al. [
33] developed a clustering weighted envelope spectrum method for compound bearing fault diagnosis that automatically identifies potential fault frequencies without prior knowledge, achieving over 95% identification accuracy on signals with multiple co-existing faults. Fu et al. [
34] proposed resonance demodulation based on dynamic bearing models with improved empirical Fourier decomposition for bearing fault detection, effectively isolating fault-induced resonance bands and improving diagnostic signal-to-noise ratio by over 10 dB. Cao et al. [
35] introduced sparse Bayesian learning with a categorical probabilistic model for compound bearing fault diagnosis, reducing energy leakage effects and achieving accurate separation of co-existing inner race and outer race faults. Zhang et al. [
36] proposed a time-domain sparsity-based method using pulse signal-to-noise ratio for fast bearing fault diagnosis, enabling real-time detection by reducing computational time by 80% compared to frequency-domain approaches while maintaining over 96% accuracy.
Entropy-based measures quantify signal complexity to distinguish fault-induced patterns from healthy-state vibrations. Li and Ding [
37] introduced geometric entropy capturing phase space geometric properties for bearing fault classification, achieving 96–97% accuracy with superior robustness to noise compared to conventional entropy measures. Liu et al. [
38] proposed quadratic Manhattan entropy combined with random forest classifiers for weak-fault bearing diagnosis under strong background interference, demonstrating a 4–8% accuracy improvement over traditional entropy features. Liu et al. [
39] presented analytical vibration signal models for rolling element bearings by deriving closed-form spectral equations for different defect types, providing a theoretical foundation for feature selection in bearing diagnosis. However, traditional signal processing methods typically require manual parameter tuning and domain expertise, while most multi-scale learning approaches focus on single-scale feature extraction without systematically integrating features across multiple dilations with adaptive receptive fields for comprehensive temporal pattern modeling.
2.3. Attention Mechanisms and Noise Robustness
Attention mechanisms have demonstrated significant advantages in emphasizing fault-relevant features while suppressing background interference and can be categorized by the dimension along which attention is applied: channel, spatial, or multi-modal.
Channel attention adaptively recalibrates feature channel weights to highlight discriminative frequency bands. Luu and Huynh [
6] integrated CBAM with multi-scale depthwise separable convolutions for bearing diagnosis across mechanical domains, achieving fault-frequency invariant feature extraction and over 99% accuracy on both CWRU and HUST datasets. Cheng et al. [
14] proposed symmetric positive definite manifold deep metric learning with channel-wise feature selection through denoising autoencoders for bearing fault classification, demonstrating improved inter-class separability and achieving over 97% accuracy under domain shift conditions. Spatial and temporal attention mechanisms focus on informative regions within feature maps that contain fault impulses. Chen et al. [
4] combined CNNs with multi-head attention and bidirectional GRUs for bearing diagnosis from recurrence plot representations, achieving 92% accuracy by selectively attending to temporally salient fault patterns in noisy signals.
Multi-modal attention and knowledge-driven approaches extend beyond single-signal analysis to exploit complementary information sources. Peng et al. [
40] developed a multimodal knowledge graph construction method for bearing fault diagnosis that integrates time series vibration signals, spectrum data, and textual descriptions with a relation cascade graph attention network, demonstrating robust performance with over 96% accuracy across seven bearing datasets. Wang et al. [
41] proposed a multi-electrical signal analysis method for permanent magnet synchronous machine bearing fault diagnosis, using variational mode decomposition to extract fault harmonic components from electrical signals under various controller bandwidths, achieving over 94% accuracy without requiring additional vibration sensors, while dual attention mechanisms combining channel and spatial attention provide complementary benefits; most existing methods apply attention globally without explicit architectural components dedicated to noise suppression.
Noise robustness is critical for industrial bearing fault diagnosis, where measurement interference and background vibrations severely degrade diagnostic accuracy. Existing denoising strategies can be broadly divided into signal-level preprocessing approaches and network-integrated methods.
Signal-level denoising methods suppress noise before or during feature extraction. Zou et al. [
15] proposed a multimodal-enhanced hybrid denoising network for marine gearbox diagnosis under extreme noise, combining cross-dimensional fusion with adaptive time-frequency masking and achieving over 93% accuracy at SNR levels as low as
dB. Liu et al. [
16] introduced multikernel correntropy transfer robust dictionary learning for bearing diagnosis under non-Gaussian noise and outlier contamination, minimizing outlier impacts through correntropy-based optimization and achieving over 95% accuracy in heavy-tailed noise environments. Sheng et al. [
3] developed an adaptive filtering network combining frequency-domain Butterworth filters with deep neural networks for cross-domain bearing diagnosis under noise interference, achieving 96.3% accuracy through learned filter parameter optimization. Liu et al. [
42] constructed wavelet phase space reconstruction for online diagnosis of railway bogie bearing unbalance impact faults, combining oversampling techniques with wavelet analysis to achieve real-time fault detection under operational vibration interference.
Self-powered monitoring systems represent an emerging direction for vibration measurement under resource-constrained industrial conditions. Jiang et al. [
43] developed a weak vibration energy-powered acceleration monitoring system using triboelectric nanogenerator arrays for self-powered bearing fault diagnosis, achieving measurement sensitivity of 0.38 V/(m/s
2) without external power. Dong et al. [
44] designed an embedded triboelectric bearing integrating sensors directly into conventional bearing structures for condition monitoring, demonstrating continuous fault detection capability through triboelectrification-based signal generation. However, existing denoising methods typically operate as preprocessing steps or regularization terms, lacking dedicated architectural modules that integrate noise suppression with feature learning through gated attention mechanisms and residual connections.
Despite these advances, existing methods typically address multi-scale extraction, attention mechanisms, and denoising in isolation rather than through unified architectural design. Most approaches lack systematic integration of these components or fail to provide explicit denoising blocks specifically designed for industrial bearing vibration signals. Moreover, limited analysis exists on how different attention types contribute synergistically to fault diagnosis performance. Our MDCAD-Net addresses these limitations by proposing a unified architecture that systematically combines multi-dilated convolutions for adaptive multi-scale temporal feature extraction, dual attention mechanisms (SENet for channel recalibration and CBAM for spatial-temporal attention) for complementary feature emphasis, and a dedicated denoising block with gated attention and residual connections specifically designed for suppressing measurement noise while preserving fault-relevant transient impulses through selective information propagation.
4. Experimental Results and Analysis
4.1. Datasets
We conduct experiments on the Case Western Reserve University (CWRU) bearing dataset [
47], which is a widely recognized benchmark for bearing fault diagnosis research. The dataset contains vibration signals collected from rolling element bearings under various fault conditions using accelerometers mounted at different locations on the motor housing.
The CWRU bearing dataset comprises vibration signals collected at two sampling frequencies: 12 kHz for drive end (DE) and fan end (FE) bearings and 48 kHz for drive end bearings. In this study, we focus on the 12 kHz drive end accelerometer data, which includes measurements from both healthy bearings and bearings with single-point faults introduced using electro-discharge machining. The faults are categorized into three types based on their locations: inner race (IR), outer race (OR), and ball (rolling element) faults. Additionally, three fault severity levels are considered, corresponding to fault diameters of 0.007 inches (7 mils), 0.014 inches (14 mils), and 0.021 inches (21 mils). The dataset used in our experiments consists of 10 fault classes as shown in
Table 1.
Figure 3 shows the CWRU bearing test rig, which consists of an electric motor, a torque transducer/encoder, a dynamometer, and bearings at the drive end and fan end positions.
Data Preprocessing and Segmentation. The continuous vibration signals are segmented into non-overlapping samples using a sliding window approach with 50% overlap. Each sample contains 1024 data points, corresponding to approximately 0.085 s of vibration data at the 12 kHz sampling frequency. This window length is selected to capture sufficient temporal information for fault characteristic extraction while maintaining a reasonable sample size for efficient model training. The sliding window approach with overlap increases the number of training samples and helps the model learn robust features from different temporal positions within the fault signals.
The dataset is split into training (70%), validation (15%), and testing (15%) sets following a stratified random split to ensure balanced class distribution across all subsets. All vibration signals are normalized to zero mean and unit variance using standardization before being fed into the network, ensuring consistent input scale across different fault conditions and facilitating stable gradient-based optimization during training.
Figure 4 presents a multi-domain visualization of CWRU bearing vibration signals, demonstrating the discriminative characteristics of different fault conditions. The time-domain waveforms (
Figure 4a–c) show that normal bearings exhibit relatively smooth oscillations, while fault conditions display significant amplitude modulations and impulsive features with peak amplitudes reaching 1.5 times the normal values for inner race faults. The frequency-domain spectra (
Figure 4d–f) reveal that fault conditions produce characteristic frequency peaks at bearing defect frequencies (e.g., BPFI around 500–800 Hz) and increased high-frequency content above 3 kHz, distinguishing them from the low-frequency concentration observed in normal bearings. The time-frequency spectrograms (
Figure 4g,h) further illustrate temporal variations, with fault conditions exhibiting periodic modulation patterns and intermittent high-frequency components absent in normal operation. These multi-domain signal characteristics validate the feasibility of deep learning-based approaches for automated bearing fault diagnosis.
4.2. Experimental Setting
All experiments are conducted using PyTorch 1.13.0 on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB memory), Intel Core i9-10900K CPU, and 64 GB RAM. The input to MDCAD-Net consists of 1D vibration signals with a fixed length of 1024 time steps, reshaped to match the network’s input requirements. All models are trained for 30 epochs with a batch size of 32 for both training and validation. We employ the Adam optimizer with an initial learning rate of 0.001 and implement a ReduceLROnPlateau learning rate scheduler that reduces the learning rate by a factor of 0.5 when the validation loss plateaus for 5 consecutive epochs. Cross-entropy loss is used as the objective function, and early stopping with a patience of 10 epochs is applied to prevent overfitting. To ensure reproducible results, we fix the random seed to 42 for all experiments.
Table 2 summarizes the key hyperparameters and architectural configurations of MDCAD-Net. These settings are carefully tuned through preliminary experiments to balance model performance and computational efficiency.
4.3. Evaluation Metrics
To comprehensively evaluate the performance of MDCAD-Net and baseline methods for bearing fault diagnosis, we employ four standard classification metrics widely adopted in the machine learning and fault diagnosis literature: accuracy, precision, recall, and F1-score. These metrics provide complementary perspectives on model performance, capturing both overall correctness and class-specific diagnostic capabilities.
Accuracy measures the overall proportion of correctly classified samples across all fault classes
where
N is the total number of test samples,
is the ground truth label for the
i-th sample,
is the predicted label, and
is the indicator function that equals 1 when the condition is true and 0 otherwise.
For multi-class classification, we compute precision and recall for each class and report the macro-averaged values. Precision for class
k measures the proportion of correctly identified samples among all samples predicted as class
k
where
(true positives) is the number of samples correctly classified as class
k, and
(false positives) is the number of samples incorrectly classified as class
k. The macro-averaged precision is computed as
where
K is the total number of fault classes.
Recall for class
k measures the proportion of correctly identified samples among all samples that truly belong to class
k
where
(false negatives) is the number of samples of class
k that are incorrectly classified as other classes. The macro-averaged recall is
The F1-score provides a harmonic mean of precision and recall, offering a balanced measure of diagnostic performance
All metrics are reported as decimal values in the range [0, 1], where higher values indicate better performance. The use of macro-averaging ensures that each fault class contributes equally to the overall evaluation, which is particularly important for assessing model performance across different fault types and severities.
4.4. Baselines
To comprehensively evaluate MDCAD-Net, we compare it against eight representative baseline models that span different architectural paradigms. We select five deep convolutional neural networks, including ResNet-18 [
48] with residual connections, Inception-v3 [
49] with multi-scale feature extraction, VGG-16 [
50] with very deep architecture, WenCNN [
51], and MA1DCNN [
52] specifically designed for bearing fault diagnosis with specialized 1D convolutional architectures. Additionally, we include SequentialLSTM [
53], which captures temporal dependencies through long short-term memory units, as well as two Transformer-based models: Autoformer [
54], incorporating decomposition with auto-correlation and TARNet [
55], employing task-aware reconstruction mechanisms. To ensure a strictly fair comparison, all baseline models were reimplemented by the authors using PyTorch and trained from scratch under identical experimental conditions: the same data preprocessing pipeline (z-score normalization, 1024-point windowing with 50% overlap), the same stratified 70%/15%/15% train/validation/test split with random seed 42, the same Adam optimizer (learning rate 0.001, batch size 32, 30 epochs), and the same ReduceLROnPlateau scheduler. For 2D architectures (ResNet-18, Inception-v3, VGG-16), the 1D vibration signals were reshaped into single-channel 2D inputs following standard practice. No hyperparameter tuning was performed individually for any baseline; all share the same training protocol to eliminate confounding factors. Implementation details follow the original papers with necessary adaptations for 1D vibration signal input.
4.5. Main Results
Table 3 presents the comprehensive performance comparison of MDCAD-Net against eight baseline models on the CWRU bearing fault diagnosis task. MDCAD-Net achieves the best overall performance with 0.9893 accuracy and maintains balanced performance across all evaluation metrics, including precision (0.9894), recall (0.9893), and F1-score (0.9892). The consistent performance across all metrics indicates that the model achieves reliable fault detection without bias toward specific fault types or classes, demonstrating the effectiveness of integrating multi-dilated convolutions with dual attention mechanisms and denoising capabilities.
Among the baseline models, convolutional neural network architectures demonstrate strong performance, with ResNet-18 achieving competitive results at 0.9854 accuracy due to its deep residual connections that facilitate gradient flow and feature reuse across multiple layers. Inception-v3 obtains 0.9821 accuracy through multi-scale convolutional kernels that capture features at different temporal resolutions, while the specialized bearing fault diagnosis models MA1DCNN and WenCNN achieve 0.9765 and 0.9612 accuracy, respectively, validating the effectiveness of domain-specific architectural designs for vibration signal analysis. Transformer-based architectures show varying performance, with TARNet demonstrating 0.9687 accuracy through temporal attention mechanisms, while Autoformer exhibits relatively lower performance at 0.9401 accuracy, suggesting that general-purpose time series models may require additional adaptations for fault diagnosis tasks. The recurrent architecture SequentialLSTM achieves 0.9487 accuracy, indicating that temporal sequential modeling alone may be insufficient for capturing the complex spectral patterns in bearing vibration signals without explicit multi-scale feature extraction and attention-based feature selection mechanisms.
To provide deeper insights into the model’s classification behavior,
Figure 5 presents the confusion matrix for MDCAD-Net on the CWRU dataset. The matrix demonstrates excellent diagonal dominance, indicating strong discriminative capability across all 10 fault classes. Most classes achieve perfect classification (100%), including Normal, 7-Inner, 7-Outer, 14-Inner, 14-Outer, 21-Inner, and 21-Outer, demonstrating the model’s exceptional ability to distinguish between different fault locations and damage severities. Only three ball fault classes show minor misclassifications: 7-Ball exhibits 3.2% confusion with 21-Ball, 14-Ball shows 4.8% distributed misclassifications across multiple classes including Normal (1.6%), 7-Ball (0.5%), 21-Inner (1.1%), and 21-Outer (1.6%), and 21-Ball demonstrates 2.7% confusion with 14-Inner. These confusions primarily occur between ball fault classes with different damage sizes, which share similar frequency characteristics but differ in amplitude modulation patterns, representing the most challenging discrimination task in bearing fault diagnosis. The model’s ability to correctly classify over 98% of samples across all fault severities demonstrates the effectiveness of the multi-scale feature extraction and dual attention mechanisms in capturing subtle differences between fault signatures.
Figure 6 further breaks down the performance across individual fault classes, presenting accuracy, precision, recall, and F1-score for each of the 10 classes through a comprehensive four-panel visualization. The results reveal distinct performance patterns across different fault locations. Normal bearing state achieves perfect classification with all metrics at 1.00, providing a reliable baseline for fault detection. Outer race faults (7-Outer, 14-Outer, 21-Outer) consistently achieve perfect performance across all damage severities, with accuracy, precision, recall, and F1-score all at 1.00, indicating that outer race defects generate highly distinctive vibration patterns that are easily separable from other fault types. Inner race faults (7-Inner, 14-Inner, 21-Inner) also demonstrate excellent performance with all metrics above 0.97, showing that the model effectively captures the characteristic modulation patterns associated with inner raceway damage. Ball faults represent the most challenging category, with 14-Ball showing the lowest accuracy at 0.9520 and precision at 0.9524, yet maintaining robust recall (0.9520) and F1-score (0.9520) above 0.95. The relatively lower performance on ball faults is expected, as rolling element defects produce more complex and variable vibration signatures that depend on ball position, load zone, and contact angle dynamics. This class-wise analysis confirms that MDCAD-Net maintains balanced and robust performance across all fault types and damage severities, which is essential for reliable industrial fault diagnosis systems where both false alarms and missed detections carry significant operational costs.
The superior performance of MDCAD-Net compared to baseline models can be attributed to three key architectural innovations. First, the multi-dilated convolution module enables simultaneous capture of both fine-grained transient impulses and long-range periodic patterns in vibration signals through parallel branches with different receptive fields, providing richer temporal representations than single-scale convolutional approaches. Second, the dual attention mechanism combining SENet and CBAM selectively emphasizes discriminative features across both channel and spatial dimensions, suppressing noise-contaminated channels while focusing on informative temporal regions that carry fault-specific signatures. Third, the denoising block with gated attention effectively filters background noise and measurement interference that are prevalent in real-world industrial environments where sensor signals are subject to multiple sources of contamination. The synergistic combination of these components enables MDCAD-Net to extract robust fault signatures that generalize well across different fault types, damage severities, and operating conditions, as evidenced by the consistent high performance across all classes and the minimal confusion between fault categories.
4.6. Ablation Study
To systematically validate the contribution of each component in MDCAD-Net, we conduct ablation studies by removing individual modules while keeping other components unchanged.
Table 4 presents the performance degradation when removing specific modules, demonstrating the importance of each architectural component for achieving optimal fault diagnosis performance.
The full MDCAD-Net model achieves the best performance with 0.9893 accuracy across all metrics. Removing the denoising block results in performance degradation to 0.9812 accuracy (0.81% decrease), indicating that the gated attention mechanism provides meaningful noise suppression capabilities for handling real-world bearing vibration signals. The CBAM module shows a larger impact, with its removal decreasing accuracy to 0.9754 (1.39% decrease), validating the effectiveness of sequential channel and spatial attention for emphasizing discriminative temporal regions. The MDC module demonstrates substantial contribution, with its absence reducing accuracy to 0.9631 (2.62% decrease), confirming the critical importance of multi-scale temporal feature extraction through parallel dilated convolutions with different receptive fields. The SENet module shows the most significant impact among individual components, with its removal decreasing accuracy to 0.9487 (4.06% decrease), highlighting the essential role of channel-wise feature recalibration in learning discriminative representations from vibration signals. These results validate our design choices and demonstrate that each component contributes meaningfully to the overall performance, with SENet and MDC being the most critical modules for achieving state-of-the-art bearing fault diagnosis.
To further investigate inter-module interactions and determine whether any modules are redundant, we conduct combined ablation experiments by removing two modules simultaneously and comparing the observed degradation with the sum of individual removals. If the combined drop exceeds the additive sum, the modules exhibit synergy (each enhances the other’s contribution); if it equals the sum, the modules are independent; if it falls below the sum, partial functional overlap exists.
Table 5 presents the results.
All four module pairs exhibit synergistic interactions (), confirming that each module enhances the effectiveness of its counterpart rather than providing redundant functionality. The strongest synergy appears between SENet and MDC ( vs. expected , excess ), consistent with the architectural design rationale that channel recalibration before multi-scale extraction pre-emphasizes informative frequency bands, amplifying the discriminative power of subsequent dilated convolutions. The SENet–CBAM pair also shows notable synergy ( vs. ), indicating that channel and spatial attention serve genuinely complementary rather than overlapping roles. The minimal baseline (Conv only) achieves 0.9078, confirming that the four modules collectively contribute a improvement that cannot be attributed to any single component.
4.7. Parameter Sensitivity Analysis
Understanding how key hyperparameters affect model performance is critical for practical deployment and reproducibility. We conduct a comprehensive sensitivity analysis on six key parameters: learning rate, batch size, dilation rates in the MDC module, SENet reduction ratio, CBAM reduction ratio, and kernel size. As illustrated in
Figure 7, each parameter exhibits distinct sensitivity patterns across the four evaluation metrics, providing valuable insights into the robustness and optimization requirements of MDCAD-Net for bearing fault diagnosis applications.
Training hyperparameters demonstrate the most pronounced impact on model performance. Learning rate exhibits the highest sensitivity, with the optimal value of achieving peak performance across all metrics. Excessively large learning rates cause catastrophic training instability, as evidenced by sharply declining performance curves, particularly for recall, which shows the steepest descent pattern. Batch size analysis reveals that size 32 provides optimal performance, while smaller batches introduce severe degradation due to excessive gradient noise. Dilation rates in the MDC module prove critical for multi-scale feature extraction, with configuration [1, 2, 4] achieving superior performance by effectively capturing both fine-grained transient impulses and long-range temporal dependencies. Alternative configurations show substantial degradation, demonstrating that proper receptive field design is essential for detecting early-stage bearing faults.
Architectural parameters exhibit more stable but still substantial sensitivity patterns. SENet and CBAM reduction ratios both achieve optimal performance at ratio 16, with the curves showing symmetric degradation toward larger ratios that create information bottlenecks. CBAM demonstrates higher sensitivity than SENet, suggesting that spatial attention is more critical than channel attention for bearing fault diagnosis. Kernel size analysis presents a monotonic degradation pattern where smaller kernels consistently outperform larger ones, with size 3 capturing appropriate temporal resolution while larger kernels obscure transient fault impulses through excessive smoothing. Across all parameters, recall consistently exhibits the largest performance variations, highlighting that maintaining robust fault detectability is the primary challenge in hyperparameter optimization for diverse bearing operating conditions.
To further investigate whether the module arrangement affects performance, we evaluate five different placement orders while keeping all components unchanged.
Table 6 presents the results.
The proposed order (SENet→MDC→CBAM→Denoise) achieves the best performance. Applying SENet before MDC allows channel recalibration to pre-emphasize fault-relevant frequency bands before multi-scale extraction. Placing CBAM after MDC enables spatial-temporal attention to refine the multi-scale representations. Positioning the denoising block at the end ensures noise suppression operates on the most refined features. Placing the denoising block first (last row) causes the largest degradation (), confirming that early-stage denoising removes potentially useful signal components before the network has learned discriminative features.
4.8. Feature Visualization
To provide intuitive insights into the learned representations, we employ t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the high-dimensional feature embeddings extracted by MDCAD-Net from the test set. As shown in
Figure 8, the model successfully learns discriminative feature representations that form well-separated clusters in the two-dimensional embedding space, with each cluster corresponding to a specific bearing health condition.
The visualization reveals several key characteristics of the learned feature space. First, the normal bearing state forms a compact and isolated cluster that is clearly separated from all fault conditions, demonstrating the model’s robust capability to distinguish healthy bearings from faulty ones. Second, fault types from different locations (inner race, ball, and outer race) form distinct regional groupings, indicating that MDCAD-Net effectively captures the unique vibration signatures associated with different fault mechanisms. Third, within each fault type, samples with different damage severities (0.007, 0.014, and 0.021 inches) exhibit systematic spatial progression, suggesting that the learned features encode damage severity information in a continuous manner. Notably, the spatial arrangement shows that ball fault samples are positioned between inner race and outer race faults, which aligns with the confusion matrix analysis showing that ball faults are the most challenging to classify. The clear inter-class separability combined with intra-class compactness validates the effectiveness of the multi-scale feature extraction and dual attention mechanisms in learning discriminative representations for bearing fault diagnosis.
To provide quantitative evidence of feature clustering quality beyond visual inspection, we compute three clustering metrics on the learned feature embeddings for each ablation variant.
Table 7 presents the results.
The full MDCAD-Net achieves the highest Silhouette Score (0.8234) and Calinski-Harabasz Index (2845.6) and the lowest Davies-Bouldin Index (0.3567), confirming superior inter-class separability and intra-class compactness. The progressive improvement from the bottom row (w/o SENet) to the full model mirrors the single-module ablation results, with SENet contributing the most to clustering quality by recalibrating channel weights that emphasize fault-discriminative frequency bands.
Beyond quantitative clustering quality, the spatial structure of the t-SNE feature space is consistent with known bearing vibration mechanisms. Outer race fault clusters are tightly grouped and clearly separated from other fault types, which aligns with the physical mechanism that outer race defects produce stationary-source impacts at a fixed angular position relative to the sensor, generating highly repeatable impulse patterns. Inner race fault clusters exhibit slightly larger intra-class spread, reflecting the amplitude modulation at shaft frequency caused by the rotating defect location. Ball fault clusters show the greatest dispersion and are positioned between inner and outer race clusters in the embedding space, consistent with the physical mechanism that rolling element defects produce impacts whose amplitude and frequency depend on the instantaneous ball position within the load zone and the varying contact angle during rotation. This correspondence between the learned feature space geometry and known vibration mechanics provides evidence that MDCAD-Net captures physically meaningful fault characteristics rather than merely exploiting superficial statistical correlations in the training data.
4.9. Noise Robustness Comparison
To evaluate the denoising block from a signal processing perspective, we compare MDCAD-Net against classical denoising preprocessing methods under various signal-to-noise ratio (SNR) conditions. Gaussian white noise is added to the test signals at SNR levels ranging from
dB to
dB.
Table 8 presents the classification accuracy of MDCAD-Net compared with three classical denoising methods applied as preprocessing before the same CNN backbone (MDCAD-Net without the denoising block): wavelet thresholding (Daubechies-4, soft thresholding), empirical mode decomposition (EMD, first 5 IMFs reconstructed), and spectral filtering (4th-order Butterworth, cutoff at 5 kHz).
MDCAD-Net maintains classification accuracy above 87% even at dB SNR, outperforming the best classical method (CNN+Wavelet) by 5.22 percentage points. The performance advantage increases under more severe noise conditions: at dB, MDCAD-Net surpasses the no-denoising baseline by 9.22 points, while wavelet preprocessing only provides a 4.00-point improvement. This confirms that the learnable denoising block provides genuine noise suppression capability by jointly optimizing noise reduction and fault classification in an end-to-end manner, preserving fault-discriminative features that signal-agnostic preprocessing methods inadvertently remove.
4.10. Cross-Load Validation
To evaluate generalization under different operating conditions, we conduct cross-load validation experiments using the CWRU dataset collected under four motor loads: 0 HP (1797 RPM), 1 HP (1772 RPM), 2 HP (1750 RPM), and 3 HP (1730 RPM). In each experiment, the model is trained exclusively on data from one load condition and tested on a different load condition, with no overlap between training and testing data.
Table 9 presents the results for MDCAD-Net and three representative baselines.
MDCAD-Net achieves an average cross-load accuracy of 95.62%, outperforming ResNet-18 by 1.91 percentage points, MA1DCNN by 4.15 points, and WenCNN by 5.69 points. The performance advantage over baselines is more pronounced in cross-load scenarios than in standard evaluation (
Table 3), indicating that the multi-scale feature extraction and adaptive denoising mechanisms contribute to learning load-invariant fault representations that generalize across varying operating conditions. MDCAD-Net maintains accuracy above 94% across all six transfer scenarios, demonstrating robust cross-condition generalization.
5. Conclusions
This paper presents MDCAD-Net, a multi-dilated convolution attention denoising network that integrates multi-scale temporal feature extraction, dual attention mechanisms, and adaptive denoising within a unified end-to-end framework for bearing fault diagnosis. On the CWRU benchmark, MDCAD-Net achieves 98.93% accuracy, outperforming eight baseline models. Combined ablation studies confirm genuine inter-module synergy among all four components. Cross-load validation yields an average accuracy of 95.62%, surpassing the best baseline by 1.91 percentage points, while noise robustness experiments demonstrate that the learnable denoising block maintains above 87% accuracy at dB SNR, outperforming classical denoising methods across all tested conditions. Quantitative clustering metrics further confirm that the learned representations are well-separated and consistent with known bearing vibration mechanisms.
The results demonstrate the effectiveness of the proposed framework under both standard and simulated noisy conditions, while the current evaluation relies on artificially injected Gaussian noise rather than real industrial measurements, the consistent performance advantage across a wide range of SNR levels suggests promising potential for practical deployment. Future research will focus on extending MDCAD-Net to other rotating machinery diagnosis tasks, investigating transfer learning strategies for limited-data scenarios, incorporating multi-modal sensor information, and validating the approach on industrial datasets with authentic measurement noise.