MDCAD-Net: A Multi-Dilated Convolution Attention Denoising Network for Bearing Fault Diagnosis

Duan, Ran; Yan, Ruopeng; Jin, Guangyin

doi:10.3390/vibration9020030

Open AccessArticle

MDCAD-Net: A Multi-Dilated Convolution Attention Denoising Network for Bearing Fault Diagnosis

by

Ran Duan

¹

,

Ruopeng Yan

^1,*

and

Guangyin Jin

²

¹

Changjiang Institute of Survey, Planning, Design and Research Corporation, Wuhan 430010, China

²

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Vibration 2026, 9(2), 30; https://doi.org/10.3390/vibration9020030

Submission received: 3 March 2026 / Revised: 27 March 2026 / Accepted: 21 April 2026 / Published: 24 April 2026

Download

Browse Figures

Versions Notes

Abstract

Bearing fault diagnosis is an important task for condition monitoring and predictive maintenance of rotating machinery. Nevertheless, many existing deep learning-based methods have difficulty in jointly modeling multi-scale fault characteristics, adaptively highlighting informative features, and maintaining robustness under noisy measurement conditions. To address these issues, this study presents MDCAD-Net, a multi-dilated convolution attention denoising network that integrates multi-scale temporal feature extraction, attention-based feature refinement, and explicit noise suppression within an end-to-end learning framework. Parallel dilated convolutions with different dilation rates are employed to capture short-duration transient impulses as well as long-range periodic patterns in vibration signals. Channel-wise feature recalibration using squeeze-and-excitation networks and spatial-temporal attention via a convolutional block attention module are combined to enhance informative representations. In addition, a denoising block with gated attention and residual connections is introduced to reduce noise interference while retaining fault-related signal components. Experiments conducted on the Case Western Reserve University bearing dataset show that the proposed method achieves a classification accuracy of 98.93% and yields competitive performance compared with several commonly used deep learning models. Ablation studies and feature visualization results further illustrate the contributions of the individual components and the separability of the learned feature representations under noisy conditions. The results indicate the potential of the proposed framework for practical bearing fault diagnosis under noisy operating conditions.

Keywords:

bearing fault diagnosis; multi-dilated convolution; attention mechanism; denoising network; deep learning; vibration signal analysis

1. Introduction

Rolling bearings are fundamental mechanical components in rotating machinery systems, widely deployed across critical industrial applications including wind turbines, aircraft engines, railway vehicles, marine propulsion systems, and manufacturing equipment. As the primary load-bearing and motion-transmission elements, bearings operate under harsh conditions involving high-speed rotation, heavy loads, and severe environmental stress, making them among the most failure-prone components in mechanical systems. Bearing faults account for approximately 30% of all machinery failures, and undetected bearing degradation can lead to catastrophic equipment breakdowns, unplanned production downtime, substantial economic losses, and severe safety hazards [1,2,3]. Consequently, accurate and timely bearing fault diagnosis has become a critical enabler for predictive maintenance strategies, condition-based monitoring systems, and intelligent manufacturing paradigms, ensuring operational reliability, minimizing maintenance costs, and preventing potential accidents in industrial settings.

Deep learning has revolutionized bearing fault diagnosis by enabling end-to-end automatic feature learning from raw vibration signals, eliminating the need for manual feature engineering that relies heavily on domain expertise. Convolutional neural networks have demonstrated remarkable capabilities in extracting hierarchical spatial representations from time-domain, frequency-domain, and time-frequency domain signals [4,5,6]. Recurrent architectures, including LSTM and GRU, effectively capture temporal dependencies in sequential vibration data, while hybrid CNN-RNN models leverage complementary strengths of spatial feature extraction and temporal modeling [1,7,8]. Transfer learning and domain adaptation techniques address cross-domain diagnosis challenges by transferring knowledge across different operating conditions, equipment types, and fault scenarios [9,10,11]. Despite these advances, existing deep learning methods face critical limitations. First, most architectures lack systematic integration of multi-scale feature extraction mechanisms to simultaneously capture both fine-grained transient fault impulses and long-range temporal patterns with varying characteristic frequencies. Second, attention mechanisms are typically applied globally without dedicated architectural components for adaptive channel and spatial feature emphasis tailored to bearing fault characteristics. Third, industrial bearing vibration signals are severely contaminated by measurement noise, background vibrations, and interference from adjacent rotating components, yet most methods lack explicit denoising modules designed to suppress noise while preserving fault-relevant transient information.

Multi-scale feature extraction is essential for comprehensive bearing fault diagnosis because fault signatures manifest across different temporal resolutions and frequency scales. Early-stage bearing defects generate weak impulsive transients with short durations requiring fine-scale receptive fields, while advanced degradation produces periodic impact patterns with longer intervals demanding large-scale temporal modeling. Existing approaches address multi-scale learning through signal decomposition preprocessing [7,12] or multi-branch parallel architectures [13], but these methods either operate independently from deep feature learning or lack adaptive mechanisms to dynamically adjust receptive fields based on input characteristics. Multi-dilated convolutions offer an elegant solution by employing parallel convolutional layers with different dilation rates to simultaneously capture multi-scale temporal patterns within a unified architecture, enabling the network to attend to both local transient impulses and global periodic structures. However, existing multi-scale convolutional designs rarely incorporate systematic dilated convolutions specifically optimized for bearing vibration signal characteristics and lack integration with attention mechanisms for selective feature emphasis.

Attention mechanisms have emerged as powerful tools to emphasize discriminative features and suppress irrelevant information in bearing fault diagnosis. Channel attention, exemplified by Squeeze-and-Excitation Networks, adaptively recalibrates feature channel weights to highlight frequency bands containing fault-relevant information while attenuating noise-dominated channels [6,14]. Spatial attention focuses on informative temporal regions where fault impulses concentrate, reducing the influence of steady-state vibrations and random noise [13]. Convolutional Block Attention Module synergistically combines channel and spatial attention for complementary feature emphasis. Despite their proven effectiveness, most attention-based bearing diagnosis methods apply these mechanisms globally across entire feature maps without considering the unique characteristics of industrial noise interference. Furthermore, limited research exists on systematically integrating dual attention mechanisms (channel and spatial) with multi-scale feature extraction and dedicated denoising modules within a unified architectural framework specifically designed for noisy bearing vibration signal analysis.

Industrial bearing fault diagnosis faces severe challenges from measurement noise and environmental interference that significantly degrade diagnostic accuracy in real-world applications. Vibration sensors inevitably capture background noise from adjacent machinery, electromagnetic interference, sensor mounting vibrations, and measurement quantization errors that mask weak early-stage fault signatures [3,15,16]. Most existing denoising approaches operate as preprocessing steps through signal filtering, wavelet denoising, or variational mode decomposition, which risk removing fault-relevant transient impulses along with noise. Alternative methods incorporate denoising objectives as regularization terms during training but lack explicit architectural components dedicated to noise suppression. Recent work has begun exploring learnable denoising modules integrated within deep networks [15], yet these approaches typically employ simple filtering operations without gated attention mechanisms to selectively preserve fault features while suppressing noise and lack residual connections to prevent information loss during denoising operations.

To address these fundamental limitations and systematically integrate multi-scale feature extraction, dual attention mechanisms, and adaptive denoising within a unified framework, we propose MDCAD-Net (Multi-Dilated Convolution Attention Denoising Network), a novel deep learning architecture specifically designed for robust bearing fault diagnosis under noisy industrial conditions. MDCAD-Net introduces three key architectural innovations. First, a multi-dilated convolution module employs parallel dilated convolutional layers with complementary dilation rates to simultaneously extract fault features across multiple temporal scales, capturing both short-duration transient impulses from early-stage defects and long-range periodic patterns from advanced degradation. Second, dual attention mechanisms synergistically combine Squeeze-and-Excitation Networks for channel-wise feature recalibration and Convolutional Block Attention Module for spatial-temporal feature emphasis, enabling the network to adaptively focus on discriminative frequency bands and informative temporal regions containing fault signatures. Third, a dedicated denoising block incorporates gated attention mechanisms and residual connections to explicitly suppress measurement noise and background interference while preserving fault-relevant transient impulses through selective information propagation, ensuring robust feature learning under severe noise contamination without requiring preprocessing or manual signal filtering. Unlike existing architectures that address these challenges in isolation—CNN-LSTM models lack explicit multi-scale receptive fields and noise suppression, ResNet-based models employ fixed-scale convolutions without adaptive feature emphasis, and attention-based CNNs apply global attention without dedicated denoising modules—MDCAD-Net is specifically designed so that each module enhances the effectiveness of subsequent modules: channel recalibration (SENet) pre-emphasizes fault-relevant frequency bands before multi-scale extraction (MDC), spatial-temporal attention (CBAM) refines the multi-scale representations, and the denoising block operates on the most refined features to suppress residual noise. As demonstrated by the combined ablation experiments in Section 4.6, this sequential synergy produces performance gains that exceed the sum of individual module contributions, confirming that the architectural integration creates diagnostic capability that none of the individual components can achieve alone.

The main contributions of this work are:

We propose MDCAD-Net, a novel multi-dilated convolution attention denoising network that systematically integrates multi-scale temporal feature extraction, dual attention mechanisms, and adaptive denoising within a unified end-to-end learnable architecture specifically designed for robust bearing fault diagnosis under noisy industrial conditions, addressing the critical limitations of existing methods that treat these components in isolation.
We design a multi-dilated convolution module with parallel dilated convolutional layers employing complementary dilation rates to simultaneously capture fault signatures across multiple temporal scales, enabling comprehensive feature extraction of both short-duration transient impulses from early-stage defects and long-range periodic patterns from advanced bearing degradation, surpassing conventional multi-scale approaches that rely on signal decomposition preprocessing or fixed receptive field designs.
We develop a synergistic dual attention framework combining Squeeze-and-Excitation Networks for adaptive channel-wise feature recalibration and Convolutional Block Attention Module for spatial-temporal feature emphasis, enabling the network to selectively focus on discriminative frequency bands and informative temporal regions containing fault-relevant information while suppressing noise-dominated channels and steady-state vibration segments, with comprehensive ablation studies validating the complementary benefits of channel and spatial attention for bearing fault diagnosis.
We introduce a dedicated denoising block incorporating gated attention mechanisms and residual connections to explicitly suppress measurement noise and background interference during feature learning, achieving robust fault diagnosis under severe noise contamination without requiring preprocessing or manual signal filtering, and conduct comprehensive experiments on the CWRU bearing dataset demonstrating that MDCAD-Net achieves 0.9893 accuracy, outperforming eight competitive baseline models including ResNet-18, Inception-v3, VGG-16, and Transformer-based architectures.

The remainder of this paper is organized as follows. Section 2 reviews related work on deep learning approaches, multi-scale feature extraction, and attention mechanisms for bearing fault diagnosis. Section 3 presents the MDCAD-Net methodology with detailed descriptions of the multi-dilated convolution module, dual attention mechanisms, and denoising block architecture. Section 4 provides comprehensive experimental validation on the CWRU dataset, including performance comparisons with baseline models, ablation studies, and sensitivity analysis. Section 5 concludes with key findings and future research directions.

2. Related Work

Bearing fault diagnosis has been a critical research area in predictive maintenance and condition monitoring. Recent advances in deep learning have significantly improved diagnostic accuracy, while multi-scale feature extraction and attention mechanisms have emerged as key enablers for handling complex vibration signals. This section reviews existing methods across three key aspects aligned with the MDCAD-Net framework: deep learning approaches, multi-scale feature extraction techniques, and attention mechanisms with denoising strategies.

2.1. Deep Learning Methods for Bearing Fault Diagnosis

Deep neural networks have revolutionized bearing fault diagnosis by enabling automatic feature learning from raw vibration signals, and existing architectures can be broadly categorized into CNN-based, recurrent, and hybrid approaches.

Among CNN-based methods, Chen et al. [4] proposed a recurrence plot fusion neural network that converts vibration signals into color images and integrates bidirectional GRUs with multi-head attention for bearing fault diagnosis under noisy conditions, achieving 92% accuracy on noise-contaminated data. Ma et al. [5] designed a CNN for bearing diagnosis using weak magnetic signals with optimized input size and kernel configurations, achieving over 99% accuracy on the CWRU dataset. To reduce computational overhead for edge deployment, Luu and Huynh [6] developed a depthwise separable multi-scale CNN combined with CBAM and spatial pyramid pooling for lightweight bearing diagnosis, demonstrating over 99% accuracy with significantly fewer parameters on both CWRU and HUST datasets. Similarly, Hu et al. [17] introduced multiscale depthwise separable convolution with network pruning for bearing diagnosis under variable operating conditions, reducing model size by over 90% while maintaining accuracy above 98%.

Recurrent architectures capture temporal dependencies that CNNs may overlook. Wei and Yuan [7] proposed a VMD-DCA-BiGRU method combining variational mode decomposition with dual-channel convolutional attention and bidirectional GRUs for bearing diagnosis under variable operating conditions, demonstrating improved robustness across multiple load settings on the CWRU dataset. Wu et al. [8] introduced a physics-informed attention LSTM framework that embeds bearing dynamic mechanisms into attention-equipped networks for small-sample fault diagnosis, achieving 99.20–99.89% accuracy with enhanced interpretability.

Hybrid architectures that integrate CNNs with Transformers have emerged to combine spatial feature extraction with self-attention capabilities. Chen et al. [18] proposed a multi-feature fusion dual-channel CNN-Transformer-CAM framework that cross-fuses GADF and GST feature images for gearbox bearing diagnosis under noise interference, achieving over 98% accuracy across multiple operating states. Shi et al. [19] developed a digital twin-based multi-scale CNN-attention-BiGRU approach for bearing fault identification through virtual-physical data fusion, achieving 99.5% accuracy on simulated-real hybrid datasets.

Transfer learning and domain adaptation address the practical challenges of data scarcity and distribution shift that limit supervised approaches. Domain adaptation methods aim to align feature distributions across different operating conditions or machines. Su et al. [9] proposed a physics-informed blind wavelet deconvolution transfer network for cross-machine bearing diagnosis, embedding physical priors into wavelet filters to enhance interpretability and achieving over 97% cross-domain accuracy. Deng et al. [10] introduced a dual-space multi-node collaborative transfer learning method with dynamic MK-MMD and attention LMMD loss functions for multi-source bearing fault diagnosis, effectively aligning heterogeneous source domains and achieving over 95% accuracy on target tasks. Chen et al. [20] proposed a knowledge-sharing multi-task model for bearing diagnosis that automatically transfers useful features from high sampling frequency data to low sampling frequency tasks, achieving 96.8% accuracy for resolution-degraded signals.

Few-shot and zero-shot methods reduce the dependence on large labeled datasets. Huang et al. [11] developed an inter-domain similarity-guided meta-learning framework for cross-scenario few-shot bearing diagnosis, achieving over 95% accuracy with only five labeled samples per class. Xia et al. [21] proposed a cross-scale hybrid contrast network for generalized zero-shot fault diagnosis, generating features for inaccessible fault classes through fault description guidance and achieving over 85% accuracy on unseen categories. Lei et al. [22] introduced a deep transfer learning approach based on joint generalized sliced Wasserstein distances for semi-supervised bearing fault diagnosis with small samples, achieving 97.56% accuracy through dynamic domain alignment.

Multi-task learning and collaborative frameworks extend diagnostic capability beyond single-task classification. Zhang et al. [23] proposed an integrated multitasking scheme based on representation learning for bearing fault detection, classification, and unknown fault identification under imbalanced sample conditions, effectively handling class imbalance through modified denoising autoencoders. Xu et al. [24] developed a federated open-set fault diagnosis method enabling collaborative model training across distributed clients without centralized data sharing, maintaining over 90% accuracy while preserving data privacy. Jia et al. [25] introduced simulation-reality domain mixup adaptation for bearing fault diagnosis by leveraging bearing simulation models to generate synthetic training data, bridging the sim-to-real gap with over 93% transfer accuracy.

Generative adversarial networks have been applied to address imbalanced and limited-sample challenges. Liu et al. [26] introduced a condition multidomain GAN framework for bearing diagnosis under limited raw samples, fusing two-domain information to capture sample distributions and improving classification accuracy by over 5% compared to non-augmented baselines. Luo et al. [27] proposed a Wasserstein GAN meta-learning algorithm for early motor bearing fault diagnosis, achieving feature generation through adversarial learning and demonstrating 96.5% accuracy for incipient faults. Wu et al. [28] developed a GA-optimized wavelet transform combined with Wasserstein conditional GAN for shearer rocker arm bearing diagnosis, achieving 98.2% accuracy under small-sample conditions through targeted data augmentation. Despite the effectiveness of these deep learning methods, most require extensive labeled data and substantial computational resources while lacking explicit mechanisms to handle industrial noise and extract multi-scale fault features simultaneously.

2.2. Multi-Scale Feature Extraction and Signal Processing

Multi-scale feature learning captures fault signatures across different temporal and frequency resolutions, which is critical for detecting transient fault impulses whose characteristics vary with fault type and severity. Among the available approaches, time-frequency decomposition methods and multi-scale convolutional architectures represent two complementary strategies.

Time-frequency decomposition methods provide adaptive multi-resolution representations of non-stationary bearing signals. Variational mode decomposition (VMD) and its adaptive variants have been particularly effective for separating multi-component signals. Liu et al. [12] developed an AOVMD-ScaleShift-Net framework that uses an enhanced crested porcupine optimizer to co-optimize spectral kurtosis and time-frequency entropy for bearing fault diagnosis, achieving 99.52% accuracy on the CWRU dataset. Liu et al. [29] proposed a hybrid denoising model integrating improved whale optimization-based VMD with dataset-specific wavelet thresholding for rolling bearing early fault signal preprocessing, achieving RMSE as low as 0.00013–0.00041 and NCC of 0.9689–0.9798. Shi et al. [30] proposed a generalized envelope nonlinear Gini index–gram guided two-stage chirp mode decomposition for shield machine main bearing diagnosis, effectively separating fault impulses from strong noise interference and achieving over 95% identification accuracy. Guo et al. [31] introduced an improved red-billed blue magpie optimizer to enhance maximum second-order cyclostationary blind deconvolution for bearing fault diagnosis in metro train bogies, autonomously optimizing filter length and cyclic frequency parameters to achieve 100% classification accuracy.

Multi-scale convolutional architectures extract features at different receptive field scales through learned filters. Chen et al. [13] developed a time-frequency-aware feature disentanglement framework with dynamic convolutions for bearing diagnosis under variable speed conditions, adaptively adjusting kernel responses to non-stationary features and achieving over 97% accuracy across multiple speed settings. Du et al. [1] proposed a PSR-CNN-DLSTM model integrating phase space reconstruction with deep LSTM for bearing fault diagnosis, effectively capturing nonlinear dynamics through multi-scale temporal modeling and achieving 98.6% accuracy. Dynamic modeling approaches provide complementary physical insights: E et al. [32] established a lumped parameter dynamic model for cycloid gear bearings in rotate vector reducers, revealing fault vibration characteristics through time-varying mesh stiffness and Hertz contact force analysis, which informs feature selection for data-driven methods.

Beyond decomposition-based approaches, advanced feature extraction techniques characterize fault patterns through mathematical transforms and complexity measures. Envelope spectrum analysis and resonance demodulation are widely used to extract fault characteristic frequencies from modulated vibration signals. Chen et al. [33] developed a clustering weighted envelope spectrum method for compound bearing fault diagnosis that automatically identifies potential fault frequencies without prior knowledge, achieving over 95% identification accuracy on signals with multiple co-existing faults. Fu et al. [34] proposed resonance demodulation based on dynamic bearing models with improved empirical Fourier decomposition for bearing fault detection, effectively isolating fault-induced resonance bands and improving diagnostic signal-to-noise ratio by over 10 dB. Cao et al. [35] introduced sparse Bayesian learning with a categorical probabilistic model for compound bearing fault diagnosis, reducing energy leakage effects and achieving accurate separation of co-existing inner race and outer race faults. Zhang et al. [36] proposed a time-domain sparsity-based method using pulse signal-to-noise ratio for fast bearing fault diagnosis, enabling real-time detection by reducing computational time by 80% compared to frequency-domain approaches while maintaining over 96% accuracy.

Entropy-based measures quantify signal complexity to distinguish fault-induced patterns from healthy-state vibrations. Li and Ding [37] introduced geometric entropy capturing phase space geometric properties for bearing fault classification, achieving 96–97% accuracy with superior robustness to noise compared to conventional entropy measures. Liu et al. [38] proposed quadratic Manhattan entropy combined with random forest classifiers for weak-fault bearing diagnosis under strong background interference, demonstrating a 4–8% accuracy improvement over traditional entropy features. Liu et al. [39] presented analytical vibration signal models for rolling element bearings by deriving closed-form spectral equations for different defect types, providing a theoretical foundation for feature selection in bearing diagnosis. However, traditional signal processing methods typically require manual parameter tuning and domain expertise, while most multi-scale learning approaches focus on single-scale feature extraction without systematically integrating features across multiple dilations with adaptive receptive fields for comprehensive temporal pattern modeling.

2.3. Attention Mechanisms and Noise Robustness

Attention mechanisms have demonstrated significant advantages in emphasizing fault-relevant features while suppressing background interference and can be categorized by the dimension along which attention is applied: channel, spatial, or multi-modal.

Channel attention adaptively recalibrates feature channel weights to highlight discriminative frequency bands. Luu and Huynh [6] integrated CBAM with multi-scale depthwise separable convolutions for bearing diagnosis across mechanical domains, achieving fault-frequency invariant feature extraction and over 99% accuracy on both CWRU and HUST datasets. Cheng et al. [14] proposed symmetric positive definite manifold deep metric learning with channel-wise feature selection through denoising autoencoders for bearing fault classification, demonstrating improved inter-class separability and achieving over 97% accuracy under domain shift conditions. Spatial and temporal attention mechanisms focus on informative regions within feature maps that contain fault impulses. Chen et al. [4] combined CNNs with multi-head attention and bidirectional GRUs for bearing diagnosis from recurrence plot representations, achieving 92% accuracy by selectively attending to temporally salient fault patterns in noisy signals.

Multi-modal attention and knowledge-driven approaches extend beyond single-signal analysis to exploit complementary information sources. Peng et al. [40] developed a multimodal knowledge graph construction method for bearing fault diagnosis that integrates time series vibration signals, spectrum data, and textual descriptions with a relation cascade graph attention network, demonstrating robust performance with over 96% accuracy across seven bearing datasets. Wang et al. [41] proposed a multi-electrical signal analysis method for permanent magnet synchronous machine bearing fault diagnosis, using variational mode decomposition to extract fault harmonic components from electrical signals under various controller bandwidths, achieving over 94% accuracy without requiring additional vibration sensors, while dual attention mechanisms combining channel and spatial attention provide complementary benefits; most existing methods apply attention globally without explicit architectural components dedicated to noise suppression.

Noise robustness is critical for industrial bearing fault diagnosis, where measurement interference and background vibrations severely degrade diagnostic accuracy. Existing denoising strategies can be broadly divided into signal-level preprocessing approaches and network-integrated methods.

Signal-level denoising methods suppress noise before or during feature extraction. Zou et al. [15] proposed a multimodal-enhanced hybrid denoising network for marine gearbox diagnosis under extreme noise, combining cross-dimensional fusion with adaptive time-frequency masking and achieving over 93% accuracy at SNR levels as low as

- 4

dB. Liu et al. [16] introduced multikernel correntropy transfer robust dictionary learning for bearing diagnosis under non-Gaussian noise and outlier contamination, minimizing outlier impacts through correntropy-based optimization and achieving over 95% accuracy in heavy-tailed noise environments. Sheng et al. [3] developed an adaptive filtering network combining frequency-domain Butterworth filters with deep neural networks for cross-domain bearing diagnosis under noise interference, achieving 96.3% accuracy through learned filter parameter optimization. Liu et al. [42] constructed wavelet phase space reconstruction for online diagnosis of railway bogie bearing unbalance impact faults, combining oversampling techniques with wavelet analysis to achieve real-time fault detection under operational vibration interference.

Self-powered monitoring systems represent an emerging direction for vibration measurement under resource-constrained industrial conditions. Jiang et al. [43] developed a weak vibration energy-powered acceleration monitoring system using triboelectric nanogenerator arrays for self-powered bearing fault diagnosis, achieving measurement sensitivity of 0.38 V/(m/s²) without external power. Dong et al. [44] designed an embedded triboelectric bearing integrating sensors directly into conventional bearing structures for condition monitoring, demonstrating continuous fault detection capability through triboelectrification-based signal generation. However, existing denoising methods typically operate as preprocessing steps or regularization terms, lacking dedicated architectural modules that integrate noise suppression with feature learning through gated attention mechanisms and residual connections.

Despite these advances, existing methods typically address multi-scale extraction, attention mechanisms, and denoising in isolation rather than through unified architectural design. Most approaches lack systematic integration of these components or fail to provide explicit denoising blocks specifically designed for industrial bearing vibration signals. Moreover, limited analysis exists on how different attention types contribute synergistically to fault diagnosis performance. Our MDCAD-Net addresses these limitations by proposing a unified architecture that systematically combines multi-dilated convolutions for adaptive multi-scale temporal feature extraction, dual attention mechanisms (SENet for channel recalibration and CBAM for spatial-temporal attention) for complementary feature emphasis, and a dedicated denoising block with gated attention and residual connections specifically designed for suppressing measurement noise while preserving fault-relevant transient impulses through selective information propagation.

3. Methodology

This section presents the technical details of MDCAD-Net (Multi-Dilated Convolution Attention Denoising Network), the proposed deep learning framework for bearing fault diagnosis. We first provide an overview of the overall architecture, followed by detailed descriptions of each key module.

3.1. Network Architecture

MDCAD-Net is designed to effectively extract discriminative features from raw vibration signals for accurate bearing fault diagnosis under noisy conditions. As illustrated in Figure 1, the network comprises five key modules sequentially connected to progressively refine feature representations. The initial feature extraction module employs a standard convolutional block to extract preliminary features. Subsequently, squeeze-and-excitation networks (SENet) adaptively recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels. Multi-dilated convolutions (MDC) capture multi-scale temporal patterns through parallel dilated convolutions with different kernel sizes. The convolutional block attention module (CBAM) sequentially applies channel and spatial attention to emphasize discriminative regions. Finally, a denoising block with a gated attention mechanism suppresses noise and enhances fault-relevant features before classification.

3.2. Feature Extraction with Channel Attention

Given the input vibration signal

X \in R^{1 \times L}

, where L denotes the signal length, the initial feature extraction module applies a standard convolutional block to obtain preliminary feature representations. The 1D convolution operation can be formulated as:

Y_{c} (t) = \sum_{i = 0}^{k_{1} - 1} W_{c} (i) \cdot X (t + i) + b_{c},

(1)

where

Y_{c} (t)

represents the output at position t for the c-th channel,

W_{c} \in R^{k_{1}}

is the convolutional kernel for channel c,

k_{1}

is the kernel size, and

b_{c}

is the bias term. Following convolution, the standard batch normalization technique standardizes the feature maps:

{\hat{Y}}_{c} = γ \cdot \frac{Y_{c} - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}} + β,

(2)

where

μ_{B}

and

σ_{B}^{2}

are the mean and variance computed over the mini-batch

B

,

γ

and

β

are learnable parameters, and

ϵ

is a small constant for numerical stability. The complete feature extraction process is expressed as:

H_{1} = MaxPool (ReLU (\hat{Y})),

(3)

where

ReLU (x) = max (0, x)

introduces non-linearity, and

MaxPool (\cdot)

reduces the temporal dimension while preserving salient features. The output

H_{1} \in R^{C_{1} \times L_{1}}

, where

L_{1} = ⌊ \frac{L - k_{1} + 1}{s} ⌋

with stride s, serves as the input to the channel attention mechanism.

To adaptively recalibrate channel-wise feature responses, we employ Squeeze-and-Excitation Networks (SENet) [45] that explicitly model channel interdependencies. As shown in Figure 1c, the SENet module consists of three sequential operations. First, the squeeze operation aggregates spatial information across each channel through global average pooling:

z_{c} = F_{s q} (u_{c}) = \frac{1}{L_{1}} \sum_{i = 1}^{L_{1}} u_{c} (i),

(4)

where

u_{c}

denotes the c-th channel of the input feature map

U \in R^{C \times L}

, and

z_{c}

represents the channel descriptor. The squeeze operation produces a global channel descriptor vector

z \in R^{C}

. Second, the excitation operation learns channel-wise dependencies through a two-layer fully connected network with a bottleneck structure:

h = δ (W_{1} z + b_{1}),

(5)

s = σ (W_{2} h + b_{2}),

(6)

where

h \in R^{\frac{C}{r}}

is the intermediate representation,

σ (x) = \frac{1}{1 + e^{- x}}

denotes the sigmoid activation function,

δ (x) = max (0, x)

represents the ReLU activation,

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

are the weights of the two fully connected layers with bias terms

b_{1}

and

b_{2}

, and r is the reduction ratio that controls the bottleneck size. Finally, the scale operation uses the channel-wise attention weights

s

to recalibrate the input feature map:

{\tilde{u}}_{c} = s_{c} \cdot u_{c},

(7)

where

s_{c}

is the attention weight for the c-th channel, and

{\tilde{u}}_{c}

is the recalibrated feature map. The SENet output

\tilde{U} \in R^{C \times L}

contains enhanced discriminative features with suppressed irrelevant information.

3.3. Multi-Scale Feature Learning with Dual Attention

To capture multi-scale temporal patterns and refine features through dual attention mechanisms, we propose a two-stage architecture that integrates multi-dilated convolutions with convolutional block attention.

The Multi-Dilated Convolutions (MDC) module captures multi-scale temporal patterns by employing parallel dilated convolutions with different receptive fields. As illustrated in Figure 1b, the MDC module consists of two parallel branches with different kernel sizes and the same dilation rate. Each branch applies dilated convolution to extract features at different scales:

F_{1} = {DilatedConv}_{3 \times 3} (\tilde{U}; d = 2), F_{2} = {DilatedConv}_{5 \times 5} (\tilde{U}; d = 2),

(8)

where

{DilatedConv}_{k \times k} (\cdot; d)

denotes dilated convolution with kernel size k and dilation rate d. The dilated convolution operation is defined as:

{(F * w)}_{i} = \sum_{j = 1}^{k} w_{j} \cdot F_{i + d \cdot j},

(9)

where ∗ represents the convolution operation,

w

is the convolutional kernel, and d is the dilation rate that controls the spacing between kernel elements. The effective receptive field of a dilated convolution is computed as:

R F = k + (k - 1) (d - 1),

(10)

where

R F

denotes the receptive field size. For the

3 \times 3

and

5 \times 5

dilated convolutions with

d = 2

, the receptive fields are

R F_{1} = 3 + (3 - 1) (2 - 1) = 5

and

R F_{2} = 5 + (5 - 1) (2 - 1) = 9

, respectively, enabling the network to capture both local and broader temporal patterns.

The selection of dilation rate

d = 2

and kernel sizes (

k = 3, 5

) is motivated by the physical time scales of bearing defect signatures. For the CWRU 6205-2RS bearing operating at 1797 RPM (shaft frequency

f_{r} = 29.95

Hz), the characteristic defect frequencies are: ball pass frequency outer race (BPFO)

\approx 107.4

Hz, ball pass frequency inner race (BPFI)

\approx 162.2

Hz, ball spin frequency (BSF)

\approx 70.6

Hz, and fundamental train frequency (FTF)

\approx 11.9

Hz. At the 12 kHz sampling rate, each individual fault impulse typically spans 2–15 samples (<1.25 ms), while the inter-impulse periods range from approximately 74 samples (BPFI) to 1006 samples (FTF). The two MDC branches with effective receptive fields of 5 and 9 samples are designed to match the transient impulse durations of individual fault impacts at this time scale. Meanwhile, the progressive receptive field expansion through the full network pipeline—initial convolution (

k_{1}

), max-pooling (stride s), SENet, and the MDC module itself—enables the deeper layers to cover the inter-impulse periods of higher-frequency fault signatures such as BPFI and BPFO. This multi-level temporal coverage ensures that the network simultaneously captures the shape of individual impulses and the periodicity of repetitive fault events. The parameter sensitivity analysis in Section 4.7 further validates this design by confirming that

d = 2

achieves the best trade-off among the tested dilation rate configurations.

The outputs from both branches are concatenated and processed through a

1 \times 1

convolution for feature fusion:

H_{M D C} = {Conv}_{1 \times 1} ([F_{1}; F_{2}]),

(11)

where

[\cdot; \cdot]

denotes concatenation along the channel dimension, and

{Conv}_{1 \times 1} (\cdot)

applies

1 \times 1

convolution to integrate features from different scales. The output

H_{M D C} \in R^{C_{2} \times L_{2}}

contains multi-scale temporal representations that capture both fine-grained and coarse-grained patterns.

Subsequently, the Convolutional Block Attention Module (CBAM) [46] sequentially applies channel attention and spatial attention to refine features along both dimensions. As shown in Figure 1d, CBAM consists of two sub-modules that work in cascade. The channel attention module computes attention weights for each channel by aggregating spatial information through both average pooling and max pooling. The shared multi-layer perceptron is formulated as

MLP (x) = W_{1} (δ (W_{0} (x) + b_{0})) + b_{1},

(12)

where

W_{0} \in R^{\frac{C}{r} \times C}

and

W_{1} \in R^{C \times \frac{C}{r}}

are weight matrices with bias terms

b_{0}

and

b_{1}

. The channel attention map is computed as:

M_{c} = σ (MLP (F^{a v g}) + MLP (F^{m a x})),

(13)

where

F^{a v g} = \frac{1}{L} \sum_{i = 1}^{L} F (:, i)

and

F^{m a x} = {max}_{i = 1}^{L} F (:, i)

represent the average-pooled and max-pooled features, respectively. The channel-refined features are computed as

F^{'} = M_{c} \otimes F,

(14)

where ⊗ denotes element-wise multiplication and

M_{c} \in R^{C \times 1}

is the channel attention map. The spatial attention module then generates a spatial attention map by applying pooling operations along the channel dimension. The spatial descriptors are computed as

F_{s}^{a v g} (i) = \frac{1}{C} \sum_{c = 1}^{C} F^{'} (c, i), F_{s}^{m a x} (i) = {max}_{c = 1}^{C} F^{'} (c, i),

(15)

where

F_{s}^{a v g}, F_{s}^{m a x} \in R^{1 \times L}

represent the average-pooled and max-pooled spatial features along the channel dimension. These descriptors are concatenated and processed through a convolutional layer

M_{s} = σ ({Conv}_{7 \times 7} ([F_{s}^{a v g}; F_{s}^{m a x}])),

(16)

where

[\cdot; \cdot]

denotes concatenation, and

{Conv}_{7 \times 7} (\cdot)

applies a

7 \times 7

convolution with a single output channel. The final refined features are obtained by

H_{C B A M} = M_{s} \otimes F^{'},

(17)

where

M_{s} \in R^{1 \times L}

is the spatial attention map, and

H_{C B A M} \in R^{C \times L}

represents the attention-refined features that emphasize informative channels and spatial locations.

3.4. Denoising and Fault Classification

To achieve robust fault diagnosis under noisy conditions, we propose a denoising block that employs a gated attention mechanism with residual connections to suppress noise and enhance fault-relevant features. Unlike the adopted modules above, the following formulations represent the novel architectural contribution of this work. As illustrated in Figure 2, this module consists of two parallel paths that work cooperatively to refine the feature representations.

The main feature extraction path processes input features

H_{i n}

from the previous layer through a standard convolutional block

F_{m a i n} = MaxPool (ReLU (BN (Conv (H_{i n})))),

(18)

where

F_{m a i n}

represents the extracted features. Simultaneously, the attention gating path generates attention weights to modulate the main features

G = σ (BN (FC (H_{i n}))),

(19)

where

FC (\cdot)

denotes a fully connected layer that learns a non-linear mapping,

BN (\cdot)

applies batch normalization, and

σ (\cdot)

is the sigmoid activation function that produces attention weights in the range [0, 1]. The gating mechanism modulates the main features and combines them with the residual path

F_{g a t e d} = G \otimes F_{m a i n}, F_{r e s i d u a l} = Conv (BN (ReLU (MaxPool (H_{i n})))),

(20)

H_{d e n o i s e} = F_{g a t e d} \oplus F_{r e s i d u a l},

(21)

where ⊗ denotes element-wise multiplication, ⊕ represents element-wise addition, and

H_{d e n o i s e}

is the denoised output. The gating mechanism adaptively suppresses noisy components while the residual connection preserves important features, resulting in robust representations under noisy conditions.

Following the denoising block, an additional convolutional layer is applied to further refine the features, which are then flattened and fed into a fully connected layer for classification

H_{f i n a l} = Conv (H_{d e n o i s e}), h_{f l a t} = Flatten (H_{f i n a l}),

(22)

where

H_{f i n a l} \in R^{C_{3} \times L_{3}}

represents the final feature representation, and

h_{f l a t} \in R^{C_{3} \cdot L_{3}}

is the flattened feature vector. The fully connected layer computes the logit scores

z = W_{f c} h_{f l a t} + b_{f c},

(23)

where

W_{f c} \in R^{K \times (C_{3} \cdot L_{3})}

and

b_{f c} \in R^{K}

are the weight matrix and bias vector, respectively. The Softmax function converts the logits into a probability distribution

{\hat{y}}_{k} = \frac{e^{z_{k}}}{\sum_{j = 1}^{K} e^{z_{j}}}, k = 1, 2, \dots, K,

(24)

where

{\hat{y}}_{k}

represents the predicted probability for the k-th fault class, ensuring that

\sum_{k = 1}^{K} {\hat{y}}_{k} = 1

. The output

\hat{y} \in R^{K}

represents the predicted probability distribution over the K bearing fault types. The softmax function and cross-entropy loss used for training (Section 3.5) are standard formulations in multi-class classification.

3.5. Training Strategy and Objective Function

MDCAD-Net is trained end-to-end using supervised learning with cross-entropy loss. Following best practices in deep learning for fault diagnosis, we employ batch normalization after each convolutional layer to stabilize training and accelerate convergence. Dropout regularization is applied before the final classification layer to prevent overfitting.

The training objective function is formulated as

L_{t o t a l} = L_{C E} + λ L_{r e g},

(25)

where

L_{C E}

is the cross-entropy loss,

L_{r e g}

is the regularization term, and

λ

is a hyperparameter that balances classification accuracy and model complexity.

The cross-entropy loss for multi-class classification is defined as

L_{C E} = - \frac{1}{B} \sum_{i = 1}^{B} \sum_{k = 1}^{K} y_{i, k} log ({\hat{y}}_{i, k}),

(26)

where B is the batch size, K is the number of fault classes,

y_{i, k}

is the ground truth label (one-hot encoded), and

{\hat{y}}_{i, k}

is the predicted probability for the k-th class of the i-th sample.

The regularization term applies

L_{2}

regularization to all trainable parameters to prevent overfitting

L_{r e g} = \sum_{θ \in Θ} {∥ θ ∥}_{2}^{2},

(27)

where

Θ

represents the set of all trainable parameters in the network, including convolutional kernels, fully connected layer weights, and batch normalization parameters.

The network is optimized using the Adam optimizer, a widely adopted first-order gradient-based method that combines adaptive learning rates with momentum estimation. The parameter update rule is defined as

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) \nabla_{θ} L_{t o t a l}, v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) {(\nabla_{θ} L_{t o t a l})}^{2},

(28)

{\hat{m}}_{t} = \frac{m_{t}}{1 - β_{1}^{t}}, {\hat{v}}_{t} = \frac{v_{t}}{1 - β_{2}^{t}},

(29)

θ_{t + 1} = θ_{t} - α_{t} \frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ},

(30)

where

m_{t}

and

v_{t}

are the first and second moment estimates,

β_{1}

and

β_{2}

are exponential decay rates (typically 0.9 and 0.999),

α_{t}

is the learning rate at iteration t, and

ϵ

is a small constant for numerical stability. The learning rate is adjusted during training using a scheduler that reduces

α_{t}

when the validation loss plateaus. Data augmentation techniques such as random scaling and time shifting are applied to the training data to improve model generalization and robustness to variations in the input signals.

Algorithm 1 summarizes the complete forward propagation and training procedure of MDCAD-Net, consolidating the mathematical formulations described above into a unified computational flow.

Algorithm 1 MDCAD-Net training procedure

Require: Training set

D = {(X_{i}, y_{i})}_{i = 1}^{N}

, learning rate

α

, epochs E
Ensure: Trained parameters Θ
1: Initialize Θ randomly; set

α \leftarrow 0.001

2: for

epoch = 1

to E do
3: for each mini-batch

{(X_{b}, y_{b})}

do
4:

H_{1} \leftarrow MaxPool (ReLU (BN (Conv (X_{b}))))

5:

z \leftarrow GAP (H_{1})

;

s \leftarrow σ (W_{2} δ (W_{1} z))

6:

\tilde{U} \leftarrow s ⊙ H_{1}

7:

F_{1} \leftarrow {DConv}_{3} (\tilde{U}; d = 2)

;

F_{2} \leftarrow {DConv}_{5} (\tilde{U}; d = 2)

8:

H_{MDC} \leftarrow {Conv}_{1 \times 1} ([F_{1}; F_{2}])

9:

M_{c} \leftarrow σ (MLP (AvgPool (H_{MDC})) + MLP (MaxPool (H_{MDC})))

10:

F^{'} \leftarrow M_{c} \otimes H_{MDC}

11:

M_{s} \leftarrow σ ({Conv}_{7} ([{AvgPool}_{s} (F^{'}); {MaxPool}_{s} (F^{'})]))

12:

H_{CBAM} \leftarrow M_{s} \otimes F^{'}

13:

F_{main} \leftarrow MaxPool (ReLU (BN (Conv (H_{CBAM}))))

14:

G \leftarrow σ (BN (FC (H_{CBAM})))

15:

H_{denoise} \leftarrow (G \otimes F_{main}) \oplus Conv (BN (ReLU (MaxPool (H_{CBAM}))))

16:

\hat{y} \leftarrow Softmax (W_{f c} \cdot Flatten (Conv (H_{denoise})) + b_{f c})

17:

L \leftarrow L_{C E} (\hat{y}, y_{b}) + λ L_{r e g} (Θ)

18:

Θ \leftarrow Adam (Θ, \nabla_{Θ} L, α)

19: end for
20:

α \leftarrow ReduceLROnPlateau (α, L_{val})

21: end for
22: return

Θ

4. Experimental Results and Analysis

4.1. Datasets

We conduct experiments on the Case Western Reserve University (CWRU) bearing dataset [47], which is a widely recognized benchmark for bearing fault diagnosis research. The dataset contains vibration signals collected from rolling element bearings under various fault conditions using accelerometers mounted at different locations on the motor housing.

The CWRU bearing dataset comprises vibration signals collected at two sampling frequencies: 12 kHz for drive end (DE) and fan end (FE) bearings and 48 kHz for drive end bearings. In this study, we focus on the 12 kHz drive end accelerometer data, which includes measurements from both healthy bearings and bearings with single-point faults introduced using electro-discharge machining. The faults are categorized into three types based on their locations: inner race (IR), outer race (OR), and ball (rolling element) faults. Additionally, three fault severity levels are considered, corresponding to fault diameters of 0.007 inches (7 mils), 0.014 inches (14 mils), and 0.021 inches (21 mils). The dataset used in our experiments consists of 10 fault classes as shown in Table 1. Figure 3 shows the CWRU bearing test rig, which consists of an electric motor, a torque transducer/encoder, a dynamometer, and bearings at the drive end and fan end positions.

Data Preprocessing and Segmentation. The continuous vibration signals are segmented into non-overlapping samples using a sliding window approach with 50% overlap. Each sample contains 1024 data points, corresponding to approximately 0.085 s of vibration data at the 12 kHz sampling frequency. This window length is selected to capture sufficient temporal information for fault characteristic extraction while maintaining a reasonable sample size for efficient model training. The sliding window approach with overlap increases the number of training samples and helps the model learn robust features from different temporal positions within the fault signals.

The dataset is split into training (70%), validation (15%), and testing (15%) sets following a stratified random split to ensure balanced class distribution across all subsets. All vibration signals are normalized to zero mean and unit variance using standardization before being fed into the network, ensuring consistent input scale across different fault conditions and facilitating stable gradient-based optimization during training.

Figure 4 presents a multi-domain visualization of CWRU bearing vibration signals, demonstrating the discriminative characteristics of different fault conditions. The time-domain waveforms (Figure 4a–c) show that normal bearings exhibit relatively smooth oscillations, while fault conditions display significant amplitude modulations and impulsive features with peak amplitudes reaching 1.5 times the normal values for inner race faults. The frequency-domain spectra (Figure 4d–f) reveal that fault conditions produce characteristic frequency peaks at bearing defect frequencies (e.g., BPFI around 500–800 Hz) and increased high-frequency content above 3 kHz, distinguishing them from the low-frequency concentration observed in normal bearings. The time-frequency spectrograms (Figure 4g,h) further illustrate temporal variations, with fault conditions exhibiting periodic modulation patterns and intermittent high-frequency components absent in normal operation. These multi-domain signal characteristics validate the feasibility of deep learning-based approaches for automated bearing fault diagnosis.

4.2. Experimental Setting

All experiments are conducted using PyTorch 1.13.0 on a workstation equipped with an NVIDIA RTX 3090 GPU (24 GB memory), Intel Core i9-10900K CPU, and 64 GB RAM. The input to MDCAD-Net consists of 1D vibration signals with a fixed length of 1024 time steps, reshaped to match the network’s input requirements. All models are trained for 30 epochs with a batch size of 32 for both training and validation. We employ the Adam optimizer with an initial learning rate of 0.001 and implement a ReduceLROnPlateau learning rate scheduler that reduces the learning rate by a factor of 0.5 when the validation loss plateaus for 5 consecutive epochs. Cross-entropy loss is used as the objective function, and early stopping with a patience of 10 epochs is applied to prevent overfitting. To ensure reproducible results, we fix the random seed to 42 for all experiments.

Table 2 summarizes the key hyperparameters and architectural configurations of MDCAD-Net. These settings are carefully tuned through preliminary experiments to balance model performance and computational efficiency.

4.3. Evaluation Metrics

To comprehensively evaluate the performance of MDCAD-Net and baseline methods for bearing fault diagnosis, we employ four standard classification metrics widely adopted in the machine learning and fault diagnosis literature: accuracy, precision, recall, and F1-score. These metrics provide complementary perspectives on model performance, capturing both overall correctness and class-specific diagnostic capabilities.

Accuracy measures the overall proportion of correctly classified samples across all fault classes

Accuracy = \frac{1}{N} \sum_{i = 1}^{N} ⊮ ({\hat{y}}_{i} = y_{i}),

(31)

where N is the total number of test samples,

y_{i}

is the ground truth label for the i-th sample,

{\hat{y}}_{i}

is the predicted label, and

⊮ (\cdot)

is the indicator function that equals 1 when the condition is true and 0 otherwise.

For multi-class classification, we compute precision and recall for each class and report the macro-averaged values. Precision for class k measures the proportion of correctly identified samples among all samples predicted as class k

{Precision}_{k} = \frac{T P_{k}}{T P_{k} + F P_{k}},

(32)

where

T P_{k}

(true positives) is the number of samples correctly classified as class k, and

F P_{k}

(false positives) is the number of samples incorrectly classified as class k. The macro-averaged precision is computed as

Precision = \frac{1}{K} \sum_{k = 1}^{K} {Precision}_{k},

(33)

where K is the total number of fault classes.

Recall for class k measures the proportion of correctly identified samples among all samples that truly belong to class k

{Recall}_{k} = \frac{T P_{k}}{T P_{k} + F N_{k}},

(34)

where

F N_{k}

(false negatives) is the number of samples of class k that are incorrectly classified as other classes. The macro-averaged recall is

Recall = \frac{1}{K} \sum_{k = 1}^{K} {Recall}_{k} .

(35)

The F1-score provides a harmonic mean of precision and recall, offering a balanced measure of diagnostic performance

{F 1}_{k} = \frac{2 \cdot {Precision}_{k} \cdot {Recall}_{k}}{{Precision}_{k} + {Recall}_{k}},

(36)

F 1 - score = \frac{1}{K} \sum_{k = 1}^{K} {F 1}_{k} .

(37)

All metrics are reported as decimal values in the range [0, 1], where higher values indicate better performance. The use of macro-averaging ensures that each fault class contributes equally to the overall evaluation, which is particularly important for assessing model performance across different fault types and severities.

4.4. Baselines

To comprehensively evaluate MDCAD-Net, we compare it against eight representative baseline models that span different architectural paradigms. We select five deep convolutional neural networks, including ResNet-18 [48] with residual connections, Inception-v3 [49] with multi-scale feature extraction, VGG-16 [50] with very deep architecture, WenCNN [51], and MA1DCNN [52] specifically designed for bearing fault diagnosis with specialized 1D convolutional architectures. Additionally, we include SequentialLSTM [53], which captures temporal dependencies through long short-term memory units, as well as two Transformer-based models: Autoformer [54], incorporating decomposition with auto-correlation and TARNet [55], employing task-aware reconstruction mechanisms. To ensure a strictly fair comparison, all baseline models were reimplemented by the authors using PyTorch and trained from scratch under identical experimental conditions: the same data preprocessing pipeline (z-score normalization, 1024-point windowing with 50% overlap), the same stratified 70%/15%/15% train/validation/test split with random seed 42, the same Adam optimizer (learning rate 0.001, batch size 32, 30 epochs), and the same ReduceLROnPlateau scheduler. For 2D architectures (ResNet-18, Inception-v3, VGG-16), the 1D vibration signals were reshaped into single-channel 2D inputs following standard practice. No hyperparameter tuning was performed individually for any baseline; all share the same training protocol to eliminate confounding factors. Implementation details follow the original papers with necessary adaptations for 1D vibration signal input.

4.5. Main Results

Table 3 presents the comprehensive performance comparison of MDCAD-Net against eight baseline models on the CWRU bearing fault diagnosis task. MDCAD-Net achieves the best overall performance with 0.9893 accuracy and maintains balanced performance across all evaluation metrics, including precision (0.9894), recall (0.9893), and F1-score (0.9892). The consistent performance across all metrics indicates that the model achieves reliable fault detection without bias toward specific fault types or classes, demonstrating the effectiveness of integrating multi-dilated convolutions with dual attention mechanisms and denoising capabilities.

Among the baseline models, convolutional neural network architectures demonstrate strong performance, with ResNet-18 achieving competitive results at 0.9854 accuracy due to its deep residual connections that facilitate gradient flow and feature reuse across multiple layers. Inception-v3 obtains 0.9821 accuracy through multi-scale convolutional kernels that capture features at different temporal resolutions, while the specialized bearing fault diagnosis models MA1DCNN and WenCNN achieve 0.9765 and 0.9612 accuracy, respectively, validating the effectiveness of domain-specific architectural designs for vibration signal analysis. Transformer-based architectures show varying performance, with TARNet demonstrating 0.9687 accuracy through temporal attention mechanisms, while Autoformer exhibits relatively lower performance at 0.9401 accuracy, suggesting that general-purpose time series models may require additional adaptations for fault diagnosis tasks. The recurrent architecture SequentialLSTM achieves 0.9487 accuracy, indicating that temporal sequential modeling alone may be insufficient for capturing the complex spectral patterns in bearing vibration signals without explicit multi-scale feature extraction and attention-based feature selection mechanisms.

To provide deeper insights into the model’s classification behavior, Figure 5 presents the confusion matrix for MDCAD-Net on the CWRU dataset. The matrix demonstrates excellent diagonal dominance, indicating strong discriminative capability across all 10 fault classes. Most classes achieve perfect classification (100%), including Normal, 7-Inner, 7-Outer, 14-Inner, 14-Outer, 21-Inner, and 21-Outer, demonstrating the model’s exceptional ability to distinguish between different fault locations and damage severities. Only three ball fault classes show minor misclassifications: 7-Ball exhibits 3.2% confusion with 21-Ball, 14-Ball shows 4.8% distributed misclassifications across multiple classes including Normal (1.6%), 7-Ball (0.5%), 21-Inner (1.1%), and 21-Outer (1.6%), and 21-Ball demonstrates 2.7% confusion with 14-Inner. These confusions primarily occur between ball fault classes with different damage sizes, which share similar frequency characteristics but differ in amplitude modulation patterns, representing the most challenging discrimination task in bearing fault diagnosis. The model’s ability to correctly classify over 98% of samples across all fault severities demonstrates the effectiveness of the multi-scale feature extraction and dual attention mechanisms in capturing subtle differences between fault signatures.

Figure 6 further breaks down the performance across individual fault classes, presenting accuracy, precision, recall, and F1-score for each of the 10 classes through a comprehensive four-panel visualization. The results reveal distinct performance patterns across different fault locations. Normal bearing state achieves perfect classification with all metrics at 1.00, providing a reliable baseline for fault detection. Outer race faults (7-Outer, 14-Outer, 21-Outer) consistently achieve perfect performance across all damage severities, with accuracy, precision, recall, and F1-score all at 1.00, indicating that outer race defects generate highly distinctive vibration patterns that are easily separable from other fault types. Inner race faults (7-Inner, 14-Inner, 21-Inner) also demonstrate excellent performance with all metrics above 0.97, showing that the model effectively captures the characteristic modulation patterns associated with inner raceway damage. Ball faults represent the most challenging category, with 14-Ball showing the lowest accuracy at 0.9520 and precision at 0.9524, yet maintaining robust recall (0.9520) and F1-score (0.9520) above 0.95. The relatively lower performance on ball faults is expected, as rolling element defects produce more complex and variable vibration signatures that depend on ball position, load zone, and contact angle dynamics. This class-wise analysis confirms that MDCAD-Net maintains balanced and robust performance across all fault types and damage severities, which is essential for reliable industrial fault diagnosis systems where both false alarms and missed detections carry significant operational costs.

The superior performance of MDCAD-Net compared to baseline models can be attributed to three key architectural innovations. First, the multi-dilated convolution module enables simultaneous capture of both fine-grained transient impulses and long-range periodic patterns in vibration signals through parallel branches with different receptive fields, providing richer temporal representations than single-scale convolutional approaches. Second, the dual attention mechanism combining SENet and CBAM selectively emphasizes discriminative features across both channel and spatial dimensions, suppressing noise-contaminated channels while focusing on informative temporal regions that carry fault-specific signatures. Third, the denoising block with gated attention effectively filters background noise and measurement interference that are prevalent in real-world industrial environments where sensor signals are subject to multiple sources of contamination. The synergistic combination of these components enables MDCAD-Net to extract robust fault signatures that generalize well across different fault types, damage severities, and operating conditions, as evidenced by the consistent high performance across all classes and the minimal confusion between fault categories.

4.6. Ablation Study

To systematically validate the contribution of each component in MDCAD-Net, we conduct ablation studies by removing individual modules while keeping other components unchanged. Table 4 presents the performance degradation when removing specific modules, demonstrating the importance of each architectural component for achieving optimal fault diagnosis performance.

The full MDCAD-Net model achieves the best performance with 0.9893 accuracy across all metrics. Removing the denoising block results in performance degradation to 0.9812 accuracy (0.81% decrease), indicating that the gated attention mechanism provides meaningful noise suppression capabilities for handling real-world bearing vibration signals. The CBAM module shows a larger impact, with its removal decreasing accuracy to 0.9754 (1.39% decrease), validating the effectiveness of sequential channel and spatial attention for emphasizing discriminative temporal regions. The MDC module demonstrates substantial contribution, with its absence reducing accuracy to 0.9631 (2.62% decrease), confirming the critical importance of multi-scale temporal feature extraction through parallel dilated convolutions with different receptive fields. The SENet module shows the most significant impact among individual components, with its removal decreasing accuracy to 0.9487 (4.06% decrease), highlighting the essential role of channel-wise feature recalibration in learning discriminative representations from vibration signals. These results validate our design choices and demonstrate that each component contributes meaningfully to the overall performance, with SENet and MDC being the most critical modules for achieving state-of-the-art bearing fault diagnosis.

To further investigate inter-module interactions and determine whether any modules are redundant, we conduct combined ablation experiments by removing two modules simultaneously and comparing the observed degradation with the sum of individual removals. If the combined drop exceeds the additive sum, the modules exhibit synergy (each enhances the other’s contribution); if it equals the sum, the modules are independent; if it falls below the sum, partial functional overlap exists. Table 5 presents the results.

All four module pairs exhibit synergistic interactions (

Δ_{combined} > Δ_{sum}

), confirming that each module enhances the effectiveness of its counterpart rather than providing redundant functionality. The strongest synergy appears between SENet and MDC (

- 7.37 %

vs. expected

- 6.68 %

, excess

0.69 %

), consistent with the architectural design rationale that channel recalibration before multi-scale extraction pre-emphasizes informative frequency bands, amplifying the discriminative power of subsequent dilated convolutions. The SENet–CBAM pair also shows notable synergy (

- 6.04 %

vs.

- 5.45 %

), indicating that channel and spatial attention serve genuinely complementary rather than overlapping roles. The minimal baseline (Conv only) achieves 0.9078, confirming that the four modules collectively contribute a

+ 8.15 %

improvement that cannot be attributed to any single component.

4.7. Parameter Sensitivity Analysis

Understanding how key hyperparameters affect model performance is critical for practical deployment and reproducibility. We conduct a comprehensive sensitivity analysis on six key parameters: learning rate, batch size, dilation rates in the MDC module, SENet reduction ratio, CBAM reduction ratio, and kernel size. As illustrated in Figure 7, each parameter exhibits distinct sensitivity patterns across the four evaluation metrics, providing valuable insights into the robustness and optimization requirements of MDCAD-Net for bearing fault diagnosis applications.

Training hyperparameters demonstrate the most pronounced impact on model performance. Learning rate exhibits the highest sensitivity, with the optimal value of

1 \times 10^{- 3}

achieving peak performance across all metrics. Excessively large learning rates cause catastrophic training instability, as evidenced by sharply declining performance curves, particularly for recall, which shows the steepest descent pattern. Batch size analysis reveals that size 32 provides optimal performance, while smaller batches introduce severe degradation due to excessive gradient noise. Dilation rates in the MDC module prove critical for multi-scale feature extraction, with configuration [1, 2, 4] achieving superior performance by effectively capturing both fine-grained transient impulses and long-range temporal dependencies. Alternative configurations show substantial degradation, demonstrating that proper receptive field design is essential for detecting early-stage bearing faults.

Architectural parameters exhibit more stable but still substantial sensitivity patterns. SENet and CBAM reduction ratios both achieve optimal performance at ratio 16, with the curves showing symmetric degradation toward larger ratios that create information bottlenecks. CBAM demonstrates higher sensitivity than SENet, suggesting that spatial attention is more critical than channel attention for bearing fault diagnosis. Kernel size analysis presents a monotonic degradation pattern where smaller kernels consistently outperform larger ones, with size 3 capturing appropriate temporal resolution while larger kernels obscure transient fault impulses through excessive smoothing. Across all parameters, recall consistently exhibits the largest performance variations, highlighting that maintaining robust fault detectability is the primary challenge in hyperparameter optimization for diverse bearing operating conditions.

To further investigate whether the module arrangement affects performance, we evaluate five different placement orders while keeping all components unchanged. Table 6 presents the results.

The proposed order (SENet→MDC→CBAM→Denoise) achieves the best performance. Applying SENet before MDC allows channel recalibration to pre-emphasize fault-relevant frequency bands before multi-scale extraction. Placing CBAM after MDC enables spatial-temporal attention to refine the multi-scale representations. Positioning the denoising block at the end ensures noise suppression operates on the most refined features. Placing the denoising block first (last row) causes the largest degradation (

- 2.81 %

), confirming that early-stage denoising removes potentially useful signal components before the network has learned discriminative features.

4.8. Feature Visualization

To provide intuitive insights into the learned representations, we employ t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize the high-dimensional feature embeddings extracted by MDCAD-Net from the test set. As shown in Figure 8, the model successfully learns discriminative feature representations that form well-separated clusters in the two-dimensional embedding space, with each cluster corresponding to a specific bearing health condition.

The visualization reveals several key characteristics of the learned feature space. First, the normal bearing state forms a compact and isolated cluster that is clearly separated from all fault conditions, demonstrating the model’s robust capability to distinguish healthy bearings from faulty ones. Second, fault types from different locations (inner race, ball, and outer race) form distinct regional groupings, indicating that MDCAD-Net effectively captures the unique vibration signatures associated with different fault mechanisms. Third, within each fault type, samples with different damage severities (0.007, 0.014, and 0.021 inches) exhibit systematic spatial progression, suggesting that the learned features encode damage severity information in a continuous manner. Notably, the spatial arrangement shows that ball fault samples are positioned between inner race and outer race faults, which aligns with the confusion matrix analysis showing that ball faults are the most challenging to classify. The clear inter-class separability combined with intra-class compactness validates the effectiveness of the multi-scale feature extraction and dual attention mechanisms in learning discriminative representations for bearing fault diagnosis.

To provide quantitative evidence of feature clustering quality beyond visual inspection, we compute three clustering metrics on the learned feature embeddings for each ablation variant. Table 7 presents the results.

The full MDCAD-Net achieves the highest Silhouette Score (0.8234) and Calinski-Harabasz Index (2845.6) and the lowest Davies-Bouldin Index (0.3567), confirming superior inter-class separability and intra-class compactness. The progressive improvement from the bottom row (w/o SENet) to the full model mirrors the single-module ablation results, with SENet contributing the most to clustering quality by recalibrating channel weights that emphasize fault-discriminative frequency bands.

Beyond quantitative clustering quality, the spatial structure of the t-SNE feature space is consistent with known bearing vibration mechanisms. Outer race fault clusters are tightly grouped and clearly separated from other fault types, which aligns with the physical mechanism that outer race defects produce stationary-source impacts at a fixed angular position relative to the sensor, generating highly repeatable impulse patterns. Inner race fault clusters exhibit slightly larger intra-class spread, reflecting the amplitude modulation at shaft frequency caused by the rotating defect location. Ball fault clusters show the greatest dispersion and are positioned between inner and outer race clusters in the embedding space, consistent with the physical mechanism that rolling element defects produce impacts whose amplitude and frequency depend on the instantaneous ball position within the load zone and the varying contact angle during rotation. This correspondence between the learned feature space geometry and known vibration mechanics provides evidence that MDCAD-Net captures physically meaningful fault characteristics rather than merely exploiting superficial statistical correlations in the training data.

4.9. Noise Robustness Comparison

To evaluate the denoising block from a signal processing perspective, we compare MDCAD-Net against classical denoising preprocessing methods under various signal-to-noise ratio (SNR) conditions. Gaussian white noise is added to the test signals at SNR levels ranging from

- 6

dB to

+ 6

dB. Table 8 presents the classification accuracy of MDCAD-Net compared with three classical denoising methods applied as preprocessing before the same CNN backbone (MDCAD-Net without the denoising block): wavelet thresholding (Daubechies-4, soft thresholding), empirical mode decomposition (EMD, first 5 IMFs reconstructed), and spectral filtering (4th-order Butterworth, cutoff at 5 kHz).

MDCAD-Net maintains classification accuracy above 87% even at

- 6

dB SNR, outperforming the best classical method (CNN+Wavelet) by 5.22 percentage points. The performance advantage increases under more severe noise conditions: at

- 6

dB, MDCAD-Net surpasses the no-denoising baseline by 9.22 points, while wavelet preprocessing only provides a 4.00-point improvement. This confirms that the learnable denoising block provides genuine noise suppression capability by jointly optimizing noise reduction and fault classification in an end-to-end manner, preserving fault-discriminative features that signal-agnostic preprocessing methods inadvertently remove.

4.10. Cross-Load Validation

To evaluate generalization under different operating conditions, we conduct cross-load validation experiments using the CWRU dataset collected under four motor loads: 0 HP (1797 RPM), 1 HP (1772 RPM), 2 HP (1750 RPM), and 3 HP (1730 RPM). In each experiment, the model is trained exclusively on data from one load condition and tested on a different load condition, with no overlap between training and testing data. Table 9 presents the results for MDCAD-Net and three representative baselines.

MDCAD-Net achieves an average cross-load accuracy of 95.62%, outperforming ResNet-18 by 1.91 percentage points, MA1DCNN by 4.15 points, and WenCNN by 5.69 points. The performance advantage over baselines is more pronounced in cross-load scenarios than in standard evaluation (Table 3), indicating that the multi-scale feature extraction and adaptive denoising mechanisms contribute to learning load-invariant fault representations that generalize across varying operating conditions. MDCAD-Net maintains accuracy above 94% across all six transfer scenarios, demonstrating robust cross-condition generalization.

5. Conclusions

This paper presents MDCAD-Net, a multi-dilated convolution attention denoising network that integrates multi-scale temporal feature extraction, dual attention mechanisms, and adaptive denoising within a unified end-to-end framework for bearing fault diagnosis. On the CWRU benchmark, MDCAD-Net achieves 98.93% accuracy, outperforming eight baseline models. Combined ablation studies confirm genuine inter-module synergy among all four components. Cross-load validation yields an average accuracy of 95.62%, surpassing the best baseline by 1.91 percentage points, while noise robustness experiments demonstrate that the learnable denoising block maintains above 87% accuracy at

- 6

dB SNR, outperforming classical denoising methods across all tested conditions. Quantitative clustering metrics further confirm that the learned representations are well-separated and consistent with known bearing vibration mechanisms.

The results demonstrate the effectiveness of the proposed framework under both standard and simulated noisy conditions, while the current evaluation relies on artificially injected Gaussian noise rather than real industrial measurements, the consistent performance advantage across a wide range of SNR levels suggests promising potential for practical deployment. Future research will focus on extending MDCAD-Net to other rotating machinery diagnosis tasks, investigating transfer learning strategies for limited-data scenarios, incorporating multi-modal sensor information, and validating the approach on industrial datasets with authentic measurement noise.

Author Contributions

Conceptualization, R.D., R.Y. and G.J.; Data curation, R.D. and G.J.; Formal analysis, R.D. and G.J.; Funding acquisition, R.Y.; Investigation, R.D. and G.J.; Methodology, R.D. and R.Y.; Project administration, R.Y.; Resources, R.Y.; Software, R.D. and G.J.; Supervision, R.Y.; Validation, R.D., R.Y. and G.J.; Visualization, R.D. and G.J.; Writing—original draft, R.D., G.J. and R.Y.; Writing—review and editing, R.D., G.J. and R.Y. All authors have read and agreed to the published version of the manuscript.

Funding

Natural Science Foundation of Hubei Province of China (grant number: No. 2024AFB428).

Data Availability Statement

The data presented in this study are openly available in Case Western Reserve University Bearing Data Center at https://engineering.case.edu/bearingdatacenter/download-data-file (accessed on 1 March 2026).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Du, Y.; Geng, X.; Zhang, B.; Zhou, Q.; Cheng, S.; Zhang, H. Novel Bearing Fault Diagnosis Model for Wind Turbine: PSR-CNN-DLSTM Combined with Transfer Learning. Chin. J. Mech. Eng. 2025, 39, 100204. [Google Scholar] [CrossRef]
Hu, L.; Yao, L.; Tang, X.; Zhang, S.; Xu, Y. A feature-guided adaptive frequency band optimization framework for fault diagnosis of wind turbine pitch bearings. Int. J. Electr. Power Energy Syst. 2026, 175, 111585. [Google Scholar] [CrossRef]
Sheng, C.; Zhang, M.; Rao, X.; Huang, M.; Zhang, X. Adaptive filtering and multi-scale MixStyle network for cross-domain fault diagnosis of marine electric thruster bearings. Mech. Syst. Signal Process. 2026, 246, 113910. [Google Scholar] [CrossRef]
Chen, Z.; Xu, B.; Zhang, Z. Rolling Bearing Fault Diagnosis Based on Recurrence Plot. IEEE Access 2024, 12, 149710–149721. [Google Scholar] [CrossRef]
Ma, J.; Bai, X.; Ma, F.; Zhuo, S.; Sun, B.; Li, C. Convolutional Neural Network Design Based on Weak Magnetic Signals and Its Application in Aircraft Bearing Fault Diagnosis. IEEE Sens. J. 2024, 24, 36031–36043. [Google Scholar] [CrossRef]
Luu, T.T.; Huynh, D.A. An efficient lightweight multi-scale CNN framework with CBAM and SPP for bearing fault diagnosis. Intell. Syst. Appl. 2026, 29, 200628. [Google Scholar] [CrossRef]
WEI, W.; Yuan, Y. VMD-DCA-BiGRU wind turbine bearing fault diagnosis method with attention mechanism integration. Measurement 2026, 266, 120466. [Google Scholar] [CrossRef]
Wu, L.; Ding, N.; Wang, L.; Li, J.; Zhang, H. Physics-informed attention LSTM: A dual-knowledge fusion framework for interpretable bearing fault diagnosis under small data scenarios. Measurement 2026, 263, 120216. [Google Scholar] [CrossRef]
Su, H.; Zhang, H.; Yan, X. Physics-informed blind wavelet deconvolution transfer network for cross-machine bearing fault diagnosis. Eng. Appl. Artif. Intell. 2026, 166, 113541. [Google Scholar] [CrossRef]
Deng, F.; Hao, R.; Yang, S. Dual-space multi-node collaborative transfer learning for cross-condition and cross-machine axlebox bearing fault diagnosis. Chin. J. Mech. Eng. 2026, 100219. [Google Scholar] [CrossRef]
Huang, Q.; Li, J.; Wang, J.; Meng, Z.; Zhang, J. Cross-scenarios few-shot fault diagnosis for rolling bearings via inter-domain similarity-guided meta-learning with dual-attention multiscale feature denoising network. Eng. Appl. Artif. Intell. 2026, 167, 113828. [Google Scholar] [CrossRef]
Liu, C.; Yuan, L.; Lv, K.; Ran, T.; Li, C. AOVMD-ScaleShift-Net: An Integrated Framework for Enhanced Rolling Bearing Fault Diagnosis. IEEE Signal Process. Lett. 2025, 32, 4274–4278. [Google Scholar] [CrossRef]
Chen, Y.; Shi, J.; Shen, C.; Yang, H.; Hua, Z.; Huang, W.; Zhu, Z. Time-frequency aware feature disentanglement learning for intelligent bearing fault diagnosis under variable speed conditions. Expert Syst. Appl. 2026, 303, 130664. [Google Scholar] [CrossRef]
Cheng, J.; Ran, R.; Fang, B.; Li, B. Symmetric Positive Definite manifold deep metric learning for bearing fault diagnosis. Eng. Appl. Artif. Intell. 2026, 167, 113821. [Google Scholar] [CrossRef]
Zou, Y.; Luo, S.; Wu, X.; Jiang, Q.; Zhang, P.; Du, T.; Zhang, Y.; Sun, P.; Xu, M. MEHD-Net: A multimodal-enhanced hybrid denoising network for marine gearbox bearing fault diagnosis under extreme noise. Measurement 2026, 267, 120504. [Google Scholar] [CrossRef]
Liu, C.; Luo, Z.; Huang, Z.; Sang, Y.; Wang, X. Multikernel correntropy transfer robust dictionary learning and its application in bearing fault diagnosis. Eng. Appl. Artif. Intell. 2026, 165, 113466. [Google Scholar] [CrossRef]
Hu, Q.; Fu, X.; Sun, D.; Xu, D.; Guan, Y. A Lightweight Rolling Bearing Fault Diagnosis Method Based on Multiscale Depth-Wise Separable Convolutions and Network Pruning. IEEE Access 2024, 12, 186131–186144. [Google Scholar] [CrossRef]
Chen, L.; He, Y.; Tan, A.; Bai, X.; Li, Z.; Wang, X. Fault Diagnosis of Gearbox Bearings Based on Multi-Feature Fusion Dual-Channel CNN-Transformer-CAM. Machines 2026, 14, 92. [Google Scholar] [CrossRef]
Shi, J.; Qi, L.; Ye, S.; Li, C.; Jiang, C.; Ni, Z.; Zhao, Z.; Tong, Z.; Fei, S.; Tang, R.; et al. Intelligent Fault Diagnosis of Rolling Bearings Based on Digital Twin and Multi-Scale CNN-AT-BiGRU Model. Symmetry 2025, 17, 1803. [Google Scholar] [CrossRef]
Chen, J.; Hu, W.; Zhang, G.; Hu, J.; Huang, Q.; Chen, Z.; Blaabjerg, F. A Novel Knowledge Sharing Method for Rolling Bearing Fault Detection Against Impact of Different Signal Sampling Frequencies. IEEE Trans. Instrum. Meas. 2023, 72, 3512912. [Google Scholar] [CrossRef]
Xia, Z.; Dong, S.; Zou, S.; Zhu, S.; Zhao, X. Cross-scale hybrid contrast network for generalized zero-shot fault diagnosis of rolling bearings. Eng. Appl. Artif. Intell. 2026, 166, 113608. [Google Scholar] [CrossRef]
Lei, N.; Cui, J.; Han, J.; Chen, X.; Tang, Y. Rolling Bearing Fault Diagnosis Using Deep Transfer Learning Based on Joint Generalized Sliced Wasserstein Distance. IEEE Access 2024, 12, 41452–41463. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, K.; An, Y.; Luo, H.; Yin, S. An Integrated Multitasking Intelligent Bearing Fault Diagnosis Scheme Based on Representation Learning Under Imbalanced Sample Condition. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 6231–6242. [Google Scholar] [CrossRef] [PubMed]
Xu, D.; Jia, M.; Chen, T.; Liu, Y.; Chen, D.; Chai, T.; Yang, T. Federated Open-Set Fault Diagnosis for Unknown Bearing Fault Detection. Mech. Syst. Signal Process. 2026, 246, 113917. [Google Scholar] [CrossRef]
Jia, F.; Hao, L.; Yao, P.; Shen, J.; Huang, H.; Yu, T.; Xu, X. Simulation knowledge transfer: A new approach for intelligent fault diagnosis of rolling bearings using simulation-reality domain mixup adaptation. Int. J. Struct. Integr. 2025, 17, 102–133. [Google Scholar] [CrossRef]
Liu, X.; Sun, W.; Li, H.; Wang, Z.; Li, Q. Imbalanced Sample Fault Diagnosis of Rolling Bearing Using Deep Condition Multidomain Generative Adversarial Network. IEEE Sens. J. 2023, 23, 1271–1285. [Google Scholar] [CrossRef]
Luo, P.; Yin, Z.; Yuan, D.; Gao, F.; Liu, J. An Intelligent Method for Early Motor Bearing Fault Diagnosis Based on Wasserstein Distance Generative Adversarial Networks Meta Learning. IEEE Trans. Instrum. Meas. 2023, 72, 3517611. [Google Scholar] [CrossRef]
Wu, Z.; Wang, S.; Liu, C.; Wu, H.; Yi, J.; Pang, Y.; Cheng, G. A Data Augmentation Method for Shearer Rocker Arm Bearing Fault Diagnosis Based on GA-WT-SDP and WCGAN. Machines 2026, 14, 144. [Google Scholar] [CrossRef]
Liu, X.; Zhang, R.; Fan, J.; Li, L.; Li, Z.; Zhou, T. A Hybrid Denoising Model for Rolling Bearing Fault Diagnosis: Improved Edge Strategy Whale Optimization Algorithm-Based Variational Mode Decomposition and Dataset-Specific Wavelet Thresholding. Symmetry 2026, 18, 168. [Google Scholar] [CrossRef]
Shi, G.; Qin, C.; Xia, P.; Zhang, Z.; Liu, C. Generalized envelope nonlinear Gini index-gram guided two-stage chirp mode decomposition for shield machine main bearing fault diagnosis. Adv. Eng. Inform. 2026, 71, 104354. [Google Scholar] [CrossRef]
Guo, D.; Chen, J.; Liu, X.; Fei, J. Research on Rolling Bearing Fault Diagnosis Based on IRBMO-CYCBD. Mathematics 2026, 14, 201. [Google Scholar] [CrossRef]
Yikun, E.; Feng, Z.; Liu, Z.; Chen, H.; Zhang, Y.; Gao, T.; Lin, R. Dynamic modeling and vibration analysis for cycloid gear bearing fault diagnosis of rotate vector reducers. Mech. Syst. Signal Process. 2026, 244, 113748. [Google Scholar] [CrossRef]
Chen, T.; Guo, L.; Gao, H.; Feng, T.; Yu, Y. Clustering Weighted Envelope Spectrum for Rolling Bearing Fault Diagnosis. IEEE Trans. Autom. Sci. Eng. 2025, 22, 3922–3932. [Google Scholar] [CrossRef]
Fu, L.; Dun, G.; Liu, J.; Tan, D.; Ma, Z. Fault diagnosis via resonance demodulation based on the dynamic characteristics of defective bearings. Measurement 2026, 264, 120292. [Google Scholar] [CrossRef]
Cao, Z.; Dai, J.; Xu, W.; Chang, C. Sparse Bayesian Learning Approach for Compound Bearing Fault Diagnosis. IEEE Trans. Ind. Inform. 2024, 20, 1562–1574. [Google Scholar] [CrossRef]
Zhang, C.; Wei, S.; Dong, G.; Zeng, Y.; Zhu, G.; Zhou, X.; Liu, F. Time-Domain Sparsity-Based Bearing Fault Diagnosis Methods Using Pulse Signal-to-Noise Ratio. IEEE Trans. Instrum. Meas. 2024, 73, 3516804. [Google Scholar] [CrossRef]
Li, Y.; Ding, Q. Geometric entropy: A novel nonlinear measure for bearing fault diagnosis. Adv. Eng. Inform. 2026, 69, 104031. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, Y.; Hu, T.; Li, Z.; Li, B.; Liu, M. A weak-fault diagnosis method for bearings based on quadratic Manhattan entropy. Measurement 2026, 264, 120242. [Google Scholar] [CrossRef]
Liu, Y.; Cheng, Y.; Yang, S.; Wu, J. Vibration signal models for fault diagnosis of rolling element bearing. Measurement 2026, 264, 120303. [Google Scholar] [CrossRef]
Peng, C.; Sheng, Y.; Gui, W.; Tang, Z.; Li, C. A Rolling Bearing Fault Diagnosis Method Based on Multimodal Knowledge Graph. IEEE Trans. Ind. Inform. 2024, 20, 13047–13057. [Google Scholar] [CrossRef]
Wang, J.; Chen, Y.; Liu, K.; Zhu, Z.Q.; Wei, D.; Zhou, S.; Li, K.; Luan, H. Multi-Electrical Signal Analysis Based Bearing Fault Diagnosis for Permanent Magnet Synchronous Machines Under Uncertain Controller Bandwidths. IEEE Trans. Energy Convers. 2024, 39, 2514–2528. [Google Scholar] [CrossRef]
Liu, Z.; Xu, K.; Miao, X.; He, Q.; Pan, Y.; Yu, H. Online diagnosis for bogie bearing unbalance impact faults of railway trains with wavelet phase space reconstruction mechanism. Ain Shams Eng. J. 2026, 17, 103970. [Google Scholar] [CrossRef]
Jiang, Z.; Dong, Z.; Fu, X.; Gao, Z.; Gong, L.; Cao, J.; Qi, Y.; Liu, T.; Li, W.; Zhang, C. Weak vibration energy powered acceleration monitoring system for bearing fault diagnosis. Mech. Syst. Signal Process. 2026, 244, 113823. [Google Scholar] [CrossRef]
Dong, F.; Li, P.; Chen, Z.; Yu, G.A.; Du, H.; Zhang, Z.; Sha, W.; Huo, Y.; Du, T.; Xu, M. Development of an embedded triboelectric rolling bearing for fault diagnosis and self-powered monitoring. Tribol. Int. 2026, 216, 111570. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, S.Y.; Gu, K.R. Smart fault-detection machine for ball-bearing system with chaotic mapping strategy. Sensors 2019, 19, 2178. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
Wen, L.; Li, X.; Gao, L.; Zhang, Y. A new convolutional neural network-based data-driven fault diagnosis method. IEEE Trans. Ind. Electron. 2018, 65, 5990–5998. [Google Scholar] [CrossRef]
Wang, H.; Liu, Z.; Peng, D.; Qin, Y. Understanding and learning discriminant features based on multiattention 1DCNN for wheelset bearing fault diagnosis. IEEE Trans. Ind. Inform. 2020, 16, 5735–5745. [Google Scholar] [CrossRef]
Zhao, H.; Sun, S.; Jin, B. Sequential fault diagnosis based on LSTM neural network. IEEE Access 2018, 6, 12929–12939. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for long-term series forecasting. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Sydney, Australia, 2021; pp. 22419–22430. [Google Scholar]
Chowdhury, R.R.; Zhang, X.; Shang, J.; Gupta, R.K.; Hong, D. TARNet: Task-aware reconstruction for time-series Transformer. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 212–220. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of MDCAD-Net. (a) Main network with cascaded modules. (b) Multi-Dilated Convolutions (MDC). (c) Squeeze-and-Excitation Networks (SENet). (d) Convolutional Block Attention Module (CBAM).

Figure 2. Architecture of the Denoising Block with gated attention mechanism and residual connections.

Figure 3. CWRU bearing fault test rig.

Figure 4. Multi-domain visualization of bearing vibration signals under different health conditions. (a–c) Time-domain waveforms of normal condition (a), inner race fault (b), and ball fault (c). (d–f) Corresponding frequency-domain spectra. (g,h) Time-frequency spectrograms of normal (g) and outer race fault (h) signals. (i) Statistical feature comparison (root mean square, RMS) across all bearing conditions.

Figure 5. Confusion matrix for MDCAD-Net.

Figure 6. Per-class performance metrics for MDCAD-Net. (a) Accuracy. (b) Precision. (c) Recall. (d) F1-score.

Figure 7. Parameter sensitivity analysis. (a) Effect of learning rate; (b) Effect of batch size; (c) Effect of dilation rates; (d) Effect of SENet reduction ratio; (e) Effect of CBAM reduction ratio; (f) Effect of convolution kernel size. The four metrics (Accuracy, Precision, Recall, and F1-score) are used to evaluate the model performance under different parameter configurations.

Figure 8. t-SNE visualization of learned feature representations for MDCAD-Net.

Table 1. CWRU bearing fault classes and their descriptions.

Class	Description	Fault Type	Fault Size (Inches)
Normal	Healthy bearing	-	-
7-Inner	Inner race fault	IR	0.007
7-Ball	Ball fault	Ball	0.007
7-Outer	Outer race fault	OR	0.007
14-Inner	Inner race fault	IR	0.014
14-Ball	Ball fault	Ball	0.014
14-Outer	Outer race fault	OR	0.014
21-Inner	Inner race fault	IR	0.021
21-Ball	Ball fault	Ball	0.021
21-Outer	Outer race fault	OR	0.021

Table 2. MDCAD-Net hyperparameter configuration.

Parameter	Value
Number of Classes	10
Initial Conv Channels	64
SENet Reduction Ratio	16
MDC Kernel Sizes	3, 5
MDC Dilation Rate	2
CBAM Channel Reduction	16
CBAM Spatial Kernel	7
Optimizer	Adam
Learning Rate	0.001
LR Scheduler	ReduceLROnPlateau
Batch Size	32
Training Epochs	30
Loss Function	Cross-Entropy

Table 3. Performance comparison on bearing fault diagnosis. Bold values indicate the best results.

Model	Accuracy	Precision	Recall	F1-Score
MDCAD-Net (Ours)	0.9893	0.9894	0.9893	0.9892
ResNet-18	0.9854	0.9856	0.9854	0.9853
Inception-v3	0.9821	0.9824	0.9821	0.9820
MA1DCNN	0.9765	0.9768	0.9765	0.9764
TARNet	0.9687	0.9690	0.9687	0.9686
WenCNN	0.9612	0.9615	0.9612	0.9611
VGG-16	0.9543	0.9548	0.9543	0.9542
Autoformer	0.9401	0.9405	0.9401	0.9400
SequentialLSTM	0.9487	0.9521	0.9487	0.9503

Table 4. Ablation study results. Bold values indicate the best results.

Model Variant	Accuracy	Precision	Recall	F1-Score
MDCAD-Net (Full)	0.9893	0.9894	0.9893	0.9892
w/o Denoising Block	0.9812	0.9815	0.9812	0.9811
w/o CBAM	0.9754	0.9758	0.9754	0.9753
w/o MDC	0.9631	0.9636	0.9631	0.9630
w/o SENet	0.9487	0.9493	0.9487	0.9485

Table 5. Extended ablation study with combined module removal.

Δ_{combined}

denotes the accuracy drop from the full model;

Δ_{sum}

denotes the sum of individual drops.

Table 5. Extended ablation study with combined module removal.

Δ_{combined}

denotes the accuracy drop from the full model;

Δ_{sum}

denotes the sum of individual drops.

Model Variant	Accuracy	Δ_combined	Δ_sum	Interaction
MDCAD-Net (Full)	0.9893	—	—	—
w/o SENet & CBAM	0.9289	$- 6.04 %$	$- 5.45 %$	Synergy
w/o SENet & MDC	0.9156	$- 7.37 %$	$- 6.68 %$	Synergy
w/o MDC & Denoise	0.9504	$- 3.89 %$	$- 3.43 %$	Synergy
w/o CBAM & Denoise	0.9623	$- 2.70 %$	$- 2.20 %$	Synergy
Baseline (Conv only)	0.9078	$- 8.15 %$	—	—

Table 6. Effect of module placement order on classification accuracy. Bold values indicate the best results.

Module Order	Accuracy	Precision	Recall	F1
Conv→SE→MDC→CBAM→Denoise (Ours)	0.9893	0.9894	0.9893	0.9892
Conv→MDC→SE→CBAM→Denoise	0.9823	0.9826	0.9823	0.9822
Conv→SE→CBAM→MDC→Denoise	0.9789	0.9792	0.9789	0.9788
Conv→CBAM→MDC→SE→Denoise	0.9756	0.9760	0.9756	0.9755
Conv→Denoise→SE→MDC→CBAM	0.9612	0.9618	0.9612	0.9611

Table 7. Quantitative evaluation of feature clustering quality. ↑ indicates higher is better; ↓ indicates lower is better. Bold values indicate the best results.

Model Variant	Silhouette ↑	Davies-Bouldin ↓	Calinski-Harabasz ↑
MDCAD-Net (Full)	0.8234	0.3567	2845.6
w/o Denoising Block	0.7856	0.4123	2534.2
w/o CBAM	0.7523	0.4534	2312.8
w/o MDC	0.7189	0.5012	2078.4
w/o SENet	0.6734	0.5623	1823.6

Table 8. Classification accuracy (%) under different SNR levels with various denoising strategies. Bold values indicate the best results.

SNR (dB)	MDCAD-Net	CNN+Wavelet	CNN+EMD	CNN+Spectral	CNN (No Denoise)
Clean	98.93	98.45	97.89	98.23	98.12
6	97.34	95.89	95.12	95.45	94.23
0	95.12	92.67	91.78	92.12	90.34
−2	93.12	89.45	88.56	88.89	86.12
−4	90.67	86.12	85.23	85.56	82.34
−6	87.56	82.34	81.45	81.78	78.34

Table 9. Cross-load validation accuracy (%). Models trained on one load and tested on a different load. Bold values indicate the best results.

Train → Test	MDCAD-Net	ResNet-18	MA1DCNN	WenCNN
0HP → 1HP	97.12	95.34	93.12	91.78
0HP → 2HP	95.78	93.89	91.78	90.23
0HP → 3HP	94.23	92.12	89.67	88.12
1HP → 0HP	96.45	94.67	92.45	90.89
2HP → 0HP	95.34	93.45	91.23	89.67
3HP → 0HP	94.78	92.78	90.56	88.89
Average	95.62	93.71	91.47	89.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Duan, R.; Yan, R.; Jin, G. MDCAD-Net: A Multi-Dilated Convolution Attention Denoising Network for Bearing Fault Diagnosis. Vibration 2026, 9, 30. https://doi.org/10.3390/vibration9020030

AMA Style

Duan R, Yan R, Jin G. MDCAD-Net: A Multi-Dilated Convolution Attention Denoising Network for Bearing Fault Diagnosis. Vibration. 2026; 9(2):30. https://doi.org/10.3390/vibration9020030

Chicago/Turabian Style

Duan, Ran, Ruopeng Yan, and Guangyin Jin. 2026. "MDCAD-Net: A Multi-Dilated Convolution Attention Denoising Network for Bearing Fault Diagnosis" Vibration 9, no. 2: 30. https://doi.org/10.3390/vibration9020030

APA Style

Duan, R., Yan, R., & Jin, G. (2026). MDCAD-Net: A Multi-Dilated Convolution Attention Denoising Network for Bearing Fault Diagnosis. Vibration, 9(2), 30. https://doi.org/10.3390/vibration9020030

Article Menu

MDCAD-Net: A Multi-Dilated Convolution Attention Denoising Network for Bearing Fault Diagnosis

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning Methods for Bearing Fault Diagnosis

2.2. Multi-Scale Feature Extraction and Signal Processing

2.3. Attention Mechanisms and Noise Robustness

3. Methodology

3.1. Network Architecture

3.2. Feature Extraction with Channel Attention

3.3. Multi-Scale Feature Learning with Dual Attention

3.4. Denoising and Fault Classification

3.5. Training Strategy and Objective Function

4. Experimental Results and Analysis

4.1. Datasets

4.2. Experimental Setting

4.3. Evaluation Metrics

4.4. Baselines

4.5. Main Results

4.6. Ablation Study

4.7. Parameter Sensitivity Analysis

4.8. Feature Visualization

4.9. Noise Robustness Comparison

4.10. Cross-Load Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI