Bearing Fault Diagnosis Based on a Depthwise Separable Atrous Convolution and ASPP Hybrid Network

Gu, Xiaojiao; Liu, Chuanyu; Li, Jinghua; Yu, Xiaolin; Tian, Yang

doi:10.3390/machines14010093

Open AccessArticle

Bearing Fault Diagnosis Based on a Depthwise Separable Atrous Convolution and ASPP Hybrid Network

by

Xiaojiao Gu

¹

,

Chuanyu Liu

¹,

Jinghua Li

²,

Xiaolin Yu

^1,* and

Yang Tian

¹

College of Mechanical Engineering, Shenyang Ligong University, Nanping Middle Road 6, Shenyang 110159, China

²

Mechanical Engineering, Dalian University of Technology, Dalian 116024, China

^*

Author to whom correspondence should be addressed.

Machines 2026, 14(1), 93; https://doi.org/10.3390/machines14010093

Submission received: 16 December 2025 / Revised: 7 January 2026 / Accepted: 7 January 2026 / Published: 13 January 2026

(This article belongs to the Special Issue Data-Driven RUL Prediction: Innovations in Generalization, Uncertainty, and Efficiency for Industrial PHM)

Download

Browse Figures

Versions Notes

Abstract

To address the computational redundancy, inadequate multi-scale feature capture, and poor noise robustness of traditional deep networks used for bearing vibration and acoustic signal feature extraction, this paper proposes a fault diagnosis method based on Depthwise Separable Atrous Convolution (DSAC) and Acoustic Spatial Pyramid Pooling (ASPP). First, the Continuous Wavelet Transform (CWT) is applied to the vibration and acoustic signals to convert them into time–frequency representations. The vibration CWT is then fed into a multi-scale feature extraction module to obtain preliminary vibration features, whereas the acoustic CWT is processed by a Deep Residual Shrinkage Network (DRSN). The two feature streams are concatenated in a feature fusion module and subsequently fed into the DSAC and ASPP modules, which together expand the effective receptive field and aggregate multi-scale contextual information. Finally, global pooling followed by a classifier outputs the bearing fault category, enabling high-precision bearing fault identification. Experimental results show that, under both clean data and multiple low signal-to-noise ratio (SNR) noise conditions, the proposed DSAC-ASPP method achieves higher accuracy and lower variance than baselines such as ResNet, VGG, and MobileNet, while requiring fewer parameters and FLOPs and exhibiting superior robustness and deployability.

Keywords:

rolling bearings; fault diagnosis; deep learning; acoustic vibration signals

1. Introduction

As core components of rotating machinery, the operating condition of rolling bearings directly affects system performance, reliability, and safety, making them indispensable in critical fields such as aerospace and automotive manufacturing. In real industrial environments, harsh operating conditions, such as high temperatures, heavy loads, and contamination, frequently cause rolling bearings to develop a diverse range of faults such as fatigue spalling, wear, plastic deformation, and corrosion [1]. These faults not only degrade equipment performance but can also lead to unexpected shutdowns and cascading failures in severe cases. Therefore, developing efficient and accurate bearing fault diagnosis methods is of great engineering significance [2]. Traditional bearing fault diagnosis methods typically rely on the manual extraction of time-domain and frequency-domain features from signals such as vibration and temperature, followed by classification using shallow models such as support vector machines (SVMs) and artificial neural networks [3,4]. However, these approaches depend heavily on empirical knowledge and hand-crafted feature engineering, making it difficult to comprehensively capture fault characteristics under complex operating conditions. Moreover, shallow models have limited capability to model strong nonlinear relationships and complex noise, resulting in insufficient diagnostic accuracy and robustness in real industrial environments [5,6].

With the rapid development of modern industry and intelligent manufacturing, data-driven deep learning methods have become a research hotspot in bearing fault diagnosis [5,6]. Convolutional neural networks (CNNs), as representative deep learning models, adopt a hierarchical architecture composed of convolutional, pooling, and fully connected layers to automatically learn multi-level features from raw signals or their transformed representations. Comparative evidence has shown that CNN pipelines built on time–frequency representations can differ substantially from signal-driven (or signal-preprocessing-driven) CNNs in fault detection performance, motivating careful input-representation selection [7]. After achieving remarkable success in tasks such as image recognition and speech processing, CNNs have also been widely applied to bearing fault diagnosis [8,9,10]. Beyond conventional CNNs, Jiang et al. proposed a convolutional capsule network for rolling bearing fault diagnosis, leveraging capsule representations to enhance discriminative feature learning under complex conditions [11]. Song et al. proposed a CNN–BiLSTM transfer learning scheme based on an improved particle swarm optimization algorithm, achieving high diagnostic accuracy under multiple operating conditions and limited-sample scenarios [12]. Jia et al. introduced a bearing fault diagnosis method based on the Gram Matrix Time–Frequency Enhancement Network (GTFE-Net). They first designed a GNR denoising strategy that leverages the periodic self-similarity of vibration signals. Then they constructed an end-to-end CNN model with parallel processing branches for raw signals, GNR-denoised signals, and spectra. The integration of GNR reduces noise interference and enhances the efficiency of fault feature extraction [13]. Ruan et al. developed the PGCNN method, which adaptively sets the input and kernel sizes according to fault periods and envelope information derived from the physical characteristics of bearing signals. This approach alleviates the problems of arbitrary parameter design and performance instability in traditional CNNs [14]. However, as research has progressed, conventional convolutional networks still exhibit limitations when processing complex textures and multi-scale fault patterns: on the one hand, their high computational cost hinders deployment in resource-constrained scenarios; on the other hand, in complex industrial noise environments, abundant interference components in bearing signals can easily mask early-stage fault features. As a result, deep learning models find it difficult to stably and accurately extract effective information, thereby compromising diagnostic accuracy and reliability. Parallel to architectural innovations, the field is witnessing a paradigm shift towards more adaptive and context-aware fault diagnosis frameworks. On one hand, research focuses on developing models that can dynamically integrate physical knowledge to enhance generalization across varying scenarios [15]. On the other hand, advanced feature fusion methodologies have emerged as a powerful strategy to overcome the limitations of single-source data, particularly in high-noise industrial environments. Techniques range from hybrid multimodal fusion of multi-sensor signals [16]. These emerging directions underscore the critical importance of building powerful contextual modeling and adaptive calibration capabilities to tackle complex industrial patterns. Concurrently, the paradigm of foundation models, particularly large vision-language models (VLMs), is being explored for their powerful cross-modal understanding and generalization capabilities. For instance, frameworks like CNC-VLM demonstrate how aligning visual time–frequency patterns with textual knowledge can enhance diagnostic reasoning [17].

To address these challenges and overcome the limitations of traditional convolutional networks in multi-scale feature extraction, dilated convolution has been introduced as an effective technique [18,19]. Hollow (or dilated) convolution expands the receptive field by inserting holes into the convolution kernel without increasing the number of parameters. This enables shallower networks to capture feature information over a larger range, thereby facilitating the extraction of multi-scale fault features. Deng et al. combined dynamically adaptive hollow convolution with an attention mechanism for knowledge graph embedding and link prediction, demonstrating the advantages of hollow convolution in modeling global information while preserving local details [20]. Wang Yang et al. developed a multi-input fusion fault diagnosis model. They processed raw signals using the Fast Fourier Transform (FFT) and Continuous Wavelet Transform (CWT) to obtain frequency-domain signals and time–frequency representations. By integrating standard convolution, deformable convolution, and squeeze-and-gain mechanisms, they achieved effective multimodal feature fusion [21]. Li et al. proposed FAC-CNN, a hollow-convolution-based network for fault diagnosis in industrial fans and centrifugal pumps, which achieved superior performance compared with traditional CNNs, ANNs, and SVMs [22]. These studies demonstrate the effectiveness of hollow convolutions for multi-scale feature extraction in fault diagnosis. However, when operating conditions and data volume are limited, relying solely on dilated convolution may still lead to overfitting. At the same time, effectively reducing parameter complexity and computational load while maintaining high accuracy has become a critical requirement for engineering applications of deep models.

Against this backdrop, depthwise separable atrous convolution (DSAC) has attracted significant attention due to its computational efficiency in substantially reducing the number of parameters and the computational load. Specifically, DSAC decomposes standard convolution into two steps: depthwise convolution and pointwise convolution. This approach preserves feature representation capability while significantly reducing the parameter count and computational complexity, thereby enhancing runtime efficiency and making it particularly suitable for resource-constrained industrial applications. Cui et al. proposed a lightweight rolling bearing fault diagnosis method based on the Gauss–Fermat Angular Field (GAF) and collaborative attention (CA). By constructing a lightweight CNN with deep separable convolutions, inverted residual blocks, and linear bottleneck layers, combined with attention mechanisms to enhance key feature representations, they achieved low computational overhead and high diagnostic accuracy on the CWRU and experimental datasets [23]. Cao et al. designed a lightweight semantic segmentation network that uses deep separable convolutions as the basic building blocks for object segmentation tasks. This approach significantly improved inference speed while maintaining segmentation accuracy, further validating the potential of DSAC for real-time applications [24].

When processing acoustic signals, residual networks are well-suited to modeling complex, non-stationary acoustic patterns because of their strong feature representation capability and deep architecture, and they demonstrate excellent characterization performance across different time scales and frequency components [25,26,27]. However, when processing signals that contain a large amount of redundant information, traditional residual networks struggle to explicitly separate critical information from noise components, which limits feature extraction efficiency and diagnostic robustness. To address this limitation, the deep residual shrinkage network (DRSN) introduces soft-thresholding and attention mechanisms within the residual network framework. This enables the adaptive suppression of non-informative features and realizes an integrated “feature-extraction–denoising” modeling framework [28,29,30]. Zhang et al. addressed the adverse effects of high environmental noise on vibration-signal-based fault diagnosis by proposing a novel global multi-attention deep residual shrinkage network (GMA-DRSN). This architecture employs multi-scale attention and shrinkage mechanisms to suppress environmental noise, thereby achieving superior performance in vibration-signal-based fault diagnosis [31]. Hu et al. integrated the whale optimization algorithm (WOA), variational mode decomposition (VMD), and DRSN to develop a multi-condition rotating machinery fault diagnosis method, which outperformed traditional approaches on multi-condition gear fault datasets [32]. Luo et al. further proposed a dual-channel DRSN (DC-DRSN) to address the insufficient diagnostic accuracy of planetary gear systems under low signal-to-noise ratio (SNR) conditions, achieving higher diagnostic precision and robustness [33]. These studies demonstrate that combining residual learning with adaptive shrinkage mechanisms is an effective approach to enhancing fault diagnosis performance in boisterous environments.

For multi-scale feature aggregation, spatial pyramid pooling (SPP) is a commonly used structure that balances receptive-field expansion and feature alignment [34,35]. SPP uniformly pools feature maps of different spatial sizes into a fixed-dimensional representation, thereby capturing multi-scale receptive-field information. However, expanding the receptive field often reduces spatial resolution and introduces a strong dependence on the data distribution. In contrast, atrous spatial pyramid pooling (ASPP) integrates local details and global contextual information by employing parallel branches of dilated convolutions with different dilation rates, without significantly increasing the number of parameters. This further enhances the model’s ability to represent multi-scale features [36,37].

To address these challenges while accounting for the characteristics of bearing vibration and acoustic signals, this paper proposes a hybrid network model for bearing fault diagnosis, termed DSAC–ASPP, which combines deep separable hollow convolutions with atrous spatial pyramid pooling. The proposed method first synchronously acquires bearing vibration and acoustic signals from an experimental platform, and then maps both time-domain signals into time–frequency representations via the Continuous Wavelet Transform (CWT), which is consistent with mainstream time–frequency-CNN diagnostic pipelines and their reported advantages in fault detection [7]. Subsequently, the vibration spectrogram is fed into a multi-scale feature extraction module to obtain preliminary vibration features, whereas the acoustic spectrogram is fed into a deep residual shrinkage network (DRSN). Residual learning and adaptive soft-thresholding shrinkage are employed to suppress strong noise and extract robust acoustic features. The two feature streams are then concatenated in a feature fusion module and fed into a lightweight convolutional backbone based on deep separable convolutions (DSAC) and atrous spatial pyramid pooling (ASPP). This design expands the effective receptive field and aggregates multi-scale contextual information while reducing the parameter count and computational overhead. Building on this, a SimAM lightweight attention mechanism is introduced to recalibrate high-level feature representations. Finally, global pooling followed by a classifier outputs the bearing fault category, thereby enhancing recognition accuracy in complex industrial noise environments. Furthermore, experimental validation on public datasets and real-world industrial data demonstrates that the proposed method can achieve efficient and accurate bearing fault diagnosis under complex industrial noise conditions. This provides a new and effective approach for bearing health management and fault prediction, and it has important theoretical value and engineering application significance.

The integration of advanced feature extraction, noise resilience, and computational efficiency into a unified diagnostic framework remains an outstanding challenge in the field of bearing fault diagnosis. Current methodologies often excel in one or two aspects—such as the lightweight design of depthwise separable convolutions for efficiency, the targeted noise suppression of shrinkage networks, or the fusion of multiple signal modalities—yet a holistic solution is still lacking. To precisely position our contribution within the existing literature and clarify its distinctive advantages, a comparative analysis is presented in Table 1.

The remainder of this paper is organized as follows. Section 2 presents the theoretical background and key building blocks, including Depthwise Separable Atrous Convolution (DSAC), Atrous Spatial Pyramid Pooling (ASPP), Deep Residual Shrinkage Network (DRSN), and the lightweight SimAM attention mechanism, and then formulates the proposed dual-channel DSAC–ASPP fault diagnosis framework. Section 3 describes the experimental setup and data preprocessing pipeline (including the CWT-based time–frequency transformation for vibration and acoustic signals), details the training and evaluation procedure, and reports comparative results against representative CNN backbones under both clean and noisy conditions. Section 4 summarizes the main findings and conclusions of this study.

2. Theoretical Background

2.1. Depthwise Separable Atrous Convolution (DSAC)

Depthwise Separable Atrous Convolution (DSAC) is an efficient convolutional variant that synergistically combines the parameter efficiency of Depthwise Separable Convolution with the multi-scale receptive field advantage of Atrous Convolution (Dilated Convolution). This design aims to drastically reduce the model’s parameter complexity and computational cost while enhancing its ability to extract multi-scale fault features from time–frequency representations.

Standard Convolution is the fundamental operation. For an input feature map

x \in R^{H \times W \times M}

(where H, W, M are height, width, and number of input channels), using N filters of size

K_{h} \times K_{w} \times M

, produces an output

Y \in R^{H \times W \times M}

. Its computational cost in Floating-Point Operations (FLOPs) is

C_{std} = H \times W \times M \times N \times K_{h} \times K_{w}

(1)

In this equation, H × W represents the spatial size of the output feature map.

Depthwise Separable Convolution factorizes this operation into two stages to lower computational cost, a decomposition strategy that forms the foundation for building efficient diagnostic models [38].

Depthwise Convolution: A spatial convolution is applied independently to each input channel using M filters of size

K_{h} \times K_{w} \times 1

. Its cost is

C_{dw} = H \times W \times M \times N \times K_{h} \times K_{w}

(2)

Pointwise Convolution: A 1 × 1 convolution mixes the channels from the depthwise output. Using N filters of size 1 × 1 × M, its cost is

C_{pw} = H \times W \times M \times N

(3)

The total cost for depthwise separable convolution is

C_{dsc} = C_{dw} + C_{pw}

. The factorization process of depthwise separable convolution is illustrated in Figure 1.

Atrous Convolution enlarges the receptive field without increasing parameters or computation by inserting “holes” (zeros) between kernel elements, controlled by a dilation rate r. A base kernel of size K has an effective receptive field of K + (K − 1)(r − 1) when dilated, which has been proven effective for extracting fault features under variable conditions [39]. DSAC integrates the atrous mechanism into the depthwise convolution step. Therefore, the computational cost of a DSAC module, C_DSAC, is the sum of atrous depthwise convolution and pointwise convolution:

C_{DSAC} = H \times W \times M \times K_{h} \times K_{w} + H \times W \times M \times N

(4)

The ratio of its cost to that of standard convolution is in Equation (5):

R = \frac{C_{DSAC}}{C_{std}} = \frac{H \times W \times M \times (K_{h} \times K_{w} + N)}{H \times W \times M \times N \times K_{h} \times K_{w}} = \frac{1}{N} + \frac{1}{K_{h} \times K_{w}}

(5)

For common 3 × 3 kernels (K_h = K_w = 3) and a large N, DSAC can reduce theoretical computation to approximately 1/9 of the standard convolution, offering significant efficiency. The integration of depthwise separable and atrous convolutions has been validated as an effective architecture for achieving high diagnostic accuracy under noisy conditions [38], which forms the foundational two-stage factorization used in the proposed DSAC module. When the dilation rate is 1, dilated convolution is equivalent to standard convolution. When the dilation rate exceeds 1, it significantly expands the receptive field of the convolution kernel without increasing the number of parameters or the computational cost, thereby capturing broader contextual information. In DSAC, multi-scale feature extraction is achieved by setting different dilation rates, which enhances the network’s ability to jointly model local and global information. Subsequently, pointwise convolution is introduced to perform weighted fusion of features along the channel dimension, thereby strengthening feature interaction and their comprehensive representation across channels. The specific structure of the DSAC module is shown in Figure 2.

2.2. Dual-Channel Fault Diagnosis

The proposed dual-channel processing of vibration and acoustic signals is grounded in their complementary physical characteristics and noise profiles. Vibration signals, measured via accelerometers, directly capture structural resonances and impact forces from surface defects. In contrast, acoustic emission (AE) signals, captured via microphones, are stress waves generated by internal friction, crack propagation, and lubricant film collapse, often revealing sub-surface or incipient faults before they manifest significantly in the vibration spectrum.

This physical difference leads to a critical distinction in their signal-to-noise ratio (SNR) characteristics. Vibration signals, while noisy, typically have a higher inherent SNR for macro-faults as the sensor is directly coupled to the source. Acoustic signals, however, propagate through air and are susceptible to contamination from ambient machinery noise, airflow, and electromagnetic interference, resulting in a substantially lower operational SNR.

Therefore, we apply the Deep Residual Shrinkage Network (DRSN) specifically to the acoustic signal channel. The DRSN’s core mechanism—adaptive soft-thresholding—is designed to suppress noise components that are irrelevant to fault features. This is particularly effective for acoustic signals where noise is a dominant, non-stationary component. The DRSN learns to apply channel-wise thresholds, automatically shrinking low-amplitude, noisy coefficients towards zero while preserving high-amplitude fault-induced impulses. Applying DRSN to the cleaner vibration signal offers marginal gain but increases complexity, whereas for low-SNR acoustic data, it provides essential denoising and feature purification, making the two modalities commensurate for effective fusion. This targeted use aligns with the principle of allocating specialized processing where it is most needed, optimizing overall model efficiency and robustness. This targeted use aligns with the principle of allocating specialized processing where it is most needed, optimizing overall model efficiency and robustness.

2.2.1. Atrous Spatial Pyramid Pooling (ASPP)

Atrous spatial pyramid pooling (ASPP) introduces dilated convolutions and 1 × 1 convolutions into the traditional spatial pyramid pooling framework. The input image not only expands its receptive field and fuses feature information through max-pooling layers but also effectively reduces computational complexity through 1 × 1 convolutions. Additionally, ASPP incorporates an upsampling step to recover features and perform processing at higher resolutions. The input features are processed by dilated convolutions with different dilation rates to capture larger receptive fields and extract richer feature information. Finally, features extracted by dilated convolutions are fused with those obtained from pooling layers to obtain multi-scale feature representations. This approach significantly enhances object detection accuracy while reducing the probability of false detections.

ASPP operates by processing the input feature map through a sequence of operations: 1 × 1 convolution, dilated convolutions with different dilation rates, pooling, another 1 × 1 convolution, upsampling, and a final dilated convolution. The feature maps generated by the dilated convolutions, the 1 × 1 convolutions, and the pooling layer are concatenated before being passed to the next layer. The architecture of ASPP is illustrated in Figure 3.

2.2.2. Deep Residual Shrinkage Network (DRSN)

Traditional residual networks struggle to promptly highlight key features when processing signals that contain substantial noise and redundant components. To address this problem, the deep residual shrinkage network (DRSN) introduces soft-thresholding shrinkage units on top of the residual structure. By combining soft-threshold denoising with attention mechanisms, DRSN performs adaptive compression and filtering of features, effectively suppressing redundant information while enhancing useful features. The soft-threshold function employed in DRSN offers smoother transitions than hard thresholds, significantly reducing local jitter caused by thresholding. Its derivative form resembles that of ReLU, yielding only 0 or 1 values, which helps mitigate the vanishing gradient problem. The mathematical expressions for the relevant soft-threshold functions are given in Equations (6) and (7).

Y = \{\begin{matrix} x - τ & x > τ \\ 0 & - τ \leq x \leq τ \\ x + τ & x < - τ \end{matrix}

(6)

\frac{\partial y}{\partial x} = \{\begin{matrix} 1 & x > τ \\ 0 & - τ \leq x \leq τ \\ 0 & x < - τ \end{matrix}

(7)

2.2.3. Attention Mechanisms

The attention mechanism dynamically allocates computational resources, enabling the model to focus on critical regions within the input data and thereby enhancing the discriminative power of feature representations. To address challenges such as strong noise interference and weak fault features in bearing fault signals, this paper introduces a channel–spatial attention mechanism into the network to amplify fault-related feature responses while suppressing irrelevant background components.

A representative example of channel attention is the squeeze-and-excitation (SE) module. It obtains channel-level global descriptors through global pooling and uses two fully connected layers to adaptively adjust channel weights, thereby highlighting feature channels that are highly correlated with fault patterns. The basic structure of the SE module is shown in Figure 4. Both existing studies and our experimental results demonstrate that incorporating the SE module effectively improves fault classification performance under variable load conditions.

The spatial attention mechanism focuses on adaptively adjusting responses across different spatial locations, enabling the network to concentrate on regions where faults occur. A typical channel–spatial joint attention module is CBAM, which sequentially applies weighting to the channel and spatial dimensions. Its simple structure facilitates integration and has been widely applied in tasks such as image classification, object detection, and semantic segmentation. Its architecture is shown in Figure 5.

However, CBAM employs a serial modeling strategy for the channel and spatial dimensions, computing channel attention and spatial attention sequentially in two stages. This approach struggles to simultaneously capture the interaction between channel and spatial information. Additionally, implementing CBAM introduces extra parameters and computational overhead, which hinders deployment in resource-constrained scenarios.

To address these issues, this paper adopts a lightweight attention mechanism, SimAM, which precisely models the importance of each neuron within the feature maps. It employs a “separability” mathematical model to measure the independence of an individual neuron relative to others: an ideal salient feature should exhibit low variance within the same-category samples, significant mean differences across different categories, distinct firing patterns compared with surrounding neurons, and a certain degree of spatial inhibition. Based on this idea, SimAM constructs an energy function to characterize the linear separability between the target neurons and the others, thereby assigning attention weights to each neuron. This emphasizes critical fault features while suppressing noise components. The corresponding module structure is shown in Figure 6, and the energy function is given in Equation (8).

e_{t} (w_{t}, b_{t}, y, x_{i}) = (y_{t} - \hat{t}) + \frac{1}{M - 1} \sum_{i = 1}^{M - 1} {(y_{0} - \hat{t})}^{2}

(8)

In this equation, t represents the target neuron, x_i denotes the remaining neurons, w_t and b_t are the weight and bias, respectively, M is the number of neurons, and y_t and y_o represent the optimal values that guide the search.

\hat{t} = w_{t} t + b_{t}

(9)

\hat{x} = w_{t} x_{i} + b_{t}

(10)

The regularized objective function is defined in Equation (11).

e_{t} (w_{t}, b_{t}, y_{t}, x_{i}) = \frac{1}{M - 1} {(- 1 - (w_{t} x_{i} + b_{t}))}^{2} + {(1 - (w_{t} x_{i} + b_{t}))}^{2} + λ {w_{t}}^{2}

(11)

Here, λ represents the regularization coefficient. The closed-form solutions for w_t and b_t are obtained from Equation (11) and are summarized in Equation (12).

\{\begin{matrix} w_{t} = \frac{2 (t - μ_{t})}{{(t - μ_{t})}^{2} + 2 σ_{t}^{2} + 2 λ} \\ b_{t} = - \frac{1}{2} (t + μ_{t}) w_{t} \end{matrix}

(12)

Here, μ_t represents the mean, σ_t² denotes the variance. The simplified expression of the minimum energy is formalized in Equation (13).

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{{(t - μ_{t})}^{2} + 2 {\hat{σ}}^{2} + 2 λ}

(13)

Here,

{\hat{σ}}^{2}

represents the covariance. Based on Equation (9), the mean and variance of each element in the feature vectors are computed as in Equation (14).

\{\begin{matrix} μ_{t} = \frac{1}{M} \sum_{i = 1}^{M} x_{i} \\ {\hat{σ}}^{2} = \frac{1}{M} \sum_{i = 1}^{M} {{(x}_{i} - μ_{i})}^{2} \end{matrix}

(14)

From this formulation, a lower energy indicates that neuron t is more distinct from its neighboring neurons and thus more important. After discarding constant terms, the final importance measure is written in Equation (15).

z = 4 ({\hat{σ}}^{2} + λ) {(t - μ_{t})}^{2}

(15)

Here, z represents the simplified energy.

2.2.4. Integrated Architecture of the DSAC-ASPP Hybrid Network

Based on this rationale, the proposed DSAC–ASPP bearing fault diagnosis model is constructed as follows. First, vibration signals and acoustic signals are converted into time–frequency maps using the Continuous Wavelet Transform (CWT). The vibration time–frequency map is then fed into a multi-scale feature extraction module to obtain preliminary vibration features. The acoustic time–frequency map is fed into a deep residual shrinkage network (DRSN), which suppresses strong noise and extracts robust acoustic features through residual learning and adaptive soft-threshold shrinkage. Subsequently, the two feature streams are concatenated in a feature fusion module and fed into a lightweight hollow convolutional backbone based on depthwise separable atrous convolution (DSAC) and atrous spatial pyramid pooling (ASPP). This expands the effective receptive field and enables the aggregation of multi-scale contextual information. Concurrently, a lightweight SimAM attention mechanism is introduced to recalibrate high-level features. Finally, the outputs from both signal branches undergo feature fusion to complete fault classification. The overall model architecture is shown in Figure 7.

3. Experimental Results

3.1. Experimental Setup

Vibration and acoustic signals were acquired using the QPZZ-II rotating machinery vibration analysis and fault diagnosis test bench in the laboratory. The QPZZ-II acquisition system, designed for collecting vibration signals from rotating machinery and analyzing faults, primarily consists of four components: an electric motor, a torque sensor, a power meter, and an electronic controller, as illustrated in Figure 8.

This experimental platform facilitates the simulation of bearing operating conditions. The vibration data acquisition process involves real-time monitoring of the test object using various sensors (accelerometers, acoustic sensors, etc.), with accelerometers used for vibration signal acquisition. Data acquisition cards are connected to the sensors to collect and record the experimental data. The accelerometer was mounted radially on the housing of the test bearing, while the microphone was positioned 17 cm away. Experiments were conducted on N205EM deep groove ball bearings (inner diameter: 25 mm, outer diameter: 52 mm, width: 15 mm). A constant rotational speed of 1200 r/min was maintained, and a radial load was applied via the test bench’s loading system. Both vibration and acoustic signals were sampled at 5120 Hz. Data covering four bearing health states were collected: normal condition (N), inner race fault (IF), outer race fault (OF), and rolling element fault (BF). The bearing components under these four states are presented in Figure 9, providing visual reference for the fault characteristics that generate the acquired signals. Subsequently, the experimental data were stored, and real-time waveforms were visualized for initial inspection.

3.2. Data Preprocessing

The accelerometer is mounted radially on the bearing, while the microphone is placed 17 cm away from the bearing. Four bearing conditions with distinct fault characteristics were considered: normal, outer ring fault, inner ring fault, and rolling element fault, for which both vibration and acoustic signals were collected. Continuous segments of 10,000 sampling points (corresponding to approximately 1.95 s of signal duration per sample) were extracted to plot the time-domain waveforms. The time-domain plots of the vibration and acoustic signals are shown in Figure 10 and Figure 11.

The above Figures reveal distinct differences in the raw signals across the four bearing states. To further analyze these variations, the previously introduced continuous wavelet transform (CWT) method was applied to convert the signals of the four bearing states into time–frequency representations. The resulting time–frequency representations of the vibration and acoustic signals are shown in Figure 12 and Figure 13.

3.3. Experimental Method

Fault diagnosis based on DSAC–ASPP primarily involves the following steps: dataset partitioning, model training, model evaluation, model testing, and fault classification. The dataset was partitioned into training, validation, and test sets with a ratio of 6:2:2. This split is a common practice in deep learning research to balance model training and performance evaluation. The number of training samples per fault category (1200) is sufficient to ensure stable convergence of the proposed DSAC-ASPP model (with approximately 1.2 M parameters) without observed overfitting. Model training involves feeding the partitioned dataset into the DSAC–ASPP model. Model evaluation aims to determine the optimal network parameters, while model testing verifies the fault diagnosis performance of the proposed model. The dataset Partitioning are shown in Table 2.

The experimental dataset is balanced, with an equal number of samples for all fault categories. The standard categorical cross-entropy loss was used during training without introducing class weights. The performance of deep learning models heavily depends on hyperparameter selection. These hyperparameters, which must be manually set before training, directly affect model accuracy and generalization capability. The learning rate controls the magnitude of gradient updates. An excessively small value causes slow convergence, whereas an excessively large value may cause the model to oscillate near local minima or even diverge. The batch size influences the direction and stability of gradient descent. Too few samples per batch make convergence more difficult, whereas too many may trap the model in local optima. The number of training epochs must be carefully controlled: too few epochs cause underfitting, whereas too many may lead to overfitting. In the experiments of this study, the model hyperparameters were determined via grid search and are summarized in Table 3. The AdamW optimizer was adopted with an initial learning rate of 0.001, a batch size of 32, and a weight decay coefficient of 0.01. The model was trained for 50 epochs using the Cross-Entropy Loss function. The dataset was split into training, validation, and test sets with a ratio of 6:2:2. The fixed hold-out split (6:2:2 ratio) was employed for all experiments, without the use of k-fold cross-validation, to maintain consistency and manage computational cost. To rigorously assess the stability and variability of the results, each experiment—including all baseline models—was repeated five times with different random seeds. Consequently, all performance metrics reported in this study are presented as the mean ± standard deviation across these five independent runs, providing a reliable estimate of model performance and its variance. The dilation rates within the DSAC modules and the ASPP configuration were designed to capture multi-scale fault signatures from the CWT time–frequency representations. The DSAC blocks employ parallel branches with dilation rates [1,2,4], enabling a progressive expansion of the receptive field from local details to broader context. For the ASPP module, we configured three parallel dilated convolutions with rates {6, 12, 18}, along with a 1 × 1 convolution and a global average pooling branch. These larger rates operate on higher-level feature maps to aggregate contextual information at complementary scales, which is crucial for robust fault discrimination under varying noise conditions. This configuration yielded optimal and stable performance on the validation set. All baseline models were trained under this unified setup to ensure a fair comparison. The specific steps for fault diagnosis using the DSAC–ASPP method are as follows: First, sensors on the test bench collect bearing vibration and acoustic signals. These one-dimensional data are converted into time–frequency maps via the Continuous Wavelet Transform (CWT) and then divided into training, validation, and test sets. Next, the network parameters are initialized, and the training and validation sets of time–frequency maps are fed into the model for training. Convergence is assessed by monitoring changes in the loss function. Model parameters are saved according to validation-set accuracy; if the performance is suboptimal, the parameters are fine-tuned to enhance generalization. Finally, the optimally trained model is used to make predictions on the test set to validate the sample recognition accuracy.

To evaluate robustness, Gaussian white noise was added at SNR levels of 0 dB, −2 dB, −4 dB, and −8 dB. The SNR calculation is computed as in Equation (16).

SNR (dB) = \log_{10} 10 = (\frac{P_{signal}}{P_{signal}})

(16)

Here, P_signal and P_noise denote the power of the original signal and the added noise, respectively. Each experiment was repeated five times with different random seeds; the mean and standard deviation are reported. Future work will include additional noise types to better simulate industrial environments.

3.4. Results and Evaluation

To assess the stability of model performance, a hold-out validation strategy was employed. Considering the computational cost of training deep models, k-fold cross-validation was not used. However, all reported results are presented as the mean ± standard deviation from five independent training runs with different random seeds (as shown in Table 3), providing a reliable estimate of performance and its variance.

3.4.1. Comparison of Training Convergence and Classification Performance

The performance of different models on the vibration signal dataset is shown in Figure 14. Figure 14a,b present the changes in accuracy and loss over training epochs for each model, respectively. Overall, the proposed DSAC–ASPP model achieves the best performance on both metrics. Its training accuracy rapidly increases to nearly 100% within the first few epochs and then stabilizes. The corresponding loss curve drops sharply within approximately 10 epochs and converges to a value close to zero, with almost no noticeable fluctuations. This indicates that the model exhibits fast convergence and good training stability. Inception_v3 and MobileNet_v2 also achieve high accuracy within the first 20–30 epochs. Their loss values decrease rapidly at the beginning and then fluctuate within a narrow range during convergence, indicating strong fitting capability for this task. Among the more advanced models, LightNAS exhibits a convergence profile similar to these efficient networks, achieving competitive accuracy quickly but with slightly higher final loss. ViT-FDM, while demonstrating stable convergence and lower loss fluctuation owing to its global self-attention mechanism, requires more epochs to reach its peak accuracy. MFACNN shows moderate convergence speed, with its loss curve declining steadily, reflecting its focus on frequency-aware feature extraction. However, the convergence speed and stability of all these models are slightly inferior to those of DSAC–ASPP. In contrast, VGG-16 and ResNet50 exhibit slower accuracy improvement, with loss values decreasing more slowly and showing larger fluctuations during training. ResNet50, in particular, demonstrates more pronounced optimization instability and higher final loss values, suggesting that its architecture may not be fully aligned with the characteristics of the current dataset. A comprehensive analysis of the accuracy and loss curves reveals that DSAC–ASPP outperforms the other comparison models in convergence speed, training stability, and final performance. It is therefore more suitable as the feature extraction and classification framework for this bearing fault diagnosis task.

Furthermore, to assess the stability and reproducibility of the results, each experiment was repeated five times with different random seeds. The comprehensive performance is evaluated using Accuracy, Precision, Recall, and F1-score, with the final results reported as the mean ± standard deviation across all runs, as summarized in Table 3 (Results are presented as mean ± standard deviation over five independent runs with different random seeds. The small standard deviations reflect stable performance across repetitions).

Inference latency was measured on two representative platforms: an NVIDIA GeForce RTX 4060 GPU (6 GB) and an Intel Core i5-13500H CPU (2.6 GHz). The average latency per sample (224 × 224) on GPU/CPU is: DSAC-ASPP (2.1 ms/15.3 ms), Inception_v3 (5.8 ms/45.2 ms), VGG_16 (12.5 ms/98.7 ms), ResNet50 (6.3 ms/52.1 ms), ViT-Based (22.5 ms/156.8 ms), LightNAS (3.5 ms/28.4 ms), MFACNN (4.2 ms/35.6 ms). As shown in Table 3, the proposed DSAC-ASPP model attains the highest average values and the smallest standard deviations across all four evaluation metrics (Accuracy, Precision, Recall, and F1-score), which demonstrates its superior diagnostic performance and excellent stability. Remarkably, it achieves this leading result with only 1.2 M parameters—far fewer than all other compared models, including the lightweight-oriented LightNAS (3.5 M) and the recent ViT-based approach (85.7 M). These results confirm that DSAC-ASPP effectively balances high accuracy, strong robustness, and exceptional parameter efficiency, directly addressing the key challenges of computational redundancy and noise sensitivity.

3.4.2. Validation of Dual-Channel Design and Complementary Information

To quantitatively validate the dual-channel architecture and demonstrate the complementarity of the two modalities, we conducted an ablation study under 0 dB Gaussian white noise, as summarized in Table 4.

The ablation study quantitatively validates the core design of the dual-channel architecture. First, the full model exhibits a 4.16% accuracy gain over the vibration-only variant, providing direct evidence for the complementary diagnostic value of the acoustic signal. Second, the DRSN proves critical for the acoustic channel: its replacement with a standard convolutional block leads to a severe 9.09% performance drop, whereas its inclusion enables the acoustic-only model to achieve 92.80% accuracy. Finally, the low feature similarity between the two streams confirms they learn distinct representations. These results collectively substantiate the premise of complementary fusion and justify the modality-specific processing pipeline.

3.4.3. Confusion Matrix Analysis

To further analyze the model’s classification performance, a confusion matrix is used to evaluate the classification results. The confusion matrix clearly displays the model’s predictions across different categories, with the horizontal axis representing the predicted category and the vertical axis representing the true category. In the experiments, the number of training epochs was set to 100, and the corresponding results are shown in Figure 15.

As can be observed from the confusion matrices, the model can accurately distinguish different bearing fault categories under clean conditions. In real industrial environments, bearing signals are typically accompanied by various forms of noise. Adding white noise simulates complex real-world conditions, making the training data more representative of actual operating scenarios. Analyzing data with added white noise helps validate the fault diagnosis model’s adaptability across different noise environments. The signal-to-noise ratio (SNR) was set to 0, −2, −4, and −8 dB. Future studies will incorporate multiple noise types (e.g., motor noise, mechanical vibration noise) for a more comprehensive assessment.

3.4.4. t-SNE Visualization Analysis

Comparing t-SNE visualization results across different training epochs reveals that the model’s classification capability significantly improves as training progresses: at 10 epochs, data points are mixed with blurred boundaries; by 30 epochs, clusters begin to form and separation trends emerge; at 60 epochs, the boundaries between categories become clearer and the overlap is further reduced; finally, at 100 epochs, clusters corresponding to different fault states are nearly completely separated, indicating that the model has learned highly discriminative features and that the classification performance is close to optimal. t-SNE visualization of learned features under different training epochs is shown in Figure 16.

3.4.5. Generalization Analysis

Comparative experiments using Inception_v3, MobileNet_v2, VGG-16, and ResNet50 were conducted to validate the generalization capability of the proposed model. As shown in Figure 17, DSAC–ASPP demonstrates the strongest noise resistance among all models, maintaining high accuracy under various noise conditions and exhibiting the best generalization performance. Inception_v3 and MobileNet_v2 follow closely, demonstrating stable performance and approaching optimal results at higher signal-to-noise ratios. VGG-16 and ResNet50 exhibit weaker noise resistance, particularly under low signal-to-noise ratio conditions, where their performance declines markedly, indicating relatively insufficient generalization capability. The generalization capabilities of the different models are summarized in Table 5, while their accuracy under white noise is illustrated in Figure 17.

4. Conclusions and Future Work

4.1. Conclusions

This study proposes a bearing fault diagnosis method that integrates depthwise separable atrous convolution (DSAC) and atrous spatial pyramid pooling (ASPP). Vibration and acoustic signals are first transformed into CWT-based time–frequency maps; then, vibration features and noise-robust acoustic features are extracted in parallel and fused, followed by DSAC–ASPP-based lightweight feature learning and SimAM attention recalibration for final classification. Experimental results lead to the following conclusions:

Atrous spatial pyramid pooling (ASPP) effectively captures multi-scale features across varying receptive fields by employing dilated convolutions with multiple dilation rates. Combined with global pooling and 1 × 1 convolutions, it achieves comprehensive global information integration and feature fusion, significantly enhancing the network’s representational capability for diverse bearing fault patterns.
Dual-Channel Fault Diagnosis, integrating deep separable convolution, channel–spatial attention, and SimAM mechanisms, enhances the representation of input features while improving training stability and mitigating issues such as gradient explosion.
On laboratory-collected bearing vibration and acoustic signal datasets, the DSAC–ASPP model achieves superior diagnostic performance. It not only outperforms conventional CNN backbones but also surpasses several state-of-the-art methods, including ViT-FDM, LightNAS, and MFACNN, in terms of overall accuracy and robustness. Furthermore, under varying intensities of superimposed Gaussian white noise, the proposed model maintains the highest classification accuracy with minimal performance degradation, demonstrating its strong generalization capability and noise robustness. This makes it suitable for reliable multi-source signal fault diagnosis under complex operating conditions. Moreover, this performance is achieved with high efficiency and low inference latency. This presents a favorable trade-off compared to computationally heavy foundation models, which typically require orders of magnitude more parameters and longer inference times, making our approach more amenable to real-world, resource-constrained and real-time deployment.

Despite the promising results, several limitations should be acknowledged, which also point to directions for future research. First, the experimental data were collected on a laboratory test bench with predefined bearing states, and the noisy scenarios were emulated by superimposing artificial white noise. This setup may not fully capture the complexity of real industrial environments, which involve variable loads, transient speeds, and diverse interference sources. Second, although the proposed method was compared against several representative CNN backbones and recent advanced models, further validation on broader public datasets and cross-domain industrial scenarios is needed to thoroughly verify its generalizability and robustness.

4.2. Future Work

To address these limitations and enhance the practical applicability of the proposed method, future work will focus on the following directions:

Real-world data validation and collection. Vibration and acoustic signals will be collected from operational rotating machinery under real industrial conditions. The method will be evaluated using naturally progressing faults and diverse operational profiles to better assess its practicality.
Adaptation to variable operating conditions. Domain adaptation and transfer learning techniques will be investigated to improve the model’s robustness under non-stationary conditions, such as changing rotational speeds, fluctuating loads, and complex noise environments.
Lightweight and efficient deployment. Model compression techniques (e.g., pruning, quantization) and optimized time–frequency transformation pipelines will be explored to reduce computational overhead and memory footprint, facilitating real-time or edge-device deployment without compromising diagnostic accuracy.
Extension to complex fault scenarios. The fault label space will be expanded to include compound faults and early-stage degradation patterns. Fine-grained health assessment beyond discrete fault classification will also be studied to enable more predictive maintenance capabilities.

By pursuing these research directions, we aim to bridge the gap between laboratory-based validation and real-world industrial applications, ultimately contributing to more robust, efficient, and practical solutions for machinery fault diagnosis.

Author Contributions

Conceptualization, X.G. and C.L.; methodology, X.G. and X.Y.; software, X.G. and C.L.; validation, X.G. and J.L.; formal analysis, X.Y.; investigation, X.Y.; resources, X.Y.; data curation, C.L. and Y.T.; writing—original draft preparation, X.G.; writing—review and editing, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science Research Fund of the Department of Science and Technology of Liaoning Province, China (2025-MSLH-587). This work was supported by the National Natural Science Foundation of China (52275119).

Data Availability Statement

Dataset available on request from the authors, The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chafic, C.E.; Hocine, K.; Salima, B.; Pont, M.; Hay, M. Prediction of fatigue damage and spalling in a multilayered journal bearing shell. Tribol. Int. 2022, 175, 107850. [Google Scholar] [CrossRef]
Mahendra, R.S.; Shivdayal, P. Crack propagation and fatigue life estimation of spur gear with and without spalling failure. Theor. Appl. Fract. Mech. 2023, 127, 104020. [Google Scholar] [CrossRef]
Gu, J.; Congalton, G.R. Assessing the impact of mixed pixel proportion training data on SVM-based remote sensing classification: A simulated study. Remote Sens. 2025, 17, 1274. [Google Scholar] [CrossRef]
Li, X.; Wu, W.; Zhu, F.; Guan, S.; Zhang, W.; Li, Z. FA-Unet: A deep learning method with fusion of frequency domain features for fruit leaf disease identification. Horticulturae 2025, 11, 783. [Google Scholar] [CrossRef]
Wang, Z.; Wang, S.; Cheng, Y. Fault feature extraction of parallel-axis gearbox based on IDBO-VMD and t-SNE. Appl. Sci. 2024, 14, 289. [Google Scholar] [CrossRef]
Li, W.; Cai, H.; Yang, X.; Xue, Y.; Ye, J.; Hu, X. Dual-channel parallel multimodal feature fusion for bearing fault diagnosis. Machines 2025, 13, 950. [Google Scholar] [CrossRef]
Spirto, M.; Melluso, F.; Nicolella, A.; Malfi, P.; Cosenza, C.; Savino, S.; Niola, V. A Comparative Study between SDP-CNN and Time–Frequency-CNN based Approaches for Fault Detection. J. Dyn. Monit. Diagn. 2025, early access. [Google Scholar] [CrossRef]
Chen, X.; Zhang, B.; Gao, D. Bearing fault diagnosis based on multi-scale CNN and LSTM model. J. Intell. Manuf. 2021, 32, 597–613. [Google Scholar] [CrossRef]
Zhou, Z.; Ai, Q.; Lou, P.; Hu, J.; Yan, J. A novel method for rolling bearing fault diagnosis based on Gramian angular field and CNN-ViT. Sensors 2024, 24, 3967. [Google Scholar] [CrossRef]
Cui, K.; Liu, M.; Meng, Y. A new fault diagnosis of rolling bearing on FFT image coding and L-CNN. Meas. Sci. Technol. 2024, 35, 076108. [Google Scholar] [CrossRef]
Jiang, G.; Li, D.; Feng, K.; Li, Y.; Zheng, J.; Ni, Q.; Li, H. Rolling Bearing Fault Diagnosis Based On Convolutional Capsule Network. J. Dyn. Monit. Diagn. 2023, 2, 275–289. [Google Scholar] [CrossRef]
Song, B.; Liu, Y.; Fang, J.; Liu, W.; Zhong, M.; Liu, X. An optimized CNN–BiLSTM network for bearing fault diagnosis under multiple working conditions with limited training samples. Neurocomputing 2024, 574, 127284. [Google Scholar] [CrossRef]
Jia, L.; Chow, T.W.S.; Yuan, Y. GTFE-Net: A Gramian time frequency enhancement CNN for bearing fault diagnosis. Eng. Appl. Artif. Intell. 2023, 119, 105794. [Google Scholar] [CrossRef]
Ruan, D.; Jin, W.; Yang, J.; Gühmann, C. CNN parameter design based on fault signal analysis and its application in bearing fault diagnosis. Adv. Eng. Inform. 2023, 55, 101877. [Google Scholar] [CrossRef]
Gao, S.; Zhao, K.; Wen, T. Learning criteria of normalized regressor-based adaptive observer for actuator fault diagnosis of disturbed systems. Nonlinear Dyn. 2024, 113, 9551–9576. [Google Scholar] [CrossRef]
Xu, Z.; Chen, X.; Li, Y.; Xu, J. Hybrid Multimodal Feature Fusion with Multi-Sensor for Bearing Fault Diagnosis. Sensors 2024, 24, 1792. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Chen, J.; Wang, C.; Peng, C.; Xuan, J.; Shi, T.; Zuo, M. CNC-VLM: An RLHF-optimized industrial large vision-language model with multimodal learning for imbalanced CNC fault detection. Mech. Syst. Signal Process. 2026, 245, 113838. [Google Scholar] [CrossRef]
Wei, C.; Fan, H. Attention mechanism-enhanced multi-scale separable atrous convolution method for motor bearing fault diagnosis. In Proceedings of the 2025 International Conference on Electrical Automation and Artificial Intelligence (ICEAAI), Guangzhou, China, 10–12 January 2025; IEEE: New York, NY, USA, 2025; pp. 969–973. [Google Scholar] [CrossRef]
Mbiethieu, C.; Tsopze, N.; Mephu Nguifo, E. XLITE-Unet: Extremely light and efficient deep learning architecture with selective atrous and axial depthwise convolution for image segmentation. Comput. Vis. Image Underst. 2025, 262, 104543. [Google Scholar] [CrossRef]
Deng, W.; Zhang, Y.; Yu, H.; Li, H. Knowledge graph embedding based on dynamic adaptive atrous convolution and attention mechanism for link prediction. Inf. Process. Manag. 2024, 61, 103642. [Google Scholar] [CrossRef]
Wang, Y.; Yang, M.; Zhang, Y.; Xu, Z.; Zhou, Z.; Li, H. A bearing fault diagnosis model based on deformable atrous convolution and squeeze-and-excitation aggregation. IEEE Trans. Instrum. Meas. 2021, 70, pp,1–10. [Google Scholar] [CrossRef]
Li, S.; Wang, H.; Song, L.; Cui, L.; Li, X. An adaptive data fusion strategy for fault diagnosis based on the convolutional neural network. Measurement 2020, 165, 108122. [Google Scholar] [CrossRef]
Chen, J.; Zhang, Q.; Zhang, S.; Peng, L.; Wen, J. A lightweight model for bearing fault diagnosis based on Gramian angular field and coordinate attention. Machines 2022, 10, 282. [Google Scholar] [CrossRef]
Cao, Z.; Xu, X.; Hu, B.; Zhou, M. Rapid detection of blind roads and crosswalks by using a lightweight semantic segmentation network. IEEE Trans. Intell. Transp. Syst. 2021, 22, 6188–6197. [Google Scholar] [CrossRef]
Xie, S.; Wang, J.; Li, Y.; Yang, L. Bearing fault diagnosis method based on improved meta-ResNet and sample weighting under noisy labels. Struct. Health Monit. 2025, 24, 3707–3727. [Google Scholar] [CrossRef]
Li, P.; Wang, W.; Yang, X.; Liu, S.; Zhang, L.; Cheng, Y. Intelligent fault diagnosis of rolling bearings based on wavelet transform and improved ResNet under noisy labels and environments. Eng. Appl. Artif. Intell. 2022, 115, 105269. [Google Scholar] [CrossRef]
Qiu, G.; Nie, Y.; Peng, Y.; Huang, P.; Chen, J.; Gu, Y. A variable-speed-condition fault diagnosis method for crankshaft bearing in the RV reducer with WSO-VMD and ResNet-SWIN. Qual. Reliab. Eng. Int. 2024, 40, 2321–2347. [Google Scholar] [CrossRef]
Dong, H.; Zhang, R.; Mi, Y. Fault diagnosis on bearing of electric motor based on DRSN-BIGRU using stator current signals. J. Phys. Conf. Ser. 2025, 3079, 012050. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, C.; Zhang, X.; Chen, L.; Shi, H.; Li, H. A self-adaptive DRSN-GPReLU for bearing fault diagnosis under variable working conditions. Meas. Sci. Technol. 2022, 33, 124005. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, C.; Chen, L.; Li, H. LGMA-DRSN: A lightweight convex global multi-attention deep residual shrinkage network for fault diagnosis. Meas. Sci. Technol. 2023, 34, 115011. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, C.; Chen, L.; Li, H.; Shi, H. GMA-DRSNs: A novel fault diagnosis method with global multi-attention deep residual shrinkage networks. Measurement 2022, 196, 111203. [Google Scholar] [CrossRef]
Hu, H.; Jiang, A.; Wu, X.; An, Z.; Zhang, S. Multi-condition fault diagnosis method for rotating machinery based on whale optimization variational mode decomposition algorithm and deep residual network. Meas. Sci. Technol. 2025, 36, 076118. [Google Scholar] [CrossRef]
Luo, L.; Liu, Y. Fault diagnosis of planetary gear train crack based on DC-DRSN. Appl. Sci. 2024, 14, 6873. [Google Scholar] [CrossRef]
Guo, S.; Yang, T.; Gao, W.; Zhang, C.; Zhang, Y. An intelligent fault diagnosis method for bearings with variable rotating speed based on Pythagorean spatial pyramid pooling CNN. Sensors 2018, 18, 3857. [Google Scholar] [CrossRef]
Yan, X.; Zhang, Y.; Jin, Q. Chemical process fault diagnosis based on improved ResNet fusing CBAM and SPP. IEEE Access 2023, 11, 46678–46690. [Google Scholar] [CrossRef]
Yang, X.; Yunlei, F.; Hui, L. Lightweight semantic segmentation of complex structural damage recognition for actual bridges. Struct. Health Monit. 2023, 22, 3250–3269. [Google Scholar] [CrossRef]
Chen, J.; Wang, H.; Su, B.; Li, Z. Rolling bearing fault diagnosis based on DRS frequency spectrum image and improved DQN. Trans. Can. Soc. Mech. Eng. 2024, 48, 437–446. [Google Scholar] [CrossRef]
Lei, Y.; Yang, B.; Jiang, X.; Jia, F.; Li, N.; Nandi, A.K. Applications of Machine Learning to Machine Fault Diagnosis: A Review and Roadmap. Mech. Syst. Signal Process. 2020, 138, 106587. [Google Scholar] [CrossRef]
Yao, D.; Zhou, T.; Yang, J.; Meng, C.; Huan, B. Fault diagnosis of rolling bearings based on dynamic convolution and dual-channel feature fusion under variable working conditions. Meas. Sci. Technol. 2024, 35, 066110. [Google Scholar] [CrossRef]
Hua, C.; Luo, K.; Wu, Y.; Shi, R. YOLO-ABD: A Multi-Scale Detection Model for Pedestrian Anomaly Behavior Detection. Symmetry 2024, 16, 1003. [Google Scholar] [CrossRef]

Figure 1. Schematic of depthwise separable convolution.

Figure 2. Structure of the depthwise separable atrous convolution (DSAC).X represents Input Feature Maps, and Y represents Output Feature Maps.

Figure 3. Architecture of the ASPP module.

Figure 4. Basic architecture of channel attention (SE module). Different colors or shades of color visually represent the varying weight values assigned to different channels.

Figure 5. CBAM attention mechanism structure diagram.

\otimes

represents Element-wise Multiplication.

Figure 5. CBAM attention mechanism structure diagram.

\otimes

represents Element-wise Multiplication.

Figure 6. Schematic diagram of the SimAM module [40]. Color variations (e.g., bright colors indicating low neuronal energy and high importance, while dark colors signifying high neuronal energy and low importance) represent the importance weights of neurons at different spatial locations.

Figure 7. Overall architecture of the combined acoustic-vibration fault diagnosis model.

Figure 8. QPZZ-II rotating machinery vibration analysis and fault diagnosis test bench. The experimental data were collected using a QPZZ-II rotating machinery vibration analysis and fault diagnosis test bench (Manufacturer: Jiangsu Qianpeng Diagnosis Engineering Co., Ltd., Zhenjiang, Jiangsu, China).

Figure 9. Bearings with different fault types. (a) Normal condition. (b) Outer race fault. (c) Inner race fault. (d) Rolling element fault.

Figure 10. Time-domain waveforms of vibration signals under four bearing conditions. (a) Normal. (b) Inner ring fault. (c) Outer ring fault. (d) Rolling element fault.

Figure 11. Time-domain waveforms of acoustic signals under four bearing conditions. (a) Normal. (b) Inner ring fault. (c) Outer ring fault. (d) Rolling element fault.

Figure 12. CWT-based time–frequency representations of bearing vibration signals under four states. (a) Normal. (b) Inner ring fault. (c) Outer ring fault. (d) Rolling element fault.

Figure 13. CWT-based time–frequency representations of bearing acoustic signals under four states. (a) Normal. (b) Inner ring fault. (c) Outer ring fault. (d) Rolling element fault.

Figure 14. Accuracy and loss curves of different models. (a) represents the changes in accuracy and (b) represents the loss over training epochs for each model.

Figure 15. Confusion matrices of DSAC–ASPP under different training epochs. (a) Epochs = 10. (b) Epochs = 30. (c) Epochs = 60. (d) Epochs = 100.

Figure 16. t-SNE visualization of learned features under different training epochs.

Figure 17. Accuracy of each model under Gaussian white noise.

Table 1. Summary of Bearing Fault Diagnosis Approaches.

Method	Core Feature Extraction	Multi-Modal Signal Processing	Noise Robustness Strategy	Model Efficiency/Lightweight Design	Primary Focus
Traditional CNNs	Stacked standard convolutions	Typically single-modal	Relies on data augmentation & deep architecture	High parameter count, computationally intensive	Generic feature learning
Lightweight CNNs	Depthwise separable convolutions	Typically single-modal	Limited inherent robustness	High	Deployment efficiency
Advanced Diagnostic Networks	Time–frequency enhancement	Single-modal, focused on pre-processing	Pre-processing denoising via GNR strategy	Medium	Feature enhancement in noisy environments
Attention-Based Networks	Residual shrinkage + multi-scale attention	Single-modal	High	Lower	Strong noise suppression
Multi-Modal Fusion Networks	Multi-input fusion	Multi-Modal Signal Processing	Not explicitly emphasized	Lower	Multi-source information fusion
DSAC-ASPP	Depthwise separable atrous convolution (DSAC) + ASPP	Multi-Modal Sig-nal Processing	High	High	Integrated multi-modality, strong robustness, and high computational efficiency

Table 2. Dataset Partitioning.

Condition	Training Samples	Validation Samples	Test Samples	Label
Normal	1200	400	400	N
Inner ring fault	1200	400	400	IF
Outer ring fault	1200	400	400	OF
Rolling element fault	1200	400	400	BF

Table 3. Comparative analysis of bearing fault diagnosis methodologies.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Parameters (M)	GFLOPs
DSAC-ASPP	98.21 ± 0.28	98.18 ± 0.31	99.23 ± 0.25	99.20 ± 0.27	1.2	0.45
Inseption_v3	97.80 ± 0.55	97.65 ± 0.60	97.90 ± 0.58	97.77 ± 0.59	23.8	5.2
VGG_16	96.50 ± 0.85	96.40 ± 0.90	96.55 ± 0.88	96.47 ± 0.89	138.4	15.5
Resnet_50	95.80 ± 0.95	95.70 ± 1.00	95.85 ± 0.98	95.77 ± 0.99	25.6	4.1
ViT-Based Model	96.75 ± 0.45	97.60 ± 0.50	98.85 ± 0.48	98.72 ± 0.49	85.7	16.3
LightNAS Model	96.95 ± 0.35	97.88 ± 0.40	98.99 ± 0.38	98.93 ± 0.39	3.5	1.8
MFACNN	96.50 ± 0.60	96.45 ± 0.65	98.55 ± 0.62	98.50 ± 0.63	5.8	2.5

Table 4. Ablation study on signal modalities and the role of DRSN.

Model Configuration	Accuracy (%)	vs. Full Dual-Channel Mode
full dual-channel model	94.31 ± 0.91	—
Vibration-Only	90.15 ± 1.35	−4.16
Acoustic-Only (with DRSN)	92.80 ± 1.28	−1.51
Acoustic-Only (with standard Conv)	85.22 ± 1.52	−9.09

Table 5. Generalization results for different models.

Model	Accuracy Under Gaussian White Noise (%)
Model	−8 dB	−4 dB	−2 dB	0 dB
DSAC-ASPP	79.64 ± 1.06	85.21 ± 0.73	89.38 ± 0.46	94.31 ± 0.21
Inception_v3	75.13 ± 0.89	77.64 ± 0.65	81.12 ± 0.52	91.86 ± 0.31
mobileNet_v2	66.37 ± 1.52	71.43 ± 1.15	76.94 ± 0.83	88.46 ± 0.46
VGG-16	73.11 ± 1.64	75.48 ± 1.03	79.32 ± 0.75	90.17 ± 0.19
ResNet50	54.62 ± 1.86	60.15 ± 1.24	66.75 ± 0.84	75.37 ± 0.26
ViT-FDM	77.50 ± 1.20	83.15 ± 0.95	87.20 ± 0.70	93.05 ± 0.40
LightNAS	75.80 ± 1.35	80.92 ± 1.10	85.41 ± 0.85	91.88 ± 0.55
MFACNN	74.25 ± 1.50	79.36 ± 1.25	83.97 ± 1.00	90.75 ± 0.60

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gu, X.; Liu, C.; Li, J.; Yu, X.; Tian, Y. Bearing Fault Diagnosis Based on a Depthwise Separable Atrous Convolution and ASPP Hybrid Network. Machines 2026, 14, 93. https://doi.org/10.3390/machines14010093

AMA Style

Gu X, Liu C, Li J, Yu X, Tian Y. Bearing Fault Diagnosis Based on a Depthwise Separable Atrous Convolution and ASPP Hybrid Network. Machines. 2026; 14(1):93. https://doi.org/10.3390/machines14010093

Chicago/Turabian Style

Gu, Xiaojiao, Chuanyu Liu, Jinghua Li, Xiaolin Yu, and Yang Tian. 2026. "Bearing Fault Diagnosis Based on a Depthwise Separable Atrous Convolution and ASPP Hybrid Network" Machines 14, no. 1: 93. https://doi.org/10.3390/machines14010093

APA Style

Gu, X., Liu, C., Li, J., Yu, X., & Tian, Y. (2026). Bearing Fault Diagnosis Based on a Depthwise Separable Atrous Convolution and ASPP Hybrid Network. Machines, 14(1), 93. https://doi.org/10.3390/machines14010093

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bearing Fault Diagnosis Based on a Depthwise Separable Atrous Convolution and ASPP Hybrid Network

Abstract

1. Introduction

2. Theoretical Background

2.1. Depthwise Separable Atrous Convolution (DSAC)

2.2. Dual-Channel Fault Diagnosis

2.2.1. Atrous Spatial Pyramid Pooling (ASPP)

2.2.2. Deep Residual Shrinkage Network (DRSN)

2.2.3. Attention Mechanisms

2.2.4. Integrated Architecture of the DSAC-ASPP Hybrid Network

3. Experimental Results

3.1. Experimental Setup

3.2. Data Preprocessing

3.3. Experimental Method

3.4. Results and Evaluation

3.4.1. Comparison of Training Convergence and Classification Performance

3.4.2. Validation of Dual-Channel Design and Complementary Information

3.4.3. Confusion Matrix Analysis

3.4.4. t-SNE Visualization Analysis

3.4.5. Generalization Analysis

4. Conclusions and Future Work

4.1. Conclusions

4.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI