A Bearing Fault Diagnosis Method Based on an Attention Mechanism and a Dual-Branch Parallel Network

Liu, Qiang; Chen, Minghao; Tang, Mingxin; Lai, Hongxi

doi:10.3390/app16094511

Open AccessArticle

A Bearing Fault Diagnosis Method Based on an Attention Mechanism and a Dual-Branch Parallel Network

¹

School of Mechanical Engineering, Guangdong Ocean University, Zhanjiang 524088, China

²

Guangdong Provincial Key Laboratory of Intelligent Equipment for South China Sea Marine Ranching, Guangdong Ocean University, Zhanjiang 524088, China

³

Guangdong Marine Equipment and Manufacturing Engineering Research Center, Guangdong Ocean University, Zhanjiang 524088, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4511; https://doi.org/10.3390/app16094511

Submission received: 9 April 2026 / Revised: 26 April 2026 / Accepted: 30 April 2026 / Published: 3 May 2026

Download

Browse Figures

Versions Notes

Abstract

Rolling bearings represent one of the core functional components of rotating machinery, with their application scope continuously expanding into various sectors of modern social production and life, making the research on fault diagnosis of rolling bearings increasingly significant. Effective vibration feature extraction and improved classification models are crucial to achieving accurate and automated fault diagnosis of rolling bearings. We proposed a fault diagnosis approach based on a Swin Transformer–Improved ResNet module. In the data preprocessing stage, the frequency-domain features and time-domain multi-scale features of fault signals are extracted using FFT and VMD methods, respectively. And then, dual-channel feature extraction is employed using both the Swin Transformer and Improved ResNet module, followed by feature fusion through an ECA module, thereby enhancing diagnostic accuracy and model robustness. The architecture retains shallow-level feature details while incorporating global contextual information, improving feature representation and detection precision. Extensive experiments were carried out on data collected from an SEU bearing dataset, including model validation, ablation analysis, comparative evaluation and simulated noise testing. An average classification accuracy of 99.41% was achieved by the proposed model under uniform experimental conditions, as evidenced by the obtained experimental results, outperforming other models by at least 0.96%. Even under severe noise interference with a signal-to-noise ratio of −4, the model maintained an average accuracy of 91.92%, exceeding that of noise-resistant counterparts. Moreover, generalization experiments on the CWRU bearing dataset under varying load conditions revealed an average fault diagnosis accuracy exceeding 98%, confirming the model’s strong cross-domain adaptability.

Keywords:

bearing fault diagnosis; deep learning; parallel fusion; attention mechanism

1. Introduction

Driven by attributes such as high adaptability, superior integrability, dynamic reconfiguration capability, and substantial data processing power, artificial intelligence has continued to evolve, placing it at the forefront of Industry 4.0 development [1]. In modern industrial production, mechanical equipment constitutes a critical element underpinning the efficient operation of industrial systems. The integration of artificial intelligence, IoT, and related technologies has markedly improved the refinement, systematization, intelligence, and automation of mechanical systems [2]. Simultaneously, the deployment of multiple sensors in industrial scenarios has become increasingly prevalent [3]. Bearings, as essential components in rotating machinery, play a vital role in ensuring operational stability. Statistical studies reveal that approximately 40% of motor failures stem from bearing defects [4], underscoring the critical importance of bearing fault diagnosis. Efforts from both academia and industry have consistently focused on exploring effective methods for bearing fault diagnosis. Current research predominantly centers around three major domains: signal processing, machine learning, and deep learning. The Fast Fourier Transform (FFT) [5] is widely recognized as a fundamental approach for frequency-domain analysis, transforming time-domain signals into frequency representations to reveal spectral features. Other commonly adopted techniques include Empirical Mode Decomposition (EMD) [6], Wavelet Transform (WT) [7], and Variational Mode Decomposition (VMD) [8]. However, due to the variability of operating conditions in real industrial settings and the intricate nature of fault mechanisms, single signal processing approaches often fail to extract health-related features accurately. With advancements in IoT sensing technologies, the pervasive use of sensors in industrial equipment has facilitated the rise of data-driven diagnostic methodologies. Machine learning methods have gained prominence for their ability to identify fault characteristics. Methods such as Multilayer Perceptron (MLP) [9], Hidden Markov Model (HMM) [10], and Support Vector Machine (SVM) [11] have demonstrated considerable success in bearing fault identification. Nonetheless, the complex and dynamic operating conditions encountered in practice render manually crafted feature-based methods unable to meet the strict performance demands of contemporary industrial fault diagnosis.

With its well-validated robust automatic feature extraction capabilities, deep learning theory has been broadly deployed in numerous research domains, ranging from image recognition and speech recognition to natural language processing, and has thus become a leading research direction in intelligent information processing. Deep learning algorithms—particularly convolutional neural networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-term Memory Networks (LSTMs), and Deep Belief Networks (DBNs)—have significantly propelled advancements in fault diagnosis. Jiang et al. [12] constructed an interpretable CNN using gradient-weighted class activation mapping, which achieved high diagnostic accuracy while offering transparent interpretability of the prediction process. To capture multi-scale features and enhance prognostic precision. Liu et al. [13] introduced a novel bearing remaining useful life prediction framework, TcLstmNet-CBAM. To improve the feature extraction capability of signals, Wang et al. [14] employed Andrews plots to extract fault features from online process measurements, and used a convolutional neural network to further extract diagnostic information from the outputs of Andrews plots, so as to address the uncertainty in setting the correct dimension of features extracted from Andrews plots. Image-based methods have been widely applied in fault detection algorithms for mechanical systems. These images are derived from vibration signals transformed from the time-domain to the time–frequency-domain. Spirto et al. [15] compared image-based convolutional neural network methods using time–frequency transforms and SDP transforms for input images, and the latter can significantly reduce computational costs. Tang et al. [16] developed a bidirectional DBN-based diagnostic approach, which improved feature learning efficiency and reduced reliance on the quality of training data. Despite these developments, several limitations persist. CNNs are constrained by local receptive fields and require deep stacking to capture global context, which increases model complexity. Moreover, pooling operations may discard semantically important information. RNNs face challenges such as vanishing or exploding gradients, limited long-term memory retention, lack of parallelizability, and extended training durations. LSTMs are structurally complex, leading to increased risk of overfitting, while DBNs inherently struggle to model intricate pattern variations and multi-modal data relationships. In practical industrial applications, noisy environments are prevalent, making the presence of noise-contaminated labels inevitable [17]. Lin et al. [18] proposed an insulated bearing fault diagnosis method integrating shape-aware kernel attention (SAKA) and dynamic physics guidance (DPG). Liu et al. [19] propose a method based on attention-enhanced MpResCNN-BiLSTM to address the challenges of bearing fault diagnosis in bogie transmission systems. Wang et al. [20] propose a lightweight intra–inter-domain adaptive network (LIIDAN). To solve the problem of noise interference, He et al. [21] proposed the deep residual network (DRN), which mitigates the vanishing and exploding gradient problems in deep neural networks through the use of cross-connections and residual modules, maintaining high classification performance even with substantial network depth. Zhao et al. [22] developed the Deep Residual Shrinkage Network (DRSN), which retains the depth advantages of DRN while incorporating an attention mechanism and soft-threshold function for effective noise suppression. Zhu et al. [23] introduced the Improved Deep Residual Shrinkage Network (IDRSN), designed for engines under various fault levels and operating conditions. However, limitations remain. DRNs may suffer from feature redundancy, and their simple residual pathways can constrain the upper limit of representational capacity. DRSNs are relatively difficult to train and debug, exhibit low computational efficiency, and possess limited applicability. IDRSNs adopt a single-channel diagnostic structure, which hinders comprehensive feature learning. Their diagnostic efficiency is lower than dual-channel approaches, making deployment challenging in industrial scenarios requiring high accuracy and reliability.

The Transformer framework, a landmark attention-driven neural network structure, was first presented and formalized by Vaswani et al. [24], replaces the RNN structures traditionally used in natural language processing with a self-attention mechanism, thereby eliminating the need for convolutional or recurrent layers. Notably, self-attention offers advantages in terms of parallel computation and longer effective path lengths. The Transformer has since been widely adopted across a broad range of modern deep learning applications. However, the original Transformer primarily emphasizes global attention while often neglecting local contextual dependencies. Furthermore, its quadratic computational complexity with respect to sequence length limits scalability in long-sequence scenarios. To mitigate these limitations, Fang et al. [25] proposed CLFormer, which integrates convolutional embedding with linear self-attention to enable fault diagnosis under limited-sample conditions. Gao et al. [26] introduced the Twins Transformer, which effectively extracts both temporal and frequency-domain features via a cross-attention mechanism, although the model remains a relatively complex structure. To address challenges in bearing fault diagnosis, including data scarcity, class imbalance, and high levels of noise, Hou et al. [27] developed a hybrid model that combines Transformer and ResNet architectures for joint feature extraction. Nevertheless, traditional architectures that rely on Softmax classification layers encounter difficulties in small-sample contexts, where linear classifiers often fail to capture complex data distributions, leading to overfitting or biased parameter estimation. Tang et al. [28] proposed a composite model based on an FFT-Transformer framework, leveraging multi-head attention to extract fault features from vibration signals, thus enabling the accurate identification and differentiation of concurrent faults. However, substituting self-attention entirely with linear FFT operations typically reduces the model’s expressive capacity, resulting in inferior performance compared to standard Transformer models in tasks requiring complex comprehension. To further enhance feature representation and address more intricate challenges, Liu et al. [29] proposed the Swin Transformer. By introducing a shifted window mechanism, this architecture significantly reduces computational complexity while enabling effective global feature extraction through self-attention. This design enhances the model’s generalization performance across diverse application domains.

In light of the strengths and limitations of existing technologies, this paper proposes a bearing fault diagnosis model based on the Swin Transformer–Improved ResNet module. In the data preprocessing stage, the frequency-domain features and time-domain multi-scale features of fault signals are extracted using FFT and VMD methods, respectively. Subsequently, a dual-branch parallel feature extraction network is constructed by integrating the Swin Transformer and Improved ResNet module. Finally, an ECA module is introduced to adaptively adjust feature channels and assign weights during the fusion process. The core academic contributions of this work to the field of intelligent fault diagnosis are outlined in the following:

The integration of Swin Transformer and Improved ResNet module combines the local feature extraction capability of ResNet with the global contextual modeling capacity of the self-attention mechanism in Swin Transformer. By fusing features extracted from different stages and both branches, the proposed model enhances the richness of feature representation and ensures comprehensive feature learning, thereby improving detection accuracy.
The Improved ResNet module incorporates the low computational complexity of depthwise separable convolution and the non-linear enhancement capability of pointwise convolution. By replacing the two standard convolution layers in traditional deep residual networks, the Improved ResNet module improves the model’s robustness against noise.
By employing an efficient adaptive fusion strategy based on the ECA module, features from both branches are reweighted and fused, which reduces the computational complexity of the dual-branch model while further enhancing its representational power and generalization capability.

The structure of this paper is as follows: Section 2 reviews related knowledge. Section 3 describes the proposed model. Section 4 presents the experiments, and finally, Section 5 concludes the paper.

2. Related Knowledge

2.1. Swin Transformer

The Swin Transformer (Shifted Window Transformer) is a novel Transformer architecture that has demonstrated competitive performance across a wide range of computer vision applications. By integrating localized window partitioning with a shifted window mechanism, it effectively captures global contextual dependencies while significantly reducing computational complexity, thereby improving overall processing efficiency. The model is composed of three primary components: Patch Partition, Swin Transformer block, and Patch Merging. In this study, the Swin-T (Tiny) variant is employed, with its architectural details illustrated in Figure 1.

Following segmentation and linear embedding, the signals are fed into the Swin Transformer block, as depicted in Figure 2. The block comprises components including W-MSA, MLP, LN, and SW-MSA. Specifically, W-MSA refers to the window-based multi-head self-attention module, MLP denotes the multi-layer perceptron, and LN indicates layer normalization, where SW-MSA is used to denote the shifted window-based multi-head self-attention mechanism. The core design objective of the SW-MSA module lies in the reduction of overall computational complexity. These modules are arranged in an alternating fashion, with layer normalization applied prior to each MLP, and residual connections introduced after each module to facilitate convergence and alleviate gradient vanishing.

To enable efficient modeling, the Swin Transformer computes self-attention within multi-scale local windows rather than employing global self-attention. By restricting attention operations to non-overlapping local windows, this approach substantially reduces computational overhead while preserving modeling capability. The hierarchical architecture, combined with a shifted window mechanism, facilitates cross-window information exchange, resulting in linear computational complexity with respect to input size. For a window containing N = hw patch tokens, the computational complexities of the two attention mechanisms are as follows:

\{\begin{matrix} Ω (MSA) = 4 h w C^{2} + 2 (h w)^{2} C \\ Ω (W - MSA) = 4 h w C^{2} + 2 M^{2} h w C \end{matrix}

(1)

In this context, Ω(MSA) denotes the computational complexity of global self-attention, while Ω(W-MSA) refers to that of shifted window attention. Here,

C

represents the embedding dimension, where

M

denotes the window size, and h and w correspond to the height and width of the feature map contained within the window. When

M

is fixed, the complexity of global self-attention is

{(h w)}^{2}

, whereas that of the shifted window attention grows linearly with

h w

. This reduction in complexity contributes to improved computational efficiency and faster training.

The formula for calculating attention is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}} + B) V

(2)

where

(Q, K, V) \in R^{M^{2} \times d}

with

Q, K, V

correspond to the query, key and value matrices, respectively; d refers to the dimensionality of the query or key matrix, with

B

set as the bias matrix for attention calculation.

When utilizing the W-MSA module, self-attention operations are confined within individual windows, thus blocking the direct information exchange across different windows. To mitigate this inherent limitation, the SW-MSA module is incorporated subsequently into the W-MSA layer. The shifted window mechanism periodically redefines window boundaries through cyclic shifting, enabling information flow across adjacent windows while preserving computational efficiency. This design facilitates global context modeling, enabling the network to simultaneously capture both local features and long-range dependencies, ultimately improving overall performance.

The Swin Transformer block consists of two consecutive stages. In the first stage, as shown in Equation (3), the input feature

z^{l - 1}

undergoes LN and W-MSA computation, The output is then added to

z^{l - 1}

to obtain the intermediate feature

{\hat{z}}^{l}

. Next,

{\hat{z}}^{l}

is passed through an MLP with LN, and the result is added to

{\hat{z}}^{l}

to yield

z^{l}

. This process completes the first stage.

\{\begin{matrix} {\hat{z}}^{l} = W - M S A (L N (z^{l - 1})) + z^{l} \\ z^{l} = M L P (L N ({\hat{z}}^{l})) + {\hat{z}}^{l} \end{matrix}

(3)

In the second stage, the output

z^{l}

serves as the input for LN and SW-MSA, followed by a residual connection to produce

{\hat{z}}^{l + 1}

. This result is then processed by another MLP with LN, and the previous layer is summed with

{\hat{z}}^{l + 1}

, through which the final output

z^{l + 1}

of the current layer is obtained.

\{\begin{matrix} {\hat{z}}^{l + 1} = S W - M S A (L N (z_{l})) + z_{l} \\ z^{l + 1} = M L P (L N ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1} \end{matrix}

(4)

Here,

z^{l - 1}

denotes the input feature of the Swin Transformer block, and

{\hat{z}}^{l}

and

{\hat{z}}^{l}

represent the intermediate features and output features of the first stage, respectively, while

{\hat{z}}^{l + 1}

and

z^{l + 1}

correspond to the intermediate and final output features of the second stage.

2.2. Improved ResNet Module

The core component of DRN is the residual building block. The structure of a conventional residual module is illustrated in Figure 3. Let x denote the input of the residual module,

F (x)

represents the residual mapping function,

G (x)

represents the underlying identity mapping function, and

W^{l}

represents the weight matrix derived as the input passes through the first convolutional layer within the residual module. This block primarily comprises convolutional layers, BN layers, and ReLU activation functions, with skip connections enabling cross-layer signal propagation. Such a design allows gradients from lower layers to be effectively propagated to earlier layers during training, thereby mitigating performance degradation in deep neural networks through residual learning.

Chollet [30] proposed the use of depthwise separable convolution to replace standard convolution operations, addressing the challenges of diminished diagnostic performance and increased computational cost associated with deeper networks. The depthwise separable convolution consists of two sequential components: a depthwise convolution (DWC) layer with conventional kernel width, and a pointwise convolution (PWC) layer with a 1 × 1 kernel that performs channel-wise linear combinations. Unlike standard convolution, the DWC layer applies convolutions independently on each channel using a number of kernels equal to the number of input channels. Its mathematical formulation is as follows:

S_{i, j, m} = \sum_{w, h}^{W, H} V_{w, h, m} \cdot X_{i + w, j + h, m}

(5)

In the formula, S represents the output feature; V refers to the convolution kernel with width W and height H; X represents the input feature map;

m

indicates the

m

channel of the feature;

(i, j)

stands for the spatial coordinates of the output feature in the

m

channel; and

(w, h)

corresponds to the coordinates of the matching weight element in the convolution kernel for the

m

channel.

Pointwise convolution operates similarly to standard convolution, with its primary function being the weighted combination of output features along the channel dimension. One-dimensional depthwise separable convolution first employs a depthwise convolution to extract features independently from each channel, followed by a pointwise convolution to integrate the resulting features across channels. Assuming an input feature map with width

W_{n}

and

M

channels, a convolution kernel of width

W

, and a total of

K

kernels, the computational cost

T_{1}

of the depthwise separable convolution and the computational cost

T_{2}

of the standard convolution are as shown in Formula (6).

\{\begin{matrix} T_{1} = W_{n} \times M \times W + K \times M \times W_{n} \\ T_{2} = W_{n} \times M \times K \times W \end{matrix}

(6)

According to Formula (6), the ratio of the computational cost of depthwise separable convolution to that of standard convolution is

\frac{1}{K} + \frac{1}{W}

. Since the kernel width

W

typically takes values such as 3, 5, or 7, and the number of kernels

K

is greater than 1, the computational cost of depthwise separable convolution is consistently lower than that of conventional standard convolution.

As illustrated in Figure 4, the core innovation of the Optimized Residual Building Block (ORBB) proposed in this study lies in replacing the standard convolutional layers in the residual module with depthwise separable convolutional layers. The DWC layer reduces model complexity by decoupling channel-wise correlations, thereby enhancing the parameter efficiency of the convolution kernels. The subsequent pointwise convolution layer projects multi-channel features into a single channel while preserving salient information and enhancing the non-linear representational capacity of the network. This structure improves the model’s ability to extract subtle features that may be obscured by noise. In addition, a BN layer is incorporated after the depthwise separable convolution to mitigate feature distribution shifts under noisy conditions, thereby further improving the model’s noise robustness and generalization performance.

2.3. Efficient Channel Attention Networks

Wang et al. [31] adopted the Efficient Channel Attention (ECA) module, as illustrated in Figure 5, which is a lightweight channel attention mechanism designed to enhance the model’s capability to learn inter-channel dependencies in an efficient and straightforward manner. The core concept of ECA is to capture channel-wise relevance and importance differences without incurring substantial computational overhead. Accordingly, the ECA module is integrated after the feature fusion stage in this study.

Initially, a comprehensive average pooling operation is applied to the output features. The corresponding computational process is mathematically formulated in the following equation:

y = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X (i, j, c)

(7)

In the formula:

y

represents the feature vector obtained after the GAP layer;

H

and

W

represent the height and width of the window, respectively; and

X (i, j, k)

denotes the feature value at the

i

-th row and

j

-th column of the

c

-th channel. Following comprehensive average pooling, the adaptive one-dimensional convolution kernel size k is dynamically determined based on the number of channels C. The calculation is as follows:

k = ψ (C) = {|\frac{\log_{2} C}{γ} + \frac{b}{γ}|}_{o d d}

(8)

where

γ

= 2 and

b

= 1. After obtaining the kernel size

k

, a one-dimensional convolution operation is performed to compute the channel attention weights, as shown below:

ω = σ ({C o n v 1 D}_{k} (y))

(9)

In the formula:

ω

represents the attention weight;

{C o n v 1 D}_{k} (\cdot)

is a one-dimensional convolution with a kernel size of

k

; and

σ (\cdot)

is the Sigmoid function.

The final weighted features are generated by executing an element-wise channel-wise multiplication between the original input features and the network-learned attention weights, which enhances the ability to capture features of different scales and dynamically strengthens the key sensor channels related to faults while suppressing noise interference signals.

3. Proposed Method

Conventional one-dimensional neural networks are prone to signal distortion caused by high-frequency interference and random noise in complex operational environments. Consequently, these networks often struggle to extract discriminative features from raw signals and fail to achieve the desired prediction accuracy. To overcome this limitation, we propose a dual-branch parallel fusion model-based bearing fault diagnosis method in this study that integrates the Swin Transformer and Improved ResNet module. The overall framework is illustrated in Figure 6. First, the frequency-domain features and time-domain multi-scale features of fault signals are extracted using FFT and VMD methods, respectively, thereby emphasizing components associated with fault characteristic frequencies. The resulting frequency-domain samples are then fed into the feature extraction module. Finally, the extracted features are processed via fully connected layers and a Softmax classifier to output multiple fault categories, enabling accurate bearing fault diagnosis.

In this paper, the Swin Transformer architecture originally designed for two-dimensional images is adapted to the task of fault diagnosis using one-dimensional vibration signals. Since one-dimensional time-series signals lack a two-dimensional spatial structure, FFT and VMD are employed to construct pseudo-two-dimensional feature maps, enabling compatibility with the input format of the original model.

For the one-dimensional frequency-domain amplitude features output by FFT, dimension regularization is first performed via zero-padding or truncation, so that the total feature length can be decomposed into a two-dimensional size

H \times W

that meets the patch partitioning requirements. Min–max normalization is then applied to map the features to the range [0, 1]. Finally, the one-dimensional vector is reshaped into an

H \times W

two-dimensional grid, forming a single-channel pseudo-image input.

For the multi-component IMF features obtained by VMD, length alignment and normalization are performed component-wise. Each component is then reshaped into an H × W two-dimensional structure and stacked along the channel dimension to form multi-channel pseudo-image patches. After the above dimension adjustment, normalization and reshaping operations, the features can be directly fed into the Patch Partition module of the Swin Transformer, enabling normal computation of window attention and hierarchical downsampling. In this way, the adaptation to one-dimensional vibration signals is achieved without modifying the backbone structure of the model.

3.1. Signal Preprocessing

Raw vibration signals are initially subjected to preprocessing via FFT and VMD, with the aim of extracting time–frequency-domain features. Subsequent to this preprocessing step, the processed data is partitioned into three subsets: training, validation, and test sets. Specifically, for the signal preprocessing procedure, each of the four distinct types of fault signals undergoes both FFT and VMD. The fault time-domain signals, after being transformed by FFT, are presented visually in Figure 7.

It can be seen from the figure that the image after FFT has relatively obvious features, indicating that FFT can effectively extract frequency-domain features. There are obvious differences between the spectrograms corresponding to different faults, and these features help to distinguish different fault types and their severity levels.

Although the Fast Fourier Transform has many advantages, it cannot effectively capture the variation in frequency with time when dealing with non-stationary signals. To avoid this problem, VMD is introduced in the signal preprocessing stage.

The VMD method transforms the multi-component signal decomposition problem into a variational optimization problem, and adaptively iterates to obtain several Intrinsic Mode Functions (IMFs), exhibiting excellent performance in noise suppression and component separation. By constructing and solving a constrained variational problem, VMD iteratively updates each mode and its central frequency, enabling an energy concentration of each mode around its respective central frequency with minimized bandwidth. For simplicity of analysis, the number of modes to be decomposed is generally assumed to be K, and each mode

u k (t)

can be regarded as a band-pass signal near the central frequency

ω k

. When the modes are separated from each other in the frequency spectrum, the overall signal spectrum is reasonably partitioned.

Utilizing the central frequency method, an analysis is conducted on the central frequencies corresponding to different values of K. The outcomes of the VMD are illustrated in Figure 8. It is found that, when K equals 4, certain modes start to display similar central frequencies, which signifies the occurrence of over-decomposition; excessive decomposition does not necessarily facilitate feature extraction. Other parameters are configured as follows: α = 4000 and τ = 0.03. After the data is denoised by means of VMD, noise components are effectively eliminated, while both the fault and normal signal characteristics are preserved. This denoising process ultimately contributes to enhancing the accuracy of fault classification.

Finally, the vector concatenation method is adopted for combination. The FFT spectrum and the VMD components are stacked along the channel dimension to generate multi-scale fusion features. By combining FFT and VMD, the signal can first be transformed into the frequency-domain using FFT to obtain its spectral information. Meanwhile, VMD is applied to the fault signal to decompose it into a series of modal functions. By analyzing this information, multi-scale features in the fault signal can be mined, thereby better understanding the time–frequency characteristics of the signal. The method enables a more comprehensive analysis of the signal and facilitates signal processing tasks in application scenarios such as fault detection and diagnosis.

The full dataset is randomly split into three subsets with a stratified ratio of 7:2:1, namely the training set, validation set, and held-out test set. The training set is used for model fitting, the validation set for early stopping and hyperparameter optimization, and the independent test set is exclusively used to quantify the generalization capability of the proposed dual-branch fusion model. Test data does not participate in the training process.

3.2. Swin Transformer–Improved ResNet Module

In the proposed model, the preprocessed signals are simultaneously processed through two parallel branches:

Branch 1: The preprocessed data is fed into the Swin Transformer network, which utilizes its window-based attention mechanism to abstract local fault-related features from the signals. The Swin Transformer, which draws inspiration from the fundamental design principles of convolutional neural networks, achieves global attention modeling capability, and optimizes computational complexity by reducing it from a quadratic to a linear relationship with the input resolution through its shifted window attention mechanism. The overall computational overhead and model training costs are substantially decreased through this linear complexity optimization strategy. The specific parameter information of the Swin Transformer network is shown in Table 1.

Branch 2: As illustrated in Figure 9, the frequency-domain signal samples are concurrently input into the Improved ResNet module. The input signals first pass through a standard convolutional layer with wide kernels followed by a pooling layer, which reduces the influence of noise on useful feature extraction. The signals are then processed by the ORBB layer and the pooling layer, where the combined use of DWC and PWC strengthens the extraction of non-sensitive features that may be masked by noise. A BN layer is applied to alleviate internal covariate shift caused by noise interference.

3.3. Feature Fusion

Simply concatenating or applying weighted summation to the features extracted from the two branches may be insufficient to fully exploit their complementary information. To address this limitation, an ECA module is incorporated to dynamically assign channel-wise weights, enabling the model to autonomously learn the relative importance of different channels. In the proposed approach, the features obtained from the Swin Transformer and Improved ResNet module branches are integrated using the ECA mechanism. The fused features are subsequently processed through fully connected (FC) layers and a Softmax classifier to complete the final classification. This fusion strategy, when applied within a dual-channel architecture, mitigates feature redundancy more effectively than in single-branch models and adaptively adjusts the contributions of different channels. The fault diagnosis workflow of the Swin Transformer–Improved ResNet module is illustrated in Figure 10.

Step 1: Acquire fault data using the data collection system.

Step 2: Apply FFT to the vibration signals and construct the dataset.

Step 3: Perform dual-branch feature extraction using the Swin Transformer and Improved ResNet module.

Step 4: Fuse the extracted features using the ECA module.

Step 5: Use the trained model to diagnose the samples in the test set.

4. Experimental Verification

To evaluate the accuracy and effectiveness of the proposed Swin Transformer–Improved ResNet, a series of experiments were conducted, including model validation, comparative analysis, ablation studies, complexity assessments, and noise robustness evaluations. In addition, to further assess the model’s robustness in practical industrial scenarios, a comparative experiment was performed using the SEU bearing dataset. Generalization capability was also tested through experiments on the CWRU bearing dataset. The model was implemented using PyTorch 2.2.1 under a Python 3.7 environment. The hardware setup included an Intel Core i9-13900K processor and 128 GB of RAM. For the sake of fair comparison and reproducibility of experimental results, all models were subjected to the exact same preprocessing steps during both the training and evaluation stages. A dropout rate of 0.5 was applied, randomly deactivating 50% of neurons during training to prevent overfitting. The cross-entropy loss function is adopted as the loss metric. Use the Adam optimizer, and the learning rate is set to 0.0003. The model was trained over 50 epochs.

4.1. Dataset Description

4.1.1. SEU Bearing Dataset

The Southeast University (SEU) dataset [32] consists of two sub-datasets: a bearing dataset and a gear dataset. Its fault data was collected from a Drivetrain Dynamic Simulator. Two different operating conditions were set during data acquisition, namely a speed–system load of 20 Hz–0 V and 30 Hz–2 V, with a sampling frequency of 5120 Hz. The SEU dataset contains 10 fault conditions in total, including 5 rolling bearing fault conditions, specifically the normal condition, rolling element fault, inner race fault, outer race fault, and compound fault. The specific fault types of the bearings are shown in Table 2.

A dedicated fault simulation test rig is adopted for the acquisition of the target bearing vibration dataset, which is composed of a drive motor, motor controller, planetary gearbox, transmission gearbox, load regulation unit, and brake controller, with its specific structural composition depicted in Figure 11.

4.1.2. CWRU Bearing Dataset

The dataset used for cross-domain generalization validation experiments is the standard rolling bearing fault dataset sourced from the Bearing Data Center of Case Western Reserve University (CWRU) [33]. The experimental configuration is depicted in Figure 12, which mainly includes a horsepower motor, sensors, dynamometers, and control electronics. The tested bearing model is the SKF6205 motor bearing.

To conduct the generalization experiment and evaluate the model’s generalization capability, vibration signals obtained from the drive end of the CWRU benchmark dataset were used, with a sampling frequency of 12 kHz. The dataset includes four fault diameter categories, each corresponding to a distinct fault type. All the faults in question were recorded under four different levels of motor load, specifically 0 hp, 1 hp, 2 hp, and 3 hp. The data is categorized into 10 classes to distinguish both the fault location and diameter, as detailed in Table 3. Based on the load conditions, the dataset is divided into four subsets: A, B, C, and D.

4.2. Model Validation Experiment

In the validation experiment, the SEU bearing dataset was utilized. To reduce randomness, each experiment was repeated five times. The average results of five experiments are presented in Figure 13. During each training epoch, input data is processed concurrently through a dual-branch architecture for feature extraction, followed by classification using fully connected layers. The model exhibits stable behavior throughout the training process. Both training accuracy and cross-entropy loss begin to converge about the 30th epoch. The final average classification accuracy reaches 99.41%, while the cross-entropy loss decreases to below 0.02.

To further analyze the classification performance of the model across different fault categories, the confusion matrix is presented in Figure 14. The vertical axis of the matrix represents the true labels of the samples, and the horizontal axis corresponds to the predicted labels of the model. It can be easily seen from the distribution characteristics of the matrix that most values are concentrated on the main diagonal. This phenomenon indicates that the prediction accuracy of the model is at a high level, which fully proves that the proposed fault diagnosis and classification model performs excellently with good classification accuracy, and has strong fault identification capability and practical application value.

Additionally, t-Distributed Stochastic Neighbor Embedding (TSNE) [34] is employed for data visualization and dimensionality reduction, as shown in Figure 15. The TSNE method is applied to both the original features and those extracted by the proposed model, allowing for a comparative analysis of feature distributions and the model’s capacity to discriminate between various bearing fault types. As illustrated, nearly all faulty samples are accurately identified, demonstrating the effectiveness of the proposed model in fault classification tasks.

4.3. Performance Evaluation Metrics

Four evaluation metrics are adopted to assess the feasibility and practicality of the model, namely accuracy, precision, recall, and F1-score. These metrics provide a quantitative assessment of the proposed method’s overall effectiveness.

Accuracy is a fundamental evaluation metric for classification models, representing the proportion of correctly classified instances relative to the total number of instances in the dataset. It provides an overall measure of the model’s predictive performance. It is defined as

$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$

(10)
Precision quantifies the proportion of instances predicted as positive that are truly positive, reflecting the reliability of the model’s positive predictions. It is defined as

$Precision = \frac{T P}{T P + F P}$

(11)
Recall measures the proportion of actual positive instances that are correctly identified by the model, serving as an evaluation indicator of the model’s capacity to capture the positive class samples. It is defined as

$R e c a l l = \frac{T P}{T P + F N}$

(12)
F1-score represents the harmonic mean of precision and recall, delivering a balanced assessment that accounts for both false positives and false negatives. This metric is highly applicable to imbalanced data distribution scenarios. It is defined as

$F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(13)
Specificity is a crucial indicator for evaluating the performance of fault diagnosis models, which measures the model’s ability to accurately identify negative samples. This metric can effectively reflect the model’s capacity to restrain misjudgments and reduce false positives. It is defined as

$S p e c i f i c i t y = \frac{T N}{T N + F P}$

(14)

Among these, True Positive (TP) is defined as the correct recognition of an instance that genuinely falls into the positive class. False Positive (FP) occurs when a sample that is actually negative is incorrectly categorized as positive. True Negative (TN) refers to the accurate classification of a sample that is truly a negative instance. False Negative (FN) occurs when a positive instance is incorrectly classified as negative.

4.4. Model Ablation Experiment

Ablation experiments were conducted to investigate the factors influencing model performance and to determine optimal configurations by evaluating various architectural settings. To assess the contribution of each key component in the proposed method, a series of ablation studies was performed on the Swin Transformer–Improved ResNet module feature extraction framework and the ECA module. By progressively removing individual modules and comparing the diagnostic performance of each configuration, it was observed that using only the Swin Transformer or Improved ResNet module led to relatively inferior results. Even with the integration of the ECA module for feature fusion, performance improved compared to configurations without it. These findings demonstrate that all modules incorporated in the model are essential. To reduce randomness, each experiment was repeated five times. The average results of five experiments are presented in Table 4.

Experimental results show that using either the Improved ResNet module or the Swin Transformer module alone can achieve a certain level of fault recognition capability, with accuracies of only 94.05% and 94.51%, respectively. When the two modules are fused in parallel, the accuracy is improved to 95.82%. Further integration with the ECA module increases the diagnostic accuracy to 96.58%.

On this basis, a signal preprocessing module is introduced to further enhance the feature quality. After adding FFT or VMD, the accuracy rises to 97.87% and 97.35%, respectively, indicating that both frequency-domain spectral features and modal components make positive contributions to the enhancement of diagnostic information.

When the combined preprocessing strategy of FFT and VMD is adopted, the model achieves the best performance with an accuracy of 99.41%. Meanwhile, the precision, recall and F1-Score also reach the highest values, which verifies the effectiveness and robustness of the proposed method in identifying complex faults.

4.5. Model Comparison Experiment

To further demonstrate the reliable performance of the proposed method, a comparative analysis of diagnostic accuracy across multiple models was conducted using the SEU bearing dataset. To ensure fairness and comparability, all models were trained and evaluated under identical preprocessing procedures, and their hyperparameters were adjusted accordingly in a consistent manner. We selected several mainstream deep learning models for comparison: LiConvFomer [35], Autoformer [36], CLFormer [25], ResNet18 [21], CNN-LSTM [37], TCN [38], and CNN [39] were included as baselines. To reduce randomness, each experiment was repeated five times. A spider chart illustrating the performance metrics of these eight models on the SEU bearing dataset is presented in Figure 16.

The comparative results are summarized in Table 5. Under identical preprocessing conditions, experiments with the selected baseline and Transformer-based models clearly demonstrate that the proposed method outperforms the others across multiple evaluation metrics, including accuracy, precision, recall, and F1-score. The experimental findings confirm that the window-based self-attention mechanism of the Swin Transformer effectively captures local fault features and facilitates the integration of multi-scale feature representations. The Improved ResNet module enhances model generalization through the computational efficiency of depthwise separable convolutions. Furthermore, the ECA module adaptively fuses multi-branch features by assigning dynamic weights to feature channels and reducing redundant information, thereby improving the discriminative and expressive capabilities of the extracted features. In summary, under uniform conditions, the proposed method exhibits superior performance compared to alternative models, underscoring its strong potential for practical fault diagnosis applications.

4.6. Simulated Noise Experiment

The signal-to-noise ratio (SNR) [40] is widely adopted to measure the noise level contained in vibration signals, which is formulated as follows:

S N R = 10 \lg (\frac{P_{singal}}{P_{n o i s e}})

(15)

where

P_{singal}

represents the average power of the original signal and

P_{n o i s e}

denotes the average power of the noise component.

To simulate various types of noise interference encountered during actual bearing operation and to evaluate the diagnostic reliability of the proposed model under high-noise conditions, white Gaussian noise was artificially introduced into the test set. This setup mimics the distributional discrepancy between training and testing data typically observed in real-world industrial environments. Within the SEU bearing dataset, SNR values ranging from 2 dB to −4 dB were added. Comparative experiments were conducted against three residual network-based models—IDRSN [23], DRSN [22], and DRN [21]—as well as a conventional CNN model. To reduce randomness, each experiment was repeated five times.

Table 6 demonstrates that our method maintains the highest average fault identification accuracy across both noise-free settings and noisy environments, where the SNR varies from −4 dB to 2 dB. In particular, under the harsh high-noise condition of −4 dB, the presented model reaches an average accuracy of 91.92%, which outperforms comparative approaches including IDRSN, DRSN, DRN and CNN, with their respective accuracy values of 88.33%, 84.67%, 80.67% and 68.67%. When exposed to moderate noise at 2 dB SNR, our method realizes an average diagnostic accuracy of 98.82%, leading the optimal baseline IDRSN by a margin of 1.49%. Only at the SNR of −1 dB does the recognition performance of IDRSN closely approximate that of the developed model. The overall superiority and robustness of the proposed method under different noise intensities are further visualized in Figure 17.

To further evaluate diagnostic performance under significant noise interference, the recognition stability of different models was compared using box plots. In these plots, model A represents the proposed method; B, C, D, and E correspond to the IDRSN, DRSN, DRN, and CNN models, respectively. Experiments were conducted under noisy conditions with SNRs of −4 dB and −2 dB. As depicted in Figure 18, the proposed model achieved superior recognition performance with low variance, even in high-noise scenarios. As evidenced by the above results, the presented strategy yields remarkable gains in both detection precision and robustness, which is critical for bearing fault diagnosis amid intense noise disturbance.

4.7. Model Generalization Performance

The ability of the diagnostic model to identify bearing damage under diverse experimental conditions is an essential index for quantitative performance evaluation. To assess this capability, a generalization performance experiment was designed. Variations in load can modify the vibration signal characteristics of bearings. Validating the model under different load conditions enables an assessment of its adaptability to signal feature changes across various speeds. This process helps verify the model’s stability and adaptability in diverse operating conditions and ensures that false alarms or misdiagnoses do not occur due to speed fluctuations.

Fault data under four distinct load conditions (A, B, C, and D) was employed to construct the training and testing sets. For example, A → B indicates that dataset A was utilized for model training and dataset B was utilized for testing. Similarly, other cross-load scenarios follow the same configuration. Five validation experiments were conducted, and the average value was computed as the final diagnostic result. As shown in Figure 19, under varying load conditions, the proposed model achieved a mean fault diagnosis accuracy exceeding 98% on the CWRU dataset. Although a slight decrease in accuracy was observed with increasing load variation, the declining trend remained modest. These results demonstrate the model’s capability to extensively extract two-dimensional time–frequency features and exhibit excellent generalization and diagnostic performance.

5. Conclusions

To enhance the diagnostic performance of fault diagnosis models under complex operational conditions, a novel approach based on a dual-branch Swin Transformer–Improved ResNet module integrated with an ECA module is proposed. The main contributions are summarized as follows:

The Swin Transformer branch employs a window-based self-attention mechanism to capture local features from fault signals, enabling effective integration of multi-level feature representations, while the Improved ResNet module branch utilizes depthwise separable convolutions to reduce computational complexity and improve generalization capability, and these two models are integrated via a dual-branch parallel structure for feature extraction to enhance overall model robustness; the incorporated ECA module then adaptively recalibrates feature channels by assigning differential weights and minimizing information redundancy, after which the fused features are processed through adaptive pooling and FC layers for fault classification, and embedding the ECA module after feature fusion highlights and reinforces important features while suppressing redundant ones, a design that enables dynamic feature enhancement and noise suppression in bearing fault diagnosis; simulation results demonstrate that the proposed method surpasses several state-of-the-art approaches across multiple metrics, noise interference experiments further confirm its enhanced robustness in complex scenarios compared to alternative methods, ablation analyses further confirm that the proposed structural design contributes critically to the improvement of the model’s overall diagnostic performance, and generalization tests underscore the model’s adaptability, making it highly suitable for bearing fault detection across diverse operational environments.

Future work holds potential for further enhancement. Subsequent research in bearing fault diagnosis will prioritize the following aspects: First, noise robustness will be improved by incorporating a broader spectrum of diverse and challenging noise types. Second, bearing fault signals will be gathered from a wider array of industrial production scenarios to enhance model robustness and interference resistance using real-world datasets, inspired by Li et al. [40], who developed a continual learning model (UACLF) for fault diagnosis of rotating machinery in dynamic environments. Future work will explore the integration of continual learning with the proposed model to achieve superior fault diagnosis performance in such dynamic settings.

Author Contributions

Methodology, Q.L. and M.C.; software, M.C. and H.L.; formal analysis, Q.L. and M.C.; data curation, M.C. and H.L.; writing—review and editing, M.C. and H.L.; supervision, Q.L. and M.T.; funding acquisition, Q.L. and M.T. All authors have read and agreed to the published version of the manuscript.

Funding

Natural Science Foundation of Guangdong Province (2025A1515012901); National Training Program of Innovation and Entrepreneurship for Undergraduates (No. 202510566004, No. 202210566015); Undergraduate Innovation Team Project of Guangdong Ocean University (CXTD2023008); Postgraduate Education Innovation Project of Guangdong Ocean University (202452).

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FFT	Fast Fourier Transform
EMD	Empirical Mode Decomposition
WT	Wavelet Transform
SVMs	Support Vector Machines
MLP	Multilayer Perceptron
HMM	Hidden Markov Model
CNNs	Convolutional Neural Networks
DRN	Deep Residual Network
DRSN	Deep Residual Shrinkage Network
IDRSN	Improved Deep Residual Shrinkage Network
RNNs	Recurrent Neural Network
LSTMs	Long Short-term Memory Networks
DBNs	Deep Belief Networks
FFT-Transformer	Fast Fourier Transform-Transformer
LN	Layer Normalization
MSA	Multi-head Self-Attention
ResNet	Residual Network
DWC	Depthwise Convolution
PWC	Pointwise Convolution
ORBB	Optimized Residual Building Block
BN	Batch Normalization
FC	Fully Connected
t-SNE	t-Distributed Stochastic Neighbor Embedding
TP	True Positive
FP	False Positive
TN	True Negative
FN	False Negative
SNR	Signal-to-Noise Ratio
ECA	Efficient Channel Attention

References

Jan, Z.; Ahamed, F.; Mayer, W.; Patel, N.; Grossmann, G.; Stumptner, M.; Kuusk, A. Artificial intelligence for industry 4.0: Systematic review of applications, challenges, and opportunities. Expert Syst. Appl. 2023, 216, 119456. [Google Scholar] [CrossRef]
Javaid, M.; Haleem, A.; Singh, R.P.; Suman, R. Artificial Intelligence Applications for Industry 4.0: A Literature-Based Study. J. Ind. Integr. Manag. 2022, 7, 83–111. [Google Scholar] [CrossRef]
Wang, S.; Feng, Z. Multi-sensor fusion rolling bearing intelligent fault diagnosis based on VMD and ultra-lightweight GoogLeNet in industrial environments. Digit. Signal Process. 2024, 145, 104306. [Google Scholar] [CrossRef]
Gai, J.; Shen, J.; Hu, Y.; Wang, H. An integrated method based on hybrid grey wolf optimizer improved variational mode decomposition and deep neural network for fault diagnosis of rolling bearing. Measurement 2020, 162, 107901. [Google Scholar] [CrossRef]
Strömbergsson, D.; Marklund, P.; Berglund, K.; Larsson, P.E. Bearing monitoring in the wind turbine drivetrain: A comparative study of the FFT and wavelet transforms. Wind Energy 2020, 23, 1381–1393. [Google Scholar] [CrossRef]
Bodile, R.; Rao, T.V.K.H. Adaptive Filtering of Electrocardiogram Signal Using Hybrid Empirical Mode Decomposition-Jaya Algorithm. J. Circuits Syst. Comput. 2021, 30, 2150209. [Google Scholar] [CrossRef]
Ylmaz, A.; Bayrak, G. A new signal processing-based islanding detection method using pyramidal algorithm with undecimated wavelet transform for distributed generators of hydrogen energy. Int. J. Hydrogen Energy 2022, 47, 19821–19836. [Google Scholar] [CrossRef]
Xiong, B.; Meng, X.; Xiong, G.; Ma, H. Multi-branch wind power prediction based on optimized variational mode decomposition. Energy Rep. 2022, 8, 11181–11191. [Google Scholar] [CrossRef]
Li, J.; Yao, X.; Wang, X.; Yu, Q.; Zhang, Y. Multiscale local features learning based on BP neural network for rolling bearing intelligent fault diagnosis. Measurement 2019, 153, 107419. [Google Scholar] [CrossRef]
Prasanth, A. Certain Investigations on Energy-Efficient Fault Detection and Recovery Management in Underwater Wireless Sensor Networks. J. Circuits Syst. Comput. 2020, 30, 2150137. [Google Scholar] [CrossRef]
Zhang, X.; Li, C.; Wang, X.; Wu, H. A novel fault diagnosis procedure based on improved symplectic geometry mode decomposition and optimized SVM. Measurement 2020, 173, 108644. [Google Scholar] [CrossRef]
Jiang, K.; Yang, Z.; Jin, T.; Chen, C.; Liu, Z.; Zhang, B. CNN-Based Rolling Bearing Fault Diagnosis Method With Quantifiable Interpretability. IEEE Trans. Instrum. Meas. 2025, 74, 3525912. [Google Scholar] [CrossRef]
Liu, Q.; Dai, Z.; Lai, H.; Chen, M.; Huang, H.; Fu, J.; Hou, M.; Xu, X.; Wang, G.; Yan, J. A noval RUL prediction method for rolling bearing: TcLstmNet-CBAM. Sci. Rep. 2025, 15, 14055. [Google Scholar] [CrossRef]
Wang, S.; Zhang, J. An Intelligent Process Fault Diagnosis System based on Andrews Plot and Convolutional Neural Network. J. Dyn. Monit. Diagn. 2022, 1, 127–138. [Google Scholar] [CrossRef]
Spirto, M.; Melluso, F.; Nicolella, A.; Malfi, P.; Cosenza, C.; Savino, S.; Niola, V. A Comparative Study Between SDP-CNN and Time-Frequency-CNN-Based Approaches for Fault Detection. J. Dyn. Monit. Diagn. 2025, 5, 25–37. [Google Scholar] [CrossRef]
Tang, J.; Wu, J.; Hu, B.; Liu, J. Towards a fault diagnosis method for rolling bearing with Bi-directional deep belief network. Appl. Acoust. 2022, 192, 108727. [Google Scholar] [CrossRef]
Wang, M.; Yu, H.T.; Min, F. Noise label learning through label confidence statistical inference. Knowl.-Based Syst. 2021, 227, 107234. [Google Scholar] [CrossRef]
Lin, H.; Wang, G.; Lv, Y.; Shao, C. Insulated bearing fault diagnosis method based on shape-aware attention and dynamic physical information guidance. Meas. Sci. Technol. 2025, 36, 076125. [Google Scholar] [CrossRef]
Liu, Q.; Lai, H.; Wen, B.; Hou, D.; Liao, J.; Deng, C. A Novel Method of Bearing Fault Diagnosis for Train Bogie Transmission System Based on MpResCNN-BiLSTM Model With Attention Mechanism. IEEE Trans. Intell. Transp. Syst. 2025, 1–12. [Google Scholar]
Wang, G.; Li, C.; Lv, Y.; Zhong, Z.; Shao, C.; Zhang, H.; Lin, H.; Shi, W. A lightweight intra-class inter-class domain adaptation network approach for diagnosing bearing faults under different operating conditions. Struct. Health Monit. 2025, 14759217251329289. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Zhao, M.; Zhong, S.; Fu, X.; Tang, B. Deep Residual Shrinkage Networks for Fault Diagnosis. IEEE Trans. Ind. Inform. 2019, 16, 4681–4690. [Google Scholar] [CrossRef]
Zhu, X.; Zhang, J.; Wang, X.; Wang, H.; Song, Y.; Pei, G.; Gou, X.; Deng, L.; Lin, J. Improved deep residual shrinkage network for a multi-cylinder heavy-duty engine fault detection with single channel surface vibration. Energy AI 2024, 16, 100356. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Fang, H.; Deng, J.; Bai, Y.; Feng, B.; Li, S.; Shao, S.; Chen, D. CLFormer: A Lightweight Transformer Based on Convolutional Embedding and Linear Self-Attention With Strong Robustness for Bearing Fault Diagnosis Under Limited Sample Conditions. IEEE Trans. Instrum. Meas. 2022, 71, 3504608. [Google Scholar] [CrossRef]
Gao, Z.; Wang, Y.; Li, X.; Yao, J. Twins transformer: Rolling bearing fault diagnosis based on cross-attention fusion of time and frequency domain features. Meas. Sci. Technol. 2024, 35, 096113. [Google Scholar] [CrossRef]
Hou, S.; Lian, A.; Chu, Y. Bearing fault diagnosis method using the joint feature extraction of Transformer and ResNet. Meas. Sci. Technol. 2023, 34, 075108. [Google Scholar] [CrossRef]
Tang, J.; Cheng, X.; Sun, J.; Qing, J. A novel method for untrained detection of compound fault in rolling bearing via fast Fourier Transform-Transformer model. Measurement 2025, 253, 117755. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Linderman, G.C.; Rachh, M.; Hoskins, J.G.; Steinerberger, S.; Kluger, Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nat. Methods 2019, 16, 1. [Google Scholar] [CrossRef] [PubMed]
Yan, S.; Shao, H.; Wang, J.; Zheng, X.; Liu, B. LiConvFormer: A lightweight fault diagnosis framework using separable multiscale convolution and broadcast self-attention. Expert Syst. Appl. 2024, 237, 121338. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Kim, T.-Y.; Cho, S.-B. Predicting residential energy consumption using CNN-LSTM neural networks. Energy 2019, 182, 72–81. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks; NIPS: Grenada, Spain, 2012. [Google Scholar]
Ma, J.; Zhao, Z.; Chen, J.; Li, A.; Chi, E. SNR: Sub-Network Routing for Flexible Parameter Sharing in Multi-task Learning. In Proceedings of the AAAI’19: AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI: Menlo Park, CA, USA, 2019. [Google Scholar]
Li, J.; Yue, K.; Chen, Z.; Xia, J.; Li, W.; Zhang, X. An Uncertainty-Aware Continual Learning Framework for Fault Diagnosis of Rotating Machinery With Homogeneous-Heterogeneous Faults. IEEE Trans. Autom. Sci. Eng. 2024, 25, 3284–3298. [Google Scholar] [CrossRef]

Figure 1. Swin Transformer network architecture.

Figure 2. Swin Transformer block.

Figure 3. Residual module.

Figure 4. Optimized Residual Building Block.

Figure 5. Efficient Channel Attention network.

Figure 6. The framework diagram of the proposed model.

Figure 7. Vibration signal: (a) raw data; (b) after FFT Transformation data.

Figure 8. VMD results: (a) inner race state; (b) outer race state.

Figure 9. Improved ResNet module.

Figure 10. Model diagnosis framework.

Figure 11. Experiment system for SEU dataset.

Figure 12. Experiment system for CWRU dataset.

Figure 13. Model Training Result.

Figure 14. Confusion matrix.

Figure 15. TSNE visualization: (a) original signal; (b) proposed model.

Figure 16. Spider chart of SEU bearing dataset.

Figure 17. The recognition accuracy rates of different models under different noise conditions.

Figure 18. Box plot of accuracy under different SNRs: (a) −4dB; (b) −2dB.

Figure 19. The classification accuracy of model generalization.

Table 1. Parameters of the Swin Transformer.

Step	Downsp.Rate	Output Size	Swin Transformer
Stage1	4	56 × 56	concat 4 × 4, 96-d, LN $\{\begin{matrix} window size : 7 \times 7 \\ \dim 96, heads : 3 \end{matrix} \times 2$
Stage2	8	28 × 28	concat 4 × 4, 192-d, LN $\{\begin{matrix} window size : 7 \times 7 \\ \dim 192, heads : 6 \end{matrix} \times 2$
Stage3	16	14 × 14	concat 4 × 4, 384-d, LN $\{\begin{matrix} window size : 7 \times 7 \\ \dim 384, heads : 12 \end{matrix} \times 6$
Stage4	32	7 × 7	concat 4 × 4, 768-d, LN $\{\begin{matrix} window size : 7 \times 7 \\ \dim 768, heads : 24 \end{matrix} \times 12$

Table 2. Different types of bearings faults.

Label	Fault Types	Explanation	Samples
1	Inner	Cracks in the inner ring	1000
2	Ball	Cracks in the ball.	1000
3	Health	Normal	1000
4	Outer	Cracks in the outer ring	1000
5	Combination	Cracks in the inner and outer ring.	1000

Table 3. Description of bearing fault.

Fault	Diameter (inch)	Label	Load (hp)/Dataset
Norm	0	0	0/A, 1/B, 2/C, 3/D
Ball	0.07	1
	0.14	2
	0.21	3
Inner Race	0.07	4
	0.14	5
	0.21	6
Outer Race	0.07	7
	0.14	8
	0.21	9

Table 4. Ablation experiment.

Module Components					Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Swin Transformer	Improved ResNet	ECA	VMD	FFT	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
√					94.05	93.89	93.92	93.9
	√				94.51	94.31	94.09	94.2
√	√				95.82	95.68	95.79	95.74
√	√	√			96.58	96.25	96.3	96.27
√	√	√	√		97.35	97.01	97.06	97.03
√	√	√		√	97.87	97.61	97.53	97.57
√	√	√	√	√	99.41	99.4	99.4	99.4

Table 5. Performance comparison results.

Experimental Scheme	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)
Proposed Model	99.41 ± 0	99.4 ± 0	99.4 ± 0	99.4 ± 0	99.85 ± 0
LiConvFomer	98.45 ± 0.24	98.24 ± 0.24	98.35 ± 0.24	98.29 ± 0.24	98.18 ± 0.24
Autoformer	97.67 ± 0.36	97.21 ± 0.36	97.12 ± 0.36	97.16 ± 0.36	98.09 ± 0.36
CLFormer	96.73 ± 0.52	96.26 ± 0.52	96.13 ± 0.52	96.19 ± 0.52	97.12 ± 0.52
ResNet18	95.44 ± 0.86	95.24 ± 0.86	95.02 ± 0.86	95.13 ± 0.86	95.3 ± 0.86
CNN-LSTM	92.53 ± 1.05	92.5 ± 1.05	92.12 ± 1.05	92.31 ± 1.05	92.38 ± 1.05
TCN	90.82 ± 1.29	90.78 ± 1.29	90.41 ± 1.29	90.59 ± 1.29	90.65 ± 1.29
CNN	89.42 ± 1.57	89.21 ± 1.57	88.86 ± 1.57	89.03 ± 1.57	89.15 ± 1.57

Table 6. Recognition accuracy rates of different models under different noise backgrounds.

SNR (dB)	Accuracy (%)
SNR (dB)	Proposed Model	IDRSN	DRSN	DRN	CNN
−4	91.92	88.33	84.67	80.67	68.67
−3	92.33	90.33	86.64	81.33	72.52
−2	94.32	93.03	90.83	87.62	76.67
−1	95.47	95.42	92.67	89.43	79.35
0	96.58	96.05	93.33	91.33	81.67
1	97.87	96.87	94.43	92.33	84.36
2	98.82	97.93	95.83	93	87.33
normal	99.41±0	98.53	96.43	94.33	89.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Q.; Chen, M.; Tang, M.; Lai, H. A Bearing Fault Diagnosis Method Based on an Attention Mechanism and a Dual-Branch Parallel Network. Appl. Sci. 2026, 16, 4511. https://doi.org/10.3390/app16094511

AMA Style

Liu Q, Chen M, Tang M, Lai H. A Bearing Fault Diagnosis Method Based on an Attention Mechanism and a Dual-Branch Parallel Network. Applied Sciences. 2026; 16(9):4511. https://doi.org/10.3390/app16094511

Chicago/Turabian Style

Liu, Qiang, Minghao Chen, Mingxin Tang, and Hongxi Lai. 2026. "A Bearing Fault Diagnosis Method Based on an Attention Mechanism and a Dual-Branch Parallel Network" Applied Sciences 16, no. 9: 4511. https://doi.org/10.3390/app16094511

APA Style

Liu, Q., Chen, M., Tang, M., & Lai, H. (2026). A Bearing Fault Diagnosis Method Based on an Attention Mechanism and a Dual-Branch Parallel Network. Applied Sciences, 16(9), 4511. https://doi.org/10.3390/app16094511

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Bearing Fault Diagnosis Method Based on an Attention Mechanism and a Dual-Branch Parallel Network

Abstract

1. Introduction

2. Related Knowledge

2.1. Swin Transformer

2.2. Improved ResNet Module

2.3. Efficient Channel Attention Networks

3. Proposed Method

3.1. Signal Preprocessing

3.2. Swin Transformer–Improved ResNet Module

3.3. Feature Fusion

4. Experimental Verification

4.1. Dataset Description

4.1.1. SEU Bearing Dataset

4.1.2. CWRU Bearing Dataset

4.2. Model Validation Experiment

4.3. Performance Evaluation Metrics

4.4. Model Ablation Experiment

4.5. Model Comparison Experiment

4.6. Simulated Noise Experiment

4.7. Model Generalization Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI