1. Introduction
Driven by attributes such as high adaptability, superior integrability, dynamic reconfiguration capability, and substantial data processing power, artificial intelligence has continued to evolve, placing it at the forefront of Industry 4.0 development [
1]. In modern industrial production, mechanical equipment constitutes a critical element underpinning the efficient operation of industrial systems. The integration of artificial intelligence, IoT, and related technologies has markedly improved the refinement, systematization, intelligence, and automation of mechanical systems [
2]. Simultaneously, the deployment of multiple sensors in industrial scenarios has become increasingly prevalent [
3]. Bearings, as essential components in rotating machinery, play a vital role in ensuring operational stability. Statistical studies reveal that approximately 40% of motor failures stem from bearing defects [
4], underscoring the critical importance of bearing fault diagnosis. Efforts from both academia and industry have consistently focused on exploring effective methods for bearing fault diagnosis. Current research predominantly centers around three major domains: signal processing, machine learning, and deep learning. The Fast Fourier Transform (FFT) [
5] is widely recognized as a fundamental approach for frequency-domain analysis, transforming time-domain signals into frequency representations to reveal spectral features. Other commonly adopted techniques include Empirical Mode Decomposition (EMD) [
6], Wavelet Transform (WT) [
7], and Variational Mode Decomposition (VMD) [
8]. However, due to the variability of operating conditions in real industrial settings and the intricate nature of fault mechanisms, single signal processing approaches often fail to extract health-related features accurately. With advancements in IoT sensing technologies, the pervasive use of sensors in industrial equipment has facilitated the rise of data-driven diagnostic methodologies. Machine learning methods have gained prominence for their ability to identify fault characteristics. Methods such as Multilayer Perceptron (MLP) [
9], Hidden Markov Model (HMM) [
10], and Support Vector Machine (SVM) [
11] have demonstrated considerable success in bearing fault identification. Nonetheless, the complex and dynamic operating conditions encountered in practice render manually crafted feature-based methods unable to meet the strict performance demands of contemporary industrial fault diagnosis.
With its well-validated robust automatic feature extraction capabilities, deep learning theory has been broadly deployed in numerous research domains, ranging from image recognition and speech recognition to natural language processing, and has thus become a leading research direction in intelligent information processing. Deep learning algorithms—particularly convolutional neural networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-term Memory Networks (LSTMs), and Deep Belief Networks (DBNs)—have significantly propelled advancements in fault diagnosis. Jiang et al. [
12] constructed an interpretable CNN using gradient-weighted class activation mapping, which achieved high diagnostic accuracy while offering transparent interpretability of the prediction process. To capture multi-scale features and enhance prognostic precision. Liu et al. [
13] introduced a novel bearing remaining useful life prediction framework, TcLstmNet-CBAM. To improve the feature extraction capability of signals, Wang et al. [
14] employed Andrews plots to extract fault features from online process measurements, and used a convolutional neural network to further extract diagnostic information from the outputs of Andrews plots, so as to address the uncertainty in setting the correct dimension of features extracted from Andrews plots. Image-based methods have been widely applied in fault detection algorithms for mechanical systems. These images are derived from vibration signals transformed from the time-domain to the time–frequency-domain. Spirto et al. [
15] compared image-based convolutional neural network methods using time–frequency transforms and SDP transforms for input images, and the latter can significantly reduce computational costs. Tang et al. [
16] developed a bidirectional DBN-based diagnostic approach, which improved feature learning efficiency and reduced reliance on the quality of training data. Despite these developments, several limitations persist. CNNs are constrained by local receptive fields and require deep stacking to capture global context, which increases model complexity. Moreover, pooling operations may discard semantically important information. RNNs face challenges such as vanishing or exploding gradients, limited long-term memory retention, lack of parallelizability, and extended training durations. LSTMs are structurally complex, leading to increased risk of overfitting, while DBNs inherently struggle to model intricate pattern variations and multi-modal data relationships. In practical industrial applications, noisy environments are prevalent, making the presence of noise-contaminated labels inevitable [
17]. Lin et al. [
18] proposed an insulated bearing fault diagnosis method integrating shape-aware kernel attention (SAKA) and dynamic physics guidance (DPG). Liu et al. [
19] propose a method based on attention-enhanced MpResCNN-BiLSTM to address the challenges of bearing fault diagnosis in bogie transmission systems. Wang et al. [
20] propose a lightweight intra–inter-domain adaptive network (LIIDAN). To solve the problem of noise interference, He et al. [
21] proposed the deep residual network (DRN), which mitigates the vanishing and exploding gradient problems in deep neural networks through the use of cross-connections and residual modules, maintaining high classification performance even with substantial network depth. Zhao et al. [
22] developed the Deep Residual Shrinkage Network (DRSN), which retains the depth advantages of DRN while incorporating an attention mechanism and soft-threshold function for effective noise suppression. Zhu et al. [
23] introduced the Improved Deep Residual Shrinkage Network (IDRSN), designed for engines under various fault levels and operating conditions. However, limitations remain. DRNs may suffer from feature redundancy, and their simple residual pathways can constrain the upper limit of representational capacity. DRSNs are relatively difficult to train and debug, exhibit low computational efficiency, and possess limited applicability. IDRSNs adopt a single-channel diagnostic structure, which hinders comprehensive feature learning. Their diagnostic efficiency is lower than dual-channel approaches, making deployment challenging in industrial scenarios requiring high accuracy and reliability.
The Transformer framework, a landmark attention-driven neural network structure, was first presented and formalized by Vaswani et al. [
24], replaces the RNN structures traditionally used in natural language processing with a self-attention mechanism, thereby eliminating the need for convolutional or recurrent layers. Notably, self-attention offers advantages in terms of parallel computation and longer effective path lengths. The Transformer has since been widely adopted across a broad range of modern deep learning applications. However, the original Transformer primarily emphasizes global attention while often neglecting local contextual dependencies. Furthermore, its quadratic computational complexity with respect to sequence length limits scalability in long-sequence scenarios. To mitigate these limitations, Fang et al. [
25] proposed CLFormer, which integrates convolutional embedding with linear self-attention to enable fault diagnosis under limited-sample conditions. Gao et al. [
26] introduced the Twins Transformer, which effectively extracts both temporal and frequency-domain features via a cross-attention mechanism, although the model remains a relatively complex structure. To address challenges in bearing fault diagnosis, including data scarcity, class imbalance, and high levels of noise, Hou et al. [
27] developed a hybrid model that combines Transformer and ResNet architectures for joint feature extraction. Nevertheless, traditional architectures that rely on Softmax classification layers encounter difficulties in small-sample contexts, where linear classifiers often fail to capture complex data distributions, leading to overfitting or biased parameter estimation. Tang et al. [
28] proposed a composite model based on an FFT-Transformer framework, leveraging multi-head attention to extract fault features from vibration signals, thus enabling the accurate identification and differentiation of concurrent faults. However, substituting self-attention entirely with linear FFT operations typically reduces the model’s expressive capacity, resulting in inferior performance compared to standard Transformer models in tasks requiring complex comprehension. To further enhance feature representation and address more intricate challenges, Liu et al. [
29] proposed the Swin Transformer. By introducing a shifted window mechanism, this architecture significantly reduces computational complexity while enabling effective global feature extraction through self-attention. This design enhances the model’s generalization performance across diverse application domains.
In light of the strengths and limitations of existing technologies, this paper proposes a bearing fault diagnosis model based on the Swin Transformer–Improved ResNet module. In the data preprocessing stage, the frequency-domain features and time-domain multi-scale features of fault signals are extracted using FFT and VMD methods, respectively. Subsequently, a dual-branch parallel feature extraction network is constructed by integrating the Swin Transformer and Improved ResNet module. Finally, an ECA module is introduced to adaptively adjust feature channels and assign weights during the fusion process. The core academic contributions of this work to the field of intelligent fault diagnosis are outlined in the following:
The integration of Swin Transformer and Improved ResNet module combines the local feature extraction capability of ResNet with the global contextual modeling capacity of the self-attention mechanism in Swin Transformer. By fusing features extracted from different stages and both branches, the proposed model enhances the richness of feature representation and ensures comprehensive feature learning, thereby improving detection accuracy.
The Improved ResNet module incorporates the low computational complexity of depthwise separable convolution and the non-linear enhancement capability of pointwise convolution. By replacing the two standard convolution layers in traditional deep residual networks, the Improved ResNet module improves the model’s robustness against noise.
By employing an efficient adaptive fusion strategy based on the ECA module, features from both branches are reweighted and fused, which reduces the computational complexity of the dual-branch model while further enhancing its representational power and generalization capability.
The structure of this paper is as follows:
Section 2 reviews related knowledge.
Section 3 describes the proposed model.
Section 4 presents the experiments, and finally,
Section 5 concludes the paper.
2. Related Knowledge
2.1. Swin Transformer
The Swin Transformer (Shifted Window Transformer) is a novel Transformer architecture that has demonstrated competitive performance across a wide range of computer vision applications. By integrating localized window partitioning with a shifted window mechanism, it effectively captures global contextual dependencies while significantly reducing computational complexity, thereby improving overall processing efficiency. The model is composed of three primary components: Patch Partition, Swin Transformer block, and Patch Merging. In this study, the Swin-T (Tiny) variant is employed, with its architectural details illustrated in
Figure 1.
Following segmentation and linear embedding, the signals are fed into the Swin Transformer block, as depicted in
Figure 2. The block comprises components including W-MSA, MLP, LN, and SW-MSA. Specifically, W-MSA refers to the window-based multi-head self-attention module, MLP denotes the multi-layer perceptron, and LN indicates layer normalization, where SW-MSA is used to denote the shifted window-based multi-head self-attention mechanism. The core design objective of the SW-MSA module lies in the reduction of overall computational complexity. These modules are arranged in an alternating fashion, with layer normalization applied prior to each MLP, and residual connections introduced after each module to facilitate convergence and alleviate gradient vanishing.
To enable efficient modeling, the Swin Transformer computes self-attention within multi-scale local windows rather than employing global self-attention. By restricting attention operations to non-overlapping local windows, this approach substantially reduces computational overhead while preserving modeling capability. The hierarchical architecture, combined with a shifted window mechanism, facilitates cross-window information exchange, resulting in linear computational complexity with respect to input size. For a window containing
N =
hw patch tokens, the computational complexities of the two attention mechanisms are as follows:
In this context, Ω(MSA) denotes the computational complexity of global self-attention, while Ω(W-MSA) refers to that of shifted window attention. Here, represents the embedding dimension, where denotes the window size, and h and w correspond to the height and width of the feature map contained within the window. When is fixed, the complexity of global self-attention is , whereas that of the shifted window attention grows linearly with . This reduction in complexity contributes to improved computational efficiency and faster training.
The formula for calculating attention is as follows:
where
with
correspond to the query, key and value matrices, respectively;
d refers to the dimensionality of the query or key matrix, with
set as the bias matrix for attention calculation.
When utilizing the W-MSA module, self-attention operations are confined within individual windows, thus blocking the direct information exchange across different windows. To mitigate this inherent limitation, the SW-MSA module is incorporated subsequently into the W-MSA layer. The shifted window mechanism periodically redefines window boundaries through cyclic shifting, enabling information flow across adjacent windows while preserving computational efficiency. This design facilitates global context modeling, enabling the network to simultaneously capture both local features and long-range dependencies, ultimately improving overall performance.
The Swin Transformer block consists of two consecutive stages. In the first stage, as shown in Equation (3), the input feature
undergoes LN and W-MSA computation, The output is then added to
to obtain the intermediate feature
. Next,
is passed through an MLP with LN, and the result is added to
to yield
. This process completes the first stage.
In the second stage, the output
serves as the input for LN and SW-MSA, followed by a residual connection to produce
. This result is then processed by another MLP with LN, and the previous layer is summed with
, through which the final output
of the current layer is obtained.
Here, denotes the input feature of the Swin Transformer block, and and represent the intermediate features and output features of the first stage, respectively, while and correspond to the intermediate and final output features of the second stage.
2.2. Improved ResNet Module
The core component of DRN is the residual building block. The structure of a conventional residual module is illustrated in
Figure 3. Let x denote the input of the residual module,
represents the residual mapping function,
represents the underlying identity mapping function, and
represents the weight matrix derived as the input passes through the first convolutional layer within the residual module. This block primarily comprises convolutional layers, BN layers, and ReLU activation functions, with skip connections enabling cross-layer signal propagation. Such a design allows gradients from lower layers to be effectively propagated to earlier layers during training, thereby mitigating performance degradation in deep neural networks through residual learning.
Chollet [
30] proposed the use of depthwise separable convolution to replace standard convolution operations, addressing the challenges of diminished diagnostic performance and increased computational cost associated with deeper networks. The depthwise separable convolution consists of two sequential components: a depthwise convolution (DWC) layer with conventional kernel width, and a pointwise convolution (PWC) layer with a 1 × 1 kernel that performs channel-wise linear combinations. Unlike standard convolution, the DWC layer applies convolutions independently on each channel using a number of kernels equal to the number of input channels. Its mathematical formulation is as follows:
In the formula, S represents the output feature; V refers to the convolution kernel with width W and height H; X represents the input feature map; indicates the channel of the feature; stands for the spatial coordinates of the output feature in the channel; and corresponds to the coordinates of the matching weight element in the convolution kernel for the channel.
Pointwise convolution operates similarly to standard convolution, with its primary function being the weighted combination of output features along the channel dimension. One-dimensional depthwise separable convolution first employs a depthwise convolution to extract features independently from each channel, followed by a pointwise convolution to integrate the resulting features across channels. Assuming an input feature map with width
and
channels, a convolution kernel of width
, and a total of
kernels, the computational cost
of the depthwise separable convolution and the computational cost
of the standard convolution are as shown in Formula (6).
According to Formula (6), the ratio of the computational cost of depthwise separable convolution to that of standard convolution is . Since the kernel width typically takes values such as 3, 5, or 7, and the number of kernels is greater than 1, the computational cost of depthwise separable convolution is consistently lower than that of conventional standard convolution.
As illustrated in
Figure 4, the core innovation of the Optimized Residual Building Block (ORBB) proposed in this study lies in replacing the standard convolutional layers in the residual module with depthwise separable convolutional layers. The DWC layer reduces model complexity by decoupling channel-wise correlations, thereby enhancing the parameter efficiency of the convolution kernels. The subsequent pointwise convolution layer projects multi-channel features into a single channel while preserving salient information and enhancing the non-linear representational capacity of the network. This structure improves the model’s ability to extract subtle features that may be obscured by noise. In addition, a BN layer is incorporated after the depthwise separable convolution to mitigate feature distribution shifts under noisy conditions, thereby further improving the model’s noise robustness and generalization performance.
2.3. Efficient Channel Attention Networks
Wang et al. [
31] adopted the Efficient Channel Attention (ECA) module, as illustrated in
Figure 5, which is a lightweight channel attention mechanism designed to enhance the model’s capability to learn inter-channel dependencies in an efficient and straightforward manner. The core concept of ECA is to capture channel-wise relevance and importance differences without incurring substantial computational overhead. Accordingly, the ECA module is integrated after the feature fusion stage in this study.
Initially, a comprehensive average pooling operation is applied to the output features. The corresponding computational process is mathematically formulated in the following equation:
In the formula:
represents the feature vector obtained after the GAP layer;
and
represent the height and width of the window, respectively; and
denotes the feature value at the
-th row and
-th column of the
-th channel. Following comprehensive average pooling, the adaptive one-dimensional convolution kernel size k is dynamically determined based on the number of channels C. The calculation is as follows:
where
= 2 and
= 1. After obtaining the kernel size
, a one-dimensional convolution operation is performed to compute the channel attention weights, as shown below:
In the formula: represents the attention weight; is a one-dimensional convolution with a kernel size of ; and is the Sigmoid function.
The final weighted features are generated by executing an element-wise channel-wise multiplication between the original input features and the network-learned attention weights, which enhances the ability to capture features of different scales and dynamically strengthens the key sensor channels related to faults while suppressing noise interference signals.
3. Proposed Method
Conventional one-dimensional neural networks are prone to signal distortion caused by high-frequency interference and random noise in complex operational environments. Consequently, these networks often struggle to extract discriminative features from raw signals and fail to achieve the desired prediction accuracy. To overcome this limitation, we propose a dual-branch parallel fusion model-based bearing fault diagnosis method in this study that integrates the Swin Transformer and Improved ResNet module. The overall framework is illustrated in
Figure 6. First, the frequency-domain features and time-domain multi-scale features of fault signals are extracted using FFT and VMD methods, respectively, thereby emphasizing components associated with fault characteristic frequencies. The resulting frequency-domain samples are then fed into the feature extraction module. Finally, the extracted features are processed via fully connected layers and a Softmax classifier to output multiple fault categories, enabling accurate bearing fault diagnosis.
In this paper, the Swin Transformer architecture originally designed for two-dimensional images is adapted to the task of fault diagnosis using one-dimensional vibration signals. Since one-dimensional time-series signals lack a two-dimensional spatial structure, FFT and VMD are employed to construct pseudo-two-dimensional feature maps, enabling compatibility with the input format of the original model.
For the one-dimensional frequency-domain amplitude features output by FFT, dimension regularization is first performed via zero-padding or truncation, so that the total feature length can be decomposed into a two-dimensional size that meets the patch partitioning requirements. Min–max normalization is then applied to map the features to the range [0, 1]. Finally, the one-dimensional vector is reshaped into an two-dimensional grid, forming a single-channel pseudo-image input.
For the multi-component IMF features obtained by VMD, length alignment and normalization are performed component-wise. Each component is then reshaped into an H × W two-dimensional structure and stacked along the channel dimension to form multi-channel pseudo-image patches. After the above dimension adjustment, normalization and reshaping operations, the features can be directly fed into the Patch Partition module of the Swin Transformer, enabling normal computation of window attention and hierarchical downsampling. In this way, the adaptation to one-dimensional vibration signals is achieved without modifying the backbone structure of the model.
3.1. Signal Preprocessing
Raw vibration signals are initially subjected to preprocessing via FFT and VMD, with the aim of extracting time–frequency-domain features. Subsequent to this preprocessing step, the processed data is partitioned into three subsets: training, validation, and test sets. Specifically, for the signal preprocessing procedure, each of the four distinct types of fault signals undergoes both FFT and VMD. The fault time-domain signals, after being transformed by FFT, are presented visually in
Figure 7.
It can be seen from the figure that the image after FFT has relatively obvious features, indicating that FFT can effectively extract frequency-domain features. There are obvious differences between the spectrograms corresponding to different faults, and these features help to distinguish different fault types and their severity levels.
Although the Fast Fourier Transform has many advantages, it cannot effectively capture the variation in frequency with time when dealing with non-stationary signals. To avoid this problem, VMD is introduced in the signal preprocessing stage.
The VMD method transforms the multi-component signal decomposition problem into a variational optimization problem, and adaptively iterates to obtain several Intrinsic Mode Functions (IMFs), exhibiting excellent performance in noise suppression and component separation. By constructing and solving a constrained variational problem, VMD iteratively updates each mode and its central frequency, enabling an energy concentration of each mode around its respective central frequency with minimized bandwidth. For simplicity of analysis, the number of modes to be decomposed is generally assumed to be K, and each mode can be regarded as a band-pass signal near the central frequency . When the modes are separated from each other in the frequency spectrum, the overall signal spectrum is reasonably partitioned.
Utilizing the central frequency method, an analysis is conducted on the central frequencies corresponding to different values of K. The outcomes of the VMD are illustrated in
Figure 8. It is found that, when K equals 4, certain modes start to display similar central frequencies, which signifies the occurrence of over-decomposition; excessive decomposition does not necessarily facilitate feature extraction. Other parameters are configured as follows: α = 4000 and τ = 0.03. After the data is denoised by means of VMD, noise components are effectively eliminated, while both the fault and normal signal characteristics are preserved. This denoising process ultimately contributes to enhancing the accuracy of fault classification.
Finally, the vector concatenation method is adopted for combination. The FFT spectrum and the VMD components are stacked along the channel dimension to generate multi-scale fusion features. By combining FFT and VMD, the signal can first be transformed into the frequency-domain using FFT to obtain its spectral information. Meanwhile, VMD is applied to the fault signal to decompose it into a series of modal functions. By analyzing this information, multi-scale features in the fault signal can be mined, thereby better understanding the time–frequency characteristics of the signal. The method enables a more comprehensive analysis of the signal and facilitates signal processing tasks in application scenarios such as fault detection and diagnosis.
The full dataset is randomly split into three subsets with a stratified ratio of 7:2:1, namely the training set, validation set, and held-out test set. The training set is used for model fitting, the validation set for early stopping and hyperparameter optimization, and the independent test set is exclusively used to quantify the generalization capability of the proposed dual-branch fusion model. Test data does not participate in the training process.
3.2. Swin Transformer–Improved ResNet Module
In the proposed model, the preprocessed signals are simultaneously processed through two parallel branches:
Branch 1: The preprocessed data is fed into the Swin Transformer network, which utilizes its window-based attention mechanism to abstract local fault-related features from the signals. The Swin Transformer, which draws inspiration from the fundamental design principles of convolutional neural networks, achieves global attention modeling capability, and optimizes computational complexity by reducing it from a quadratic to a linear relationship with the input resolution through its shifted window attention mechanism. The overall computational overhead and model training costs are substantially decreased through this linear complexity optimization strategy. The specific parameter information of the Swin Transformer network is shown in
Table 1.
Branch 2: As illustrated in
Figure 9, the frequency-domain signal samples are concurrently input into the Improved ResNet module. The input signals first pass through a standard convolutional layer with wide kernels followed by a pooling layer, which reduces the influence of noise on useful feature extraction. The signals are then processed by the ORBB layer and the pooling layer, where the combined use of DWC and PWC strengthens the extraction of non-sensitive features that may be masked by noise. A BN layer is applied to alleviate internal covariate shift caused by noise interference.
3.3. Feature Fusion
Simply concatenating or applying weighted summation to the features extracted from the two branches may be insufficient to fully exploit their complementary information. To address this limitation, an ECA module is incorporated to dynamically assign channel-wise weights, enabling the model to autonomously learn the relative importance of different channels. In the proposed approach, the features obtained from the Swin Transformer and Improved ResNet module branches are integrated using the ECA mechanism. The fused features are subsequently processed through fully connected (FC) layers and a Softmax classifier to complete the final classification. This fusion strategy, when applied within a dual-channel architecture, mitigates feature redundancy more effectively than in single-branch models and adaptively adjusts the contributions of different channels. The fault diagnosis workflow of the Swin Transformer–Improved ResNet module is illustrated in
Figure 10.
Step 1: Acquire fault data using the data collection system.
Step 2: Apply FFT to the vibration signals and construct the dataset.
Step 3: Perform dual-branch feature extraction using the Swin Transformer and Improved ResNet module.
Step 4: Fuse the extracted features using the ECA module.
Step 5: Use the trained model to diagnose the samples in the test set.
4. Experimental Verification
To evaluate the accuracy and effectiveness of the proposed Swin Transformer–Improved ResNet, a series of experiments were conducted, including model validation, comparative analysis, ablation studies, complexity assessments, and noise robustness evaluations. In addition, to further assess the model’s robustness in practical industrial scenarios, a comparative experiment was performed using the SEU bearing dataset. Generalization capability was also tested through experiments on the CWRU bearing dataset. The model was implemented using PyTorch 2.2.1 under a Python 3.7 environment. The hardware setup included an Intel Core i9-13900K processor and 128 GB of RAM. For the sake of fair comparison and reproducibility of experimental results, all models were subjected to the exact same preprocessing steps during both the training and evaluation stages. A dropout rate of 0.5 was applied, randomly deactivating 50% of neurons during training to prevent overfitting. The cross-entropy loss function is adopted as the loss metric. Use the Adam optimizer, and the learning rate is set to 0.0003. The model was trained over 50 epochs.
4.1. Dataset Description
4.1.1. SEU Bearing Dataset
The Southeast University (SEU) dataset [
32] consists of two sub-datasets: a bearing dataset and a gear dataset. Its fault data was collected from a Drivetrain Dynamic Simulator. Two different operating conditions were set during data acquisition, namely a speed–system load of 20 Hz–0 V and 30 Hz–2 V, with a sampling frequency of 5120 Hz. The SEU dataset contains 10 fault conditions in total, including 5 rolling bearing fault conditions, specifically the normal condition, rolling element fault, inner race fault, outer race fault, and compound fault. The specific fault types of the bearings are shown in
Table 2.
A dedicated fault simulation test rig is adopted for the acquisition of the target bearing vibration dataset, which is composed of a drive motor, motor controller, planetary gearbox, transmission gearbox, load regulation unit, and brake controller, with its specific structural composition depicted in
Figure 11.
4.1.2. CWRU Bearing Dataset
The dataset used for cross-domain generalization validation experiments is the standard rolling bearing fault dataset sourced from the Bearing Data Center of Case Western Reserve University (CWRU) [
33]. The experimental configuration is depicted in
Figure 12, which mainly includes a horsepower motor, sensors, dynamometers, and control electronics. The tested bearing model is the SKF6205 motor bearing.
To conduct the generalization experiment and evaluate the model’s generalization capability, vibration signals obtained from the drive end of the CWRU benchmark dataset were used, with a sampling frequency of 12 kHz. The dataset includes four fault diameter categories, each corresponding to a distinct fault type. All the faults in question were recorded under four different levels of motor load, specifically 0 hp, 1 hp, 2 hp, and 3 hp. The data is categorized into 10 classes to distinguish both the fault location and diameter, as detailed in
Table 3. Based on the load conditions, the dataset is divided into four subsets: A, B, C, and D.
4.2. Model Validation Experiment
In the validation experiment, the SEU bearing dataset was utilized. To reduce randomness, each experiment was repeated five times. The average results of five experiments are presented in
Figure 13. During each training epoch, input data is processed concurrently through a dual-branch architecture for feature extraction, followed by classification using fully connected layers. The model exhibits stable behavior throughout the training process. Both training accuracy and cross-entropy loss begin to converge about the 30th epoch. The final average classification accuracy reaches 99.41%, while the cross-entropy loss decreases to below 0.02.
To further analyze the classification performance of the model across different fault categories, the confusion matrix is presented in
Figure 14. The vertical axis of the matrix represents the true labels of the samples, and the horizontal axis corresponds to the predicted labels of the model. It can be easily seen from the distribution characteristics of the matrix that most values are concentrated on the main diagonal. This phenomenon indicates that the prediction accuracy of the model is at a high level, which fully proves that the proposed fault diagnosis and classification model performs excellently with good classification accuracy, and has strong fault identification capability and practical application value.
Additionally, t-Distributed Stochastic Neighbor Embedding (TSNE) [
34] is employed for data visualization and dimensionality reduction, as shown in
Figure 15. The TSNE method is applied to both the original features and those extracted by the proposed model, allowing for a comparative analysis of feature distributions and the model’s capacity to discriminate between various bearing fault types. As illustrated, nearly all faulty samples are accurately identified, demonstrating the effectiveness of the proposed model in fault classification tasks.
4.3. Performance Evaluation Metrics
Four evaluation metrics are adopted to assess the feasibility and practicality of the model, namely accuracy, precision, recall, and F1-score. These metrics provide a quantitative assessment of the proposed method’s overall effectiveness.
Accuracy is a fundamental evaluation metric for classification models, representing the proportion of correctly classified instances relative to the total number of instances in the dataset. It provides an overall measure of the model’s predictive performance. It is defined as
Precision quantifies the proportion of instances predicted as positive that are truly positive, reflecting the reliability of the model’s positive predictions. It is defined as
Recall measures the proportion of actual positive instances that are correctly identified by the model, serving as an evaluation indicator of the model’s capacity to capture the positive class samples. It is defined as
F1-score represents the harmonic mean of precision and recall, delivering a balanced assessment that accounts for both false positives and false negatives. This metric is highly applicable to imbalanced data distribution scenarios. It is defined as
Specificity is a crucial indicator for evaluating the performance of fault diagnosis models, which measures the model’s ability to accurately identify negative samples. This metric can effectively reflect the model’s capacity to restrain misjudgments and reduce false positives. It is defined as
Among these, True Positive (TP) is defined as the correct recognition of an instance that genuinely falls into the positive class. False Positive (FP) occurs when a sample that is actually negative is incorrectly categorized as positive. True Negative (TN) refers to the accurate classification of a sample that is truly a negative instance. False Negative (FN) occurs when a positive instance is incorrectly classified as negative.
4.4. Model Ablation Experiment
Ablation experiments were conducted to investigate the factors influencing model performance and to determine optimal configurations by evaluating various architectural settings. To assess the contribution of each key component in the proposed method, a series of ablation studies was performed on the Swin Transformer–Improved ResNet module feature extraction framework and the ECA module. By progressively removing individual modules and comparing the diagnostic performance of each configuration, it was observed that using only the Swin Transformer or Improved ResNet module led to relatively inferior results. Even with the integration of the ECA module for feature fusion, performance improved compared to configurations without it. These findings demonstrate that all modules incorporated in the model are essential. To reduce randomness, each experiment was repeated five times. The average results of five experiments are presented in
Table 4.
Experimental results show that using either the Improved ResNet module or the Swin Transformer module alone can achieve a certain level of fault recognition capability, with accuracies of only 94.05% and 94.51%, respectively. When the two modules are fused in parallel, the accuracy is improved to 95.82%. Further integration with the ECA module increases the diagnostic accuracy to 96.58%.
On this basis, a signal preprocessing module is introduced to further enhance the feature quality. After adding FFT or VMD, the accuracy rises to 97.87% and 97.35%, respectively, indicating that both frequency-domain spectral features and modal components make positive contributions to the enhancement of diagnostic information.
When the combined preprocessing strategy of FFT and VMD is adopted, the model achieves the best performance with an accuracy of 99.41%. Meanwhile, the precision, recall and F1-Score also reach the highest values, which verifies the effectiveness and robustness of the proposed method in identifying complex faults.
4.5. Model Comparison Experiment
To further demonstrate the reliable performance of the proposed method, a comparative analysis of diagnostic accuracy across multiple models was conducted using the SEU bearing dataset. To ensure fairness and comparability, all models were trained and evaluated under identical preprocessing procedures, and their hyperparameters were adjusted accordingly in a consistent manner. We selected several mainstream deep learning models for comparison: LiConvFomer [
35], Autoformer [
36], CLFormer [
25], ResNet18 [
21], CNN-LSTM [
37], TCN [
38], and CNN [
39] were included as baselines. To reduce randomness, each experiment was repeated five times. A spider chart illustrating the performance metrics of these eight models on the SEU bearing dataset is presented in
Figure 16.
The comparative results are summarized in
Table 5. Under identical preprocessing conditions, experiments with the selected baseline and Transformer-based models clearly demonstrate that the proposed method outperforms the others across multiple evaluation metrics, including accuracy, precision, recall, and F1-score. The experimental findings confirm that the window-based self-attention mechanism of the Swin Transformer effectively captures local fault features and facilitates the integration of multi-scale feature representations. The Improved ResNet module enhances model generalization through the computational efficiency of depthwise separable convolutions. Furthermore, the ECA module adaptively fuses multi-branch features by assigning dynamic weights to feature channels and reducing redundant information, thereby improving the discriminative and expressive capabilities of the extracted features. In summary, under uniform conditions, the proposed method exhibits superior performance compared to alternative models, underscoring its strong potential for practical fault diagnosis applications.
4.6. Simulated Noise Experiment
The signal-to-noise ratio (SNR) [
40] is widely adopted to measure the noise level contained in vibration signals, which is formulated as follows:
where
represents the average power of the original signal and
denotes the average power of the noise component.
To simulate various types of noise interference encountered during actual bearing operation and to evaluate the diagnostic reliability of the proposed model under high-noise conditions, white Gaussian noise was artificially introduced into the test set. This setup mimics the distributional discrepancy between training and testing data typically observed in real-world industrial environments. Within the SEU bearing dataset, SNR values ranging from 2 dB to −4 dB were added. Comparative experiments were conducted against three residual network-based models—IDRSN [
23], DRSN [
22], and DRN [
21]—as well as a conventional CNN model. To reduce randomness, each experiment was repeated five times.
Table 6 demonstrates that our method maintains the highest average fault identification accuracy across both noise-free settings and noisy environments, where the SNR varies from −4 dB to 2 dB. In particular, under the harsh high-noise condition of −4 dB, the presented model reaches an average accuracy of 91.92%, which outperforms comparative approaches including IDRSN, DRSN, DRN and CNN, with their respective accuracy values of 88.33%, 84.67%, 80.67% and 68.67%. When exposed to moderate noise at 2 dB SNR, our method realizes an average diagnostic accuracy of 98.82%, leading the optimal baseline IDRSN by a margin of 1.49%. Only at the SNR of −1 dB does the recognition performance of IDRSN closely approximate that of the developed model. The overall superiority and robustness of the proposed method under different noise intensities are further visualized in
Figure 17.
To further evaluate diagnostic performance under significant noise interference, the recognition stability of different models was compared using box plots. In these plots, model A represents the proposed method; B, C, D, and E correspond to the IDRSN, DRSN, DRN, and CNN models, respectively. Experiments were conducted under noisy conditions with SNRs of −4 dB and −2 dB. As depicted in
Figure 18, the proposed model achieved superior recognition performance with low variance, even in high-noise scenarios. As evidenced by the above results, the presented strategy yields remarkable gains in both detection precision and robustness, which is critical for bearing fault diagnosis amid intense noise disturbance.
4.7. Model Generalization Performance
The ability of the diagnostic model to identify bearing damage under diverse experimental conditions is an essential index for quantitative performance evaluation. To assess this capability, a generalization performance experiment was designed. Variations in load can modify the vibration signal characteristics of bearings. Validating the model under different load conditions enables an assessment of its adaptability to signal feature changes across various speeds. This process helps verify the model’s stability and adaptability in diverse operating conditions and ensures that false alarms or misdiagnoses do not occur due to speed fluctuations.
Fault data under four distinct load conditions (A, B, C, and D) was employed to construct the training and testing sets. For example, A → B indicates that dataset A was utilized for model training and dataset B was utilized for testing. Similarly, other cross-load scenarios follow the same configuration. Five validation experiments were conducted, and the average value was computed as the final diagnostic result. As shown in
Figure 19, under varying load conditions, the proposed model achieved a mean fault diagnosis accuracy exceeding 98% on the CWRU dataset. Although a slight decrease in accuracy was observed with increasing load variation, the declining trend remained modest. These results demonstrate the model’s capability to extensively extract two-dimensional time–frequency features and exhibit excellent generalization and diagnostic performance.
5. Conclusions
To enhance the diagnostic performance of fault diagnosis models under complex operational conditions, a novel approach based on a dual-branch Swin Transformer–Improved ResNet module integrated with an ECA module is proposed. The main contributions are summarized as follows:
The Swin Transformer branch employs a window-based self-attention mechanism to capture local features from fault signals, enabling effective integration of multi-level feature representations, while the Improved ResNet module branch utilizes depthwise separable convolutions to reduce computational complexity and improve generalization capability, and these two models are integrated via a dual-branch parallel structure for feature extraction to enhance overall model robustness; the incorporated ECA module then adaptively recalibrates feature channels by assigning differential weights and minimizing information redundancy, after which the fused features are processed through adaptive pooling and FC layers for fault classification, and embedding the ECA module after feature fusion highlights and reinforces important features while suppressing redundant ones, a design that enables dynamic feature enhancement and noise suppression in bearing fault diagnosis; simulation results demonstrate that the proposed method surpasses several state-of-the-art approaches across multiple metrics, noise interference experiments further confirm its enhanced robustness in complex scenarios compared to alternative methods, ablation analyses further confirm that the proposed structural design contributes critically to the improvement of the model’s overall diagnostic performance, and generalization tests underscore the model’s adaptability, making it highly suitable for bearing fault detection across diverse operational environments.
Future work holds potential for further enhancement. Subsequent research in bearing fault diagnosis will prioritize the following aspects: First, noise robustness will be improved by incorporating a broader spectrum of diverse and challenging noise types. Second, bearing fault signals will be gathered from a wider array of industrial production scenarios to enhance model robustness and interference resistance using real-world datasets, inspired by Li et al. [
40], who developed a continual learning model (UACLF) for fault diagnosis of rotating machinery in dynamic environments. Future work will explore the integration of continual learning with the proposed model to achieve superior fault diagnosis performance in such dynamic settings.