1. Introduction
Across modern industrial settings, maintaining the operational reliability of mechanical equipment is the foundation for ensuring smooth production, notably for rotation-driven machinery like aero-engines, high-speed trains, drive motors, and gearboxes. In these rotating machines, rolling bearings are widely used as core components [
1,
2]. Due to high-speed operating conditions and heavy loads, however, multiple localized faults are prone to occur often in rolling bearings. Furthermore, compound faults, which accelerate the performance degradation of rolling bearings, are more destructive than single faults for mechanical equipment. Therefore, accurately and promptly detecting compound faults in bearings is a crucial problem for ensuring the stable operation of mechanical equipment in modern industrial systems.
Inspired by advances in artificial intelligence [
3], various technologies have been applied to fault diagnosis. Existing bearing fault diagnosis methods can be categorized from three complementary perspectives: signal representation (e.g., time–frequency analysis), learning paradigms (e.g., feature-based or end-to-end deep learning), and sensing modalities (e.g., single-sensor or multi-sensor fusion). Across broader manufacturing applications, frequency analysis and multi-sensor CNN architectures have been widely proven as highly effective and reliable foundational strategies for condition monitoring and predictive maintenance. Building upon these established techniques, our work focuses on addressing a critical architectural engineering bottleneck: how to efficiently extract and fuse heterogeneous, multi-scale data under complex and variable industrial conditions. Time–frequency domain methods preprocess raw signals prior to model input, commonly utilizing algorithms such as the Short-Time Fourier Transform (STFT), Fast Fourier Transform (FFT), and Wavelet Transform (WT) to extract temporal, spectral, and other discriminative features from vibration signals [
4]. For instance, Peng et al. [
5] developed a fault diagnosis model that integrates STFT with ResNet [
6], converting sensor signals into time–frequency representations and extracting features through residual learning. Similarly, Liang et al. [
7] proposed a small-sample bearing fault diagnosis method combining CNN with Generative Adversarial Networks (GANs), where the time–frequency images obtained by WT were enhanced via GANs to further improve the diagnostic performance of CNN-based models. Time–frequency representations can be used in both feature-based pipelines and deep learning models, where they often serve as informative inputs to improve feature separability under complex operating conditions.
More recently, Li et al. [
8] introduced an adaptive variational mode extraction method that segments the autopower spectrum via K-means clustering and iteratively optimizes mode parameters to isolate fault-related components under strong nonstationary operating conditions of wind turbine bearings. Jian et al. [
9] proposed a fault diagnosis approach based on compressive sensing and a parameter-efficient SqueezeNet model, where the original vibration signal is first sparsely reconstructed using the CoSaMP algorithm and then encoded into a recurrence plot to preserve temporal correlations before classification. To enhance model interpretability, Gao et al. [
10] designed a fully interpretable convolutional neural network in which each layer—from wavelet-inspired convolution to sparse feature selection—corresponds to a physically meaningful signal processing operation. Meanwhile, Zhang et al. [
11] combined adaptive wavelet denoising with a Kolmogorov–Arnold Network (KAN)-based Transformer, leveraging time-series gradient activation mapping to ensure transparent and reliable diagnosis under noisy and small-sample conditions.
On the other hand, the “end-to-end” fault diagnosis method refers to directly using the collected data as the input of the network model [
12]. For instance, Xie et al. [
13] proposed a method for converting multiple signals into RGB images based on PCA, fusing the signals into three-channel RGB images and simultaneously using an improved residual network model to enhance the accuracy of fault diagnosis. Niu et al. [
14] developed a DR-CNN model that enhances discriminative feature learning by integrating multi-sensor data from different operating conditions and domain knowledge while introducing two feature attention modules to improve feature representation and diagnostic performance. Wang et al. [
15] proposed a feature fusion network incorporating dual attention mechanisms. Fan et al. [
16] presented a parameter-efficient multi-scale and multi-attention fusion network, which strengthens feature learning and noise resistance by adaptively adjusting feature weights. Recent end-to-end works further address noise, dynamics, and system complexity. Zhao et al. [
17] proposed a pure convolutional network based on ConvNeXt architecture with dense connections and a joint spatial-channel attention module, which enhances feature representation and achieves robust fault diagnosis under extremely low signal-to-noise ratios. Cheng et al. [
18] integrated local convolutional operators with relation-aware self-attention in a Transformer framework, enabling simultaneous capture of transient impacts and long-range temporal dependencies in raw vibration signals. Zhang et al. [
19] developed an adaptive representation dual classifier residual network that mitigates catastrophic forgetting in domain-incremental scenarios through decision boundary separation and feature subspace alignment. Zhang et al. [
20] formulated wind turbine fault diagnosis as a multi-task learning problem, where operational data under different conditions are treated as separate tasks in a shared-bottom network to improve generalization under non-stationary and outlier-prone industrial environments. Finally, Ma et al. [
21] proposed a spatio-temporal graph convolutional neural network that constructs a Gaussian kernel-based adjacency matrix to model sensor correlations and jointly applies graph convolution to extract spatial and temporal features for system-level fault diagnosis.
Beyond traditional CNNs, emerging Transformer architectures have recently been introduced to fault diagnosis due to their superior ability to model long-range global dependencies. However, standard Transformers are often constrained by massive parameter counts and quadratic computational complexity, hindering their deployment on industrial edge devices. To address this trade-off, cutting-edge lightweight transformers have been proposed. For instance, Hou et al. [
22] developed a lightweight transformer based on feature fusion and global–local parallel stacked self-activation units, which achieves efficient feature extraction while significantly reducing resource requirements. While such lightweight transformers provide an excellent balance between global and local sensing, advanced CNN-based architectures offer distinct advantages in edge-computing scenarios. By replacing self-attention with dual-convolution strategies, CNN-based models can mimic global sensing capabilities while pushing the parameter count to an absolute minimum.
Furthermore, the impact of AI-driven fault management extends significantly beyond standard rotating machinery into the realm of complex nonlinear systems. Modern industrial settings frequently involve equipment with profound dynamic complexities, such as manipulator systems subjected to uncertainties and time delays. For instance, recent advancements have explored neural network-based reinforcement iterative learning schemes to achieve robust fault estimation in such nonlinear uncertain environments [
23]. Discussing these methodologies highlights the broader applicability and necessity of advanced deep learning frameworks in addressing evolving industrial fault diagnostic challenges.
However, relying solely on a single signal source for fault information is inherently limited since it can only reflect the operating condition from a specific physical domain (mechanical or electrical). While vibration signals are effective at capturing dynamic anomalies, they are susceptible to sensor mounting location and environmental noise; moreover, their fault signatures can be extremely weak or masked under certain failure modes [
24]. Conversely, motor current signals reflect electromagnetic modulation effects induced by mechanical faults and offer advantages of simple sensor deployment and low cost, yet their fault-related features are often buried within the dominant fundamental frequency component, leading to significantly reduced diagnostic sensitivity—especially under light-load operating conditions [
25,
26]. To mitigate these limitations, multi-sensor fusion leverages complementary information from heterogeneous measurements to improve feature completeness and robustness. Furthermore, since fault signatures may appear at different time–frequency scales, multi-scale feature extraction is beneficial for capturing both local details and global structures. In this work, we adopt a feature-level fusion scheme where multi-scale features are extracted in parallel from vibration and current signals, and then selectively aggregated by the proposed skip fusion module to enhance cross-sensor interactions. This approach not only enables complementary integration of mechanical and electrical information but also significantly enhances diagnostic accuracy while reducing the risk of misdiagnosis caused by single-sensor failures.
To address these issues, a bearing fault diagnosis method based on multi-sensor and multi-scale feature fusion is proposed. Specifically, the method uses multi-sensor data such as vibration and current signals as inputs to provide richer feature information. Leveraging the ability of neural networks to uncover hidden patterns in training samples, multi-scale features are extracted from multi-source data. Moreover, attention mechanisms are devised to adaptively select important multi-scale features and perform branch fusion. Then, self-calibrated convolution and dilated convolution are embedded without introducing any additional parameters or increasing model complexity. By alternating two layers of self-calibrated convolution and dilated convolution, more discriminative features are extracted to improve the accuracy of fault diagnosis while keeping the model parameter-efficient. Finally, the fault classification results are output through a Softmax classifier. In summary, our main contributions are fourfold:
A parameter-efficient method for bearing fault diagnosis is proposed. By utilizing vibration, current, and other multi-sensor signals as inputs, the model extracts multi-scale features from multi-source data to enhance diagnostic accuracy.
A skip-attention module is designed to adaptively select key features across multiple scales and perform branch fusion, enabling more effective extraction of critical information and improving the overall diagnostic performance.
Self-calibrated convolution and dilated convolution are incorporated, where two layers of these modules are alternately applied to capture more discriminative features. Moreover, this design introduces no additional parameters or computational complexity.
Extensive experiments are conducted on bearing datasets to evaluate the proposed approach. The results demonstrate that the method achieves superior accuracy, recall, and F1-score compared with existing approaches, under both constant and variable operating conditions.
The remainder of this paper is organized as follows:
Section 2 outlines the framework of the proposed bearing fault diagnosis model.
Section 3 details the complete diagnostic procedure.
Section 4 reports the experimental results and compares the method with state-of-the-art approaches on benchmark datasets. Finally,
Section 5 summarizes the main findings and concludes the paper.
2. The Proposed Method
To address the low diagnostic accuracy of single-sensor methods under constant and variable operating conditions, a bearing fault diagnosis approach based on multi-sensor and multi-scale skip fusion is proposed, as shown in
Figure 1.
The proposed model consists of three main components, the first of which is the Fourier Transform module. To prevent overfitting and ensure robust model training, data augmentation is initially applied to increase the number of vibration and current samples. Subsequently, because bearing fault characteristics are typically more pronounced in the frequency domain, the augmented time-domain signals are converted using the Fast Fourier Transform (FFT). This transformation provides clearer and more discriminative representations for downstream fault feature extraction.
The second component is the multi-sensor and multi-scale feature extraction with skip fusion module. To achieve comprehensive and effective feature representation, multi-scale convolutional extraction modules are constructed for both current and vibration signals. On this basis, a skip-attention submodule is introduced to adaptively assign weights to the extracted features, emphasizing critical information while suppressing redundant or irrelevant responses. Grouped convolution operations are further applied to reduce feature dimensionality, which not only decreases computational cost but also enhances feature compactness. Finally, the features extracted from the current and vibration branches are fused to improve the model’s overall capability in recognizing bearing fault characteristics.
The third component is the dual-convolution fault diagnosis module. Without introducing extra parameters or increasing model complexity, this module enhances the network’s nonlinearity and representational power to capture more informative fault features. A two-layer dual-convolution structure is designed to enable the model to learn more complex, abstract, and discriminative representations, thereby further improving the diagnostic accuracy and robustness.
2.1. Fourier Transform Module
Since neural network models require sufficient data to learn meaningful representations, limited training samples can cause excessive sensitivity to specific features and lead to overfitting. To enhance stability and robustness while ensuring effective feature learning, an overlapping segmentation strategy is used to augment the data.
The fault signals of bearings are usually periodic. To efficiently extract spectral features, the signals are transformed into the frequency domain using the Fast Fourier Transform (FFT). Taking the data from the first sampling moment of the bearing dataset N15_M07_F04_KA04 from the University of Paderborn, Germany, as an example, the current and vibration signals are transformed into the frequency domain through the FFT.
Figure 2 displays the resulting amplitudes of the current and vibration in the frequency domain.
2.2. Multi-Sensor Multi-Scale Feature Extraction and Skip Fusion Module
The multi-sensor multi-scale feature extraction and skip fusion module takes the frequency-domain signals of the augmented current and vibration signals as input to the model. Since the proportion of fault features in the current signal and vibration signal is unbalanced, single-scale convolution cannot fully consider all the details and structures in the data. However, multi-scale convolution helps to extract richer features. Meanwhile, the skip-attention submodule assigns higher weights to the effective features that contribute more to the classification task. Dimensionality reduction is achieved through grouped convolution, which reduces the model’s complexity and computational load. The skip fusion module is shown in
Figure 3.
Firstly, convolutional kernels are employed to extract basic features from the frequency-domain signals of the current and vibration signals, respectively. Secondly, due to the imbalance of fault features, the width of the convolutional kernels is increased, and a two-layer multi-scale convolutional structure is used, with convolutional kernels of sizes , , , and to extract basic feature information. Smaller-sized convolutional kernels capture local detail features, while larger-sized convolutional kernels capture global features over a broader range.
In this study, to further enhance the network model’s ability to recognize fault features, we designed a skip-attention submodule in each branch. By introducing the SCSA mechanism [
27], which dynamically adjusts the importance weights of features across different channels, the network model is enabled to focus more on the effective features that significantly contribute to the classification task while effectively suppressing noise features irrelevant to fault diagnosis. Specifically, the skip-attention submodule first performs self-calibration on the features of each channel based on the SCSA mechanism. By calculating the mean and variance of the features in each channel, it adaptively adjusts the scale of the features in each channel. This process not only enhances the distinguishability of the features but also enables the model to better accommodate the distribution differences of features across different channels. Subsequently, utilizing the calibrated features, the skip-attention submodule computes channel-wise attention weights. By establishing skip connections, it improves the efficiency and stability of feature propagation, thereby further strengthening the model’s capacity to extract critical features. These weights reflect the importance of each channel’s features to the classification task. Channels with higher weights are given greater influence, while those with lower weights are suppressed, thereby achieving dynamic feature selection and optimization.
By applying skip-attention after each branch, the model effectively extracts multi-scale features and adaptively adjusts their importance, further enhancing fault diagnosis performance. This mechanism also improves robustness and accuracy when processing complex fault signals.
2.3. Dual-Convolution Fault Diagnosis Module
Since the fused features still preserve their inherent multi-scale characteristics, Self-Calibrated Convolutions (SCNet) [
28] are introduced to extract deeper-level representations while keeping the model parameter-efficient. SCNet achieves this enhancement without introducing additional parameters or increasing model complexity. It performs convolutional feature transformations in both the original scale space and a downsampled self-calibrated space with lower resolution. By adaptively modeling long-range dependencies across channels and spatial locations at each position, SCNet generates more discriminative feature representations and strengthens the model’s feature extraction capability. The structure of the self-calibrated convolution is shown in
Figure 4, and the procedure is provided in Algorithm 1.
| Algorithm 1: Self-calibrated convolution. |
| Input :Feature map X, convolution kernel K |
| Output: Feature map Y |
| Split into ; |
| Split K into ; |
| ; |
| ; |
| ; |
| ; |
| ; |
| . |
Due to the variability of signal features, traditional convolutional modules struggle to effectively identify the characteristic information of bearing signals, and they also increase the parameters of convolutional kernels and other operations. To address this issue and enhance the capability of feature extraction, Dilated Convolution (DConv) is introduced to expand the convolutional sampling range and enlarge the receptive field [
29]. Dilated convolution enlarges the receptive field through spaced sampling, allowing more information to be captured without changing the kernel size, stride, or parameter count. In AMSF-SF-DC, the dilation rate is set to
, enabling a
kernel to achieve an effective receptive field of size 5 and thus improving feature extraction capability.
By combining self-calibrated convolution and dilated convolution, the proposed dual-convolution module enhances feature representation from complementary perspectives. Specifically, self-calibrated convolution focuses on adaptive recalibration of local and channel-wise features, which improves the discriminability of fine-grained fault characteristics, while dilated convolution expands the receptive field to capture broader contextual information and long-range dependencies without increasing parameter count. Compared with using a single-convolution strategy, this dual-convolution design enables the model to simultaneously model local details and global structures, thereby improving robustness and diagnostic performance under complex operating conditions.
3. The Process of Bearing Fault Diagnosis
The overall process of the bearing fault diagnosis method is shown in
Figure 5. Current and vibration signals are first collected to construct the dataset. To avoid information leakage caused by overlapping windows, the raw signals are first divided into training, validation, and test sets in an 8:1:1 ratio at the acquisition-segment level. Then, an overlapping segmentation technique is applied for data augmentation within each subset, followed by Fast Fourier Transform (FFT) processing to obtain the corresponding frequency-domain representations.
Subsequently, a multi-sensor and multi-scale feature extraction and fusion model is constructed, and the corresponding model parameters are initialized. The frequency-domain signals of both current and vibration data, obtained after augmentation and FFT, are fed into the proposed model to extract representative features from the training samples. The multi-scale convolutional features are then weighted through an attention mechanism, where the attention coefficients of each feature vector are adaptively adjusted to emphasize critical fault information.
The group convolution operation is performed, dividing the convolution kernel into multiple groups. Each group of convolution kernels only processes a subset of the input feature map. Here, the convolution kernels are partitioned into 4 groups, each utilizing a kernel size. The features extracted by the two branch networks are then fused.
Self-calibrated and dilated convolutions are alternately applied to enhance feature representation and generalization. The resulting features are then passed through pooling, flattening, and fully connected layers, with the Softmax classifier producing the final fault categories.
Input the validation set into the trained multi-scale network model, evaluate the model performance, and fine-tune the model parameters based on the validation results. Determine whether the training number M of the network model has reached the preset number of iterations N. If it has, proceed to the next step. Otherwise, return and continue the model training steps.
Finally, the trained model is evaluated on the test set, the performance metrics are computed, and the final results are obtained to complete the process.
4. Experiments
In the experiments, the bearing dataset from the University of Paderborn (PU) was used. The experimental test bench consists of a drive motor, a torque measurement shaft, a rolling bearing test module, a flywheel, and a load motor [
30]. All experimental signals were collected by installing 6203-type rolling bearings with various fault types in the test module. The dataset contains measurements of radial force, current signals, vibration signals, rotational speed, and load torque. Four distinct operating conditions were configured, as summarized in
Table 1. For each condition, 20 groups of data were recorded, with each sample lasting 4 s. The vibration and current signals were sampled at 64 kHz, while the mechanical parameters (rotational speed, load torque, and radial force) were sampled at 4 kHz. Temperature signals were also recorded at 64 kHz. The ambient temperature during testing ranged from 5–50 °C.
The dataset provides three types of damage. The damage data is normalized based on the percentage of the damage length to the pitch circle circumference, and the degree of damage is assessed through the normalized parameters to establish a five-level damage grading system, as detailed in
Table 2.
Based on the condition and degree of damage of the bearing, it can be categorized into nine types: NO (Normal Operation), IR(A) (Inner Race fault ≤ 2 mm), IR(B) (2 mm < Inner Race fault ≤ 4.5 mm), IR(C) (Inner Race fault > 4.5 mm), OR(A) (Outer Race fault ≤ 2 mm), OR(B) (2 mm < Outer Race fault ≤ 4.5 mm), (IR + OR)(A) (composite fault ≤ 2 mm), (IR + OR)(B) (2 mm < composite fault ≤ 4.5 mm), and (IR + OR)(C) (composite fault > 4.5 mm). Here, damage extent A represents early/incipient faults, and extent B represents moderate faults. It is worth noting that while the original PU dataset defines five damage levels (A through E), our experimental scope intentionally excludes the severe damage levels D and E. The rationale is twofold: First, from a practical engineering perspective, levels D and E represent catastrophic macroscopic failures (with damage lengths exceeding 13.5 mm). Such extreme degradation generates massive, unmistakable vibration signals that can be easily detected by basic threshold-based monitoring systems, rendering sophisticated deep learning diagnosis unnecessary. Second, the primary contribution of the proposed AMSF-SF-DC method—particularly its multi-sensor fusion and multi-scale feature extraction capabilities—is designed specifically to capture the weak, easily masked signatures characteristic of early (Level A) and moderate (Levels B and C) faults. Focusing the validation on these challenging categories better demonstrates the model’s superiority and practical value for early predictive maintenance. Finally, the specific data used in the experiments of this section are detailed in
Table 3, and the verification is conducted under three operating conditions: P1, P2, and P3.
4.1. Experimental Setup
The experiments were conducted using the PyTorch 2.1.0 deep learning framework, which supports GPU acceleration and dynamic computation graphs. All tests ran on a workstation with an Intel(R) Xeon(R) Silver 4210R CPU @ GHz, 64 GB RAM, and an NVIDIA GeForce RTX 3090 GPU under Windows 10. The code environment was Python .
For model training, the Adam optimizer and binary cross-entropy loss were used. To reduce overfitting, regularization with a weight decay of was applied. The learning rate was set to , with a batch size of 64, a step size of 10, and 50 training epochs. Instead of relying on a computationally exhaustive grid search, these specific hyperparameters were determined through empirical tuning and preliminary trial experiments. Guided by standard practices in recent deep learning-based fault diagnosis literature, we fine-tuned these values to achieve an optimal balance between convergence speed, model stability, and overfitting prevention on the target dataset.
4.2. Evaluation Metrics
In fault diagnosis, multiple metrics are used to evaluate model performance, including Accuracy, Recall, Precision, Specificity, and F1-Score. Accuracy measures the proportion of correctly classified samples, Recall reflects the proportion of correctly identified positive samples, and Precision indicates the proportion of true positives among predicted positives. Specificity represents the proportion of correctly identified negative samples. F1-Score, the harmonic mean of Precision and Recall, provides a balanced assessment of both metrics. The corresponding calculation formulas are given as follows:
Here, TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively.
4.3. Model Complexity and Parameter-Efficient Analysis
To further validate the parameter-efficient design of the proposed AMSF-SF-DC model, we conduct a comprehensive comparison of model complexity and computational efficiency with several representative deep learning-based fault diagnosis methods. The number of Parameters (Params), FLoating-point OPerations (FLOPs), training time, and inference time are adopted as evaluation metrics.
All models are implemented under the same hardware and software environment to ensure a fair comparison. The experimental results are summarized in
Table 4. As shown, the proposed method achieves a favorable trade-off between diagnostic performance and computational cost. Although incorporating multi-scale fusion and dual-convolution structures inherently increases the computational load (FLOPs) compared to ultra-simple baselines like WDCNN, our model remains highly parameter-efficient with only 0.55 M parameters. This extremely small memory footprint is a significant advantage for deployment on edge devices with limited storage and RAM. Furthermore, while the FLOPs reach 1353.95 M (approx. 1.35 GFLOPs), this computational demand is easily accommodated by modern industrial edge AI platforms. As shown in
Table 4, the inference time for the entire test set is merely 2.2559 s (fractions of a millisecond per sample). This verifies that the proposed AMSF-SF-DC model achieves a highly favorable trade-off between superior diagnostic accuracy and computational cost, fully meeting the strict real-time condition monitoring requirements of modern industrial applications.
4.4. Single-Condition Experimental Results
To verify the effectiveness of AMSF-SF-DC, based on multi-sensor and multi-scale feature fusion, under single operating conditions, the accuracy of fault diagnosis using multi-sensor feature fusion is compared with that using single sensor signals. As shown in
Figure 6, the branch networks use identical parameters when diagnosing faults with single-sensor inputs.
As shown in
Figure 6, the accuracy curves of the training and testing sets for the three testing methods under three operating conditions are presented for 50 epochs. The proposed AMSF-SF-DC model demonstrates higher stability and faster convergence than the models using single-sensor inputs. For the single vibration signal and the fused multi-sensor features, both the training and testing curves stabilize after approximately 10 epochs, whereas the single current signal stabilizes after around 20 epochs. The notable fluctuation near the 10th epoch is attributed to the presence of two different fundamental frequency interference components in the current signal.
To examine the classification performance of each fault type in greater detail, confusion matrices are computed for the three methods (single current signal, single vibration signal, and fused multi-sensor features). The confusion matrix reflects the relationship between the predicted labels and the actual classes, where rows denote the true categories and columns denote the predicted ones, as shown in
Figure 7. The specific experimental settings are provided in
Table 5.
Under the three operating conditions, using the single current signal as input results in poor classification performance, mainly due to the fewer fault features in the current signal. Using the single vibration signal as input yields the second-best classification performance, which is very close to that of the fusion of both signals, mainly because the fault feature information is primarily contained in the vibration signal. The best classification performance is achieved by fusing the features of both signals. As shown in
Figure 7b,e,f,h,i, all methods perform well on NO, IN(A), IN(C), OR(A), and IN + OR, with an accuracy of 1. There is approximately a
classification error in IN(B) and OR(B). Overall, the confusion matrix is a powerful tool that provides detailed information on classification performance and helps identify model biases for different classes. In fault diagnosis, the number of samples across different classes is often imbalanced. Confusion matrix analysis offers a comprehensive way to assess the model’s performance for each class and effectively addresses the challenges arising from data imbalance.
The input signals and the fully connected layer outputs are visualized using the t-SNE algorithm. In these plots, the x- and y-axes carry no practical meaning. Different colors and marker shapes distinguish the bearing fault types. According to the label settings used in the experiments, the legend annotations are placed in the upper-right corner, as shown in
Figure 8.
Figure 8a,b show the original current and vibration signals under operating condition P1, where different fault types exhibit overlap. After feature extraction, the t-SNE features of the last fully connected layer—based on current, vibration, and fused inputs—are presented in
Figure 8c–e. The fused multi-sensor features achieve much clearer separation of fault categories than single-sensor inputs, with minimal overlap between clusters.
To evaluate the AMSF-SF-DC model under single operating conditions, its performance was compared with widely used fault diagnosis models, including ECNN [
31], WDCNN [
32], MTAGN [
33], LiConvFormer [
34], MT1DCNN [
35], ResCNN [
36], DWCN [
37], and MSCFormer [
38]. The results for the three operating conditions are shown in
Table 6. Using both current and vibration signals, the model performs multi-scale convolutional feature extraction, attention-based fusion, and alternated self-calibrated and dilated convolutions to refine representations. This structure enhances fault feature discrimination, leading to superior accuracy, precision, recall, and F1-score compared with all baseline models. In addition, two recent state-of-the-art fault diagnosis methods, namely DWCN [
37] and MSCFormer [
38], were further included for comparison. As shown in
Table 6, the proposed AMSF-SF-DC consistently outperforms these SOTA methods under all single operating conditions, demonstrating its superior feature representation and generalization capability.
4.5. Variable-Condition Experimental Results
Typically, bearings often face different operating environments during operation. Therefore, to replicate the working conditions of bearings in actual use and ensure that the proposed model performs well in classification under various operating conditions, experiments need to be conducted under multiple variable operating conditions. This also evaluates the model’s generalization performance. Six experimental groups are conducted under three operating conditions, with the detailed settings provided in
Table 7.
Compared with single operating conditions, the fault diagnosis accuracy rate for single-sensor signals has dropped significantly, while that for multi-sensor signals has decreased slightly. There are several reasons for this situation. For example, under variable operating conditions, there may be more noise and interference, which can degrade the quality of sensor data and thus affect the overall performance of the model. If the model fails to learn the feature changes between different operating conditions during training, it will perform poorly under new conditions. In summary, multi-sensor feature fusion performs well under both single and variable operating conditions, demonstrating good generalization and adaptability.
To assess the classification performance of each fault type under variable operating conditions, the first experiment group (P1 → P2) is taken as an example, and confusion matrices for single-sensor and multi-sensor fusion methods are computed. As shown in
Figure 9, comparison of the three methods indicates that the proposed approach achieves markedly better classification performance than single-sensor models, even under varying operating conditions.
The t-SNE plot of the input current and vibration signals indicates that the features associated with different fault types are largely overlapping and mixed.
Figure 10c–e illustrate the t-SNE visualizations of the output features from the final fully connected layer for the three different methods. By comparison, the multi-sensor feature fusion method yields a much clearer separation of fault categories than single-sensor approaches, and the proposed model maintains strong diagnostic performance and robustness under variable operating conditions.
To evaluate the proposed method under variable operating conditions, comparative experiments were conducted on the same bearing dataset using the same baseline models. The data show that the proposed model achieves higher recall, precision, specificity, and F1-score than all comparison methods, with accuracy results illustrated in
Table 8.
Overall, the model maintains balanced performance and effectively distinguishes different fault categories. While the proposed AMSF-SF-DC model achieves highly competitive results across most scenarios, it is noted that its accuracy on the P1 → P2 task (91.81%) is slightly lower than the MSCFormer baseline (93.61%). However, this represents a deliberate architectural trade-off: unlike Transformer-based models that rely on computationally expensive global self-attention mechanisms, our model maintains an extremely low parameter count (0.55 M), offering a strictly parameter-efficient solution for edge-computing deployment. Furthermore, to explicitly validate the robustness of the multi-sensor fusion scheme under complex industrial conditions, additional noise interference experiments were conducted. Additive White Gaussian Noise was injected into the test signals at various Signal-to-Noise Ratios (SNRs: 20, 15, 10, 5, and 0 dB). As illustrated in
Figure 11, the proposed model demonstrates exceptional noise robustness. Notably, when the SNR decreases to the extremely noisy condition of 0 dB, the comprehensive evaluation score exhibits only a controlled increase from 0.1361 to 0.1957. Interestingly, the Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) do not deteriorate but rather show a slight improvement (with RMSE decreasing from 63.3735 to 59.4595). This phenomenon suggests that the multi-sensor fusion strategy, coupled with the skip-attention mechanism, not only filters out destructive interference but also effectively utilizes the injected noise as a form of implicit regularization to stabilize deep feature representations.
4.6. Ablation Studies
To further verify the effectiveness of modules in the proposed fault diagnosis model, the variants are validated under three operating conditions: MSF-DC (Multi-Sensor Fusion + Dual Convolution), AMSF (Adaptive Multi-Sensor Fusion + CBAM), AMSF-SF (Adaptive Multi-Sensor Fusion + Skip Fusion), and AMSF-DC (Adaptive Multi-Sensor Fusion + Dual Convolution). The experimental results are presented in
Table 9. The analysis indicates that within the AMSF-SF-DC model, the self-calibrated convolution module contributes the most to performance improvement, followed by the dilated convolution module, with the attention mechanism having the smallest relative impact.
Ablation experiments were conducted on MSF-DC, AMSF, AMSF-SF, and AMSF-DC in the six experiments under variable operating conditions. The experimental results are shown in
Table 10. Analysis of the above results indicates that under both single and variable operating conditions, the parameter-efficient modules of the attention mechanism, self-calibrated convolution, and dilated convolution play important roles in the AMSF-SF-DC model. Specifically, as observed in
Table 10, the AMSF-SF variant suffers from performance degradation under severe variable conditions. This severe instability stems primarily from cross-domain feature interference rather than simple overfitting. Under varying operating conditions, the feature distributions of heterogeneous sensor signals shift drastically. If these signals are fused without deep recalibration, the unaligned features interfere with each other, confusing the classifier. The full AMSF-SF-DC model resolves this by utilizing the self-calibrated convolution as a critical stabilizer to extract domain-invariant representations from the fused features, thereby effectively suppressing feature interference and ensuring robust cross-domain generalization.