4.1. Experimental Setup
This section evaluates the proposed MBT-XAI framework from three aspects: diagnostic performance, robustness under noise interference, and explanation consistency in the TF domain. Rather than focusing solely on classification accuracy, the experiments are designed to examine: (1) whether the proposed multi-wavelet three-channel TF representation provides more discriminative information than single-wavelet or simple fusion schemes; (2) whether the proposed method maintains stable performance under different noise levels and operating conditions; and (3) whether the discriminative evidence identified by the model can be traced back to physically meaningful TF regions associated with fault-related frequency structures.
The experiments are conducted on two bearing datasets with complementary characteristics: the public CWRU benchmark dataset and the self-collected IMUST industrial dataset.
The CWRU dataset was acquired from the bearing test rig at Case Western Reserve University, as shown in
Figure 5. The vibration signals were sampled at 12 kHz. In this study, data under the 0 HP load condition are selected, including ball fault (BF), inner race fault (IF), outer race fault (OF), and normal condition (N). The detailed class definitions are summarized in
Table 2. This dataset is widely used as a standard benchmark for evaluating bearing fault diagnosis methods.
The IMUST dataset was collected using a bearing fault experimental platform developed at the Key Laboratory of Intelligent Diagnosis and Control of Mechanical Systems, Inner Mongolia University of Science and Technology, as shown in
Figure 6. The vibration signals were sampled at 25 kHz, with the accelerometer mounted above the bearing housing, and all data were collected under no-load conditions. Compared with public datasets mainly containing electrical discharge machining (EDM)-induced defects, the IMUST dataset includes real fatigue pitting and crack faults, and further incorporates compound fault conditions, thereby providing a more realistic and challenging industrial diagnosis scenario.
Specifically, the IMUST dataset contains five categories: BF, CF, IF, OF, and N, as summarized in
Table 2. The fault sizes for IF and OF are both 0.2 mm, while the CF represents a combination of rolling element pitting and a 0.2 mm inner race crack. Representative fault samples are illustrated in
Figure 7. The rotational speed is 1995 rpm.
For sample construction, the raw vibration signals are segmented into fixed-length samples containing 1024 sampling points. Each sample is transformed into a TF representation using the continuous wavelet transform (CWT) with 128 scales. Three wavelet functions, namely Morlet, Mexican Hat, and Complex Morlet, are employed to generate complementary TF responses. The magnitudes of the wavelet coefficients are used as TF energy maps. For the proposed representation, the three wavelet responses are independently normalized and then arranged as a unified three-channel TF image. In addition, an equal-weight fusion image obtained by averaging the three wavelet responses is constructed as a comparison baseline. All TF images are resized to 256 × 256 pixels before being fed into the diagnostic model.
The generated samples are partitioned into training, validation, and test subsets in a stratified manner at a ratio of 7:2:1, so that the class distribution is preserved across all subsets. The same partition protocol is applied to all competing methods to ensure fair comparison. To reduce the influence of random initialization and data shuffling, the random seed is fixed to 42. All major experimental results are reported as mean ± standard deviation over 3 independent runs. In each run, the model is re-initialized and trained from scratch under the same hyperparameter configuration, and the final performance is evaluated on the held-out test subset.
To evaluate robustness under noise contamination, AWGN is injected into the original vibration signals. Let the original signal be denoted by.
The average signal power is defined as
For a target signal-to-noise ratio
, the corresponding linear-scale SNR is
The noise power is given by
Zero-mean Gaussian noise
is generated and added to the original signal to obtain the noisy signal
In this study, three AWGN levels, namely 0 dB, −2 dB, and −4 dB, are considered to evaluate the degradation behavior of the model under progressively stronger interference. To further approximate non-white disturbances encountered in practical industrial environments, pink noise and brown noise are additionally introduced. These colored-noise signals are generated under the same target SNR levels as the AWGN setting and are added to the original vibration signals using the same power-control principle. In this way, the robustness evaluation covers both white-noise corruption and frequency-dependent colored-noise interference.
To comprehensively assess the classification performance, Accuracy, Precision, Recall, F1-score, and false positive rate (FPR) are adopted. Under the one-versus-rest setting, true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) are defined in the standard way. The corresponding metrics are computed as
The training process is carried out under a unified set of hyperparameter configurations to ensure stable convergence and fair comparison across all experiments. The main training settings of the proposed MBT-XAI model are summarized in
Table 3.
4.3. Effectiveness and Stability of Multi-Wavelet RGB Input
First, the impact of different input representation strategies on diagnostic performance is evaluated under low-noise conditions. The comparison methods include three single-wavelet input representations, an equal-weight averaging (EWA) fusion strategy, and the proposed RGB-based multi-wavelet fusion strategy. The experimental results are summarized in
Table 4.
Table 4 presents an ablation study on different time–frequency input representation strategies. Among the single-wavelet inputs, Morlet achieves the best performance, indicating its relatively strong capability in capturing oscillatory fault-related patterns. In contrast, Mexh (89.47%) and Complex Morlet (90.10%) yield lower performance, suggesting that a single wavelet is insufficient to simultaneously characterize the heterogeneous signal components in bearing faults, including impulsive transients, narrowband resonance, and modulation-related structures.
By comparison, tensor stacking significantly improves performance (96.28%), confirming that preserving the separability of wavelet views is essential for effective feature learning. This improvement demonstrates that multi-view time–frequency representations provide additional discriminative information when their structural independence is maintained, allowing the network to exploit complementary characteristics across different wavelet domains.
Nevertheless, the proposed structured three-channel encoding (implemented in an RGB format) achieves the best performance (98.13%), outperforming tensor stacking by 1.85%. It should be emphasized that this gain is not due to the RGB representation itself, as it is mathematically equivalent to a three-channel tensor. Instead, the performance improvement arises from the explicit preservation of wavelet-view identity together with the subsequent cross-channel attention mechanism. In tensor stacking, inter-channel interactions are implicitly entangled during early feature extraction, which may dilute view-specific diagnostic structures. In contrast, the proposed structured encoding, combined with cross-channel attention (CCA), enables the model to perform reliability-aware feature fusion by selectively emphasizing more informative wavelet views. This mechanism explains the observed performance advantage over naive tensor stacking.
Table 5 further investigates the impact of patch size on diagnostic performance. For both datasets, the intermediate patch size of 16 × 16 achieves the highest accuracy (98.13% for CWRU and 96.23% for IMUST), while smaller (8 × 8) and larger (32 × 32) patch sizes result in performance degradation. Specifically, reducing the patch size from 16 × 16 to 8 × 8 leads to a decrease of 0.49% on CWRU and 0.52% on IMUST. Although smaller patches provide higher spatial resolution, they increase redundancy and sensitivity to local noise, making it more difficult for the Transformer to capture stable global dependencies. Conversely, increasing the patch size to 32 × 32 results in a more significant performance drop (−1.72% on CWRU and −1.85% on IMUST), indicating that excessive spatial aggregation leads to the loss of localized fault-related features such as impulsive transients and narrowband frequency components.
These results demonstrate that the choice of patch size is not arbitrary, but must balance local detail preservation and global structure modeling. The optimal patch size aligns with the intrinsic scale of time–frequency fault signatures, thereby supporting effective Transformer-based representation learning.
Further analysis is conducted to examine whether the multi-wavelet fusion mechanism exhibits stability and interpretability. Statistical and visualization analyses of the CCA module under different noise conditions are presented in
Figure 9. It can be observed that the relative weighting relationships among different wavelet channels remain generally consistent during both training and inference, without abrupt fluctuations or random reassignment.
More importantly, the weight distribution is not uniform across channels. One channel consistently receives higher attention weights, while others are assigned lower but non-zero contributions. This behavior indicates that the model does not perform equal weight fusion but instead learns a structured and stable weighting pattern that reflects the relative discriminative reliability of different wavelet views.
From a signal-processing perspective, this phenomenon is physically meaningful. Different wavelets respond differently to noise contamination and fault-induced structures. For example, impulsive interference may distort high-frequency transient responses, while colored noise tends to bias energy distributions in specific frequency bands. The observed stable weighting pattern suggests that the CCA mechanism implicitly suppresses noise-dominated responses while preserving physically informative components associated with fault-related features.
Therefore, the proposed multi-wavelet fusion in MBT-XAI is not a simple feature concatenation process, but a structured and dynamically weighted integration of complementary physical observations. The consistent channel-weight allocation further provides empirical evidence that the model performs reliability-aware feature selection rather than arbitrary feature aggregation.
Overall, the proposed multi-wavelet structured encoding, together with the CCA-based adaptive fusion mechanism, provides not only improved diagnostic performance but also a stable and interpretable feature integration process. This design establishes a reliable foundation for subsequent TST-based traceback and physics-consistent interpretability analysis.
4.4. Comprehensive Robustness and Comparative Evaluation
To systematically assess the practical applicability of the proposed MBT-XAI framework in complex industrial environments, a comprehensive evaluation is conducted from three complementary perspectives: robustness against noise interference, training convergence stability, and comparative performance against representative deep learning architectures. This evaluation is designed not only to report performance improvements, but also to explain why the proposed multi-wavelet fusion with reliability-aware channel weighting provides superior robustness and generalization capability. Specifically, this section aims to answer two key questions: (1) whether the multi-wavelet fusion strategy coupled with the CCA-based dynamic channel weighting mechanism can preserve stable diagnostic performance under progressively deteriorating signal-to-noise ratios; and (2) whether the proposed framework provides consistent and mechanism-driven advantages over mainstream CNN-, Transformer-, and multi-scale-based models.
To emulate realistic background interference, additive noise scenarios with SNR levels of 0 dB, −2 dB, and −4 dB are constructed.
Figure 10 illustrates the corresponding training and validation accuracy and loss curves of MBT-XAI under these conditions.
It can be observed that, despite increasing noise intensity, the optimization process remains well behaved across all cases. Both accuracy and loss curves exhibit smooth and monotonic convergence patterns, with no evidence of overfitting, divergence, or training instability. More importantly, the convergence trajectories remain highly consistent across different noise levels, indicating that the learned feature representations are not sensitive to noise perturbations but are instead governed by stable underlying structural patterns. While higher noise levels induce moderate fluctuations during early training stages, the convergence rate and final performance remain largely unaffected. This behavior suggests that the model is able to suppress noise-induced perturbations during representation learning, rather than overfitting to noisy patterns.
Beyond convergence behavior, robustness is further assessed by comparing the diagnostic accuracy of different models under pink-noise interference on two benchmark datasets, namely CWRU and IMUST.
Figure 11 presents the cross-model performance comparisons at SNR levels of −2 dB and −4 dB, thereby providing a more direct evaluation of robustness under progressively deteriorated noise conditions.
The results show that the F1-score of MBT-XAI declines in a gradual and near-linear manner as noise intensity increases, with degradation slopes that are significantly smaller than those observed in conventional CNNs, multi-scale convolutional models, and single time–frequency input architectures. Competing methods exhibit Significant performance collapse under −4 dB conditions, whereas MBT-XAI maintains relatively stable performance across all noise levels. This behavior suggests that the proposed multi-wavelet representation consistently preserves salient structural information even under severe noise interference.
From a mechanistic perspective, this robustness can be attributed to the synergistic interaction between multi-wavelet feature diversity and the CCA-based dynamic channel weighting module. Specifically, the CCA module Different noise types affect wavelet responses in a non-uniform manner. For example, impulsive disturbances tend to distort high-frequency transient components, while colored noise may bias specific frequency bands. The proposed CCA mechanism implicitly performs a reliability-aware selection, where noise-dominated channels are suppressed and physically informative responses are preserved. As a result, the feature fusion process is not a passive aggregation but an adaptive filtering mechanism guided by signal reliability. This explains why MBT-XAI maintains stable performance under low-SNR conditions, while conventional models that rely on single representations or fixed fusion strategies suffer from significant degradation.
To ensure the fairness and representativeness of the comparative experiments, the selected benchmark models encompass several representative technical paradigms in the field of bearing fault diagnosis.
The selected benchmark models cover several representative paradigms in bearing fault diagnosis. CNN-Transformer represents hybrid architectures that combine convolutional feature extraction with Transformer-based sequence modeling, thereby jointly modeling local structures and long-range dependencies. The wavelet-related model represents methods that explicitly exploit wavelet-domain information for signal enhancement and fault characterization. Ref. [
31] WCAResNet incorporates channel attention into a residual convolutional architecture, reflecting attention-enhanced CNN-based diagnosis. Ref. [
32] TF-ViT is built upon the Vision Transformer architecture and emphasizes global dependency modeling in time–frequency images, although its representation remains restricted to a single time–frequency view. Ref. [
33] CMB-ResNet adopts a multi-branch convolutional design to achieve multi-scale feature fusion and serves as a representative modern CNN-based diagnostic framework [
34]. Collectively, these baselines cover convolution-based, attention-enhanced, Transformer-based, wavelet-assisted, and multi-scale learning paradigms, thereby providing a comprehensive and fair basis for evaluating the effectiveness of the proposed MBT-XAI framework. The comparative results are presented in
Figure 12.
Under the ideal condition of SNR = 0 dB, all deep models achieve relatively high accuracy, while MBT-XAI shows a consistent but moderate improvement. However, this gap becomes significantly larger as noise intensity increases. Under −2 dB and −4 dB conditions, conventional CNN and multi-scale architectures exhibit substantial performance degradation. Transformer-based models also suffer from reduced accuracy due to their reliance on single-view TF representations. In contrast, MBT-XAI consistently maintains higher Accuracy, Precision, Recall, and F1-score, while achieving lower FPR.
This performance gap indicates that the superiority of MBT-XAI does not arise from model scale, but from its ability to preserve multi-view physical structures and perform reliability-aware feature integration.
Table 6 compares the representative methods in terms of model complexity and diagnostic performance. The proposed MBT-XAI achieves the highest accuracy of 98.13%, with a parameter count of 10.393 M and FLOPs of 10.678 G. Although MBT-XAI introduces a moderate increase in computational cost, the performance gain is disproportionately larger, indicating a favorable trade-off between accuracy and efficiency.
More importantly, models with comparable or even larger computational cost do not achieve similar performance improvements, suggesting that the advantage of MBT-XAI is rooted in representation effectiveness rather than parameter scaling.
To further analyze the contribution of each core component, ablation experiments are conducted.
The results are summarized in
Table 7. It can be observed that removing the multi-wavelet input, the CCA module, or the Transformer encoder leads to a significant decrease in diagnostic accuracy.
In particular, the performance drops from 98.13% to 80.11% when all key modules are removed, indicating that each component plays a critical role in the overall framework. Despite the reduction in parameter count and FLOPs, the diagnostic performance deteriorates sharply.
These results demonstrate that the performance improvement of MBT-XAI is not due to increased model complexity but arises from the synergistic interaction between multi-view physical representation and reliability-aware feature fusion.
4.5. Interpretability and Physical-Consistency Verification
High diagnostic accuracy does not necessarily imply engineering-level trustworthiness. For safety-critical rotating machinery systems, diagnostic models are required not only to maintain stable performance under complex noise conditions, but also to explain whether their decision-making basis is consistent with underlying bearing fault mechanisms. To this end, this study systematically validates the interpretability results of MBT-XAI based on the proposed TST mechanism from five complementary perspectives: saliency localization, physical consistency, causal fidelity, temporal traceability, and spectral evidence comparison.
Figure 13 illustrates the time–frequency saliency distributions produced by the TST mechanism under various bearing health conditions. It can be observed that the saliency patterns associated with different fault types exhibit clear distinctions and remain consistent with classical bearing fault mechanisms. For example, salient regions for inner-race faults are predominantly concentrated in high-frequency resonance bands, whereas outer-race faults exhibit stripe-like structures corresponding to periodic impact responses. Under normal operating conditions, the saliency distribution appears relatively diffuse and fails to form a stable or structured pattern.
To mitigate the potential subjectivity introduced by qualitative analysis, theoretical fault characteristic frequency bands are further constructed as ground-truth regions. The theoretical characteristic frequency bands are derived from bearing geometric parameters and rotational speed, with a fixed tolerance window defined around the corresponding center frequencies. Four metrics, including IoU, Coverage, Over-coverage, and Under-coverage, are employed to quantitatively evaluate the explanation results, as summarized in
Table 8.
As can be observed from the table, compared with CAM and ViT-based attention methods, MBT-XAI achieves higher IoU and Coverage values, while significantly reducing both over-coverage and under-coverage ratios. These results indicate that the TST mechanism can accurately cover the true fault-related frequency bands, while effectively suppressing redundant activation regions that are irrelevant to fault diagnosis. The quantitative results further demonstrate that the TST mechanism achieves a more favorable balance between localization accuracy and explanation compactness, thereby providing objective evidence for its physical consistency.
From a physical perspective, the IoU metric quantifies the degree of overlap between the model-extracted explanation regions and the theoretical fault-related frequency bands, where a higher value indicates that the model’s decision basis is more strongly concentrated on frequency regions with explicit physical significance. Coverage reflects whether the model is capable of fully covering the theoretical fault frequency bands, thereby avoiding the omission of critical fault-related information. In contrast, excessively high Over-coverage indicates that the model attends to a substantial number of frequency regions unrelated to fault characteristics, whereas elevated Under-coverage suggests that the model fails to sufficiently capture the essential fault-related features. The results presented in
Table 6 demonstrate that MBT-XAI effectively suppresses both over-coverage and under-coverage while simultaneously improving IoU and Coverage, indicating that the resulting explanations achieve a more favorable balance between compactness and completeness.
The consistency of saliency distributions alone is insufficient to establish a causal relationship between the explanation results and the model’s actual decision-making process. Accordingly, deletion and insertion perturbation tests are employed to verify the causal faithfulness of the salient regions identified by the TST mechanism, with the results illustrated in
Figure 14. In the deletion test, as highly salient regions are progressively removed, the diagnostic performance of the model deteriorates rapidly, indicating that these regions play a necessary role in the prediction process. In the insertion test, as highly salient regions are gradually restored from a fully masked input, the model confidence rapidly increases and approaches saturation, suggesting that these regions constitute sufficient discriminative evidence. These ablation-based experimental results indicate that the salient regions identified by the TST mechanism are not only visually plausible, but also genuinely drive the model’s decision-making process in a causal sense.
To further establish a direct connection between the model explanation results and the underlying mechanical behaviors, a spectral–temporal integral mapping (STIM) is introduced based on time–frequency energy representations, which back-projects two-dimensional saliency distributions onto one-dimensional temporal attention curves. By jointly weighting saliency and physical energy, STIM highlights key temporal locations that the model strongly relies on and that exhibit sufficient energy, thereby avoiding the ambiguity introduced by time localization based solely on saliency intensity. Experimental results demonstrate that the high-response temporal positions identified by STIM are highly aligned with actual impact events in the original vibration signals along the time axis, and exhibit periodic characteristics consistent with the bearing defect contact process, as illustrated in
Figure 15. These results indicate that MBT-XAI not only explains the key regions attended by the model in the time–frequency domain, but also enables the discriminative evidence to be traced back to time-domain impact events that can be directly verified by engineering practitioners, thereby establishing a complete closed-loop interpretability framework from model decision-making to physical behavior.
To further examine the stability and physical fidelity of the proposed TST mechanism, the back-traced spectral evidence
is compared with the energy spectrum
of the original signal under different bearing conditions.
Figure 16 illustrates the evidence alignment behaviors under inner-race fault, outer-race fault, and normal operating conditions. Under fault conditions, the back-traced evidence exhibits strong concentration around characteristic frequency bands while suppressing irrelevant background components, indicating that the back-tracing process selectively preserves diagnostically meaningful energy structures rather than replicating the entire spectrum. In contrast, under normal operating conditions, the overall magnitude of the spectral evidence is significantly reduced, indicating that the model does not introduce artificial discriminative patterns in the absence of fault-related excitations.
In addition, the smooth and bounded characteristics of across all conditions indicate that the back-tracing mapping does not amplify high-frequency noise or induce spectral distortion. These observations provide additional experimental evidence supporting the stability and measurement consistency of the proposed interpretability mechanism.