Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis

Fan, Hongzhi; Zhang, Chao; Sun, Mingyu; Xu, Kexi; Zhang, Wenyang; Zhang, Ximing

doi:10.3390/vibration9020028

Open AccessArticle

Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis

by

Hongzhi Fan

^1,2,

Chao Zhang

^1,2,3,*,

Mingyu Sun

^1,2,

Kexi Xu

^1,2

,

Wenyang Zhang

^1,2 and

Ximing Zhang

^1,2

¹

College of Mechanical Engineering, Inner Mongolia University of Science and Technology, Baotou 014010, China

²

Inner Mongolia Autonomous Region Key Laboratory of Intelligent Diagnosis and Control of Electromechanical Systems, Baotou 014010, China

³

School of Digital and Intelligent Industry, Inner Mongolia University of Science and Technology, Baotou 014010, China

^*

Author to whom correspondence should be addressed.

Vibration 2026, 9(2), 28; https://doi.org/10.3390/vibration9020028

Submission received: 2 March 2026 / Revised: 25 March 2026 / Accepted: 31 March 2026 / Published: 15 April 2026

Download

Browse Figures

Versions Notes

Abstract

Rolling bearing fault diagnosis under complex and noisy operating conditions requires not only high diagnostic accuracy but also interpretability that can be quantitatively verified against physically meaningful excitation structures. However, many existing deep learning approaches rely on a single time–frequency (TF) representation and provide limited, non-verifiable links between model decisions and the original vibration patterns. To address this issue, we propose MBT-XAI, a multi-wavelet TF fusion network with a Token-to-Spectrum Traceback (TST) mechanism for structure-preserving, physics-consistent interpretability. Three complementary wavelets, namely Morlet, Mexican Hat, and Complex Morlet, are used to construct multi-view TF representations, which are encoded into RGB channels and adaptively fused via cross-channel attention within a Transformer backbone. TST maps patch-token attributions back to the TF domain, enabling quantitative evaluation of physics consistency through overlap-based metrics. Experiments on the public CWRU dataset and an industrial IMUST dataset show that MBT-XAI achieves 98.13 ± 0.24% and 96.23 ± 0.31% accuracy at SNR = 0 dB, outperforming the strongest baseline by 2.83% and 2.43%, respectively. Under AWGN contamination, MBT-XAI maintains 95.44 ± 0.38%/93.45 ± 0.47% accuracy on CWRU and 95.80 ± 0.33%/92.91 ± 0.51% accuracy on IMUST at SNR = −2/−4 dB. Under colored-noise contamination, the proposed method also preserves robust performance under pink and brown noise at the same SNR levels. Quantitative interpretability evaluation further indicates high alignment between salient frequency regions and theoretical fault-characteristic bands, with IoU = 80.21 ± 0.86% and Coverage = 91.70 ± 0.63%. In addition, MBT-XAI requires 10.393 M parameters and 10.678 GFLOPs, with an inference latency of 14.7 ms per sample (batch size = 1) on an NVIDIA GeForce RTX 3060 GPU. These results suggest that multi-wavelet TF modeling with attention-based fusion and TF-level traceback provides an accurate, robust, and physics-consistent framework for intelligent bearing fault diagnosis.

Keywords:

rolling bearing fault diagnosis; physics-consistent interpretability; multi-wavelet time-frequency analysis; attention mechanism

1. Introduction

In modern industrial production systems, rotating machinery is extensively employed across critical sectors such as energy, transportation, metallurgy, and advanced manufacturing, where its operational stability is directly linked to production safety and equipment reliability. As one of the critical supporting components in rotating machinery, rolling bearings typically operate under high-speed rotation, heavy loads, and complex environmental noise, making them highly susceptible to localized defects such as fatigue pitting, spalling, and cracking. If such incipient faults are not detected in a timely manner, they may progressively evolve into severe structural damage or even catastrophic failure. Consequently, achieving accurate, stable, and reliable bearing fault measurement and diagnosis under complex operating conditions remains a central challenge in machinery condition monitoring and predictive maintenance [1,2].

Traditional bearing fault diagnosis methods primarily rely on manually engineered features in the time domain, frequency domain, or time–frequency domain, combined with classifiers such as support vector machines and random forests. Although these methods have achieved reasonable performance under controlled conditions [3], their effectiveness is highly dependent on expert knowledge and exhibits limited generalization under strong noise, non-stationary signals, and variable operating conditions. In recent years, deep learning techniques have driven a paradigm shift in bearing fault diagnosis from feature engineering-driven approaches toward data-driven representation learning. Convolutional neural networks (CNNs) demonstrate strong capability in extracting time–frequency texture features through local receptive field modeling. In contrast, Transformer architectures leverage self-attention mechanisms to model long-range dependencies and global correlations, offering distinct advantages over convolution-based models [4,5,6,7]. Meanwhile, integrating time–frequency analysis techniques [8,9,10,11], such as the continuous wavelet transform (CWT) and short-time Fourier transform (STFT), to map one-dimensional vibration signals into two-dimensional time–frequency representations for deep learning models has become a prevalent strategy for improving diagnostic performance under complex operating conditions [12,13,14]. Although these methods have achieved significant improvements in diagnostic accuracy, their strengths are primarily reflected in classification performance, while the measurement credibility of the underlying decision rationale remains largely unaddressed.

From an engineering perspective, high classification accuracy alone is insufficient to support practical deployment in safety-critical rotating machinery systems. Diagnostic models must also be able to answer whether their decision rationale genuinely originates from physical excitations and characteristic frequency structures consistent with bearing fault mechanisms. General explainable artificial intelligence (XAI) techniques, such as LRP and Grad-CAM [15,16,17,18], have been widely used to attribute model decisions to input regions [19,20,21]. However, most existing XAI methods focus on visual attribution rather than establishing physically meaningful correspondence with signal characteristics [22,23]. More recently, explainable and interpretable diagnosis models for machinery have been investigated. For example, Wang et al. proposed a globally interpretable CNN that incorporates bearing semantics to align diagnostic evidence with characteristic frequency components, improving the transparency of the decision basis [24]. In addition, Transformer-based models have been introduced for complex industrial diagnosis scenarios; Zhang et al. developed a coupled time–frequency attention Transformer to address long-tail fault diagnosis, highlighting the potential of attention-based global modeling under class imbalance and industrial disturbances [25]. Furthermore, recent surveys have systematically reviewed Transformer-based intelligent fault diagnosis methods for mechanical equipment, indicating rapid progress and broad adoption of this paradigm [26]. Nevertheless, for TF-based Transformer models, establishing a reproducible and structure-preserving connection between patch-token evidence and the original TF measurement structures, together with quantitative verification of physical consistency, remains challenging [27].

This issue is particularly pronounced in Transformer-based Patch–Token representation models. Although self-attention mechanisms are effective in modeling global correlations, the correspondence between high-level semantic decisions and the original time–frequency measurement structures are not explicitly established. As a result, model outputs cannot be readily traced back to specific physical frequency bands or impulsive events. In the absence of a structure-preserving traceback mechanism, interpretation results are highly sensitive to network depth, attention-head distributions, and gradient noise, thereby undermining their credibility in engineering measurement applications [28].

Based on the above analysis, this paper proposes MBT-XAI diagnostic framework, together with TST interpretation mechanism, from the perspectives of measurement credibility and physical consistency. The proposed method treats different continuous wavelets as multi-view physical observations of the same vibration signal and explicitly preserves their separability at the input level through orthogonal channel encoding. Cross-channel attention mechanisms are further employed to adaptively model the diagnostic reliability of each view. On this basis, a discriminative evidence traceback mechanism that is endogenous to the Transformer architecture is constructed to map model decisions back to the original time–frequency domain in a structure-preserving manner. As a result, the proposed framework provides a measurement-oriented diagnostic solution that combines high diagnostic performance with verifiable interpretability for bearing fault diagnosis.

The main contributions of this paper are summarized as follows.

(1): A multi-wavelet time-frequency representation framework is developed to capture complementary fault-related patterns from bearing vibration signals. By combining wavelet responses with different time-frequency localization characteristics, the proposed representation preserves transient, resonant, and modulation-related information in a unified three-channel input space.
(2): A Transformer-based fusion architecture is constructed to model cross-channel dependencies among the multi-wavelet representations. This design enhances the integration of complementary diagnostic evidence and facilitates interpretable analysis of channel-wise contributions during feature fusion.
(3): TST mechanism is introduced to map model decision evidence from token space back to the original time-frequency domain in a structure-preserving manner. Furthermore, overlap-based metrics, including IoU and Coverage, are employed to quantitatively evaluate the physical consistency between the identified salient regions and theoretical fault-frequency bands.

2. Theoretical Foundations

This section presents the theoretical foundations of the proposed MBT-XAI framework. It includes three components: (1) the continuous wavelet transform (CWT)-based time–frequency (TF) energy representation used to convert raw vibration signals into structured TF measurements; (2) the scaled dot-product self-attention mechanism underlying the Transformer backbone [29,30]; and (3) the formal definition and structural-consistency rationale of the proposed Token-to-Spectrum Traceback (TST) mechanism. To avoid ambiguity between theoretical formulation and implementation details, experimental hyperparameters such as signal segmentation length, CWT settings, network configurations, and threshold selection rules are reported separately in Section 4.

2.1. CWT-Based Time–Frequency Representation

Rolling bearing vibration signals typically exhibit pronounced non-stationary characteristics, with statistical properties that vary significantly over time. The CWT performs joint time–frequency analysis under a multi-resolution framework through scale and translation operations and has been widely applied in rotating machinery fault diagnosis. Given a discrete-time vibration signal

x [n]

, its CWT with respect to a mother wavelet

ψ (\cdot)

is defined as

W_{ψ} (a, b) = \sum_{n} x [n] \frac{1}{\sqrt{a}} ψ * (\frac{n - b}{a})

(1)

where

a > 0

denotes the scale parameter,

b

denotes the translation parameter, and

(\cdot) *

denotes complex conjugation. The scale parameter controls the analysis resolution in frequency, whereas the translation parameter determines temporal localization. By discretizing

a

and

b

, a two-dimensional TF representation of the signal can be obtained.

For a TF coefficient

W_{x} (t, f)

, the corresponding TF energy distribution is defined as

E (t, f) = {|W_{x} (t, f)|}^{2}

(2)

where

E (t, f)

represents the local energy density of the signal at time

t

and frequency

f

. This definition is consistent with the physical interpretation of impulsive energy and structural resonance responses in mechanical vibration analysis. All subsequent TF feature construction, saliency analysis, and physical traceback procedures are conducted based on this energy representation.

The equivalent frequency associated with scale

a

is determined by the center-frequency mapping

\begin{matrix} f = \frac{f_{c}}{a Δ t} \end{matrix}

(3)

where

f_{c}

denotes the center frequency of the selected wavelet and

Δ t

is the sampling interval. Based on this mapping, all frequency-domain interpretation and fault-band verification in this study are conducted in a unified equivalent-frequency domain. Therefore, the CWT-based TF energy representation not only provides stable structured inputs for the deep model, but also establishes a physically interpretable reference space for comparing model-highlighted regions with theoretical fault-characteristic frequency bands. A representative CWT TF representation of a bearing vibration signal is shown in Figure 1.

2.2. Transformer Self-Attention Mechanism

The Transformer architecture is built upon the self-attention mechanism, which models pairwise dependencies between arbitrary positions in an input sequence without relying on recurrent or convolutional operations. This property is particularly suitable for TF representations of vibration signals, where fault-related structures may exhibit long-range dependencies across both time and frequency dimensions.

Q = X W_{Q}, K = X W_{K}, V = X W_{V}

(4)

where

W_{Q}, W_{K}, W_{V}

are trainable projection matrices. The single-head scaled dot-product attention is formulated as:

Attention (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V

(5)

However, in multi-view TF analysis, different channels may exhibit different levels of reliability under varying noise conditions and fault states. Standard self-attention does not explicitly distinguish the credibility of heterogeneous TF views and may therefore propagate redundant or noise-contaminated information across tokens. This motivates the channel-aware fusion strategy employed in the proposed MBT-XAI framework, whose purpose is to adaptively emphasize more informative TF views under the current measurement conditions.

2.3. Theoretical Basis of Token-to-Spectrum Traceback

Vision Transformer-based models partition a two-dimensional input into regular patches and map them into a sequence of token embeddings. This tokenization process induces a deterministic correspondence between each token and a local region of the original TF plane, which provides the basis for the proposed Token-to-Spectrum Traceback (TST) mechanism.

Let the TF energy map be denoted by

E \in R^{H \times W}

. It is partitioned into

N = H W / P^{2}

non-overlapping patches of size

P \times P

, and each patch is mapped to a token embedding. Let

Ω_{i} \subseteq {1, \dots, H} \times {1, \dots, W}

denote the TF support region corresponding to the

i

-th patch token, where

i = 1, \dots, N

. After patch embedding and Transformer encoding, the model outputs a class logit

y_{c}

for target class

c

.

To quantify the class-specific contribution of the

i

-th token, TST defines the raw token relevance as a token activation-gradient attribution score:

\begin{matrix} r_{i}^{(c)} = R e L U (\sum_{k = 1}^{D} \frac{\partial y_{c}}{\partial z_{i, k}} z_{i, k}), i = 1, \dots, N, \end{matrix}

(6)

where

z_{i, k}

denotes the

k

-th component of the encoded token representation of token

i

, and

\partial y_{c} / \partial z_{i, k}

measures the sensitivity of the class score to that component. This formulation combines token activation magnitude with class-specific gradient sensitivity, thereby suppressing token responses that are not positively associated with the target decision.

To remove scale inconsistency across tokens and obtain a normalized evidence distribution, the relevance scores are normalized as

\begin{matrix} {\tilde{r}}_{i}^{(c)} = \frac{r_{i}^{(c)}}{\sum_{j = 1}^{N} r_{j}^{(c)} + ε}, \sum_{i = 1}^{N} {\tilde{r}}_{i}^{(c)} \approx 1, \end{matrix}

(7)

where

ε > 0

is a small constant introduced for numerical stability. The normalized token evidence vector is then written as

\begin{matrix} {\tilde{r}}^{(c)} = [{\tilde{r}}_{1}^{(c)}, {\tilde{r}}_{2}^{(c)}, \dots, {\tilde{r}}_{N}^{(c)}] . \end{matrix}

(8)

The token-level evidence is mapped back to the original TF plane through a deterministic patch-wise backprojection operator

T (\cdot)

, defined as

\begin{matrix} M^{(c)} (u, v) = \sum_{i = 1}^{N} {\tilde{r}}_{i}^{(c)} 1_{(u, v) \in Ω_{i}}, M^{(c)} \in R^{H \times W}, \end{matrix}

(9)

where

1_{(u, v) \in Ω_{i}}

is the indicator function that equals 1 when the TF coordinate

(u, v)

lies within the patch support region

Ω_{i}

, and 0 otherwise. If a dense visualization at the original image resolution is required,

M^{(c)}

can be further refined by a bounded interpolation or upsampling operator. In this study, TST is therefore not treated as an inverse reconstruction of latent features, but as a deterministic attribution backprojection from class-specific token evidence to the corresponding TF support regions.

The above definition yields two useful structural properties. First, because each token is projected only onto its associated patch region

Ω_{i}

, the traceback process preserves patch-level spatial locality. Second, if the optional interpolation operator is bounded, the resulting TF saliency map is stable under bounded perturbations of the token evidence vector. Hence, TST provides a reproducible and structurally well-defined mapping from Transformer token evidence to the TF domain. It should be noted, however, that structural stability alone does not imply physical validity. Whether the traced-back salient regions are consistent with known bearing fault mechanisms must be further assessed using external physical references, as described in Section 2.4.

2.4. Structural-Consistency Verification via Overlap-Based Metrics

The TST mechanism defined in Section 2.3 produces a class-specific saliency map on the TF plane. To evaluate whether the highlighted regions are consistent with known bearing fault mechanisms, this study introduces a quantitative structural-consistency assessment based on the overlap between model-derived saliency regions and theoretically expected fault-frequency regions.

For each fault category, a theoretical fault-related TF region is constructed according to prior bearing fault knowledge, and the TST-derived saliency map is compared against this reference region. Two complementary overlap-based metrics are adopted. The first is Intersection-over-Union, which measures the overall spatial agreement between the saliency region and the theoretical fault region. The second is Coverage, which measures the extent to which the expected physical region is captured by the saliency map.

These metrics provide a quantitative supplement to qualitative heatmap visualization and allow the explanation results to be evaluated in a physics-consistent manner. The practical construction of saliency masks, the definition of theoretical fault-band regions, and the detailed computation of IoU and Coverage used in this study are presented in Section 3.4.

3. The Proposed Method

The proposed MBT-XAI framework is designed to establish a closed-loop pipeline from raw vibration measurements to quantitatively interpretable diagnostic evidence. Beyond improving fault classification accuracy, its primary objective is to provide measurement-level explanation of diagnostic decisions by integrating complementary TF observation views with a structure-preserving traceback mechanism.

The overall workflow of the proposed framework is illustrated in Figure 2. First, the acquired raw vibration signals are segmented, and multiple CWTs are applied to convert one-dimensional temporal measurements into TF energy representations with explicit diagnostic relevance. Second, the resulting multi-wavelet TF maps are organized into a structured three-channel input and fed into a Transformer encoder for global discriminative modeling. During feature fusion, CCA mechanism is introduced to adaptively characterize the relative discriminative reliability of different wavelet views under the current measurement condition. Finally, after fault classification, the model evidence is traced back to the original TF domain through the proposed TST mechanism, thereby yielding class-specific saliency patterns that can be quantitatively compared with theoretically expected fault-related frequency regions.

Unlike many existing deep learning-based diagnosis frameworks that treat TF maps merely as generic image features, MBT-XAI explicitly regards them as complementary measurement views derived from different TF analysis operators. This design establishes the structural basis for both multi-view discriminative modeling and subsequent physics-consistency verification of the interpretation results.

3.1. Multi-Wavelet Time–Frequency Representation

Rolling bearing fault vibration signals usually contain multiple coexisting components, including defect-induced impulsive transients, resonance-related narrowband responses, and modulation or sideband structures associated with shaft rotation and fault repetition. Because a single wavelet inevitably involves a trade-off between time localization and frequency localization, it is generally difficult for one TF representation alone to characterize all these heterogeneous signal components in a balanced manner. To address this issue, three continuous wavelets with complementary analysis characteristics are employed in this study: Morlet, Mexican Hat, and Complex Morlet.

The three selected wavelets are not chosen arbitrarily. Rather, they are intended to capture different diagnostic signatures commonly observed in defective bearing vibration signals. The Morlet wavelet provides relatively strong frequency localization and is suitable for describing narrowband resonance responses and harmonic structures. The Mexican Hat wavelet has stronger sensitivity to localized impulsive transients and is therefore effective for capturing defect-induced impact characteristics. The Complex Morlet wavelet retains phase-related information and is more suitable for representing modulation phenomena and sideband-like structures. By combining these three views, the proposed framework seeks to preserve complementary fault-relevant TF information that may not be adequately represented by a single wavelet transform alone.

After unified scale mapping, spatial resampling, and normalization, the three wavelet transforms produce three TF energy maps with identical spatial dimensions, denoted by

\begin{matrix} {S_{1}, S_{2}, S_{3}} . \end{matrix}

(10)

As illustrated in Figure 3, it should be emphasized that these TF maps are not directly superimposed at the pixel level. Instead, their view-specific identity is explicitly preserved so as to avoid premature mixing of heterogeneous physical response characteristics.

To further clarify the rationale for selecting these three wavelets, their complementary TF analysis characteristics are summarized in Table 1. The purpose of this design is not to increase the number of input views for its own sake, but to preserve diagnostically meaningful TF responses associated with different fault-related signal structures.

As shown in Table 1, the three wavelets provide complementary analysis capabilities rather than redundant descriptions of the same signal content. This complementarity forms the basis for the subsequent multi-view fusion strategy adopted in MBT-XAI.

3.2. Orthogonal RGB Channel Encoding and Multi-View Preservation

To explicitly preserve the separability of multi-wavelet views at the input layer of the deep model, the three groups of normalized time–frequency energy maps are mapped onto three orthogonal RGB channels, thereby constructing a three-channel tensor.

X \in R^{H \times W \times 3}

(11)

where

X (:, :, 1) = {\tilde{S}}_{M o r}, X (:, :, 2) = {\tilde{S}}_{M e x}, X (:, :, 3) = {\tilde{S}}_{C m o r}

(12)

This RGB-style encoding does not introduce any color semantics. Instead, it serves as a structured implementation choice for preserving the one-to-one correspondence between each wavelet view and a dedicated input channel. Compared with arbitrary tensor stacking, this design offers two practical advantages in the present framework. First, it maintains explicit view identity while remaining directly compatible with standard vision-style patch embedding modules. Second, it avoids introducing additional front-end redesign associated with higher-order channel stacking, thereby enabling controlled multi-view fusion under a unified input interface.

Therefore, the role of the three-channel encoding in this study is not to claim a new image representation scheme, but to provide a compact and reproducible container for multi-wavelet TF observations. This structured design also creates the necessary input basis for the subsequent cross-channel attention mechanism, which adaptively models the relative contribution of different wavelet views.

3.3. Cross-Channel Attention-Based Multi-View Fusion

Although the three wavelet TF views describe the same vibration signal, their discriminative reliability may differ under different fault states and measurement conditions. For example, one wavelet view may better highlight impulsive fault signatures, whereas another may provide clearer modulation or resonance information. A fixed equal-weight fusion strategy may therefore dilute informative views or propagate noise-contaminated responses.

To address this issue, a cross-channel attention (CCA) mechanism is introduced to adaptively fuse multi-wavelet features. As illustrated in Figure 4, let the intermediate feature representation extracted from the three-channel TF input be denoted by

\begin{matrix} F \in R^{H^{'} \times W^{'} \times C}, \end{matrix}

(13)

where

C

is the number of feature channels. Global channel descriptors are first obtained by spatial pooling,

\begin{matrix} g_{c} = \frac{1}{H^{'} W^{'}} \sum_{u = 1}^{H^{'}} \sum_{v = 1}^{W^{'}} F_{c} (u, v), c = 1, \dots, C, \end{matrix}

(14)

where

F_{c}

denotes the

c

-th channel of

F

. Based on the pooled descriptor vector

g

, channel attention weights are generated by a lightweight gating function,

\begin{matrix} α = σ (W_{2} δ (W_{1} g)), \end{matrix}

(15)

where

W_{1}

and

W_{2}

are trainable weight matrices,

δ (\cdot)

denotes a nonlinear activation function, and

σ (\cdot)

is the sigmoid function. The recalibrated feature map is then obtained as

\begin{matrix} {\hat{F}}_{c} = α_{c} F_{c}, c = 1, \dots, C . \end{matrix}

(16)

The objective of CCA is to estimate the relative importance of different wavelet-derived channels and to strengthen the contribution of more informative views during feature integration. In this way, the proposed framework does not simply concatenate multiple TF inputs, but explicitly models the reliability difference among heterogeneous observation channels. By performing channel-aware fusion before Transformer-based global modeling, the framework can preserve complementary view-specific characteristics while suppressing redundant or weakly discriminative responses. This design provides the feature basis for subsequent class-specific traceback analysis in the fused token space.

3.4. Token-to-Spectrum Traceback for Class-Specific Interpretation

To bridge the gap between high-level Transformer token representations and physically meaningful TF structures, the proposed MBT-XAI framework incorporates a Token-to-Spectrum Traceback (TST) mechanism. The purpose of TST is not to reconstruct latent features, but to map class-specific discriminative evidence from the encoded token space back to the original TF support domain in a reproducible and structure-preserving manner.

In the present framework, TST operates on the deterministic correspondence between Transformer patch tokens and local TF regions. Based on the formal definition introduced in Section 2.3, let

z_{i} \in R^{D}

denote the encoded representation of the

i

-th fused token, and let

y_{c}

be the logit associated with target class

c

. The class-specific relevance of token

i

is computed as

\begin{matrix} r_{i}^{(c)} = R e L U (\sum_{d = 1}^{D} \frac{\partial y_{c}}{\partial z_{i, d}} z_{i, d}), \end{matrix}

(17)

where only positive contributions are retained. The token relevance scores are then normalized as

\begin{matrix} {\tilde{r}}_{i}^{(c)} = \frac{r_{i}^{(c)}}{\sum_{j = 1}^{N} r_{j}^{(c)} + ε}, \end{matrix}

(18)

where

ε

is a small constant for numerical stability.

In this study, the input TF image size is

256 \times 256

, and the Vision Transformer adopts a patch size of

16 \times 16

. Therefore, the token sequence corresponds to a

16 \times 16

token grid. The normalized token evidence is mapped back to the TF plane according to the patch support of each token, yielding a patch-wise saliency map. To obtain a dense visualization at the original TF resolution, bicubic interpolation is then applied to the patch-wise traceback result, producing the final class-specific saliency map

\begin{matrix} M^{(c)} \in R^{256 \times 256} . \end{matrix}

(19)

This design is particularly suitable for bearing fault diagnosis because discriminative evidence is often concentrated in localized frequency bands, transient energy regions, and modulation-related TF structures. By preserving the intrinsic correspondence between token positions and TF patches, TST allows the highlighted evidence to remain aligned with the original measurement structure.

It should be noted that the proposed TST mechanism is different from conventional CNN-based visualization methods such as CAM and Grad-CAM, as well as from direct attention-map visualization in Transformer models. CNN-based class activation methods rely on convolutional receptive fields and often produce coarse activation maps after repeated downsampling, whereas direct attention visualization reflects token interaction patterns but is not necessarily class-specific. By contrast, TST explicitly uses the target class output as the attribution objective and deterministically projects token-level evidence back to the original TF support domain through the patch structure of the Transformer input. Therefore, TST provides a more structured and class-relevant explanation for Transformer-based fault diagnosis.

3.5. Physics-Consistency Evaluation of TST Saliency

To quantitatively assess whether the salient TF regions identified by TST are consistent with known fault mechanisms, a physics-consistency evaluation strategy is introduced. The key idea is to compare the model-derived saliency region with theoretically expected fault-related frequency regions derived from standard bearing kinematics.

In this study, all experiments are conducted on bearings of type 6205. For a rolling bearing with

n

rolling elements, rolling element diameter

d

, pitch diameter

D

, contact angle

θ

, and shaft rotational frequency

f_{r}

, the characteristic fault frequencies are defined according to classical bearing kinematics as follows:

\begin{matrix} f_{B P F O} = \frac{n}{2} f_{r} (1 - \frac{d}{D} c o s θ), \end{matrix}

(20)

\begin{matrix} f_{B P F I} = \frac{n}{2} f_{r} (1 + \frac{d}{D} c o s θ), \end{matrix}

(21)

\begin{matrix} f_{B S F} = \frac{D}{2 d} f_{r} (1− {(\frac{d}{D} c o s θ)}^{2}), \end{matrix}

(22)

\begin{matrix} f_{F T F} = \frac{1}{2} f_{r} (1 - \frac{d}{D} c o s θ), \end{matrix}

(23)

where

f_{B P F O}

,

f_{B P F I}

,

f_{B S F}

, and

f_{F T F}

denote the ball-pass frequency of the outer race, ball-pass frequency of the inner race, ball spin frequency, and fundamental train frequency, respectively.

For the 6205 bearing used in this study, the geometric parameters are fixed, and the characteristic frequencies are therefore determined primarily by the rotational frequency

f_{r}

. Under each operating condition, the theoretical fault-related frequency bands are constructed around the corresponding characteristic frequency and, when applicable, its low-order harmonics. For a target characteristic frequency

f_{c}

, the

m

-th expected band is defined as

\begin{matrix} B_{m} (f_{c}) = [m f_{c} - Δ f, m f_{c} + Δ f], \end{matrix}

(24)

where

m

denotes the harmonic order and

Δ f

is the tolerance bandwidth. In this study, the theoretical region is constructed using the fundamental frequency and its first few low-order harmonics that fall within the TF frequency range. Unless otherwise stated, explicit sideband modeling is not separately introduced in the overlap evaluation, and the resulting theoretical mask is therefore intended to represent the primary expected fault-frequency structure.

For each fault category

c

, the set of expected fault bands is denoted by

\begin{matrix} B^{(c)} = {\{B_{m} (f_{c})\}}_{m = 1}^{M_{c}}, \end{matrix}

(25)

where

M_{c}

is the number of retained bands for class

c

. These bands are mapped onto the vertical frequency axis of the TF image according to the frequency range covered by the CWT representation, and then extended along the full time axis to form a binary theoretical fault-region mask, denoted by

Ω_{F C F}^{(c)}

.

Given the normalized saliency map

M^{(c)}

, the model-identified salient region is obtained by thresholding:

\begin{matrix} Ω_{s a l}^{(c)} = \{(t, f) ∣ M^{(c)} (t, f) \geq τ\}, \end{matrix}

(26)

where

τ = 0.6

in this study.

Based on

Ω_{s a l}^{(c)}

and

Ω_{F C F}^{(c)}

, two overlap-based metrics are used for quantitative evaluation. The first is Intersection-over-Union (IoU),

\begin{matrix} {I o U}^{(c)} = \frac{∣ Ω_{s a l}^{(c)} \cap Ω_{F C F}^{(c)} ∣}{∣ Ω_{s a l}^{(c)} \cup Ω_{F C F}^{(c)} ∣}, \end{matrix}

(27)

which measures the overall spatial agreement between the saliency region and the theoretical fault region. The second is Coverage,

\begin{matrix} {C o v e r a g e}^{(c)} = \frac{∣ Ω_{s a l}^{(c)} \cap Ω_{F C F}^{(c)} ∣}{∣ Ω_{F C F}^{(c)} ∣}, \end{matrix}

(28)

which measures the proportion of the expected physical region captured by the saliency map.

Higher values of IoU and Coverage indicate better alignment between the model explanation and the theoretically expected fault-frequency structure. These metrics therefore provide a quantitative supplement to qualitative heatmap inspection and allow the interpretability results of MBT-XAI to be assessed in a physics-consistent manner. It should also be noted that the proposed overlap analysis evaluates consistency with expected fault-frequency regions rather than causal correctness in a strict physical sense. Therefore, IoU and Coverage are used here as quantitative indicators of structural agreement between model-highlighted evidence and bearing fault theory.

4. Experiments and Discussion

4.1. Experimental Setup

This section evaluates the proposed MBT-XAI framework from three aspects: diagnostic performance, robustness under noise interference, and explanation consistency in the TF domain. Rather than focusing solely on classification accuracy, the experiments are designed to examine: (1) whether the proposed multi-wavelet three-channel TF representation provides more discriminative information than single-wavelet or simple fusion schemes; (2) whether the proposed method maintains stable performance under different noise levels and operating conditions; and (3) whether the discriminative evidence identified by the model can be traced back to physically meaningful TF regions associated with fault-related frequency structures.

The experiments are conducted on two bearing datasets with complementary characteristics: the public CWRU benchmark dataset and the self-collected IMUST industrial dataset.

The CWRU dataset was acquired from the bearing test rig at Case Western Reserve University, as shown in Figure 5. The vibration signals were sampled at 12 kHz. In this study, data under the 0 HP load condition are selected, including ball fault (BF), inner race fault (IF), outer race fault (OF), and normal condition (N). The detailed class definitions are summarized in Table 2. This dataset is widely used as a standard benchmark for evaluating bearing fault diagnosis methods.

The IMUST dataset was collected using a bearing fault experimental platform developed at the Key Laboratory of Intelligent Diagnosis and Control of Mechanical Systems, Inner Mongolia University of Science and Technology, as shown in Figure 6. The vibration signals were sampled at 25 kHz, with the accelerometer mounted above the bearing housing, and all data were collected under no-load conditions. Compared with public datasets mainly containing electrical discharge machining (EDM)-induced defects, the IMUST dataset includes real fatigue pitting and crack faults, and further incorporates compound fault conditions, thereby providing a more realistic and challenging industrial diagnosis scenario.

Specifically, the IMUST dataset contains five categories: BF, CF, IF, OF, and N, as summarized in Table 2. The fault sizes for IF and OF are both 0.2 mm, while the CF represents a combination of rolling element pitting and a 0.2 mm inner race crack. Representative fault samples are illustrated in Figure 7. The rotational speed is 1995 rpm.

For sample construction, the raw vibration signals are segmented into fixed-length samples containing 1024 sampling points. Each sample is transformed into a TF representation using the continuous wavelet transform (CWT) with 128 scales. Three wavelet functions, namely Morlet, Mexican Hat, and Complex Morlet, are employed to generate complementary TF responses. The magnitudes of the wavelet coefficients are used as TF energy maps. For the proposed representation, the three wavelet responses are independently normalized and then arranged as a unified three-channel TF image. In addition, an equal-weight fusion image obtained by averaging the three wavelet responses is constructed as a comparison baseline. All TF images are resized to 256 × 256 pixels before being fed into the diagnostic model.

The generated samples are partitioned into training, validation, and test subsets in a stratified manner at a ratio of 7:2:1, so that the class distribution is preserved across all subsets. The same partition protocol is applied to all competing methods to ensure fair comparison. To reduce the influence of random initialization and data shuffling, the random seed is fixed to 42. All major experimental results are reported as mean ± standard deviation over 3 independent runs. In each run, the model is re-initialized and trained from scratch under the same hyperparameter configuration, and the final performance is evaluated on the held-out test subset.

To evaluate robustness under noise contamination, AWGN is injected into the original vibration signals. Let the original signal be denoted by.

x = [x_{1}, x_{2} \dots, x_{N}]

(29)

The average signal power is defined as

P_{x} = \frac{1}{N} \sum_{i = 1}^{N} {|X_{i}|}^{2}

(30)

For a target signal-to-noise ratio

{S N R}_{d B}

, the corresponding linear-scale SNR is

S N R = 10^{\frac{{S N R}_{d B}}{10}}

(31)

The noise power is given by

P_{n} = \frac{P_{x}}{S N R}

(32)

Zero-mean Gaussian noise

n \sim N (0, P_{n})

is generated and added to the original signal to obtain the noisy signal

\tilde{x} = x + n

(33)

In this study, three AWGN levels, namely 0 dB, −2 dB, and −4 dB, are considered to evaluate the degradation behavior of the model under progressively stronger interference. To further approximate non-white disturbances encountered in practical industrial environments, pink noise and brown noise are additionally introduced. These colored-noise signals are generated under the same target SNR levels as the AWGN setting and are added to the original vibration signals using the same power-control principle. In this way, the robustness evaluation covers both white-noise corruption and frequency-dependent colored-noise interference.

To comprehensively assess the classification performance, Accuracy, Precision, Recall, F1-score, and false positive rate (FPR) are adopted. Under the one-versus-rest setting, true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) are defined in the standard way. The corresponding metrics are computed as

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(34)

P R E = \frac{T P}{T P + F P}

(35)

R e c a l l = \frac{T P}{T P + F N}

(36)

F P R = \frac{F P}{F P + T N}

(37)

F 1 = \frac{2 \times P R E \times R e c a l l}{P R E + R e c a l l}

(38)

The training process is carried out under a unified set of hyperparameter configurations to ensure stable convergence and fair comparison across all experiments. The main training settings of the proposed MBT-XAI model are summarized in Table 3.

4.2. Experimental Design and Validation Logic

The overall experimental design and validation logic of this study are illustrated in Figure 8, where a series of complementary experiments (A–F) are designed to systematically evaluate the representation effectiveness, noise robustness, and physical fidelity of the proposed MBT-XAI framework. The experimental design is organized into six categories of sub-experiments. First, Experiment A evaluates the effectiveness of the multi-wavelet RGB input construction in enhancing discriminative representations. In Experiment B, different signal-to-noise ratio (SNR) conditions are introduced to analyze the robustness of the model in terms of training dynamics and convergence stability under noise interference. Experiment C evaluates the overall performance of the proposed method under complex operating conditions through benchmark comparisons with multiple mainstream deep learning models. Experiment D quantitatively analyses the contribution of the core modules to the final performance through ablation studies. To reveal the internal decision-making mechanisms of the model, Experiment E employs channel-attention visualization to analyze the model’s adaptive weighting behavior across different frequency bands. Experiment F combines quantitative metrics with causal perturbation analysis to verify the reliability and physical consistency of the generated interpretation results.

4.3. Effectiveness and Stability of Multi-Wavelet RGB Input

First, the impact of different input representation strategies on diagnostic performance is evaluated under low-noise conditions. The comparison methods include three single-wavelet input representations, an equal-weight averaging (EWA) fusion strategy, and the proposed RGB-based multi-wavelet fusion strategy. The experimental results are summarized in Table 4.

Table 4 presents an ablation study on different time–frequency input representation strategies. Among the single-wavelet inputs, Morlet achieves the best performance, indicating its relatively strong capability in capturing oscillatory fault-related patterns. In contrast, Mexh (89.47%) and Complex Morlet (90.10%) yield lower performance, suggesting that a single wavelet is insufficient to simultaneously characterize the heterogeneous signal components in bearing faults, including impulsive transients, narrowband resonance, and modulation-related structures.

By comparison, tensor stacking significantly improves performance (96.28%), confirming that preserving the separability of wavelet views is essential for effective feature learning. This improvement demonstrates that multi-view time–frequency representations provide additional discriminative information when their structural independence is maintained, allowing the network to exploit complementary characteristics across different wavelet domains.

Nevertheless, the proposed structured three-channel encoding (implemented in an RGB format) achieves the best performance (98.13%), outperforming tensor stacking by 1.85%. It should be emphasized that this gain is not due to the RGB representation itself, as it is mathematically equivalent to a three-channel tensor. Instead, the performance improvement arises from the explicit preservation of wavelet-view identity together with the subsequent cross-channel attention mechanism. In tensor stacking, inter-channel interactions are implicitly entangled during early feature extraction, which may dilute view-specific diagnostic structures. In contrast, the proposed structured encoding, combined with cross-channel attention (CCA), enables the model to perform reliability-aware feature fusion by selectively emphasizing more informative wavelet views. This mechanism explains the observed performance advantage over naive tensor stacking.

Table 5 further investigates the impact of patch size on diagnostic performance. For both datasets, the intermediate patch size of 16 × 16 achieves the highest accuracy (98.13% for CWRU and 96.23% for IMUST), while smaller (8 × 8) and larger (32 × 32) patch sizes result in performance degradation. Specifically, reducing the patch size from 16 × 16 to 8 × 8 leads to a decrease of 0.49% on CWRU and 0.52% on IMUST. Although smaller patches provide higher spatial resolution, they increase redundancy and sensitivity to local noise, making it more difficult for the Transformer to capture stable global dependencies. Conversely, increasing the patch size to 32 × 32 results in a more significant performance drop (−1.72% on CWRU and −1.85% on IMUST), indicating that excessive spatial aggregation leads to the loss of localized fault-related features such as impulsive transients and narrowband frequency components.

These results demonstrate that the choice of patch size is not arbitrary, but must balance local detail preservation and global structure modeling. The optimal patch size aligns with the intrinsic scale of time–frequency fault signatures, thereby supporting effective Transformer-based representation learning.

Further analysis is conducted to examine whether the multi-wavelet fusion mechanism exhibits stability and interpretability. Statistical and visualization analyses of the CCA module under different noise conditions are presented in Figure 9. It can be observed that the relative weighting relationships among different wavelet channels remain generally consistent during both training and inference, without abrupt fluctuations or random reassignment.

More importantly, the weight distribution is not uniform across channels. One channel consistently receives higher attention weights, while others are assigned lower but non-zero contributions. This behavior indicates that the model does not perform equal weight fusion but instead learns a structured and stable weighting pattern that reflects the relative discriminative reliability of different wavelet views.

From a signal-processing perspective, this phenomenon is physically meaningful. Different wavelets respond differently to noise contamination and fault-induced structures. For example, impulsive interference may distort high-frequency transient responses, while colored noise tends to bias energy distributions in specific frequency bands. The observed stable weighting pattern suggests that the CCA mechanism implicitly suppresses noise-dominated responses while preserving physically informative components associated with fault-related features.

Therefore, the proposed multi-wavelet fusion in MBT-XAI is not a simple feature concatenation process, but a structured and dynamically weighted integration of complementary physical observations. The consistent channel-weight allocation further provides empirical evidence that the model performs reliability-aware feature selection rather than arbitrary feature aggregation.

Overall, the proposed multi-wavelet structured encoding, together with the CCA-based adaptive fusion mechanism, provides not only improved diagnostic performance but also a stable and interpretable feature integration process. This design establishes a reliable foundation for subsequent TST-based traceback and physics-consistent interpretability analysis.

4.4. Comprehensive Robustness and Comparative Evaluation

To systematically assess the practical applicability of the proposed MBT-XAI framework in complex industrial environments, a comprehensive evaluation is conducted from three complementary perspectives: robustness against noise interference, training convergence stability, and comparative performance against representative deep learning architectures. This evaluation is designed not only to report performance improvements, but also to explain why the proposed multi-wavelet fusion with reliability-aware channel weighting provides superior robustness and generalization capability. Specifically, this section aims to answer two key questions: (1) whether the multi-wavelet fusion strategy coupled with the CCA-based dynamic channel weighting mechanism can preserve stable diagnostic performance under progressively deteriorating signal-to-noise ratios; and (2) whether the proposed framework provides consistent and mechanism-driven advantages over mainstream CNN-, Transformer-, and multi-scale-based models.

To emulate realistic background interference, additive noise scenarios with SNR levels of 0 dB, −2 dB, and −4 dB are constructed. Figure 10 illustrates the corresponding training and validation accuracy and loss curves of MBT-XAI under these conditions.

It can be observed that, despite increasing noise intensity, the optimization process remains well behaved across all cases. Both accuracy and loss curves exhibit smooth and monotonic convergence patterns, with no evidence of overfitting, divergence, or training instability. More importantly, the convergence trajectories remain highly consistent across different noise levels, indicating that the learned feature representations are not sensitive to noise perturbations but are instead governed by stable underlying structural patterns. While higher noise levels induce moderate fluctuations during early training stages, the convergence rate and final performance remain largely unaffected. This behavior suggests that the model is able to suppress noise-induced perturbations during representation learning, rather than overfitting to noisy patterns.

Beyond convergence behavior, robustness is further assessed by comparing the diagnostic accuracy of different models under pink-noise interference on two benchmark datasets, namely CWRU and IMUST. Figure 11 presents the cross-model performance comparisons at SNR levels of −2 dB and −4 dB, thereby providing a more direct evaluation of robustness under progressively deteriorated noise conditions.

The results show that the F1-score of MBT-XAI declines in a gradual and near-linear manner as noise intensity increases, with degradation slopes that are significantly smaller than those observed in conventional CNNs, multi-scale convolutional models, and single time–frequency input architectures. Competing methods exhibit Significant performance collapse under −4 dB conditions, whereas MBT-XAI maintains relatively stable performance across all noise levels. This behavior suggests that the proposed multi-wavelet representation consistently preserves salient structural information even under severe noise interference.

From a mechanistic perspective, this robustness can be attributed to the synergistic interaction between multi-wavelet feature diversity and the CCA-based dynamic channel weighting module. Specifically, the CCA module Different noise types affect wavelet responses in a non-uniform manner. For example, impulsive disturbances tend to distort high-frequency transient components, while colored noise may bias specific frequency bands. The proposed CCA mechanism implicitly performs a reliability-aware selection, where noise-dominated channels are suppressed and physically informative responses are preserved. As a result, the feature fusion process is not a passive aggregation but an adaptive filtering mechanism guided by signal reliability. This explains why MBT-XAI maintains stable performance under low-SNR conditions, while conventional models that rely on single representations or fixed fusion strategies suffer from significant degradation.

To ensure the fairness and representativeness of the comparative experiments, the selected benchmark models encompass several representative technical paradigms in the field of bearing fault diagnosis.

The selected benchmark models cover several representative paradigms in bearing fault diagnosis. CNN-Transformer represents hybrid architectures that combine convolutional feature extraction with Transformer-based sequence modeling, thereby jointly modeling local structures and long-range dependencies. The wavelet-related model represents methods that explicitly exploit wavelet-domain information for signal enhancement and fault characterization. Ref. [31] WCAResNet incorporates channel attention into a residual convolutional architecture, reflecting attention-enhanced CNN-based diagnosis. Ref. [32] TF-ViT is built upon the Vision Transformer architecture and emphasizes global dependency modeling in time–frequency images, although its representation remains restricted to a single time–frequency view. Ref. [33] CMB-ResNet adopts a multi-branch convolutional design to achieve multi-scale feature fusion and serves as a representative modern CNN-based diagnostic framework [34]. Collectively, these baselines cover convolution-based, attention-enhanced, Transformer-based, wavelet-assisted, and multi-scale learning paradigms, thereby providing a comprehensive and fair basis for evaluating the effectiveness of the proposed MBT-XAI framework. The comparative results are presented in Figure 12.

Under the ideal condition of SNR = 0 dB, all deep models achieve relatively high accuracy, while MBT-XAI shows a consistent but moderate improvement. However, this gap becomes significantly larger as noise intensity increases. Under −2 dB and −4 dB conditions, conventional CNN and multi-scale architectures exhibit substantial performance degradation. Transformer-based models also suffer from reduced accuracy due to their reliance on single-view TF representations. In contrast, MBT-XAI consistently maintains higher Accuracy, Precision, Recall, and F1-score, while achieving lower FPR.

This performance gap indicates that the superiority of MBT-XAI does not arise from model scale, but from its ability to preserve multi-view physical structures and perform reliability-aware feature integration.

Table 6 compares the representative methods in terms of model complexity and diagnostic performance. The proposed MBT-XAI achieves the highest accuracy of 98.13%, with a parameter count of 10.393 M and FLOPs of 10.678 G. Although MBT-XAI introduces a moderate increase in computational cost, the performance gain is disproportionately larger, indicating a favorable trade-off between accuracy and efficiency.

More importantly, models with comparable or even larger computational cost do not achieve similar performance improvements, suggesting that the advantage of MBT-XAI is rooted in representation effectiveness rather than parameter scaling.

To further analyze the contribution of each core component, ablation experiments are conducted.

The results are summarized in Table 7. It can be observed that removing the multi-wavelet input, the CCA module, or the Transformer encoder leads to a significant decrease in diagnostic accuracy.

In particular, the performance drops from 98.13% to 80.11% when all key modules are removed, indicating that each component plays a critical role in the overall framework. Despite the reduction in parameter count and FLOPs, the diagnostic performance deteriorates sharply.

These results demonstrate that the performance improvement of MBT-XAI is not due to increased model complexity but arises from the synergistic interaction between multi-view physical representation and reliability-aware feature fusion.

4.5. Interpretability and Physical-Consistency Verification

High diagnostic accuracy does not necessarily imply engineering-level trustworthiness. For safety-critical rotating machinery systems, diagnostic models are required not only to maintain stable performance under complex noise conditions, but also to explain whether their decision-making basis is consistent with underlying bearing fault mechanisms. To this end, this study systematically validates the interpretability results of MBT-XAI based on the proposed TST mechanism from five complementary perspectives: saliency localization, physical consistency, causal fidelity, temporal traceability, and spectral evidence comparison.

Figure 13 illustrates the time–frequency saliency distributions produced by the TST mechanism under various bearing health conditions. It can be observed that the saliency patterns associated with different fault types exhibit clear distinctions and remain consistent with classical bearing fault mechanisms. For example, salient regions for inner-race faults are predominantly concentrated in high-frequency resonance bands, whereas outer-race faults exhibit stripe-like structures corresponding to periodic impact responses. Under normal operating conditions, the saliency distribution appears relatively diffuse and fails to form a stable or structured pattern.

To mitigate the potential subjectivity introduced by qualitative analysis, theoretical fault characteristic frequency bands are further constructed as ground-truth regions. The theoretical characteristic frequency bands are derived from bearing geometric parameters and rotational speed, with a fixed tolerance window defined around the corresponding center frequencies. Four metrics, including IoU, Coverage, Over-coverage, and Under-coverage, are employed to quantitatively evaluate the explanation results, as summarized in Table 8.

As can be observed from the table, compared with CAM and ViT-based attention methods, MBT-XAI achieves higher IoU and Coverage values, while significantly reducing both over-coverage and under-coverage ratios. These results indicate that the TST mechanism can accurately cover the true fault-related frequency bands, while effectively suppressing redundant activation regions that are irrelevant to fault diagnosis. The quantitative results further demonstrate that the TST mechanism achieves a more favorable balance between localization accuracy and explanation compactness, thereby providing objective evidence for its physical consistency.

From a physical perspective, the IoU metric quantifies the degree of overlap between the model-extracted explanation regions and the theoretical fault-related frequency bands, where a higher value indicates that the model’s decision basis is more strongly concentrated on frequency regions with explicit physical significance. Coverage reflects whether the model is capable of fully covering the theoretical fault frequency bands, thereby avoiding the omission of critical fault-related information. In contrast, excessively high Over-coverage indicates that the model attends to a substantial number of frequency regions unrelated to fault characteristics, whereas elevated Under-coverage suggests that the model fails to sufficiently capture the essential fault-related features. The results presented in Table 6 demonstrate that MBT-XAI effectively suppresses both over-coverage and under-coverage while simultaneously improving IoU and Coverage, indicating that the resulting explanations achieve a more favorable balance between compactness and completeness.

The consistency of saliency distributions alone is insufficient to establish a causal relationship between the explanation results and the model’s actual decision-making process. Accordingly, deletion and insertion perturbation tests are employed to verify the causal faithfulness of the salient regions identified by the TST mechanism, with the results illustrated in Figure 14. In the deletion test, as highly salient regions are progressively removed, the diagnostic performance of the model deteriorates rapidly, indicating that these regions play a necessary role in the prediction process. In the insertion test, as highly salient regions are gradually restored from a fully masked input, the model confidence rapidly increases and approaches saturation, suggesting that these regions constitute sufficient discriminative evidence. These ablation-based experimental results indicate that the salient regions identified by the TST mechanism are not only visually plausible, but also genuinely drive the model’s decision-making process in a causal sense.

To further establish a direct connection between the model explanation results and the underlying mechanical behaviors, a spectral–temporal integral mapping (STIM) is introduced based on time–frequency energy representations, which back-projects two-dimensional saliency distributions onto one-dimensional temporal attention curves. By jointly weighting saliency and physical energy, STIM highlights key temporal locations that the model strongly relies on and that exhibit sufficient energy, thereby avoiding the ambiguity introduced by time localization based solely on saliency intensity. Experimental results demonstrate that the high-response temporal positions identified by STIM are highly aligned with actual impact events in the original vibration signals along the time axis, and exhibit periodic characteristics consistent with the bearing defect contact process, as illustrated in Figure 15. These results indicate that MBT-XAI not only explains the key regions attended by the model in the time–frequency domain, but also enables the discriminative evidence to be traced back to time-domain impact events that can be directly verified by engineering practitioners, thereby establishing a complete closed-loop interpretability framework from model decision-making to physical behavior.

To further examine the stability and physical fidelity of the proposed TST mechanism, the back-traced spectral evidence

S (f)

is compared with the energy spectrum

P (f)

of the original signal under different bearing conditions. Figure 16 illustrates the evidence alignment behaviors under inner-race fault, outer-race fault, and normal operating conditions. Under fault conditions, the back-traced evidence exhibits strong concentration around characteristic frequency bands while suppressing irrelevant background components, indicating that the back-tracing process selectively preserves diagnostically meaningful energy structures rather than replicating the entire spectrum. In contrast, under normal operating conditions, the overall magnitude of the spectral evidence is significantly reduced, indicating that the model does not introduce artificial discriminative patterns in the absence of fault-related excitations.

In addition, the smooth and bounded characteristics of

S (f)

across all conditions indicate that the back-tracing mapping does not amplify high-frequency noise or induce spectral distortion. These observations provide additional experimental evidence supporting the stability and measurement consistency of the proposed interpretability mechanism.

5. Conclusions

This study addressed a key challenge in rolling bearing fault diagnosis, namely how to improve diagnostic accuracy while enhancing the transparency and quantitative interpretability of deep learning-based decision making. To this end, MBT-XAI was developed by integrating multi-view time–frequency representations and cross-channel discriminative reliability modeling within a unified Transformer-based framework. By combining Morlet, Mexican Hat, and Complex Morlet wavelet responses, the proposed method preserves complementary fault-related structures associated with transient impulses, resonance components, and modulation patterns, thereby improving the discriminability of the input representation.

Experimental results on both the public CWRU benchmark dataset and the industrial IMUST dataset demonstrated that the proposed framework achieved strong diagnostic performance and robust behavior under noisy conditions. In addition to classification accuracy, the introduced TST mechanism established a structure-preserving connection between model evidence in the token space and the original time–frequency domain. The traced salient regions showed high agreement with theoretical fault-related frequency bands and impact-related TF structures, indicating that the proposed framework can provide quantitatively interpretable diagnostic evidence under the evaluated conditions.

Nevertheless, several limitations should be acknowledged. First, although both AWGN and colored noise were considered, the present robustness evaluation still does not cover all interference patterns encountered in practical industrial environments, such as strong impulsive disturbances, periodic electromagnetic interference, and more severe speed fluctuations. Second, although representative kernel-based, CNN-based, Transformer-based, and multi-scale baselines were included, the current comparison does not exhaust all recent strong hybrid diagnostic architectures. Third, the proposed framework should be regarded as physics-consistent rather than fully physics-constrained, since physical knowledge is mainly incorporated through representation design and quantitative consistency verification, rather than being explicitly enforced in the training objective or structural constraints. Fourth, the current study focused on fixed-condition bearing diagnosis using two datasets, and the generalization ability of the method to variable-speed conditions, heterogeneous multi-source sensing scenarios, and broader classes of rotating machinery still requires further validation. Future work will therefore focus on extending the framework to more complex industrial conditions, incorporating stronger prior-knowledge constraints, and validating the proposed method on more diverse datasets and operating environments.

Author Contributions

Conceptualization, H.F. and C.Z.; methodology, H.F.; software, H.F.; validation, H.F., K.X. and M.S.; formal analysis, H.F.; investigation, H.F.; resources, C.Z.; data curation, H.F. and W.Z.; writing—original draft preparation, H.F.; writing—review and editing, H.F. and C.Z.; visualization, H.F.; supervision, C.Z.; project administration, C.Z.; funding acquisition, C.Z. and X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (Project No. 52365014), Inner Mongolia Natural Science Foundation (Project No. 2025QN05040), the Young Science and Technology Talents Support Program for Doctoral Students of Inner Mongolia Association for Science and Technology (Project No. QTBS2520), and the Major Special Project of Inner Mongolia Autonomous Region (KCX2024010).

Data Availability Statement

The data generated and/or analyzed during the current study are not publicly available for legal/ethical reasons but are available from the corresponding author on reasonable request.

Acknowledgments

The authors would like to thank the editor and reviewers for their valuable comments and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lei, Y.; Li, N.; Guo, L.; Li, N.; Yan, T.; Lin, J. Machinery health prognostics: A systematic review from data acquisition to RUL prediction. Mech. Syst. Signal Process. 2018, 104, 799–834. [Google Scholar] [CrossRef]
Zhang, W.; Peng, G.; Li, C.; Chen, Y.; Zhang, Z. A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals. Sensors 2017, 17, 425. [Google Scholar] [CrossRef]
Wen, L.; Li, X.; Gao, L.; Zhang, Y. A new convolutional neural network-based data-driven fault diagnosis method. IEEE Trans. Ind. Electron. 2018, 65, 5990–5998. [Google Scholar] [CrossRef]
Camperi, S.; Tehrani, M.G.; Elliott, S.J. Local tuning and power requirements of a multi-input multi-output decentralised velocity feedback with inertial actuators. Mech. Syst. Signal Process. 2019, 117, 689–708. [Google Scholar] [CrossRef]
Guo, Z.; Du, W.; Li, C.; Yu, Y.; Hu, T.; Wang, S.; Liu, Z. Multi-scale wavelet decomposition and feature fusion for rotating machinery fault diagnosis under multi-level class imbalance. Mech. Syst. Signal Process. 2025, 240, 113427. [Google Scholar] [CrossRef]
Zhao, R.; Yan, R.; Chen, Z.; Mao, K.; Wang, P.; Gao, R.X. Deep learning and its applications to machine health monitoring. Mech. Syst. Signal Process. 2019, 115, 213–237. [Google Scholar] [CrossRef]
Xiao, Z.; Li, D.; Yang, C.; Chen, W. Fault diagnosis method of special vehicle bearing based on multi-scale feature fusion and transfer adversarial learning. Sensors 2024, 24, 5181. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Wu, Y.; Sicard, B.; Gadsden, S.A. A review of physics-informed machine learning methods with applications to condition monitoring and anomaly detection. arXiv 2024, arXiv:2401.11860. [Google Scholar] [CrossRef]
Shen, S.; Lu, H.; Sadoughi, M.; Hu, C.; Nemani, V.; Thelen, A.; Webster, K.; Darr, M.; Sidon, J.; Kenny, S. A physics-informed deep learning approach for bearing fault detection. Eng. Appl. Artif. Intell. 2021, 103, 104295. [Google Scholar] [CrossRef]
He, F.; Ye, Q. A Bearing Fault Diagnosis Method Based on Wavelet Packet Transform and Convolutional Neural Network Optimized by Simulated Annealing Algorithm. Sensors 2022, 22, 1410. [Google Scholar] [CrossRef]
Chen, G.; Tang, G.; Zhu, Z. VKCNN: An interpretable variational kernel convolutional neural network for rolling bearing fault diagnosis. Adv. Eng. Inform. 2024, 62, 102705. [Google Scholar] [CrossRef]
Wang, D.; Zhao, Y.; Yi, C.; Tsui, K.-L.; Lin, J. Sparsity guided empirical wavelet transform for fault diagnosis of rolling element bearings. Mech. Syst. Signal Process. 2018, 101, 292–308. [Google Scholar] [CrossRef]
Xu, Z.; Zhao, K.; Wang, J.; Bashir, M. Physics-informed probabilistic deep network with interpretable mechanism for trustworthy mechanical fault diagnosis. Adv. Eng. Inform. 2024, 62, 102806. [Google Scholar] [CrossRef]
Huang, Y.; Tang, B.; Yang, Q.; Ming, Z. Physics-informed causal learning network for fault diagnosis of rotating machinery under unseen operating conditions. Neurocomputing 2025, 639, 130187. [Google Scholar] [CrossRef]
Li, Y.; Gu, X.; Wei, Y. A Deep Learning-Based Method for Bearing Fault Diagnosis with Few-Shot Learning. Sensors 2024, 24, 7516. [Google Scholar] [CrossRef]
Jiang, G.; Wang, J.; Wang, L.; Xie, P.; Li, Y.; Li, X. An interpretable convolutional neural network with multi-wavelet kernel fusion for intelligent fault diagnosis. J. Manuf. Syst. 2023, 70, 18–30. [Google Scholar] [CrossRef]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Tonekaboni, S.; Joshi, S.; McCradden, M.D.; Goldenberg, A. What clinicians want: Contextualizing explainable machine learning for clinical end use. In Proceedings of the 4th Machine Learning for Healthcare Conference, Ann Arbor, MI, USA, 8–10 August 2019; pp. 359–380. [Google Scholar]
Bach, S.; Binder, A.; Montavon, G.; Klauschen, F.; Müller, K.R.; Samek, W. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 2015, 10, e0130140. [Google Scholar] [CrossRef]
Hassannejad, R.; Ettefagh, M.M.; Mossayebi, Y.B. Adaptive Wavelet-Based Physics-Informed CNN for Bearing Fault Diagnosis. Int. J. Progn. Health Manag. 2025, 16, 4234. [Google Scholar] [CrossRef]
Jin, Y.; Qin, C.; Huang, Y.; Liu, C. Actual bearing compound fault diagnosis based on active learning and decoupling attentional residual network. Measurement 2021, 173, 108500. [Google Scholar] [CrossRef]
Wang, Z.; Han, G.; Liu, L.; Wang, F.; Zhu, Y. A globally interpretable convolutional neural network combining bearing semantics for bearing fault diagnosis. IEEE Trans. Instrum. Meas. 2025, 74, 3507713. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Luo, H.; Ren, T.; Li, H. A long-tail fault diagnosis method based on a coupled time-frequency attention transformer. Actuators 2025, 14, 255. [Google Scholar] [CrossRef]
Wang, R.; Dong, E.; Cheng, Z.; Liu, Z.; Jia, X. Transformer-based intelligent fault diagnosis methods of mechanical equipment: A survey. Open Phys. 2024, 22, 20240015. [Google Scholar] [CrossRef]
Ruan, D.; Wang, J.; Yan, J.; Gühmann, C. CNN parameter design based on fault signal analysis and its application in bearing fault diagnosis. Adv. Eng. Inform. 2023, 55, 101877. [Google Scholar] [CrossRef]
Liu, J.; Xie, F.; Zhang, Q.; Lyu, Q.; Wang, X.; Wu, S. A multisensory time-frequency features fusion method for rotating machinery fault diagnosis under nonstationary case. J. Intell. Manuf. 2024, 35, 3197–3217. [Google Scholar] [CrossRef]
Yang, Y.; Qiu, D.; Hu, N.; Hu, J.; Zhang, L.; Cheng, Z. Dynamic modeling and analysis of a split-torque transmission with a tooth crack fault. Nonlinear Dyn. 2026, 114, 148. [Google Scholar] [CrossRef]
Deng, Z.; Hu, N.; Yang, Y.; Yin, Z.; Luo, H.; Lin, X.; Hao, J.; Zhou, Z. Physics-guided counterfactual causal learning for explainable single-source domain generalization cross-machine bearing fault diagnosis. Adv. Eng. Inform. 2026, 69, 103955. [Google Scholar] [CrossRef]
Li, S.; Li, T.; Sun, C.; Yan, R.; Chen, X. Multilayer Grad-CAM: An effective tool towards explainable deep neural networks for intelligent fault diagnosis. J. Manuf. Syst. 2023, 69, 20–30. [Google Scholar] [CrossRef]
Peng, C.; Sheng, Y.; Gui, W.; Tang, Z.; Li, C. A Rolling Bearing Fault Diagnosis Method Based on Multimodal Knowledge Graph. IEEE Trans. Ind. Inform. 2024, 20, 13047–13057. [Google Scholar] [CrossRef]
Ye, L.; Ma, X.; Wen, C. Rotating machinery fault diagnosis method by combining time-frequency domain features and CNN knowledge transfer. Sensors 2021, 21, 8168. [Google Scholar] [CrossRef] [PubMed]
Cação, J.; Santos, J.; Antunes, M. Explainable AI for industrial fault diagnosis: A systematic review. J. Ind. Inf. Integr. 2025, 47, 100905. [Google Scholar] [CrossRef]

Figure 1. Representative time–frequency energy map of a bearing vibration signal obtained by CWT.

Figure 2. Overall workflow of the proposed MBT-XAI framework.

Figure 3. Construction of the multi-wavelet time–frequency representation and structured three-channel fusion.

Figure 4. Schematic illustration of the cross-channel attention mechanism.

Figure 5. Bearing fault test rig of the CWRU dataset.

Figure 6. Bearing fault test rig of the IMUST dataset.

Figure 7. Representative fault samples of the IMUST dataset.

Figure 8. Experimental Design.

Figure 9. Variation of CCA channel weights under different noise conditions.

Figure 10. (a–c) illustrate the training and validation accuracy curves of the proposed model under noise conditions of SNR = 0 dB, −2 dB, and −4 dB, respectively, while (d–f) present the corresponding loss curves.

Figure 11. Robustness comparison of different diagnostic models under pink-noise interference on the CWRU and IMUST datasets in terms of accuracy. (a) CWRU dataset at SNR = −2 dB; (b) CWRU dataset at SNR = −4 dB; (c) IMUST dataset at SNR = −2 dB; (d) IMUST dataset at SNR = −4 dB.

Figure 12. Comparison bar chart of accuracy of different diagnostic models under multiple noise conditions.

Figure 13. Time–frequency saliency maps generated by the TST mechanism under different bearing conditions: (a) inner-race fault, showing concentrated high-frequency bands; (b) outer-race fault, exhibiting periodic stripe-like structures; (c) rolling-element fault, with distinct resonance patterns; (d) normal condition, presenting diffuse and non-structured distributions.

Figure 14. Deletion and insertion curves for evaluating the causal faithfulness of the TST saliency maps.

Figure 15. STFT-based temporal traceback results under different bearing conditions: (a) inner-race fault; (b) outer-race fault; (c) rolling-element fault; (d) normal condition.

Figure 16. Comparison between back-traced spectral evidence and the original signal spectrum under different bearing conditions: (a) inner-race fault; (b) outer-race fault; (c) rolling-element fault.

Table 1. Complementary characteristics of the three selected wavelets for bearing fault TF representation.

Wavelet	Time Localization	Frequency Localization	Phase Sensitivity	Diagnostic Strength in Bearing Signals
Morlet	Moderate	Strong	Limited	Suitable for narrowband resonance responses, harmonic components, and relatively stable fault-frequency structures
Mexican Hat	Strong	Moderate	No explicit phase retention	Sensitive to transient impacts and localized impulsive energy caused by defect contacts
Complex Morlet	Moderate	Strong	Yes	Effective for modulation-related structures, sideband-like patterns, and phase-sensitive TF responses

Table 2. Fault category definitions of the CWRU and IMUST datasets.

CWRU Dataset		IMUST Dataset
Label	Fault Type	Label	Fault Type
C1	BF (minor)	L1	BF (1995 rpm)
C2	IF (minor)	L2	CF (1995 rpm)
C3	OF (minor)	L3	IF (1995 rpm)
C4	N	L4	OF (1995 rpm)
		L5	N

Table 3. Key training hyperparameters of the proposed model.

Hyperparameter	Setting
Input size	256 × 256
Batch size	32
Number of epochs	150
Optimizer	AdamW
Learning rate	3 × 10⁻⁵
Weight decay	1 × 10⁻⁴
Learning rate scheduler	ReduceLROnPlateau
Loss function	Cross-entropy + Focal loss

Table 4. Comparison of diagnostic performance of different time-frequency input characterization strategies.

Input Strategy	Description	Accuracy (%)	F1-Score (%)
Morlet	Single-wavelet CWT image	94.61	93.77
Mexh	Single-wavelet CWT image	89.47	88.70
cmorl	Single-wavelet CWT image	90.10	89.50
EWA fusion	Element-wise weighted averaging of three wavelet maps	93.20	92.83
Tensor stacking	Direct channel-wise stacking of three wavelet maps	96.28	95.14
RGB fusion	Three wavelet maps encoded as RGB channels	98.13	96.07

Table 5. Ablation study on patch size.

Dataset	Patch Size	Accuracy (%)	F1-Score (%)
CWRU	8 × 8	97.64	95.73
	16 × 16	98.13	96.07
	32 × 32	96.41	94.37
IMUST	8 × 8	95.71	94.88
	16 × 16	96.23	95.46
	32 × 32	94.38	93.51

Table 6. Comparison of model complexity and diagnostic performance among representative methods.

Method	Parameter Count (M)	FLOPs (G)	Accuracy (%)
CNN-Transformer	8.742	8.964	96.84
Wavelet-related	9.186	9.537	97.21
WCAResNet	7.427	9.427	94.27
TF-ViT	7.864	9.864	93.64
CMB-ResNet	8.953	8.953	89.53
MBT-XAI	10.393	10.678	98.13

Table 7. The ablation trend of the impact of key module removal on model performance.

	Parameter Count (M)	FLOPs (G)	Accuracy (%)
Baseline	10.393	10.678	98.13
Group I	9.683	10.113	93.36
Group II	9.29	9.649	89.58
Group III	8.181	8.748	84.11
Group IV	7.58	7.635	80.11

Table 8. Quantitative indicators such as IoU.

Method	IoU (%)	Coverage (%)	Over Coverage (%)	Under Coverage (%)
CAM ResNet	74.50	83.50	8.3	5.6
ViT Attention	76.30	88.40	7.5	4.1
MBT-XAI	80.21	91.70	5.6	2.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, H.; Zhang, C.; Sun, M.; Xu, K.; Zhang, W.; Zhang, X. Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis. Vibration 2026, 9, 28. https://doi.org/10.3390/vibration9020028

AMA Style

Fan H, Zhang C, Sun M, Xu K, Zhang W, Zhang X. Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis. Vibration. 2026; 9(2):28. https://doi.org/10.3390/vibration9020028

Chicago/Turabian Style

Fan, Hongzhi, Chao Zhang, Mingyu Sun, Kexi Xu, Wenyang Zhang, and Ximing Zhang. 2026. "Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis" Vibration 9, no. 2: 28. https://doi.org/10.3390/vibration9020028

APA Style

Fan, H., Zhang, C., Sun, M., Xu, K., Zhang, W., & Zhang, X. (2026). Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis. Vibration, 9(2), 28. https://doi.org/10.3390/vibration9020028

Article Menu

Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis

Abstract

1. Introduction

2. Theoretical Foundations

2.1. CWT-Based Time–Frequency Representation

2.2. Transformer Self-Attention Mechanism

2.3. Theoretical Basis of Token-to-Spectrum Traceback

2.4. Structural-Consistency Verification via Overlap-Based Metrics

3. The Proposed Method

3.1. Multi-Wavelet Time–Frequency Representation

3.2. Orthogonal RGB Channel Encoding and Multi-View Preservation

3.3. Cross-Channel Attention-Based Multi-View Fusion

3.4. Token-to-Spectrum Traceback for Class-Specific Interpretation

3.5. Physics-Consistency Evaluation of TST Saliency

4. Experiments and Discussion

4.1. Experimental Setup

4.2. Experimental Design and Validation Logic

4.3. Effectiveness and Stability of Multi-Wavelet RGB Input

4.4. Comprehensive Robustness and Comparative Evaluation

4.5. Interpretability and Physical-Consistency Verification

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI