A Long-Tail Fault Diagnosis Method Based on a Coupled Time–Frequency Attention Transformer

Zhang, Li; Zhang, Ying; Luo, Hao; Ren, Tongli; Li, Hongsheng

doi:10.3390/act14050255

Open AccessArticle

A Long-Tail Fault Diagnosis Method Based on a Coupled Time–Frequency Attention Transformer

by

Li Zhang

,

Ying Zhang

,

Hao Luo

^*,

Tongli Ren

and

Hongsheng Li

College of Information, Liaoning University, Shenyang 110036, China

^*

Author to whom correspondence should be addressed.

Actuators 2025, 14(5), 255; https://doi.org/10.3390/act14050255

Submission received: 9 April 2025 / Revised: 16 May 2025 / Accepted: 19 May 2025 / Published: 20 May 2025

(This article belongs to the Section Actuators for Manufacturing Systems)

Download

Browse Figures

Versions Notes

Abstract

Bearings are essential rotational components that enable mechanical equipment to operate effectively. In real-world industrial environments, bearings are subjected to high temperatures and loads, making failure prediction and health management critical for ensuring stable equipment operations and safeguarding both personnel and property. To address long-tail defect identification, we propose a coupled time–frequency attention model that accounts for the long-tail distribution and pervasive noise present in production environments. The model efficiently learns amplitude and phase information by first converting the time-domain signal into the frequency domain with the Fast Fourier Transform (FFT) and then processing the data using a real–imaginary attention mechanism. To capture dependencies in long sequences, a multi-head self-attention mechanism is then implemented in the time domain. Furthermore, the model’s ability to fully learn features is enhanced through the linear coupling of time–frequency domain attention, which effectively mitigates noise interference and corrects imbalances in data distribution. The performance of the proposed model is compared with that of advanced models under the conditions of imbalanced label distribution, cross-load, and noise interference, proving its superiority. The model is evaluated using the Case Western Reserve University (CWRU) and laboratory bearing datasets.

Keywords:

bearing fault diagnosis; long-tail distribution; coupling time–frequency attention; real–imaginary attention

1. Introduction

Bearings are essential components of the rotating mechanisms in various mechanical systems [1]. However, bearings are susceptible to malfunctions due to harsh operating conditions and prolonged use, which can compromise equipment performance and pose significant safety risks [2]. Given their critical role in machinery, it is vital to promptly identify and resolve any potential bearing issues. Consequently, bearing fault diagnosis technology has emerged as a key area of focus, ensuring the continuous and safe operation of machinery [3,4].

With the rapid advancement of intelligent fault diagnosis technology, machine learning methods have been widely adopted to enhance diagnostic performance. For instance, Zhao et al. [5] proposed a rolling bearing fault diagnosis method that combines convolutional neural networks (CNN) with principal component analysis (PCA) for efficient fractal feature extraction. Wu et al. [6] introduced an intelligent fault classification and diagnosis model for rolling bearings based on Fast Fourier Transform (FFT) integrated with a time convolutional network (SE-TCN), utilizing a support vector machine (SVM) as the classifier, with parameters optimized by the Particle Swarm Optimization (PSO) algorithm. Aburakhia et al. [7] developed a random forest (RF) algorithm optimized using Bayesian methods to classify operational states under varying motor speeds. Although these approaches demonstrate the potential of advanced machine learning techniques, traditional machine learning methods still encounter limitations due to their shallow architecture. Specifically, such architectures struggle to extract meaningful features from raw, high-dimensional data, making diagnostic performance heavily reliant on expert knowledge and the quality of handcrafted features [8].

The capacity of deep learning to extract high-level abstract properties from unprocessed data is one of its primary advantages. The current mainstream deep learning models include convolutional neural networks [9], recurrent neural networks (RNNs) [10], generative adversarial networks (GANs) [11], and transformer [12]. Zhang et al. [13] proposed a new lightweight CNN fault diagnosis framework called the Dilated Perception Coupled Convolutional Neural Network (DPCCNN). This framework enhances the receptive field of small convolutional kernels through dilated convolution, while the self-attention mechanism compensates for limited interactions among deep convolutional layers, thereby reducing the number of model parameters and computational complexity. Wang et al. [14] enhanced the feature extraction process of CNN by applying the principles of Variational Bayesian Inference, resulting in the development of the Variational Bayesian Inference Convolutional Neural Network (VBICNN) to obtain preliminary diagnostic results for single-channel signals. Additionally, to address the redundancy of information in multi-channel signals, a voting strategy was implemented to fuse the preliminary diagnostic results from the single-channel model, thereby yielding the final outcome. Mansouri et al. [15] introduced a dimensionality-reduced RNN to overcome the limitations of traditional RNN techniques in managing system uncertainties. Niu et al. [16] employed a fault diagnosis technique specifically designed for bearings with limited sample sizes, utilizing Variational Mode Decomposition (VMD) and Symmetric Dot Pattern (SDP), in conjunction with a pre-trained and subsequently fine-tuned Residual Network-18. Zhou et al. [17] designed a novel generator and discriminator for GAN, utilizing a global optimization approach to generate more discriminative fault samples. The generator synthesizes fault features from small datasets through an autoencoder (AE), with training guided by feature reconstruction and diagnostic errors. The discriminator filter is responsible for filtering out generated samples to reduce minimize misclassification. A rolling bearing fault diagnosis method based on Feature Mapping Reconstruction GAN (FMRGAN) was proposed by Chen et al. [18]. The feature mapping reconstruction module of the adaptive generative reorganization kernel is employed to construct the generator. The coordinate attention (CA) mechanism is introduced into the discriminator to enhance the training sample set by generating high-quality fault data, thus improving the performance of the fault diagnosis model under limited data conditions.

In fault diagnosis datasets, normal samples are typically abundant, while the number of samples in most fault categories is significantly smaller. Formally, consider a dataset with m classes

(C_{1}, C_{2}, \dots, C_{m})

, where

C_{1}

denotes the normal class containing

n_{1}

samples, and

(C_{2}, \dots, C_{m})

represent fault classes with

n_{2}, \dots, n_{m}

samples, respectively. This distribution satisfies

n_{1} ≫ n_{2} \approx \dots \approx n_{m}

, characteristic of a long-tailed distribution in statistics. In class-imbalanced fault diagnosis scenarios, the number of fault samples is relatively small, typically ranging from 20 to 50 samples. For few-shot fault diagnosis, tail-class instances may contain as few as 1 to 20 samples.

For imbalanced fault diagnosis under long-tail data distributions, Peng et al. [19] proposed a supervised contrastive learning method combined with progressive balanced resampling. This approach generates balanced training batches to enhance model performance in long-tail classification tasks. Huang et al. [20] introduced a novel fault diagnosis model called TransGRU, which is based on transformers and Gated Recurrent Unit (GRU) networks, specifically addressing the long-tail effect. The model extracts features from long-distance multi-sensor data and refines the diagnostic results through a gating mechanism. Furthermore, the authors designed an adaptive conditional loss (ACL) function by integrating focal loss dynamics, class-customized weights, and confusion weights, tailored for long-tail fault diagnosis scenarios. Luo et al. [21] proposed a Normalized Guided and Gradient-Weighted Unsupervised Domain Adaptation Network (NG-UDAN) for intelligent fault diagnosis. This network integrates a residual feature extractor and a domain normalization (DN) module to enhance domain-invariant feature extraction. They further employed localized maximum mean discrepancy (LMMD) loss to minimize the conditional distribution discrepancy between source and target domains, addressing both domain shift and intra-domain class imbalance. Jian et al. [22] proposed a Long-Tail Multi-Domain Generalized Fault Diagnosis (LMGFD) paradigm with a two-stage framework. The first stage utilizes a Balanced Multi-Order Moment Matching (BMMM) module to align long-tail subdomains, coupled with a Balanced Prototype Supervised Contrastive (BPSC) module to mitigate contrast imbalance. In the second stage, the focal loss is extended to a multi-class version to strengthen the tail class loss, thereby alleviating the overfitting problem. Zhang et al. [23] introduced a fault diagnosis approach tailored for small samples, leveraging a dual-path convolutional network integrated with an attention mechanism (DCA) and bidirectional gated recurrent units (BiGRUs). The effectiveness of such methods can be significantly boosted through the application of advanced regularization training techniques. Liu et al. [24] introduced Fourier Transform into the proxy tasks of self-supervised learning, constructing a time–frequency contrastive model. Long et al. [25] proposed a self-training semi-supervised deep learning (SSDL) model for effective learning in environments with a limited number of labeled samples and a large amount of unlabeled data. Chen et al. [26] introduced a method called Multi-Channel Calibrated Transformer with Shifted Windows (MCSwin-T) to address challenges such as domain shift and data scarcity. Wang et al. [27] proposed a novel attention-guided joint learning convolutional neural network (JL-CNN) by integrating diagnostic and denoising tasks. Chen et al. [28] proposed a diagnostic method based on a dual-path convolutional and capsule network with a multi-branch attention mechanism to enhance diagnostic performance under the conditions of limited labeled data and strong noise.

Existing research methods commonly use strategies such as reweighting, resampling, and class balancing, which have shown promising results. However, most intelligent fault diagnosis approaches focus on a single domain: either the time domain or the frequency domain. As a result, feature extraction often fails to fully capture all the attributes of fault signals. To address this limitation, we propose a coupled time–frequency attention mechanism (CTFAM), which uses the FFT to simultaneously model both time-domain and frequency-domain features. This dual-domain approach effectively resolves the problem of insufficient single-domain feature representation. A novel real–imaginary attention mechanism is introduced to process the signals while preserving their imaginary components, with a specific focus on phase information. This mechanism enables the more accurate learning of signal features. By coupling the attention of the two domains in a linear serial fashion, CTFAM independently extracts key fault-signal features from each domain and efficiently fuses them. This approach not only enhances the model’s ability to learn critical features but also reduces noise interference and suppresses redundant information, ultimately improving diagnostic accuracy and robustness.

The main contributions of the study are as follows:

Addressing the issue of decreased model accuracy caused by long-tail data distribution in real production environments, an innovative CTFAM is proposed, which processes signals in the frequency domain using the FFT while retaining the imaginary part; we also design a real–imaginary attention mechanism to effectively extract important information in the frequency domain.
Based on CTFAM, an enhanced transformer model called the coupled time–frequency attention transformer (CTFAT) is suggested for precise fault identification and diagnosis. The one-dimensional convolution is applied to the signal to decompose it into blocks, which effectively reduces the influence of high-frequency noise and addresses the issue of excessively long input size due to the length of the signal.
The model’s diagnostic performance is analyzed using bearing datasets from CWRU and the laboratory. The experimental results demonstrate that, compared to other related methods, the model exhibits exceptional generalization ability and anti-interference capabilities in complex environments, especially when dealing with long-tailed data samples.

The structure of the remainder of this article is outlined below. Section 2 introduces the attention network of the transformer and its variant architectures. Section 3 proposes the real–imaginary attention, coupled time–frequency attention mechanism, and an improved transformer model incorporating the enhanced time–frequency attention mechanism. In Section 4, the performance of the proposed model is validated using bearing datasets from CWRU and the laboratory. Lastly, Section 5 offers the closing remarks and summary.

2. Related Work

2.1. Transformer Attention Network

The attention mechanism enables models to automatically identify key components of input features. By allocating varying levels of focus, the model can process input data more flexibly, allowing it to learn essential information and characteristics. This enhances its adaptability and efficiency in handing complex input distributions or noisy scenarios. Various types of attention mechanisms have been developed, including frequency domain attention [29], channel attention [30], and self-attention [31]. The fundamental architecture of the transformer consists of input embeddings, positional encodings, an encoder, a decoder, and a classifier. Since the input to a transformer is a sequence, each element of the sequence is first mapped to a high-dimensional vector. Positional encoding is introduced to incorporate positional information into the input vector, enabling the model to capture the relative positions of elements within the sequence. The input then enters the encoder, where it is transformed through the multi-head attention mechanism, which applies attention in parallel across multiple independent heads. The relationships learned by each attention head can be viewed as distinct representations of the input, while the outputs from the various heads are combined to produce the final output, as illustrated in Figure 1.

As the most central part of the transformer, the multi-head self-attention layer realizes the feature extraction and attention allocation of the whole model. The most important components include the query, key, and value, which are used to calculate the attention weights; these are then weighted to sum the values of the input sequence to obtain the final attention output. Assuming that the input matrix is

X \in ℝ^{B \times L \times D}

, B represents the batch size,

L

represents the sequence length, and

D

represents the sequence, we then initialize three weight matrices

W_{Q}, W_{K}, W_{V} \in ℝ^{D \times d_{k}}, d_{k} = \frac{D}{H}

, where

H

represents the attention head. First, the input matrix is multiplied with the three weight matrices for spatial transformation to obtain three matrices

Q, K, V \in ℝ^{B \times L \times H \times d_{k}}

for calculation, as shown in Equation (1):

\begin{array}{l} Q = X \cdot W_{Q} \\ K = X \cdot W_{K} \\ V = X \cdot W_{V} \end{array}

(1)

For a given

Q

and

K

, the similarity is calculated using the dot product operation, and the resulting similarity is called the energy score, which is described in Formula (2):

E n e r g y (Q, K) = Q \cdot K

(2)

The attention weight

α

is obtained by normalizing the energy score. In order to prevent the gradient from disappearing, the energy score is also divided by the feature dimension

\sqrt{d_{k}}

, as shown in Equation (3):

α = softmax (\frac{E n e r g y (Q, K)}{\sqrt{d_{k}}})

(3)

Finally, the attention weights are used to obtain the weighted sum of

V

to obtain the final attention output, as shown in Equation (4), where

N

is the number of input samples:

A t t e n t i o n (Q, K, V) = \sum_{i = 1}^{N} a_{i} V_{i}

(4)

After extracting features through the multi-head self-attention layer, these features are processed by the feed-forward neural network and finally input to the linear classification layer to obtain the classification results.

2.2. Improved Transformer Model

Text-processing-specific deep learning models have made significant advancements due to the fundamental architecture of the transformer and its variations, such as BERT. In response to the challenges posed by large image sizes, Han et al. [32] proposed the vision transformer (ViT), a paradigm based on transformer architecture for image classification. ViT addresses the issue of oversized images by dividing them into a fixed number of patches, which are then input into the model. This approach also resolves the problem of fixed output resolution within the ViT model. Liu et al. and Wang et al. proposed the Swin Transformer [33] and Pyramid ViT [34]. The Swin Transformer introduces a windowing attention mechanism. By dividing the image into windows, the self-attention calculation is performed in each window, and then the feature communication between adjacent windows is realized by constantly interleaved displacement windows. Pyramid ViT introduces a gradually shrinking pyramid structure to generate multi-scale feature maps to improve feature map resolution. In addition, a simple and effective spatial dimension reduction attention mechanism is designed to reduce computational and memory costs. The feature maps extracted by ViT, Swin Transformer, and Pyramid ViT are as shown in Figure 2.

ViT and its variants can significantly enhance the feature extraction ability of the model through feature segmentation or window attention. However, once the input data are in an abnormal distribution or disturbed by noise, the diagnostic accuracy of ViT-based models becomes unstable, and their robustness degrades. To address this, future work should focus on refining both the attention mechanism and network architecture to improve robustness in long-tailed distributions.

3. Method

3.1. Real–Imaginary Attention

In signal processing, time-domain signals characterize how signals evolve over time. Time-domain features sensitively capture waveform details of transient shocks and periodic pulses. By leveraging self-attention mechanisms for global temporal modeling, repetitive shock signals at different time points can be directly captured to precisely identify the temporal patterns of fault characteristics. Frequency-domain signals, conversely, reveal the distribution of signal components across frequencies. Using the FFT to convert discrete time-domain signals, the magnitude of the FFT real part directly reflects the energy of each frequency component. Analyzing the real-part magnitude spectrum can determine the vibration intensity of various frequency components during bearing operation. Faults will cause amplitude changes at corresponding characteristic frequencies. The FFT imaginary part encodes phase information, with distinct phase patterns for different fault types. Analyzing these phase characteristics enables accurate fault localization and diagnostic analysis. Complementary time–frequency analysis empowers models with comprehensive and in-depth understanding of signal behaviors and properties, serving as indispensable dimensions in signal processing. At present, there are many methods to convert time domain signals into frequency domain signals, including Fourier Transform [35], Wavelet Transform [36] and Hilbert Transform [37]. FFT with low computational complexity and fast processing speed is used.

A signal processed using FFT contains both real and imaginary parts. A real–imaginary attention mechanism (RIA) is proposed, which aims to process the imaginary part of the signal containing phase information more effectively, so as to realize the comprehensive analysis and understanding of the signal. We assume that the input frequency domain signal

X_{fre} = x + y i

, where

x

is the real part,

y

is the imaginary part, and

i

is the imaginary unit. The original signal undergoes decomposition, resulting in its real part

X_{br}

and imaginary part

X_{bi}

, and the real part branch and imaginary part branch are input. The specific structure is shown in Figure 3.

We aimed to improve the generalization ability of the model, reducing the risk of covariate shift and overfitting in the data. The use of layer normalization (LN) for frequency domain data helps maintain signal characteristics and better meets the needs of model training, as shown in Equation (5):

\begin{array}{l} X_{br}^{'} = L N (X_{br}) \\ X_{b i}^{'} = L N (X_{bi}) \end{array}

(5)

where

X_{br}^{'}

and

X_{bi}^{'}

are the output signals of the real and imaginary parts after LN, respectively.

In order to shield and resist the interference of low-frequency noise, the correlation degree of feature fusion between high-dimensional channels is improved. Then, 1 × 1 convolution is used to raise the feature dimension of the signal, which is described in Formula (6):

\begin{array}{l} X_{br}^{Δ} = σ (Conv (X_{br}^{'})) \\ X_{b i}^{Δ} = σ (C o n v (X_{b i}^{'})) \end{array}

(6)

where

X_{b r}^{Δ}

and

X_{b i}^{Δ}

are the output signals of the real and imaginary parts after

C o n v

, respectively,

C o n v

is the convolution function, and

σ

is the activation function ReLU, which is used to improve nonlinear mapping and mitigate gradient disappearance.

After convolution, the feature maps of the raised dimension are multiplied by the learnable attention matrices

W_{real}

and

W_{imag}

. By introducing the attention matrix to assign different weights to different positions in the feature map, the model can dynamically learn and adjust the importance of different positions in the feature map, as shown in Equation (7):

\begin{array}{l} X_{br}^{attn} = W_{real} \cdot X_{br}^{Δ} \\ X_{bi}^{attn} = W_{imag} \cdot X_{bi}^{Δ} \end{array}

(7)

where

W_{real}

and

W_{imag}

are the real-part attention matrix and the imaginary-part attention matrix, respectively, and

X_{br}^{attn}

and

X_{bi}^{attn}

are the signal output after the allocation of attention.

Finally, the signal is scaled to its original scale using a 1 × 1 convolution and then concatenated to restore it to a complex number data output, which is described in Formula (8):

\begin{array}{l} X_{real} = σ (C o n v (X_{b r}^{a t t n})) \\ X_{i m a g} = σ (C o n v (X_{b i}^{a t t n})) \\ X_{o u t p u t} = (X_{r e a l} - X_{i m a g}) + (X_{r e a l} + X_{i m a g}) i \end{array}

(8)

where

Conv

is the convolution function,

σ

is the activation function ReLU, and

X_{real}

and

X_{imag}

are the data processed by the real-part branch and the imaginary-part branch, respectively; they are restored to the complex number frequency domain feature

X_{output}

after merging.

3.2. Coupled Time–Frequency Attention Mechanism

Conventional parallel processing techniques frequently fall short in fusing time-domain and frequency-domain data when dealing with uneven and long-tail data distributions. CTFAM takes into account both time-domain and frequency-domain properties, connecting signals from both domains in a linear fashion, in contrast to earlier parallel calculation approaches. This approach enhances the model’s feature extraction capabilities and anti-noise performance when dealing with long-tail distributions. The specific structure of CTFAM is illustrated in Figure 4.

The structure is composed of frequency domain attention blocks and time-domain attention blocks that are linearly stacked. For the frequency domain attention block, the time-domain signal is first transformed into the frequency domain signal by FFT, which is as shown in Equation (9):

X_{fre} = \sum_{n = 0}^{N} X_{i n} e^{- j \frac{2 π}{N} k n}

(9)

where the discrete signal input on the time-domain signal is

X_{in}

,

N

is the signal length,

e

is the natural logarithm,

j

is the imaginary number unit, and

X_{fre}

is the complex number signal in the frequency domain, representing the component of the signal at the frequency k/N.

The transformed frequency domain signal

X_{fre}

is input into RIA blocks for frequency domain attention fraction allocation. After the model focuses on learning key features, the inverse Fast Fourier Transform (IFFT) is used to restore the frequency domain signal back to the time-domain signal, as shown in Equation (10):

X_{time}^{'} = I F F T (L N (R I A (X_{f r e})))

(10)

where RIA is the real–imaginary attention proposed in the previous section.

Then, the processed time-domain signal

X_{time}^{'}

is input into the feed-forward network, and the residual connection is made between the result and the initial time-domain signal

X_{time}

. This can effectively alleviate the gradient vanishing problem and help the network to train better, as shown in Equation (11):

X_{output} = X_{time} + f (X_{t i m e}^{'})

(11)

where

f

is the feed forward function and

X_{output}

is the output of the attention block in the frequency domain. For the time-domain attention block, the input signal is directly transformed by multi-head self-attention and then connected to the original signal through the residual connection. After LN, it is input into the feed-forward network again for residual connection, and, finally, the output feature

X_{cls}

awaits classification processing by the classifier, as shown in Equation (12):

\begin{array}{l} X_{attn} = L N (M A H (X_{input}) + X_{input}) \\ X_{cls} = L N (f (X_{a t t n}) + X_{a t t n}) \end{array}

(12)

MHA is the multi-head self-attention mechanism and

X_{attn}

is the input signal after self-attention focus.

3.3. Coupled Time–Frequency Attention Transformer Model

We propose a transformer model named CTFAT, which is based on the coupled time–frequency attention mechanism. We focus on the problem that the long-tail distribution of data and noise interference affect the performance of the model under actual working conditions. In CTFAT, the low-frequency features of the time-domain signal are first extracted using 1D convolution to resist the interference of high-frequency noise. To alleviate the problem of insufficient receptive fields caused by the fixed resolution of the subsequent attention mechanism, the overall signal is divided into several local receptive fields at the same time. Second, the position coding is combined with the signal to introduce position information for the transformer, so as to capture different position features and strengthen the feature space connection. Then, the processed time-domain signal is input into the CTFAM module; the overall model diagram is shown in Figure 5.

The features of the processed time-domain signal are extracted through time–frequency dual domain attention in the CTFAM module. The frequency domain attention focuses on multi-channel frequency domain signal processing with high-dimensional local resolution, and the frequency domain signal is processed through RIA to obtain more comprehensive feature information. The time-domain attention focuses on the time-domain signal features with low-dimensional global resolution through the multi-head attention mechanism and finally maps the features into specific fault categories through the classifier. Figure 6 shows the flow design of the proposed improved time–frequency attention bearing intelligent fault diagnosis framework.

4. Experiment

4.1. Experimental Environment and Dataset Introduction

To verify the diagnostic performance of the model under long-tailed data distribution, we conducted various experiments on the bearing data from CWRU and the laboratory bearing data. All experiments were carried out using PyTorch 2.0.1 and Python 3.9.12 on a system equipped with an NVIDIA GeForce GTX 1650 graphics card manufactured by NVIDIA Santa Clara, CA, USA and an Intel (R) Core (TM) i5—9300H CPU manufactured by Intel Santa Clara, CA, USA. (running at a main frequency of 2.40 GHz with 8 GB of RAM). In specific training, the CTFAT model employs class-weighted cross-entropy as the loss function to mitigate the gradient bias toward majority classes in long-tailed distributions. The Adam optimizer and a cosine annealing learning rate strategy are used to facilitate better model convergence. The training consists of 200 epochs, with an average training time of 5.02 s per epoch. The model size is 9.03 M and FLOPs is 39.86 M.

The CWRU experimental platform consists of a drive motor, a coupling, and a bearing system, where the bearing is driven by the coupling. Accelerometers are employed to collect waveforms from various locations to diagnose the health status of the bearing. This platform utilizes SKF6205 bearings, and specific operating parameters are outlined in Table 1. The structure of the experimental platform is illustrated in Figure 7, which includes a drive motor, a torque sensor, and a dynamometer. It is divided into a drive end and a fan end, with accelerometers positioned at the 12 o’clock position on both ends, operating at sampling frequencies of 12 kHz and 48 kHz, respectively. Faults in the inner race (IR), outer race (OR), and rolling elements (RE) are artificially induced using electrical discharge machining. Each fault type is categorized into three damage levels: mild (7 mils), moderate (14 mils), and severe (21 mils) resulting in a total of nine fault types. The motor delivers a power output of 2 horsepower and operates under four loading conditions ranging from 0 to 3 (0 to 2.25 kW), with rotational speeds varying between 1797 and 1730 r/min. In order to simulate the long-tail distribution state of fault data in the real production environment, the samples in the bearing dataset are rebalanced to reduce the number of samples of fault classes, thus simulating the long-tail distribution state of fault data in the real production environment. Table 2 shows the division of CWRU datasets. Figure 8 shows the time-domain acceleration signals of the drive-end bearing under four different operating states, using the drive-end bearing under a 0 kW load situation as an example.

Figure 9 illustrates a rolling bearing device in the laboratory. This platform employs UPH205 bearings; this device consists of components such as a motor, acceleration sensors, and a support bearing. The bearing part of the platform includes two bearings: the drive end (DE) adjacent to the motor and the non-drive end (NDE). The rotational speed was 1000 r/min, and the acquisition frequency was 5 kHz; the sensor device was placed in the 12 o’clock direction of the bearing for diagnostic signal acquisition. Four main categories of data were collected: normal, inner race (IR), outer race (OR), and rolling elements (RE). Taking the example of the bearings at 1000 r/min, the signals for different fault locations are shown in Table 3. In order to fit the long-tail distribution of signals in the real production environment, the laboratory dataset is also rebalanced, and the specific division is shown in Table 4.

4.2. Data Processing

Data processing refers to the preprocessing and cleaning of the data in the dataset before model training, so as to better adapt to the training needs of the model. Sliding window sampling is a method of extracting subseries from time series or sequence data. By defining a fixed-size window, we slide the window over the data sequence, taking one subsequence at a time within the window. This allows local information at different time points in the sequence data to be obtained for the purposes of training the model. Specifically, the steps for sliding window sampling are as follows:

(1): According to the sample frequency, the window length is set at 1024 data points in order to strike a reasonable balance between frequency precision and time resolution. This configuration maintains enough temporal locality to record the time-domain features of fault impacts while also guaranteeing the capacity to discern the distinctive frequencies of bearing faults.
(2): To effectively mitigate the spectral leakage caused by the non-integer periodic truncation of bearing fault impact signals, a Hanning window is applied to the signals. The reliability of characteristic frequency band identification is significantly increased by this procedure, which means that the amplitude peaks at the fault frequencies more properly reflect the energy distribution. The specific operational procedure is as follows. First, position the Hanning window at the starting point of the signal sequence. Subsequently, slide the window along the time axis at a fixed stride to achieve segmented processing of the signal.
(3): At each location, extract the subsequences within the window as a sample.

Assuming that the sequence length is

L

, the window size is W, and the stride length is

S

, then the i-th sliding window is sampled as shown in Equation (13):

L_{i} = L_{i + 1} + S

(13)

To prevent spectral leakage from window overlap and ensure independent feature extraction, the stride

S

is set equal to the window size W (non-overlapping sampling), which prevents energy distortion between adjacent segments. The process is shown in Figure 10.

4.3. Case 1: CWRU Public Bearing Dataset

Training batch size is a crucial hyperparameter in the training process. The choice of batch size significantly affects both the training speed and the performance of the model. This study conducts experiments using the bearing dataset C, which includes 30 fault samples and 100 healthy samples for each type, under a load of 1.5 kW, to analyze the impact of batch size on the model’s diagnostic performance. Each experiment is repeated five times to minimize errors. The experimental results are shown in Figure 11. As illustrated in Figure 11, the selection of different batch sizes greatly influences the model’s performance. The model achieves high accuracy when a batch size of 64 is used. Therefore, the batch size is set to 64 for all subsequent experiments.

Then, the proposed CTFAT model is trained and tested under different loads. The samples of health class are all 100, and the samples of fault class are different in each group. Five groups with different fault sample sizes are tested under each load condition, and each group is evaluated using five random seeds to eliminate potential bias from random factors. Comparison methods 1 to 3 are Resnet18 [16], MCSwin-T [26], and Pyramid Vision Transformer [34], respectively. Method 4 is CTFAT, the model proposed in this study. The specific parameters and results are shown in Table 5 and Figure 12. It can be seen that CTFAT has significantly fewer parameters than the other models because part of the time-domain multi-head self-attention layer is replaced with frequency-domain attention blocks. This not only improves the parametric efficiency of the model but also reduces the overfitting of the model to the head health class.

Figure 12 shows that the accuracy of each method increases with the increase in fault samples. The CTFAT model proposed in this study shows excellent performance of more than 90% under various load conditions, and its average accuracy can remain above 93% even in the restricted scenario with only 10 fault samples. Specifically, in actual production environments with varying production tasks, it is essential to conduct cross-condition testing of the model under long-tail distribution to evaluate its adaptability and generalization performance. In order to verify the adaptability of the proposed CTFAT model across working conditions, the model is trained on the load under a given working condition and evaluated under different working conditions. Each experiment is repeated five times to eliminate errors, and, as clearly shown in Figure 13, the experimental results demonstrate the excellent performance of the CTFAT model in a variety of cross-load tasks. In most test scenarios, the CTFAT model outperformed the other reference models.

Unlike the laboratory environment, the vibration signals collected in factories and on actual production lines often contain a large amount of noise, which seriously affects the diagnostic accuracy of the model. Conducting fault diagnosis only with pure signals makes it difficult to meet the requirements of practical applications. In order to realistically simulate the industrial context, Gaussian white noise is added to the original vibration signals in this study. By adjusting the signal-to-noise ratio (SNR), the diagnostic accuracy of the model under different noise intensities is tested, and then the robustness of the proposed method is evaluated. In the experiment, five SNR parameters of −2 dB, 0 dB, 2 dB, 4 dB, and 6 dB are selected for the test. The calculation formula for the SNR is given by Equation (14):

S N R = 10 \lg (\frac{P_{s i g n a l}}{P_{n o i s e}})

(14)

where

P_{signal}

indicates the original input signal intensity and

P_{noise}

indicates the noise intensity.

In this study, Gaussian noise is introduced into the original input signal to create a noise signal, and the model’s noise resistance performance is evaluated accordingly. Specifically, the experiment is conducted under a load condition of 0 kW, utilizing dataset E (including 50 fault samples and 100 healthy samples for each class) and adding noise at different SNR levels to train the model.

As shown in Figure 14, the experimental results demonstrate that the diagnostic accuracy of all models is generally improved with an increase in the SNR. The CTFAT model proposed here outperforms the comparison model across all SNR conditions. Even in the challenging −2 dB signal-to-noise environment, the CTFAT model achieves a diagnostic accuracy of 84.41%.

Finally, the visualization technique of the confusion matrix is employed to assess the diagnostic capabilities of different models for input signals. To evaluate each model’s ability to diagnose input signals under a long-tail distribution, dataset C (including 30 fault samples and 100 healthy samples of each type) is tested under a load of 0 kW, and the confusion matrix for each model is presented in Figure 15. As illustrated in the figure, when the data are subjected to a long-tail distribution and noise interference, the diagnostic performance of Method 1 and Method 2 declines, rendering them unable to accurately distinguish between various fault categories. Method 3 also begins to misclassify, to some extent, the varying degrees of rolling element failure. In contrast, Method 4 demonstrates exceptional performance, accurately identifying input signals across all categories and effectively distinguishing between different types of faults, thereby validating the reliability of the proposed model.

4.4. Case 2: Laboratory Bearing Dataset

In order to comprehensively evaluate the generalization ability of the proposed model, it is tested on the bearing dataset from the laboratory. Like the CWRU dataset, the number of healthy samples is fixed at 100, while the number of faulty samples in different groups is gradually reduced from 50 to 10 per class. Additionally, to minimize the impact of random errors and variability, each group of experiments is conducted under five different random seeds. The specific experimental results are presented in Figure 16.

Figure 16 demonstrates that the model exhibits strong diagnostic performance, even when the sample size is significantly reduced. Especially under the challenging condition where the number of fault samples for each type is limited to just 10, Method 4 still achieves a high accuracy score of 93.59%. This underscores its clear advantages in feature extraction compared to the other models. Similarly, to assess the noise resistance of each model on the laboratory dataset, dataset J (including 50 fault samples and 100 healthy samples for each type of fault) was selected for the model robustness test. The results presented in Figure 17 indicate that, in the challenging environment with an SNR of −2 dB, the diagnostic accuracies of all four models decrease. However, the diagnostic accuracy of the CTFAT model remains higher than that of the other models.

To further investigate the classification performance of each model, t-SNE visualization technology is employed to compare the models under noise-free conditions, with each model represented through a confusion matrix. Dataset J is selected to evaluate the robustness of the models. The details are given in Figure 18 and Figure 19. As shown in Figure 18, several comparison methods demonstrate superior performance in identifying health classes when processing long-tail distribution data. For some mild fault classes, because the fault degree is shallow and the signal waveform is not complicated, even if there are not enough training samples, the model can still effectively identify them. However, for more complex fault categories, Method 1 struggles to differentiate them when the sample size is insufficient and the overall dataset adheres to a long-tail distribution. Method 2 tends to confuse moderate inner diameter failures and severe outer ring failures, to some extent. Method 3 shows uncertainty in distinguishing between moderate roll failures and mild roll failures. Method 4 achieves the highest test accuracy when trained on long-tail distribution data, demonstrating its superior performance in such scenarios. Figure 19 presents the visualization of feature classification across the four methods, clearly indicating that Method 4 exhibits the best clustering effect on features.

In order to reveal the feature classification effect of each layer of the CTFAT model and verify the validity of each layer, ablative experiments are conducted on the real–imaginary attention (RIA) and multi-head self-attention (MHA). The model without the real–imaginary attention module is denoted as TAT, and the one without the multi-head self-attention module is named FAT. Ablation experiments are performed on dataset J for the CTFAT model and the above two models. t-SNE dimensionality reduction technology is applied to the output of each layer of the CTFAT model. Accuracy, precision, recall, and specificity are used as indicators to evaluate the model. The classification of the specific features of the visual model is shown in Table 6 and Figure 20. As shown in Table 6, under the long-tail distribution scenario, the TAT model containing only the MHA module performs poorly, with an accuracy of only 62.81%, and it generally lags behind other models in terms of other performance metrics. In contrast, the FAT model containing only the RIA modules shows strong performance, with an accuracy of 95.40%. This result shows that the RIA module can effectively use frequency domain information for feature extraction under the condition of long-tail distribution, thereby improving the model’s diagnostic accuracy. However, due to the lack of time-domain attention, the model is still inadequate in the recognition of some samples. The CTFAT model exhibits excellent diagnostic performance by combining the advantages of both RIA and MHA, thus making full use of time-domain characteristics and also leveraging frequency-domain information processing capabilities.

Figure 20 provides an intuitive view of the feature distribution at each layer. When the input signal passes through the patch embedding layer, it can be observed that the features do not form an obvious hierarchical structure and the overall distribution is messy. This is because samples of the majority classes dominate the attention allocation process, causing the model to place excessive emphasis on the general temporal patterns of healthy signals. In contrast, the samples of minority fault classes are scarce, and the model has difficulty in learning their unique features. After processing the signals through the real attention layer and the imaginary attention layer and fully integrating the dual features of signal amplitude and phase, the classification performance of most samples was significantly improved, and the boundaries between different classes became clearer. However, there are still a small number of sample points that are misclassified as other classes. After CTFAM (RIA + MHA) processing, the signal features show significant differences in spatial distribution, and similar features are tightly aggregated, with almost no deviation cases. These results indicate that CTFAT has superior feature learning abilities under long-tail distribution.

5. Conclusions

This study proposes a long-tailed intelligent fault diagnosis framework based on the coupled time–frequency domain attention mechanism, focusing on the long-tailed distribution problem of fault diagnosis data and the diagnostic challenges encountered in noisy environments. The framework achieves a core technological innovation through the real–imaginary attention mechanism and the coupled time–frequency domain architecture. First, the Fast Fourier Transform is applied to the signal, and both the real part and the imaginary part are completely retained. The frequency domain features are captured through the real–imaginary attention mechanism, solving the problem of insufficient feature extraction in a single signal domain. Subsequently, through linear serial coupling, the fault features, such as amplitude changes and periodic shocks reflected in the time domain, are deeply integrated with the characteristic frequency components and phase information in the frequency domain, effectively dealing with the interference of strong noise and the insufficiency of single time–frequency domain information. The experimental results show that the diagnostic accuracy and generalization performance of this framework on two standard bearing datasets are superior to those of the comparative methods. The ablation experiment further verifies the effectiveness of the modules, providing a new solution for fault diagnosis in long-tailed data scenarios.

Although this study demonstrates good performance in theoretical verification and on standard datasets, the current framework still has certain limitations: it does not address the long-term data monitoring of actual industrial equipment, and the bearing fault data under real working conditions was not used for verification. Future work will focus on this issue to promote its implementation in practical industrial scenarios and improve the practicality and diagnostic accuracy of the model under complex working conditions.

Author Contributions

Conceptualization, Y.Z. and L.Z.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z., T.R. and H.L. (Hongsheng Li); formal analysis, Y.Z.; investigation, L.Z.; resources, L.Z. and H.L. (Hao Luo); data curation, L.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z. and L.Z.; visualization, Y.Z.; supervision, L.Z.; project administration, H.L. (Hao Luo); funding acquisition, H.L. (Hao Luo). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China under the grants 2024YFC3013901 and 2024YFC3013903, the National Natural Science Foundation of China under grant 52427805, and the Liaoning Provincial Department of Education’s Science and Technology Research Project under grant LKMZ20220450.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study can be requested from the corresponding author. Due to confidentiality requirements in the laboratory where the testing equipment is located, these data are not publicly disclosed.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

This table lists the acronyms used in the article.

Acronym	Meaning
FFT	Fast Fourier Transform
CWRU	Case Western Reserve University
CNN	Convolutional Neural Networks
PCA	Principal Component Analysis
SVM	Support Vector Machine
PSO	Particle Swarm Optimization
RF	Random Forest
RNN	Recurrent Neural Networks
GAN	Generative Adversarial Networks
VMD	Variational Mode Decomposition
SDP	Symmetric Dot Pattern
CA	Coordinate Attention
GRU	Gated Recurrent Unit
BiGRU	Bidirectional Gated Recurrent Units
SSDL	Self-training Semi-supervised Deep Learning
MCSwin-T	Multi-channel Calibrated Transformer with Shifted Windows
CTFAM	Coupled Time–Frequency Attention Mechanism
CTFAT	Coupled Time–Frequency Attention Transformer
ViT	Vision Transformer
RIA	Real-Imaginary Attention
LN	Layer Normalization
IFFT	Inverse Fast Fourier Transform
IR	Inner Race
OR	Outer Race
RE	Rolling Elements

References

Chen, Z.X.; Yang, Y.; He, C.B.; Liu, Y.B.; Liu, X.Z.; Cao, Z. Feature Extraction Based on Hierarchical Improved Envelope Spectrum Entropy for Rolling Bearing Fault Diagnosis. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
Liu, G.Z.; Wu, L.F. Incremental bearing fault diagnosis method under imbalanced sample conditions. Comput. Ind. Eng. 2024, 192, 110203. [Google Scholar] [CrossRef]
Wang, G.; Liu, D.D.; Cui, L.L. Auto-Embedding Transformer for Interpretable Few-Shot Fault Diagnosis of Rolling Bearings. IEEE Trans. Reliab. 2024, 73, 1270–1279. [Google Scholar] [CrossRef]
Wu, X.G.; Peng, H.D.; Cui, X.Y.; Guo, T.W.; Zhang, Y.T. Multichannel Vibration Signal Fusion Based on Rolling Bearings and MRST-Transformer Fault Diagnosis Model. IEEE Sens. J. 2024, 24, 16336–16346. [Google Scholar] [CrossRef]
Zhao, K.C.; Xiao, J.Q.; Li, C.; Xu, Z.F.; Yue, M.N. Fault diagnosis of rolling bearing using CNN and PCA fractal based feature extraction. Measurement 2023, 223, 113754. [Google Scholar] [CrossRef]
Wu, Y.Q.; Dai, J.Y.; Yang, X.Q.; Shao, F.M.; Gong, J.C.; Zhang, P.; Liu, S.D. The Fault Diagnosis of Rolling Bearings Based on FFT-SE-TCN-SVM. Actuators 2025, 14, 152. [Google Scholar] [CrossRef]
Aburakhia, S.A.; Myers, R.; Shami, A. A Hybrid Method for Condition Monitoring and Fault Diagnosis of Rolling Bearings With Low System Delay. IEEE Trans. Instrum. Meas. 2022, 71, 1–13. [Google Scholar] [CrossRef]
Hakim, M.; Omran, A.A.B.; Ahmed, A.N.; Al-Waily, M.; Abdellatif, A. A systematic review of rolling bearing fault diagnoses based on deep learning and transfer learning: Taxonomy, overview, application, open challenges, weaknesses and recommendations. Ain Shams Eng. J. 2023, 14, 101945. [Google Scholar] [CrossRef]
Chen, X.H.; Zhang, B.K.; Gao, D. Bearing fault diagnosis base on multi-scale CNN and LSTM model. J. Intell. Manuf. 2021, 32, 971–987. [Google Scholar] [CrossRef]
Zhao, X.; Chen, S.; Gao, K.; Luo, L. Bidirectional Recurrent Neural Network based on Multi-Kernel Learning Support Vector Machine for Transformer Fault Diagnosis. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 125–135. [Google Scholar] [CrossRef]
Shao, H.; Li, W.; Cai, B.; Wan, J.; Xiao, Y.; Yan, S. Dual-Threshold Attention-Guided GAN and Limited Infrared Thermal Images for Rotating Machinery Fault Diagnosis Under Speed Fluctuation. IEEE Trans. Ind. Inform. 2023, 19, 9933–9942. [Google Scholar] [CrossRef]
Zhi, S.D.; Su, K.Y.; Yu, J.; Li, X.Y.; Shen, H.K. An unsupervised transfer learning bearing fault diagnosis method based on multi-channel calibrated Transformer with shiftable window. Struct. Health Monit. Int. J. 2025, 34, 14759217251324671. [Google Scholar] [CrossRef]
Zhang, J.B.; Zhao, Z.Q.; Jiao, Y.H.; Zhao, R.C.; Hu, X.L.; Che, R.W. DPCCNN: A new lightweight fault diagnosis model for small samples and high noise problem. Neurocomputing 2025, 626, 129526. [Google Scholar] [CrossRef]
Wang, Z.Y.; Xu, X.; Song, D.L.; Zheng, Z.J.; Li, W.D. A Novel Bearing Fault Diagnosis Method Based on Improved Convolutional Neural Network and Multi-Sensor Fusion. Machines 2025, 13, 216. [Google Scholar] [CrossRef]
Mansouri, M.; Dhibi, K.; Hajji, M.; Bouzara, K.; Nounou, H.; Nounou, M. Interval-Valued Reduced RNN for Fault Detection and Diagnosis for Wind Energy Conversion Systems. IEEE Sens. J. 2022, 22, 13581–13588. [Google Scholar] [CrossRef]
Niu, J.; Pan, J.; Qin, Z.; Huang, F.; Qin, H. Small-Sample Bearings Fault Diagnosis Based on ResNet18 with Pre-Trained and Fine-Tuned Method. Appl. Sci. 2024, 14, 5360. [Google Scholar] [CrossRef]
Zhou, F.N.; Yang, S.; Fujita, H.; Chen, D.M.; Wen, C.L. Deep learning fault diagnosis method based on global optimization GAN for unbalanced data. Knowl. Based Syst. 2020, 187, 104837. [Google Scholar] [CrossRef]
Chen, Y.S.; Qiang, Y.K.; Chen, J.H.; Yang, J.L. FMRGAN: Feature Mapping Reconstruction GAN for Rolling Bearings Fault Diagnosis Under Limited Data Condition. IEEE Sens. J. 2024, 24, 25116–25131. [Google Scholar] [CrossRef]
Peng, P.; Lu, J.X.; Tao, S.T.; Ma, K.; Zhang, Y.; Wang, H.W.; Zhang, H.M. Progressively Balanced Supervised Contrastive Representation Learning for Long-Tailed Fault Diagnosis. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
Huang, M.; Sheng, C.X. Adaptive-conditional loss and correction module enhanced informer network for long-tailed fault diagnosis of motor. J. Comput. Des. Eng. 2024, 11, 306–318. [Google Scholar] [CrossRef]
Luo, H.; Wang, X.Y.; Zhang, L. Normalization-Guided and Gradient-Weighted Unsupervised Domain Adaptation Network for Transfer Diagnosis of Rolling Bearing Faults Under Class Imbalance. Actuators 2025, 14, 39. [Google Scholar] [CrossRef]
Jian, C.X.; Mo, G.P.; Peng, Y.H.; Ao, Y.H. Long-tailed multi-domain generalization for fault diagnosis of rotating machinery under variable operating conditions. Struct. Health Monit. Int. J. 2024, 24, 1927–1945. [Google Scholar] [CrossRef]
Zhang, X.; He, C.; Lu, Y.P.; Chen, B.A.; Zhu, L.; Zhang, L. Fault diagnosis for small samples based on attention mechanism. Measurement 2022, 187, 110242. [Google Scholar] [CrossRef]
Liu, Y.; Wen, W.G.; Bai, Y.H.; Meng, Q.Z. Self-supervised feature extraction via time-frequency contrast for intelligent fault diagnosis of rotating machinery. Measurement 2023, 210, 112551. [Google Scholar] [CrossRef]
Long, J.Y.; Chen, Y.B.; Yang, Z.; Huang, Y.W.; Li, C. A novel self-training semi-supervised deep learning approach for machinery fault diagnosis. Int. J. Prod. Res. 2023, 61, 8238–8251. [Google Scholar] [CrossRef]
Chen, Z.H.; Chen, J.L.; Liu, S.; Feng, Y.; He, S.L.; Xu, E.Y. Multi-channel Calibrated Transformer with Shifted Windows for few-shot fault diagnosis under sharp speed variation. Isa Trans. 2022, 131, 501–515. [Google Scholar] [CrossRef]
Wang, H.; Liu, Z.L.; Peng, D.D.; Cheng, Z. Attention-guided joint learning CNN with noise robustness for bearing fault diagnosis and vibration signal denoising. Isa Trans. 2022, 128, 470–484. [Google Scholar] [CrossRef]
Chen, B.A.; Liu, T.T.; He, C.; Liu, Z.C.; Zhang, L. Fault Diagnosis for Limited Annotation Signals and Strong Noise Based on Interpretable Attention Mechanism. IEEE Sens. J. 2022, 22, 11865–11880. [Google Scholar] [CrossRef]
Ding, Y.F.; Jia, M.P.; Miao, Q.H.; Cao, Y.D. A novel time-frequency Transformer based on self-attention mechanism and its application in fault diagnosis of rolling bearings. Mech. Syst. Signal Process. 2022, 168, 108616. [Google Scholar] [CrossRef]
Qin, Z.Q.; Zhang, P.Y.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Electr Network, Montreal, QC, Canada, 11–17 October 2021; pp. 763–772. [Google Scholar]
Zhao, H.S.; Jia, J.; Koltun, V. Exploring Self-attention for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Electr Network, Seattle, DC, USA, 14–19 June 2020; pp. 10073–10082. [Google Scholar]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar]
Bai, X.; Ma, Z.; Meng, G. Bearing Fault Diagnosis Based on Wavelet Transform and Residual Shrinkage Network. In Proceedings of the 2022 International Conference on Computer Network, Electronic and Automation (ICCNEA), Xi’an, China, 23–25 September 2022; pp. 373–378. [Google Scholar]
Liu, Y. One-level Stationary Wavelet Packet Transform & Hilbert Transform based Rolling Bearing Fault Diagnosis. In Proceedings of the 2018 IEEE International Conference on Information and Automation (ICIA), Wuyishan, China, 11–13 August 2018; pp. 1475–1479. [Google Scholar]
Yang, C.P.; Qiao, Z.J.; Zhu, R.H.; Xu, X.F.; Lai, Z.H.; Zhou, S.T. An Intelligent Fault Diagnosis Method Enhanced by Noise Injection for Machinery. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Zhang, L.; Gu, S.; Luo, H.; Ding, L.; Guo, Y. Residual Shrinkage ViT with Discriminative Rebalancing Strategy for Small and Imbalanced Fault Diagnosis. Sensors 2024, 24, 890. [Google Scholar] [CrossRef]

Figure 1. (a) Self-attention mechanism and (b) multi-head self-attention mechanism.

Figure 2. ViT, Swin Transformer, and Pyramid ViT feature extraction process diagram.

Figure 3. Schematic of the RIA structure.

Figure 4. Structure diagram of CTFAM.

Figure 5. Diagram of CTFAT.

Figure 6. Flowchart of the proposed intelligent fault diagnosis method.

Figure 7. CWRU bearing fault experimental platform [38].

Figure 8. Time-domain signal waveform diagrams of different types of bearings in the CWRU database.

Figure 9. Bearing fault test platform in the laboratory [39].

Figure 10. Sampling using a sliding window.

Figure 11. Accuracy for different batch sizes.

Figure 12. Variation of accuracy of each load in the CWRU dataset.

Figure 13. Diagnostic accuracy of different models across working conditions.

Figure 14. Comparison diagram of anti-noise experiments of various models in the CWRU dataset.

Figure 15. Confusion matrix of each method in the CWRU dataset.

Figure 16. Variation of the accuracy of each model in the laboratory dataset.

Figure 17. Comparison diagram of anti-noise experiments of various models in the laboratory dataset.

Figure 18. Confusion matrix of various methods in the laboratory dataset.

Figure 19. Visualization of the t-SNE features of each method.

Figure 20. Comparison of noise suppression experiments for various models on the laboratory dataset.

Table 1. Parameters of SKF6205 bearing.

Structural Name	IR/mm	OR/mm	Rolling Element Diameter/mm	Pitch Diameter/mm	Limiting Speed r/min	Ball No.	Weight /kg
Parameters	25	52	7.94	39.04	14,000	9	0.132

Table 2. The division of CWRU datasets.

Type of Fault	Degree of Damage/mm	Label	Number of Training Samples	Datasets
Normal	0	0	100
Inner race fault	0.118	1	10/20/30/40/50	A/B/C/D/E
	0.356	2
	0.533	3
Outer race fault	0.118	4
	0.356	5
	0.533	6
Ball fault	0.118	7
	0.356	8
	0.533	9

Table 3. Sample status of laboratory fault datasets [39].

Type of Fault	Motor Speed	Fault Position
Normal	1000 r/min
Inner race fault	1000 r/min
Outer race fault	1000 r/min
Ball fault	1000 r/min

Table 4. Division of laboratory datasets.

Type of Fault	Degree of Damage/mm	Label	Number of Training Samples	Datasets
Normal	None	0	100
Inner race fault	Slight	1	10/20/30/40/50	A/B/C/D/E
	moderate	2
	severe	3
Outer race fault	Slight	4
	moderate	5
	severe	6
Ball fault	Slight	7
	moderate	8
	severe	9

Table 5. Comparison of parameters of each model.

Model	Method 1	Method 2	Method 3	Method 4
Parameters	6.28M	10.42M	26.25M	2.34M

Table 6. Dataset of laboratory CTFAT ablation experiment results.

Model Name	Included Module	Accuracy	Precision	Recall	Specificity
TAT	MHA	62.81	0.6330	0.6280	0.9587
FAT	RIA	95.40	0.9568	0.9540	0.9950
CTFAT	RIA + MHA	99.06	0.9909	1.000	0.9989

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Zhang, Y.; Luo, H.; Ren, T.; Li, H. A Long-Tail Fault Diagnosis Method Based on a Coupled Time–Frequency Attention Transformer. Actuators 2025, 14, 255. https://doi.org/10.3390/act14050255

AMA Style

Zhang L, Zhang Y, Luo H, Ren T, Li H. A Long-Tail Fault Diagnosis Method Based on a Coupled Time–Frequency Attention Transformer. Actuators. 2025; 14(5):255. https://doi.org/10.3390/act14050255

Chicago/Turabian Style

Zhang, Li, Ying Zhang, Hao Luo, Tongli Ren, and Hongsheng Li. 2025. "A Long-Tail Fault Diagnosis Method Based on a Coupled Time–Frequency Attention Transformer" Actuators 14, no. 5: 255. https://doi.org/10.3390/act14050255

APA Style

Zhang, L., Zhang, Y., Luo, H., Ren, T., & Li, H. (2025). A Long-Tail Fault Diagnosis Method Based on a Coupled Time–Frequency Attention Transformer. Actuators, 14(5), 255. https://doi.org/10.3390/act14050255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Long-Tail Fault Diagnosis Method Based on a Coupled Time–Frequency Attention Transformer

Abstract

1. Introduction

2. Related Work

2.1. Transformer Attention Network

2.2. Improved Transformer Model

3. Method

3.1. Real–Imaginary Attention

3.2. Coupled Time–Frequency Attention Mechanism

3.3. Coupled Time–Frequency Attention Transformer Model

4. Experiment

4.1. Experimental Environment and Dataset Introduction

4.2. Data Processing

4.3. Case 1: CWRU Public Bearing Dataset

4.4. Case 2: Laboratory Bearing Dataset

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI