MDFT-GAN: A Multi-Domain Feature Transformer GAN for Bearing Fault Diagnosis Under Limited and Imbalanced Data Conditions

Guo, Chenxi; Potekhin, Vyacheslav V.; Li, Peng; Kovalchuk, Elena A.; Lian, Jing

doi:10.3390/app15116225

Open AccessArticle

MDFT-GAN: A Multi-Domain Feature Transformer GAN for Bearing Fault Diagnosis Under Limited and Imbalanced Data Conditions

by

Chenxi Guo

^1,*

,

Vyacheslav V. Potekhin

^1,*

,

Peng Li

²

,

Elena A. Kovalchuk

¹

and

Jing Lian

³

¹

School of Cyber-Physical Systems Management, Peter the Great St. Petersburg Polytechnic University, 195251 Saint-Petersburg, Russia

²

School of Automation and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

³

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6225; https://doi.org/10.3390/app15116225

Submission received: 30 April 2025 / Revised: 28 May 2025 / Accepted: 29 May 2025 / Published: 31 May 2025

(This article belongs to the Special Issue Explainable Artificial Intelligence Technology and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

In industrial scenarios, bearing fault diagnosis often suffers from data scarcity and class imbalance, which significantly hinders the generalization performance of data-driven models. While generative adversarial networks (GANs) have shown promise in data augmentation, their efficacy deteriorates in the presence of multi-category and structurally complex fault distributions. To address these challenges, this paper proposes a novel fault diagnosis framework based on a Multi-Domain Feature Transformer GAN (MDFT-GAN). Specifically, raw vibration signals are transformed into 2D RGB representations via joint time-domain, frequency-domain, and time–frequency-domain mappings, effectively encoding multi-perspective fault signatures. A Transformer-based feature extractor, integrated with Efficient Channel Attention (ECA), is embedded into both the generator and discriminator to capture global dependencies and channel-wise interactions, thereby enhancing the representation quality of synthetic samples. Furthermore, a gradient penalty (GP) term is introduced to stabilize adversarial training and suppress mode collapse. To improve classification performance, an Enhanced Hybrid Visual Transformer (EH-ViT) is constructed by coupling a lightweight convolutional stem with a ViT encoder, enabling robust and discriminative fault identification. Beyond performance metrics, this work also incorporates a Grad-CAM-based interpretability scheme to visualize hierarchical feature activation patterns within the discriminator, providing transparent insight into the model’s decision-making rationale across different fault types. Extensive experiments on the CWRU and Jiangnan University (JNU) bearing datasets validate that the proposed method achieves superior diagnostic accuracy, robustness under limited and imbalanced conditions, and enhanced interpretability compared to existing state-of-the-art approaches.

Keywords:

fault diagnosis; generative adversarial networks; time-frequency analysis; limited samples; feature fusion; predictive analysis

1. Introduction

With the rapid advancement of Industry 4.0, the demand for the health monitoring and operational stability of intelligent industrial equipment has significantly increased [1,2]. As core components of rotating machinery, the condition of rolling bearings directly impacts the safety and stability of industrial systems. Due to their operation under variable speed and load conditions, bearings are susceptible to a variety of faults. Furthermore, when localized faults occur, transient pulses often contain interference noise and harmonic components that are challenging to detect through vibration signal analysis, posing substantial risks to industrial machinery [3,4,5]. Therefore, the monitoring and fault diagnosis of rolling bearings is critical for ensuring the safe and reliable operation of industrial equipment [6,7].

Traditional fault diagnosis methods typically rely on the analysis of 1D vibration signals from sensors, requiring professionals to manually extract and assess signal features before designing classifiers [8,9,10]. These approaches have achieved high accuracy by utilizing entropy features, autocorrelation analysis, or specialized classifiers like U-SVM. However, they are often time-consuming and labor-intensive, limiting their adaptability to large-scale or real-time industrial applications.

The success of convolutional neural networks (CNNs) has spurred the development of data-driven fault diagnosis frameworks. For example, Wang et al. [11] integrated CNNs with Hidden Markov Models to enhance the classification of multi-fault signals. Gan et al. [12] introduced a hierarchical diagnosis network (HDN) using deep belief networks for layered fault recognition. Shao et al. [13] combined compressed sensing with convolutional deep belief networks for efficient feature learning. Wang et al. [14] proposed a multi-sensor fusion CNN that improves diagnostic precision by incorporating diverse signal sources. These methods automate feature extraction and achieve state-of-the-art performance under sufficient training data. Moreover, recent works have proposed more specialized architectures for interpretable or compound fault diagnosis. Li et al. [15] proposed an interpretable composite fault diagnosis method of WavCapsNet, which is capable of intelligently diagnosing composite faults in vibration signals. Wen et al. [16] introduced a novel hierarchical convolutional neural network (HCNN) capable of simultaneously identifying both the type and severity of faults. Ma et al. [17] proposed a hybrid feature transformation approach, which optimizes the fault diagnosis performance by combining random forests and auto-coding. However, most of these methods are sample data-driven, focusing on model selection, network architecture, and hyperparameter tuning. When sample data are limited, these models often struggle to achieve the desired detection accuracy due to insufficient training data, failing to meet practical requirements.

Despite the challenges posed by limited or imbalanced data, the data generation capabilities of GANs offer promising solutions for fault diagnosis [18]. The underlying principle of GANs involves a generator and discriminator optimizing each other through a zero-sum game, ultimately converging to a Nash equilibrium [19,20]. For instance, Yang et al. [21] proposed a feature fusion GAN with embedded category constraints to improve small-sample generation fidelity. Guo et al. [22] utilized an ACGAN variant for multi-label fault generation, enhancing classifier performance. Shao et al. [23] designed a 1D CNN-based ACGAN for realistic signal synthesis. Qin et al. [24] incorporated attention modules into GANs to handle multisensor and compound fault scenarios effectively. Further, several targeted GAN-based frameworks have been proposed for specific applications. Gao et al. [25] introduced ICoT-GAN for bearing fault diagnosis under limited data, using global–local feature fusion. Yang et al. [26] proposed CGAN-2-D-CNN, coupling data augmentation with a 2D CNN classifier. Liu et al. [27] presented a hybrid GAN-capsule network to balance sample distribution and enhance feature discrimination.

These developments demonstrate the increasing integration of generative modeling with traditional and deep learning-based diagnostics, offering a promising pathway for overcoming data scarcity and imbalance in real-world fault diagnosis scenarios.

While existing studies have addressed the issues of data scarcity and imbalance to some extent through variants of GANs, several limitations remain:

(1) Most current approaches primarily rely on simple statistical features derived from time-domain signals, overlooking the critical role that frequency and time-frequency domain features play in characterizing fault modes. This single-feature dependence hinders generative networks from effectively capturing both global features and the fine-grained feature distributions associated with complex fault modes, resulting in generated samples that inadequately represent the diversity and complexity of fault signals.

(2) Many existing generative networks employ traditional convolutional structures that are capable of capturing local features but are limited in modeling the cross-domain dependencies and intricate features of high-dimensional vibration signals. This limitation becomes particularly pronounced in the case of multimodal faults, where the generated samples fail to reflect the complex interdependencies of real-world signals.

(3) Most GANs rely on conventional loss functions (e.g., Jensen–Shannon divergence), which are susceptible to issues such as gradient vanishing and mode collapse during adversarial training, leading to inconsistencies in the quality of generated samples. Moreover, the inherent conflict between the discriminator’s classification task and its adversarial discrimination role further compromises both the quality and diversity of the generated samples.

To address the aforementioned challenges, we propose a Multi-Domain Feature Transformer Generative Adversarial Network (MDFT-GAN). This framework leverages multi-domain feature fusion, an enhanced network structure, and an improved training mechanism to significantly enhance the quality and diversity of generated samples, overcoming the limitations of existing methods. The main contributions of this paper are as follows:

(1) We introduce a multi-domain information fusion strategy that effectively combines time-domain, frequency-domain, and time-frequency domain features, capitalizing on their complementarity. This approach provides comprehensive feature support for generating complex fault modes, thereby improving the completeness and diversity of the generated samples.

(2) To enhance the generative network’s ability to model both global dependencies and local features, we design a network structure that combines convolutional layers with a Transformer encoder. A multi-head self-attention mechanism and channel attention are incorporated to refine the quality and feature representation of the generated samples. Additionally, we introduce an adversarial loss based on Wasserstein distance, supplemented with a gradient penalty mechanism, to significantly improve training stability while maintaining an auxiliary classification loss.

(3) We conduct a comprehensive comparative analysis using two publicly available bearing fault datasets. The results demonstrate the superiority of MDFT-GAN in terms of both the quality and diversity of the generated samples. Experimental findings further confirm that the proposed method outperforms mainstream generative adversarial networks, exhibiting higher accuracy and robustness in fault diagnosis tasks.

(4) To enhance the transparency and interpretability of the diagnostic process, we further incorporate a Grad-CAM-based visual interpretability framework. This enables visualization of hierarchical feature activations in the discriminator and classifier, offering insight into the learned fault representations and supporting the explainability of decision boundaries across different fault types.

The remainder of this paper is organized as follows: Section 2 outlines the theoretical foundations relevant to the research; Section 3 provides a detailed description of the proposed method; Section 4 presents experimental validation of the method’s effectiveness; and Section 5 concludes the paper, discussing potential directions for future research.

2. Basic Theory

This section introduces the relevant theoretical background, covering the principles and methods of multi-domain information fusion, the auxiliary classifier GAN (ACGAN), and the fundamental concepts of the Multi-head attention.

2.1. Multi-Domain Information Fusion

Multi-domain information fusion is a critical approach in fault diagnosis, which integrates time-domain, frequency-domain, and time-frequency domain features to enhance diagnostic accuracy and robustness. Time-domain signals capture the amplitude characteristics of vibration intensity, frequency-domain signals reveal the periodic vibration frequencies, and time-frequency domain signals provide the joint distribution of time and frequency. The signal processing techniques employed in this paper include 2D image transformation in the time domain, frequency-domain analysis using the fast Fourier transform (FFT), and time-frequency domain analysis via the short-time Fourier transform (STFT).

Time-domain feature analysis transforms a one-dimensional vibration signal into a two-dimensional image, preserving the signal’s original amplitude characteristics. We let the vibration signal be represented as x[t], with the discrete sampled signal denoted as x[n]. The signal length is adjusted to N = H × W, where H and W represent the row and column sizes of the target image, respectively. The signal is then reorganized into a two-dimensional matrix:

X_{raw} (i, j) = x [n], n = i \cdot W + j, i = 0, 1, \dots, H - 1, j = 0, 1, \dots, W - 1

(1)

where zero padding is applied when the signal length is insufficient, and truncation is used when the signal exceeds the target size. This method effectively preserves the original waveform characteristics of the vibration signal, ensuring that its time-domain amplitude distribution is accurately represented in the 2D image.

Frequency-domain analysis utilizes the FFT to convert a time-domain signal x[n] into the frequency domain, thereby revealing the frequency components and amplitude characteristics. The mathematical expression for the FFT is given as follows:

\begin{matrix} X [k] = \sum_{n = 0}^{N - 1} x [n] e^{- j 2 π k n / N}, k = 0, 1, \dots, N - 1 \end{matrix}

(2)

where X[k] represents the complex value of the frequency component at index k. To visualize the spectrum, the amplitude is extracted and reconstructed into a two-dimensional representation as follows:

\begin{matrix} X_{HFT} (i, j) = | X [k] |, k = i \cdot W + j, i = 0, 1, \dots, H - 1, j = 0, 1, \dots, W - 1 \end{matrix}

(3)

meanwhile, the amplitude matrix is normalized and resized to the target image dimensions H × W to generate the spectrum image.

Time-frequency domain feature analysis captures both the time-varying frequency characteristics and the amplitude distribution of a signal using the short-time Fourier transform (STFT). The mathematical expression for the STFT is given as follows:

\begin{matrix} X_{SIFT} (t, f) = \int_{- \infty}^{\infty} x (τ) w (τ - t) e^{- j 2 π f τ} d τ \end{matrix}

(4)

where w

(τ)

is the window function that controls the time span of the short-time analysis, and t and f represent time and frequency, respectively. The amplitude is calculated as follows:

\begin{matrix} ∣ X_{STFT} (t, f) ∣ = \sqrt{Re {(X_{SIFT} (t, f))}^{2} + Im {(X_{SIFT} (t, f))}^{2}} \end{matrix}

(5)

Similarly, the time-frequency spectrum is normalized and resized to a two-dimensional image of dimensions H × W for subsequent model processing.

2.2. Auxiliary Classifier GAN

The ACGAN is an extension of the traditional GAN that enhances the quality and diversity of generated data by incorporating class information [28]. Unlike traditional GANs, ACGAN introduces a class condition to the generator’s (G) input and adds a class label prediction function to the discriminator’s (D) output. As illustrated in Figure 1, the ACGAN consists of a generator and a discriminator. The generator takes random noise vectors and category labels as inputs, learning to generate synthetic data that aligns with the specified class. The discriminator, in turn, evaluates whether the input data are real or generated and predicts its corresponding class label.

In ACGAN, the discriminator seeks to maximize the log probability of real data while minimizing the log probability of generated data, thereby distinguishing real data from generated data. Additionally, it minimizes the cross-entropy loss for category prediction through an auxiliary classifier. The objective function is composed of two components, the adversarial loss

L_{source}

and the categorical loss

L_{class}

, as follows:

\begin{matrix} L_{source} = E_{x \sim P_{data}} [log D (x)] + E_{z \sim P_{z}} [log (1 - D (G (z, y)))] \end{matrix}

(6)

\begin{matrix} L_{class} = E_{x \sim P_{data}} [log P (y | x_{real})] + E_{z \sim P_{z}} [log P (y | G (z, y))] \end{matrix}

(7)

where

D (x)

represents the discriminator’s predicted probability of the input data being real and

P (y | x)

denotes the probability of the auxiliary classifier predicting the correct data category. By incorporating categorical conditional information, ACGAN significantly enhances the quality and category consistency of the generated data, particularly in multi-class tasks. Moreover, the categorical condition provides additional supervisory information to the generator, enabling it to learn complex data distributions more effectively.

2.3. Multi-Head Attention

The Transformer is a novel network architecture centered around an attention mechanism [29]. Initially introduced for natural language processing (NLP) tasks, it has since been extended to other domains, such as computer vision and time series analysis, owing to its superior performance in handling long sequences of data. Unlike the CNNs, the Transformer relies on multi-head attention (MHA) and multi-layer perceptron (MLP) components, which offer distinct advantages in capturing global dependencies.

The core structure of the Transformer consists of key components, including MHA, MLP, and layer normalization (LN), as illustrated in Figure 2.

The MHA decomposes the input feature sequence into query (Q), key (K), and value (V) components. It dynamically adjusts feature associations using attention weights, enabling the modeling of global dependencies in sequence data. The calculation formula is expressed as follows:

\begin{matrix} Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V \end{matrix}

(8)

where

d_{k}

is the feature dimension. The multi-head attention mechanism captures multi-level feature associations, thereby enhancing the model’s expressive power. Additionally, the MLP performs feature mapping through a combination of linear transformations and nonlinear activation functions, such as GELU. To accelerate training and improve model stability, the Transformer incorporates layer normalization at each layer, ensuring consistency in feature distributions. The residual connection further mitigates the issue of gradient vanishing in deep networks.

3. The Proposed Method

3.1. Motivation

To address the data imbalance and complex feature extraction problems in bearing fault diagnosis, this paper proposes a diagnosis method based on MDFT-GAN. First, to alleviate the problems of limited and imbalanced data, this paper designs a novel generative adversarial network model, MDFT-GAN. Second, to further improve the fault classification accuracy, this paper proposes an improved classifier model based on residual blocks and a hybrid vision transformer.

3.2. Proposed Fault Diagnosis Method Based on MDFT-GAN

Figure 3 illustrates the overall framework of the fault diagnosis method based on MDFT-GAN. The method consists of three components: multi-domain data fusion, the data augmentation model, and the fault classification model. These components work together to achieve multi-domain feature representation, high-quality sample generation, and accurate fault classification.

The multi-domain feature representation process converts a one-dimensional vibration signal into a two-dimensional RGB image by extracting time-domain signals, frequency-domain features using the FFT, and time-frequency domain features via the STFT. This fusion retains the signal’s local details while capturing its global distribution characteristics, providing high-quality input for downstream tasks.

The data augmentation model employs the MDFT-GAN structure, comprising a generator (G) and a discriminator (D). The generator produces high-quality samples by modeling multi-domain features, while the discriminator enhances the quality of sample generation through adversarial training. The model utilizes multi-level MDFT modules to extract both global and local features, accurately capturing the data distribution and alleviating the impact of sample imbalance on classification performance.

The fault classification model is structured as a hybrid feature extraction framework, incorporating a front-end convolutional module, a Transformer encoder, and an improved classification layer. Local features are extracted through the front-end convolutional module, while the Transformer encoder models global dependencies using a self-attention mechanism. Finally, the enhanced classification head ensures high-precision predictions across multiple fault categories.

In summary, this method enables efficient modeling and diagnosis of vibration signals by leveraging multi-domain information fusion, data augmentation, and classification framework optimization.

3.3. The Design of the Data Augmentation Model

Although ACGAN offers certain advantages in ensuring sample category consistency, it has significant limitations when dealing with complex working condition data: (1) it struggles to effectively capture both the global and local characteristics of the data; (2) under conditions of severe sample category imbalance, the generated samples fail to fully represent the characteristics of the real data. These limitations are particularly pronounced in complex multi-category fault scenarios, where the inability to accurately model the distribution of different categories in the generated samples reduces the robustness of the classification model.

To address these shortcomings, this paper introduces MDFT-GAN, designed to generate high-resolution, high-fidelity samples that meet the generation requirements of multi-class complex fault data. The structure of MDFT-GAN is illustrated in Figure 4.

The generator input consists of two components: (a) a noise vector, which is a randomly sampled latent vector used to introduce diversity in the generated samples, and (b) a category label vector, which is mapped to a high-dimensional feature vector through an embedding layer. The parameter matrix of the embedding layer is defined as follows:

\begin{matrix} W_{embed} \in R^{n_{class} \times d_{emb}} \end{matrix}

(9)

where

n_{class}

is the number of categories and

d_{emb}

is the embedding dimension. The noise vector Z and category vector C are element-wise multiplied to form joint features, which are then converted into a four-dimensional tensor. This tensor serves as the initial input to the MDFT module in the generator.

The MDFT module aims to construct an efficient and generalizable generative feature representation framework. By synergizing convolutional operations with a Transformer encoder, the module introduces a multi-domain feature modeling strategy, achieving a deep integration of local details and global information, rather than merely stacking modules. The MDFT module first extracts local features and retains spatial information using 2D convolution. Simultaneously, it innovatively integrates an attention mechanism with dynamic feature reallocation to address the challenge of high-dimensional feature interaction in complex data distributions.

Specifically, after convolutional feature extraction, the MDFT module introduces MHA, which dynamically adjusts feature weights between different channels according to the uneven distribution of global features. This approach not only captures the global dependencies of the generated features but also enhances the expressiveness of sparse features by optimizing attention distribution. This addresses the limitations of traditional attention mechanisms in processing high-dimensional data. Furthermore, the module incorporates ECA to refine feature interactions between channels. This not only improves the consistency of the generated samples under different category conditions but also significantly enhances feature representation in multi-class imbalance scenarios.

In addition, the MDFT module employs a unique normalization and feature mapping strategy to address the issue of modal mismatch in multi-domain feature fusion. The module utilizes multi-level feature modeling to efficiently capture both the global and local distributions of complex signals, providing a strong conditional constraint for the category relevance of generated samples. This design transcends traditional module stacking by seamlessly integrating feature extraction, enhancement, and redistribution, ensuring the stability and adaptability of MDFT-GAN in generating high-quality samples.

For an input feature

X \in R^{B \times C \times H \times W}

, each channel feature is first compressed along the spatial dimension using Global Average Pooling (GAP) to obtain a channel feature vector Y. The process is as follows:

\begin{matrix} Y_{c} = \frac{1}{H \times W} \sum_{i = 1, j = 1}^{H} X_{c, i, j} \end{matrix}

(10)

where c is the number of channels and H and W are the height and width of the feature map, respectively. Next, the channel features

Y_{c}

are reshaped into a sequence format suitable for MHA calculations, serving as the input for subsequent dynamic modeling. To enable dynamic modeling of channel dependencies, MHA constructs the Q, K and V through linear mappings, as follows:

\begin{matrix} {head}_{i} = f_{softmax} (\frac{Q_{i} K_{i}^{τ}}{\sqrt{d_{k}}}) V_{i} \end{matrix}

(11)

\begin{matrix} MHA (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W \end{matrix}

(12)

where

d_{k}

represents the dimension of each attention head, h is the number of attention heads, and

\begin{matrix} W \in R^{D \times D} \end{matrix}

is the linear projection matrix. MHA captures long-distance dependencies between channels while incorporating multi-scale information representation capabilities. To further optimize the feature distribution, a normalization layer (NL) is applied before MHA to enhance training efficiency and mitigate overfitting. The process is as follows:

\begin{matrix} Y^{'} = N L (Y + MultiHead (Q, K, V)) \end{matrix}

(13)

where Y and

Y^{'}

represent the embedding sequence and the MHA output, respectively, and

N L (\cdot)

denotes the normalization operation. To improve feature selection accuracy, the ECA module optimizes feature weighting using an adaptive weight allocation mechanism, expressed as follows:

\begin{matrix} O_{ECA} = σ (Y^{'}) \cdot X \end{matrix}

(14)

where

σ (\cdot)

denotes the Sigmoid activation function and

O_{ECA}

is the output feature after channel attention weighting.

By combining MHA’s modeling of global channel dependencies with ECA’s dynamic weighting of channel features, the MDFT module effectively captures comprehensive feature information from the wide-area fused image.

In summary, the generator and discriminator of MDFT-GAN are constructed using multiple MDFT modules, enabling the comprehensive representation of both global and local signal information. This design significantly improves generation quality and discrimination accuracy, providing robust support for high-quality sample generation and precise fault classification.

3.4. The Design of Classification Model

The main problem of current bearing fault classification models is that they cannot effectively capture multi-scale features. Traditional CNNs are able to extract local features, but it is difficult to model global information, while Transformer is good at capturing global features but underperforms in detail feature extraction. To solve this problem, we improved the ViT model [30], and in this paper, we propose an enhanced hybrid visual transformer (EH-ViT) model, which combines the local feature extraction capability of CNN and the global modeling advantage of the transformer. The model extracts local features through a convolutional front-end, utilizes multi-head self-attention and feed-forward networks for global feature modeling, and introduces a two-stage feature selection mechanism in the classification layer, which effectively improves the representation and classification performance of global features. The EH-ViT demonstrates excellent multi-category classification performance in bearing fault diagnosis tasks and is able to accurately capture the complex relationship between local and global features while at the same time demonstrating excellent classification performance in the face of unbalanced data and diverse fault modes. EH-ViT demonstrates strong adaptability and robustness in the face of unbalanced data and diverse failure modes.

3.5. Model Training Procedure

The training process of the data augmentation model is shown in Algorithm 1.

Algorithm 1 Training Process of the MDFT-GAN

Require:: Training epochs E, batch size B, noise dimension Z, number of classes C, Generator G, Discriminator D, Learning rates $η_{G}$ and $η_{D}$ , Gradient penalty coefficient $λ_{g p}$ , data loader $D_{l o a d}$
Ensure:: Trained Generator G and Discriminator D
1:: Initialize Generator G and Discriminator D networks with random weights
2:: Initialize Adam optimizers for G and D with learning rates $η_{G}$ and $η_{D}$
3:: for each epoch $e = 1, 2, \dots, E$ do
4:: for each batch $B \in D_{l o a d}$ do
5:: Sample real data: $x_{r e a l}, y_{r e a l} \sim P_{d a t a}$
6:: Sample noise and labels: $Z \sim N (0, I), y_{f a k e} \sim Uniform (1, C)$
7:: Generate fake data: $x_{f a k e} = G (z, y_{f a k e})$
8:: (a) Train Discriminator:
9:: Discriminator outputs $D_{r e a l}, {\hat{y}}_{r e a l} \leftarrow D (x_{r e a l})$ , $D_{f a k e}, {\hat{y}}_{f a k e} \leftarrow D (x_{f a k e})$
10:: Adversarial and auxiliary loss:
11:: $L_{D}^{a d v} = - E [D_{r e a l}] + E [D_{f a k e}]$
12:: $L_{D}^{a u x} = NLL (y_{r e a l}, {\hat{y}}_{r e a l}) + NLL (y_{f a k e}, {\hat{y}}_{f a k e})$
13:: Apply gradient penalty: $L_{g p} = λ_{g p} \cdot {(∥ \nabla_{x} D (x_{i n t e r p}) ∥_{2} - 1)}^{2}$
14:: Total loss of discriminator: $L_{D} = L_{D}^{a d v} + L_{D}^{a u x} + L_{g p}$
15:: Update Discriminator parameters: $θ_{D} \leftarrow θ_{D} - η_{D} \nabla_{θ_{D}} L_{D}$
16:: (b) Train Generator (every $g_{n u m}$ steps):
17:: Generate fake data: $x_{f a k e} = G (z, y_{f a k e})$
18:: Generator output: $D_{f a k e}, {\hat{y}}_{f a k e} \leftarrow D (x_{f a k e})$
19:: Total loss of generator: $L_{G} = - E [D_{f a k e}] + NLL (y_{f a k e}, {\hat{y}}_{f a k e})$
20:: Update Generator parameters: $θ_{G} \leftarrow θ_{G} - η_{G} \nabla_{θ_{G}} L_{G}$
21:: end for
22:: end for
23:: Save the trained models: Generator G, Discriminator D

To ensure the effectiveness and robustness of both the data augmentation model and the fault classification model, this paper implements two independent training mechanisms.

The loss function of the discriminator consists of two components: classification loss and category classification loss. For real samples

S_{real}

, the real sample loss

S_{(D, real)}

is defined as:

\begin{matrix} L_{D, real} = - E_{S_{real} \sim P_{data}} [log D (S_{real})] + E_{y_{real}} [log P (y = y_{real} ∣ S_{real})] \end{matrix}

(15)

where

y_{real}

represents the true label. For generated samples

S_{g}

, the generated loss

L_{(D, fake)}

is defined as:

\begin{matrix} L_{D, fake} = E_{z \sim P_{z}} [log D (S_{g})] - E_{y_{gen}} [log P (y = y_{gen} ∣ S_{g})] \end{matrix}

(16)

where

y_{gen}

denotes the generated label. To improve training stability, a Gradient Penalty (GP) term is introduced:

\begin{matrix} L_{GP} = λ_{GP} \cdot E_{\hat{S} \sim P_{S}} [{({‖ \nabla_{\hat{S}} D (\hat{S}) ‖}_{2} - 1)}^{2}] \end{matrix}

(17)

where

λ_{GP}

is a regularization parameter. Consequently, the total loss of the discriminator is expressed as:

\begin{matrix} L_{D} = L_{D, real} + L_{D, fake} + L_{GP} \end{matrix}

(18)

The objective of the generator is to deceive the discriminator while ensuring that the generated samples are assigned the correct class labels. The generator’s loss function is defined as:

\begin{matrix} L_{G} = - E_{z \sim P_{z}} [log D (S_{g})] + E_{y_{gen}} [log P (y = y_{gen} ∣ S_{g})] \end{matrix}

(19)

The fault classification model achieves accurate recognition of different fault modes through supervised learning and is optimized using a cross-entropy-based loss function. To further enhance model performance, real and generated data are combined for training. The loss function of the fault classification model is defined as:

\begin{matrix} L_{C} & = - E_{S_{train} \sim P_{data}, S_{g} \sim P_{g}} [log P (y = y_{train} ∣ S_{train})] \end{matrix}

(20)

3.6. Mdft-Gan Based Fault Diagnosis Steps

To summarize, the fault diagnosis process based on MDFT-GAN includes the following five steps.

Step 1: Collection of the original bearing vibration signals and construction of a dataset. The dataset includes signals from various operating conditions, encompassing normal signals and multiple fault modes. A multi-domain fusion method is applied to preprocess the one-dimensional signals into multi-domain two-dimensional images. The preprocessed dataset is then divided into training and test sets.

Step 2: Establishment of an MDFT-GAN-based data augmentation framework and initialization of the model parameters. The generator and discriminator are trained using an adversarial learning mechanism. The generator employs the MDFT module to produce high-quality data, while the discriminator applies multi-domain feature distribution learning to differentiate between real and synthetic signals and perform supervised classification.

Step 3: Use of the trained MDFT-GAN generator to create high-quality synthetic data. The generated data are automatically labeled with fault categories based on the input category labels.

Step 4: Quantitative evaluation of the quality of the generated data using metrics such as SSIM and PSNR. The high-quality synthetic data are combined with real data to construct an augmented dataset, thereby expanding the training set and improving the model’s generalization ability.

Step 5: Training of the fault classification model using the augmented dataset. The classifier learns pattern features from both real and synthetic data. Finally, the classification model on the test dataset is validated to assess its fault diagnosis performance and ensure high accuracy, even with limited training data.

In summary, MDFT-GAN not only generates high-quality synthetic data but also significantly enhances the accuracy and robustness of fault classification. By expanding the training dataset and improving its feature distribution, it provides an effective solution for small-sample fault diagnosis tasks.

4. Experiments and Analysis of Results

4.1. Mdft-Gan Based Fault Diagnosis Steps

To validate the effectiveness of the rolling bearing fault diagnosis method based on MDFT-GAN proposed in this paper, extensive experiments and data analysis were conducted on two case datasets: the CWRU dataset [31,32] and the Jiangnan University dataset [33]. The experimental process was divided into three stages. First, MDFT-GAN was trained using preprocessed data to generate high-quality synthetic samples. Second, the quality of the generated samples was evaluated using sample quality assessment metrics. Third, the generated samples were combined with the original samples to expand the training dataset for fault diagnosis model training. The effectiveness and superiority of the MDFT-GAN-based method were demonstrated by comparing its performance with several state-of-the-art fault diagnosis methods.

The method was implemented using the PyTorch deep learning framework in Python 3.11. The development environment was Pycharm 2023, and the experimental hardware configuration included an Intel Xeon Platinum 8362 CPU and an NVIDIA GeForce RTX 3090 GPU. The hyper-parameter settings for the modeling process are detailed in Table 1. The network structure of the proposed methodology is shown in Table 2.

4.2. Sample Quality Evaluation Indicator

To evaluate the data generation capability of MDFT-GAN, this paper adopts a joint evaluation method based on sample quality metrics. This method includes Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), cosine similarity, and Pearson correlation coefficient, all of which are widely used in computer vision [34,35,36,37]. These four metrics comprehensively assess pixel error, perceptual error, and feature distribution, providing a holistic evaluation of the quality of the generated images. The generative models used for comparison in this paper are GAN [18], ACGAN [28], DCGAN [38], WCGAN-GP [20], and the MDFT-GAN proposed in this paper.

PSNR quantifies the overall pixel error between the generated image and the original image. The formula for PSNR is as follows:

\begin{matrix} M S E = \frac{1}{M N} \sum_{i = 1}^{M} \sum_{j = 1}^{N} {(x_{i j} - y_{i j})}^{2} \end{matrix}

(21)

\begin{matrix} P S N R = 20 \cdot {log}_{10} (\frac{I_{M A X}}{\sqrt{MSE}}) \end{matrix}

(22)

where MSE denotes the mean square error; M and N denote the width and height of the image respectively;

X_{i j}

and

Y_{i j}

are the pixel values of the original and generated images, respectively; and

I_{M A X}

represents the maximum possible pixel value.

SSIM evaluates the perceptual similarity between the generated image and the original image. Its formula is as follows:

\begin{matrix} S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})} \end{matrix}

(23)

where

μ_{x}

and

μ_{y}

are the means of the two images;

μ_{x}^{2}

and

μ_{y}^{2}

are their variances;

σ_{x y}

denote the covariance of the two images, respectively;

C_{1}

and

C_{2}

are constants to ensure computational stability.

Cosine similarity (CS) measures the angular similarity between the original and generated images and is calculated as:

\begin{matrix} C S (x, y) = \frac{x \cdot y}{∥ x ∥ ∥ y ∥} \end{matrix}

(24)

where x and y represent the vectorized forms of the two images.

Pearson correlation (PC) coefficient assesses the linear correlation between the original and generated images. It is defined as:

\begin{matrix} P C (x, y) = \frac{\sum_{i = 1}^{N} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}} \sqrt{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}} \end{matrix}

(25)

where

x_{i}

and

y_{i}

are the ith pixel values of the two images;

\bar{x}

and

\bar{y}

are their mean pixel values; and N is the total number of pixels.

4.3. Data Preprocessing

The vibration signals of rolling bearings are inherently nonlinear and nonsmooth, and the signals collected by sensors often include random noise and shock interference. As a result, direct feature extraction and sample generation from one-dimensional vibration signals tend to be less effective. However, when these signals are transformed into images, the extracted patterns and features more intuitively reflect the dynamic changes in the signals. To fully leverage the limited vibration signal data, RGB images are generated by fusing time-domain, frequency-domain, and time-frequency domain information. Figure 5 illustrates the corresponding images for the ten categories of CWRU bearing states across the time domain (TD), frequency domain (FRD), time-frequency domain (TFD), and fusion domain (FD).

The time-domain transformation retains the temporal characteristics of the original signal, while the time-frequency domain transformation captures the dynamic variations within the signals. By employing multi-domain feature fusion, the model effectively captures the nonlinear and nonsmooth characteristics associated with complex mechanical failures, enabling improved feature extraction and fault analysis.

To ensure consistent visual representation, all generated RGB images were constructed by normalizing each of the time-domain, frequency-domain, and time-frequency domain channels to the [0, 255] range. These normalized values were rendered using the standard RGB colormap without additional contrast enhancement or nonlinear remapping in order to faithfully preserve the structural properties of the signal transformations. All subfigures use consistent image dimensions and interpolation settings to support fair cross-model visual comparison.

4.4. Case Study 1: CWRU Dataset

This case utilizes a bearing dataset published by Case Western Reserve University (CWRU), which is widely used to evaluate bearing fault diagnosis performance and is publicly accessible through the Bearing Data Center website. The data were collected using the equipment depicted in Figure 6, which includes motors, torque sensors/encoders, and control electronics. The dataset comprises vibration signals for four bearing states: ball fault (BF), inner ring fault (IRF), outer ring fault (ORF), and normal (N). Each condition includes various damage diameters (e.g., 0.007, 0.014, 0.021 inches), and the sampling frequencies available are 12 kHz and 48 kHz.

In this paper, vibration signal data for four bearing states sampled at 12 kHz were selected. This selection resulted in a total of 10 bearing state categories. Each signal was divided into sample blocks of length 4096, and overlapping sampling (with an overlap size of 2048) was applied to mitigate boundary effects during preprocessing. A detailed description of the data is provided in Table 3.

4.4.1. Data Augmentation Model Evaluation

To assess the data generation capability of the MDFT-GAN model under varying conditions, performance tests are conducted with limited training data, with sample sizes of 50, 40, 30, and 20 per category. A detailed analysis is presented for the case with 50 samples per category. Figure 7 shows the model training progression under this condition. The loss curve reveals that during the first 2000 epochs, the generator and discriminator engage in intense adversarial training, causing significant fluctuations in loss. Between 2000 and 4000 epochs, the generator persistently attempts to deceive the discriminator, while the discriminator adapts to more effectively distinguish generated data. By 4000 to 6000 epochs, the system reaches a relatively stable state, indicating convergence. As illustrated in Figure 7, the discriminator’s accuracy steadily improves throughout the training, approaching 1 by the end, demonstrating successful adaptation to the generated data and achievement of a steady state.

Figure 8 presents the 10 classes of signal samples generated by the MDFT-GAN model after training under the 50-sample condition, alongside their corresponding original training samples. It is evident that the images generated by MDFT-GAN closely resemble the original samples in terms of texture, state, and feature distribution. The absence of noticeable artifacts or significant noise further demonstrates the model’s exceptional signal generation capability and stability.

Figure 9 presents a qualitative comparison of generated samples across ten fault classes using four baseline generative models: GAN, ACGAN, DCGAN, and WCGAN-GP. Although these models can approximate coarse signal structures, clear discrepancies emerge in their ability to recover fault-discriminative features and maintain inter-class separability. GAN exhibits class-invariant high-frequency noise and texture collapse, especially in Classes 3 and 6, indicative of poor convergence and mode collapse. ACGAN alleviates some instability through label conditioning, yet suffers from oversmoothed representations in Classes 1 and 7, diluting crucial fault-specific frequency modulations. DCGAN introduces better periodicity, but its limited receptive field results in local inconsistency and spatial fragmentation—particularly visible in Classes 4 and 9—compromising semantic coherence. WCGAN-GP demonstrates improved noise suppression and global smoothness but lacks fine-grained detail restoration in Classes 2 and 8 due to the absence of cross-channel or contextual attention. These artifacts are not merely perceptual flaws; they reduce the fidelity of synthetic samples as training data and weaken their contribution to diagnostic learning. In contrast, the proposed MDFT-GAN effectively preserves both intra-class textural consistency and inter-class discriminative patterns as a result of its dual design: (i) multi-domain input encoding captures complementary signal characteristics across time, frequency, and time-frequency domains, and (ii) Transformer-based channel attention enhances global structural modeling while retaining fine detail. These design choices collectively enable the generation of diagnostically meaningful and visually faithful samples.

Figure 10 further compares the generation results of ACGAN and MDFT-GAN through localized zooming. In the zoomed-in areas, while the overall style of the ACGAN-generated images resembles the original samples, the restoration of texture details and feature distribution is clearly inadequate. In contrast, the signal samples generated by MDFT-GAN not only closely match the original images in terms of detail but also demonstrate greater stability during the generation process, significantly reducing artifacts and noise interference. This highlights MDFT-GAN’s ability to more accurately capture complex signal features, further validating its superiority in generation quality.

To rigorously evaluate the sample generation capability of MDFT-GAN, this study employs multiple quantitative metrics to assess image quality and similarity to real images. As shown in Table 4 and Figure 11, MDFT-GAN achieves the highest SSIM in most categories, with an average value of 0.91, demonstrating superior structural consistency and texture restoration. In terms of PSNR, MDFT-GAN slightly trails WGAN-GP but outperforms other models. Notably, for complex categories such as Class 7 and Class 9, it achieves the highest PSNR, highlighting its strong generative capacity for intricate visual structures.

Figure 12a,b illustrate the cosine similarity and Pearson correlation coefficients of different models, respectively. MDFT-GAN consistently achieves high cosine similarity across categories, significantly outperforming other models, particularly in Categories 4, 5, and 10. Regarding Pearson correlation coefficients, MDFT-GAN shows a strong correlation close to 1.0 in most categories, indicating a superior fit to the real data distribution. In contrast, GAN exhibits significant fluctuations in correlation values across multiple categories. Although ACGAN and DCGAN show improvement over GAN, they still fall short of MDFT-GAN, particularly in complex categories.

To further evaluate the robustness of MDFT-GAN under limited sample conditions, experiments were conducted using different numbers of training samples, with results shown in Table 5. The experimental settings included 20, 30, and 40 training samples, and the SSIM and PSNR performance of MDFT-GAN was compared across these sample sizes. The results indicate that as the number of training samples decreases, the performance of MDFT-GAN slightly declines, but the overall change remains minimal, demonstrating strong stability.

4.4.2. Fault Classification Model Evaluation

To validate the effectiveness of the data generated by MDFT-GAN in classification tasks, t-SNE was employed to visually analyze the feature distributions generated by different models. The comparison results are presented in Figure 13 and Table 6, along with the calculated accuracy and F1 scores of the classification task. The calculation formulas are as follows:

\begin{matrix} Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(26)

\begin{matrix} Precision = \frac{T P}{T P + F P} \end{matrix}

(27)

Figure 13a–f show the t-SNE feature distributions of the data generated by each model. From the clustering patterns of the feature distributions, it is evident that the data generated by MDFT-GAN exhibit compact clusters, with clear separation between classes and high consistency within classes. In contrast, the feature distributions generated by GAN and ACGAN show significant overlap, suggesting that the generated data lack distinguishability and are less representative of the real data. This indicates that the data generated by MDFT-GAN more accurately represent the true feature distribution and align closely with the real data.

Table 6 describes the effect of data generated by different models on the fault classification performance. With 50 samples per category, the proposed MDFT-GAN model achieves a classification accuracy of 99.41, which is significantly better than the compared models. In addition, MDFT-GAN has the highest F1 score of 99.53, showing its excellent classification performance and cross-category equalization results. Even when the number of samples in each category is reduced to 30, MDFT-GAN still maintains a classification accuracy of 98.69 and an F1 score of 98.74, which further validates its robustness under limited samples.

To validate the effectiveness of the proposed EH-ViT in generating data for classification tasks, this study compares its performance with several classical machine learning methods and deep learning models. The comparison methods include random forest (RaF), support vector machine (SVM), hierarchical CNN (HCNN), 2D-CNN, and 2D-ResNet. The experiments are evaluated using four metrics: accuracy, precision, recall, and F1 score, calculated as follows:

\begin{matrix} Recall = \frac{T P}{T P + F N} \end{matrix}

(28)

\begin{matrix} F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall} \end{matrix}

(29)

Table 7 shows that Random Forest and SVM have comparable performance in traditional machine learning methods. Among deep learning methods, hierarchical CNN performs better in F1. However, the classification performance of 2D-CNN and 2D-ResNet is low, especially 2D-ResNet, with an accuracy of 92.54 and an F1 score of 93.06. In contrast, the proposed method in this paper has significant advantages in all evaluation metrics and outperforms other models in all four metrics.

These results indicate that a classification model utilizing data generated by MDFT-GAN can more effectively capture data characteristics, delivering superior performance in both classification accuracy and stability.

4.5. Case Study 2: JNU Dataset

In this section, the experiments utilize the bearing fault diagnosis dataset provided by Jiangnan University. The data acquisition setup is illustrated in Figure 14. The dataset encompasses four bearing operating states: normal (N), inner ring fault (IRF), outer ring fault (ORF), and rolling element fault (REF). Vibration signals were collected under various typical operating conditions at three motor speeds: 600 r/min, 800 r/min, and 1000 r/min. The signal data were recorded using a high-precision data acquisition device with a sampling frequency of 50 kHz, while a torque sensor was employed to measure power and speed during the experiments.

As shown in Table 8, the experiment selected data from four operating states at rotational speeds of 600 r/min and 800 r/min. Given the relatively low rotational speed (600 r/min = 10 Hz), we ensured that each 4096-point sample (corresponding to 81.9 ms at 50 kHz) covers nearly a full rotation cycle. This temporal span allows the preservation of key low-frequency components associated with fault periodicity. Moreover, the use of FFT and STFT over the full window helps maintain adequate spectral resolution at lower frequencies.

4.5.1. Data Augmentation Model Evaluation

Figure 15 illustrates the loss function curve of the MDFT-GAN during training. In the initial stages, the generator’s loss gradually decreases, while the discriminator’s loss stabilizes, indicating that the MDFT-GAN has reached a dynamic equilibrium. Further demonstration of the model’s discriminative ability is shown in Figure 15b by the discriminative accuracy curve. The figure shows that the discriminator’s accuracy increases rapidly during the early stages of training and approaches 1.0, signifying its ability to effectively distinguish between generated and real data. This validates the effectiveness of the training process for the MDFT-GAN.

Figure 16 presents images of vibration signals generated by (1) GAN, (2) ACGAN, (3) DCGAN, (4) WCGAN-GP, and (5) MDFT-GAN across eight categories, compared with the ground truth (GT). The visual comparison indicates that MDFT-GAN demonstrates significant advantages in image detail preservation, texture distribution accuracy, and noise suppression, outperforming the other models. To be specific, images generated by GAN exhibit significant random noise and lower overall quality. ACGAN and WCGAN-GP improve the image quality partly, but still suffer from missing details and feature distortion. Compared to them, the images generated by MDFT-GAN are highly similar to real images in all categories. The generated signals have clear texture distribution and accurate feature details, showing excellent generation capabilities.

Figure 17a,b show the cosine similarity and Pearson correlation coefficient results for each model, respectively. Compared with other models, MDFT-GAN performs relatively well in terms of performance. It means that the images generated by MDFT-GAN can accurately capture the feature distribution of different categories.

To visualize the feature distribution of the generated data, t-SNE was used in Figure 18 to reduce the dimensionality of the generated data features. Figure 18a–f show that the MDFT-GAN generated data features have a compact distribution, with clear separation between classes, and effectively retain the structure of the real data features. The feature distributions generated by other models have intra-class mixing or inter-class overlap, and it is difficult to accurately distinguish them, especially at the category boundaries.

4.5.2. Fault Classification Model Evaluation

To evaluate the impact of generated data from different generative models on classification performance, this study compares GAN, ACGAN, WGAN-GP, DCGAN, and the proposed MDFT-GAN. The experimental results are presented in Table 9. Although ACGAN and DCGAN are the better performers among the compared models, regarding MDFT-GAN, it outperforms all the other models with an accuracy of 99.91 and an F1 score of 99.90. Therefore, this suggests that superior data quality facilitates subsequent fault diagnosis.

To further validate the effectiveness of the classification model proposed in this paper, it was compared with several mainstream classification models. The results are detailed in Table 10.

In machine learning models, due to the limitations of traditional feature extraction methods in capturing high-dimensional representations, the ability of random forests and support vector machines to classify complex fault signals remains limited. The performance of deep learning models varies, with 2D-ResNet performing relatively well, but with an accuracy of 98.12.

However, the proposed method performs the best on all evaluation metrics, achieving 99.25 on all metrics. These results highlight the superiority of MDFT-GAN in generating high-quality synthetic data and efficiently capturing the features of complex fault signals.

4.6. Imbalanced Training Sample Evaluation

To evaluate the generative ability of the MDFT-GAN model under data imbalance conditions, two imbalanced training sample settings were designed: (a) 4-classification: Under identical damage size conditions, the data volumes for different fault types are imbalanced. (b) 7-classification: Under varying combinations of damage sizes and fault types, the data volumes are imbalanced. The specific training data settings for these scenarios are detailed in Table 11.

In the 4-class experiment, the dataset consists of four types: ballistic fault (BF), internal race fault (IRF), outer race fault (ORF), and normal (N). The damage size for each fault type is 0.007, and the sample sizes are 40, 30, 20, and 50, respectively. This reflects a significant imbalance in sample sizes between the different categories.

In the 7-class experiment, the dataset is expanded to include combinations of two different damage sizes (0.007 and 0.021), covering the same three fault types (BF, IRF, ORF) and the normal state (N). In this setting, the data imbalance is more pronounced, with differences in the number of samples not only across fault types but also within the same fault type under different damage sizes. For example, the sample size for Ball Fault is 40 for a damage size of 0.007 but drops to 20 for a damage size of 0.021. This experimental design more closely resembles the uneven data distribution typically encountered in real-world industrial scenarios.

Figure 19 and Figure 20 show the images generated by the GAN family of models trained under four and seven classes of imbalance conditions. The results show that the proposed method significantly outperforms the other models in terms of detail restoration and global consistency. The images generated by GAN and ACGAN exhibit significant blurring and loss of detail, whereas WCGAN-GP, although it improves the contrast, still falls short of restoring complex textures. In contrast, the images generated by MDFT-GAN not only closely resemble the ground truth in terms of texture details, but also maintain a high degree of consistency in terms of global structure. This shows that the method is able to take into account both detailed and global features when generating high-quality images.

Figure 21 and Figure 22 show the cosine similarity and Pearson correlation coefficient curves of the compared models, respectively. The results show that MDFT-GAN achieves high levels of cosine similarity and correlation in all categories, which emphasizes the excellent agreement between its generated image distributions and the real image distributions. Compared to other models, MDFT-GAN generates more stable samples with higher consistency with real image distributions.

Figure 23a,b show the confusion matrix results of the proposed model for four- and seven-category experiments, respectively; in the four-category experiments, the classification results show 100 accuracy for all categories, despite the imbalanced sample distribution. In the more complex seven-category experiment, the data variance is more pronounced, but the proposed model still exhibits excellent classification performance. Although slight classification errors were observed in the first and fourth categories, the overall classification performance was better.

These results confirm that MDFT-GAN can generate high-quality samples and exhibit excellent adaptability and robustness in the presence of multi-category and unbalanced data, making it a reliable solution for complex industrial diagnostic tasks.

4.7. Ablation Experiments

To verify the effectiveness of each module in the MDFT-GAN model and its contribution to overall performance, a series of ablation experiments were conducted, as summarized in Table 12.

Figure 24 and Figure 25 visually illustrate the results of the comparison. The comparison includes three model variants: ACGAN, ACGAN with Conv-RES, and MDFT-GAN. The visual quality and quantitative classification performance of the generated images highlights the contribution of each module to the overall improvement.

Obviously, the MDFT-GAN with Conv-RES and Transformer is highly consistent with the ground truth in terms of both local details and global distribution, thus greatly improving the quality of the generated samples. The classification accuracy results further validate the above conclusions: among 50 samples, the classification accuracy of ACGAN is 83.60, which is improved to 89.60 after the introduction of the Conv-RES module, which indicates that Conv-RES can effectively enhance the local feature extraction, and the classification accuracy reaches 99.41 after the Transformer is introduced.

4.8. Interpretability Analysis

In order to further elucidate the hierarchical characterization mechanism of the MDFT-GAN discriminator in fault pattern recognition, the feature responses of Class 4 and Class 7 samples in the CWRU dataset at each discriminative layer are visualized based on the Grad-CAM method in this paper, as shown in Figure 26.

The results show that in the shallow stage (Figure 26a,e), the model mainly responds to the low-frequency periodic structure, which reflects its effective suppression of background noise and the modeling ability of spectral prior; the activation of the intermediate layer (Figure 26b,f) is significantly enhanced, and the model begins to perceive the local non-stationary features, in which the activation of Class 4 samples concentrates in the amplitude mutation points, while Class 7 shows a multi-region sparse distribution, revealing the differential sensitivity mechanism of the model to local degradation features under different fault patterns.

In the high-level semantic space (Figure 26c,d,g,h), the feature responses of the two classes of samples are further focused and show the trend of increasing intra-class consistency, with the activation of Class 4 expanding to the edge region to integrate extensive fault information, while Class 7 forms the discriminative core of center contraction, indicating that the model realizes the local nonsmooth features at the deeper level, with the activation of Class 4 concentrated at the amplitude mutation point, while Class 7 shows multi-region sparse distribution, which reveals the model’s differential sensitivity mechanism to local degradation features under different fault patterns.

This indicates that the model realizes class-specific semantic embedding and spatial aggregation at a deep level. The above activation evolution process systematically reveals the structural path of MDFT-GAN from low-level perception to high-level discrimination and demonstrates its discriminative robustness and semantic interpretability for complex fault types.

Overall, these results demonstrate the effectiveness and robustness of MDFT-GAN in modeling complex vibration signals and performing fault diagnosis. The model excels in both generation and classification tasks, providing a reliable solution for challenging diagnostic applications.

5. Conclusions

This study addresses the persistent challenge of bearing fault diagnosis in industrial applications, particularly under conditions of limited and imbalanced datasets. We introduced a novel Multi-Domain Feature Transformer Generative Adversarial Network (MDFT-GAN) that effectively augments data by transforming bearing signals into two-dimensional RGB images across time, frequency, and time-frequency domains. The integration of a Transformer encoder with an efficient channel attention mechanism within the MDFT submodule allows the MDFT-GAN to capture intricate global feature interactions and local dependencies, thereby generating high-quality synthetic samples. Additionally, the proposed enhanced hybrid vision transformer classification model, which combines front-end convolutional layers with residual connections, significantly improves the robustness and accuracy of fault classification.

Experimental evaluations conducted on the CWRU and Jiangnan University fault datasets demonstrate that the MDFT-GAN method substantially outperforms existing state-of-the-art approaches in terms of both robustness and diagnostic accuracy. These results underscore the efficacy of our approach in mitigating the limitations posed by scarce and unbalanced data, thereby advancing the reliability of bearing fault diagnosis in industrial settings.

Furthermore, a Grad-CAM-based interpretability framework is incorporated to visualize hierarchical feature activations within the discriminator and classifier, offering intuitive and quantitative insight into the model’s decision process and enhancing its transparency in real-world industrial deployment.

Future work will explore the application of MDFT-GAN to other types of industrial fault diagnosis and investigate how to integrate other domain-specific functions to further improve performance. In addition, more lightweight generative models will continue to be explored without compromising model performance.

Author Contributions

Conceptualization, C.G. and V.V.P.; methodology, C.G.; software, C.G.; validation, C.G. and E.A.K.; formal analysis, P.L. and J.L.; investigation, C.G.; resources, P.L. and J.L.; data curation, C.G. and E.A.K.; writing—original draft preparation, C.G.; writing—review and editing, V.V.P.; visualization, C.G. and E.A.K.; supervision, V.V.P., P.L. and J.L.; project administration, V.V.P.; funding acquisition, C.G. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the China Scholarship Council (CSC.202309810002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are available from the first author upon reasonable request.

Acknowledgments

The authors thank the anonymous reviewers and editors for their insightful comments and suggestions, which helped to improve the quality of the work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, J.; Han, H.; Dong, X.; Wang, G.; Zhang, S. Bearing Fault Diagnosis Grounded in the Multi-Modal Fusion and Attention Mechanism. Appl. Sci. 2025, 15, 1531. [Google Scholar] [CrossRef]
Xu, X.; Yang, X.; He, C.; Shi, P.; Hua, C. Adversarial Domain Adaptation Model Based on LDTW for Extreme Partial Transfer Fault Diagnosis of Rotating Machines. IEEE Trans. Instrum. Meas. 2024, 73, 34567–34578. [Google Scholar] [CrossRef]
Sun, B.; Sheng, Z.; Song, P.; Sun, H.; Wang, F.; Sun, X.; Liu, J. State-of-the-Art Detection and Diagnosis Methods for Rolling Bearing Defects: A Comprehensive Review. Appl. Sci. 2025, 15, 1001. [Google Scholar] [CrossRef]
Huang, F.; Zhang, K.; Zheng, Q.; Li, Z.; Lai, X.; Ding, G.; Zhao, M. An Open-Set Method for Diagnosing Unknown Bogie Bearing Faults Using a Hybrid Open Score Relation Network. IEEE Trans. Instrum. Meas. 2024, 73, 45678–45690. [Google Scholar] [CrossRef]
Jin, Z.; He, D.; Wei, Z. Intelligent fault diagnosis of train axle box bearing based on parameter optimization VMD and improved DBN. Eng. Appl. Artif. Intell. 2022, 110, 104713. [Google Scholar] [CrossRef]
Xing, J.; Li, Y.; Zhuang, Y.; Wang, D.; Zhang, H. IEVAEGAN: An Input Enhancement VAEGAN for Rotating Component Fault Diagnosis with Extremely Limited Data. IEEE Sens. J. 2024, 24, 56789–56801. [Google Scholar] [CrossRef]
Li, W.; Zhong, X.; Shao, H.; Cai, B.; Yang, X. Multi-mode data augmentation and fault diagnosis of rotating machinery using modified ACGAN designed with new framework. Adv. Eng. Inform. 2022, 52, 101552. [Google Scholar] [CrossRef]
Wang, Z.; Yao, L.; Cai, Y. Rolling bearing fault diagnosis using generalized refined composite multiscale sample entropy and optimized support vector machine. Measurement 2020, 156, 107574. [Google Scholar] [CrossRef]
Roy, S.; Dey, S.; Chatterjee, S. Autocorrelation aided random forest classifier-based bearing fault detection framework. IEEE Sens. J. 2020, 20, 10792–10800. [Google Scholar] [CrossRef]
Dhar, S.; Cherkassky, V. Development and evaluation of cost-sensitive universum-SVM. IEEE Trans. Cybern. 2014, 45, 806–818. [Google Scholar] [CrossRef]
Wang, S.; Xiang, J.; Zhong, Y.; Zhou, Y. Convolutional neural network-based hidden Markov models for rolling element bearing fault identification. Knowl.-Based Syst. 2018, 144, 65–76. [Google Scholar] [CrossRef]
Gan, M.; Wang, C. Construction of hierarchical diagnosis network based on deep learning and its application in the fault pattern recognition of rolling element bearings. Mech. Syst. Signal Process. 2016, 72, 92–104. [Google Scholar] [CrossRef]
Shao, H.; Jiang, H.; Zhang, H.; Duan, W.; Liang, T.; Wu, S. Rolling bearing fault feature learning using improved convolutional deep belief network with compressed sensing. Mech. Syst. Signal Process. 2018, 100, 743–765. [Google Scholar] [CrossRef]
Wang, H.; Li, S.; Song, L.; Cui, L. A novel convolutional neural network based fault recognition method via image fusion of multi-vibration-signals. Comput. Ind. 2019, 105, 182–190. [Google Scholar] [CrossRef]
Li, W.; Lan, H.; Chen, J.; Feng, K.; Huang, R. WavCapsNet: An interpretable intelligent compound fault diagnosis method by backward tracking. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
Wen, L.; Li, X.; Gao, L. A new two-level hierarchical diagnosis network based on convolutional neural network. IEEE Trans. Instrum. Meas. 2019, 69, 330–338. [Google Scholar] [CrossRef]
Ma, S.; Chen, M.; Wu, J.; Wang, Y.; Jia, B.; Jiang, Y. High-voltage circuit breaker fault diagnosis using a hybrid feature transformation approach based on random forest and stacked autoencoder. IEEE Trans. Ind. Electron. 2018, 66, 9777–9788. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved training of wasserstein gans. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10. [Google Scholar]
Yang, C.; Li, H.; Cao, S.; Zhang, K.; Xiang, W.; Liu, X. CE-FFGAN: A feature fusion generative adversarial network with deep embedded category information for limited sample fault diagnosis of rotating machinery under speed variation. Adv. Eng. Inform. 2024, 62, 102605. [Google Scholar] [CrossRef]
Guo, Q.; Li, Y.; Song, Y.; Wang, D.; Chen, W. Intelligent fault diagnosis method based on full 1-D convolutional generative adversarial network. IEEE Trans. Ind. Inform. 2019, 16, 2044–2053. [Google Scholar] [CrossRef]
Shao, S.; Wang, P.; Yan, R. Generative adversarial networks for data augmentation in machine fault diagnosis. Comput. Ind. 2019, 106, 85–93. [Google Scholar] [CrossRef]
Qin, N.; You, Y.; Huang, D.; Jia, X.; Zhang, Y.; Du, J.; Wang, T. AttGAN-DPCNN: An Extremely Imbalanced Fault Diagnosis Method for Complex Signals From Multiple Sensors. IEEE Sens. J. 2024, 24, 12345–12356. [Google Scholar] [CrossRef]
Gao, H.; Zhang, X.; Gao, X.; Li, F.; Han, H. ICoT-GAN: Integrated convolutional transformer GAN for rolling bearings fault diagnosis under limited data condition. IEEE Trans. Instrum. Meas. 2023, 72, 1–14. [Google Scholar] [CrossRef]
Yang, J.; Liu, J.; Xie, J.; Wang, C.; Ding, T. Conditional GAN and 2-D CNN for bearing fault diagnosis with small samples. IEEE Trans. Instrum. Meas. 2021, 70, 1–12. [Google Scholar] [CrossRef]
Liu, J.; Zhang, C.; Jiang, X. Imbalanced fault diagnosis of rolling bearing using improved MsR-GAN and feature enhancement-driven CapsNet. Mech. Syst. Signal Process. 2022, 168, 108664. [Google Scholar] [CrossRef]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Loparo, K.A.; Loparo, K.A. Bearing Data Center, Case Western Reserve University. Available online: https://engineering.case.edu/bearingdatacenter/download-data-file (accessed on 22 December 2019).
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64, 100–131. [Google Scholar] [CrossRef]
Li, K.; Xiong, M.; Li, F.; Su, L.; Wu, J. A novel fault diagnosis algorithm for rotating machinery based on a sparsity and neighborhood preserving deep extreme learning machine. Neurocomputing 2019, 350, 261–270. [Google Scholar] [CrossRef]
Lian, J.; Yang, Z.; Liu, J.; Sun, W.; Zheng, L.; Du, X.; Yi, Z.; Shi, B.; Ma, Y. An overview of image segmentation based on pulse-coupled neural network. Arch. Comput. Methods Eng. 2021, 28, 387–403. [Google Scholar] [CrossRef]
Liang, Z.; Li, C.; Zhou, S.; Feng, R.; Loy, C.C. Iterative prompt learning for unsupervised backlit image enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 1–6 October 2023; pp. 8094–8103. [Google Scholar]
Wang, T.; Li, L.; Lin, K.; Zhai, Y.; Lin, C.-C.; Yang, Z.; Zhang, H.; Liu, Z.; Wang, L. Disco: Disentangled control for realistic human dance generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 17–21 June 2024; pp. 9326–9336. [Google Scholar]
Song, K.; Zhang, T.; Sun, C.; Wen, X.; Yan, Y. A novel multi-exposure fusion-induced stripe inpainting method for blade reflection-encoded images. Adv. Eng. Inform. 2024, 60, 102376. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]

Figure 1. The structure of ACGAN.

Figure 2. The structure of the Transformer.

Figure 3. MDFT-GAN-based overall framework.

Figure 4. Structure of MDFT-GAN.

Figure 5. Multi-domain RGB image fusion.

Figure 6. CWRU fault data collecting equipment.

Figure 7. MDFT-GAN training on CWRU trends. (a) Training loss profile. (b) Accuracy curve of the discriminator.

Figure 8. Comparison of GT and MDFT-GAN generated samples. (a) GT. (b) MDFT-GAN.

Figure 9. Comparison with other generative models.

Figure 10. Comparison of MDFT-GAN and ACGAN generation results.

Figure 11. Comparison of different models. (a) Cosine similarity. (b) Pearson correlation.

Figure 12. Comparison with different sample sizes. (a) Cosine similarity. (b) Pearson correlation.

Figure 13. Comparison of t-SNE for different generated models. (a) GAN. (b) ACGAN. (c) WCGAN-GP. (d) DCGAN. (e) Proposed 50 Samples. (f) Proposed 30 Samples.

Figure 14. JUN data acquisition equipment.

Figure 15. MDFT-GAN training on JNU trends. (a) Training loss profile. (b) Accuracy curve of the discriminator.

Figure 16. Comparison of the results of different data generation models.

Figure 17. Comparison of different models. (a) Cosine similarity. (b) Pearson correlation.

Figure 18. Comparison of t-SNE for different generated models. (a) GAN. (b) ACGAN. (c) WCGAN-GP. (d) DCGAN. (e) Proposed 50 Samples. (f) Proposed 30 Samples.

Figure 19. Comparison of images generated under 4-class unbalanced training conditions.

Figure 20. Comparison of images generated under 7-class unbalanced training conditions.

Figure 21. Comparison of cosine similarity and Pearson correlation coefficient of comparison models in 4 classes. (a) Cosine similarity. (b) Pearson correlation.

Figure 22. Comparison of cosine similarity and Pearson correlation coefficient of comparison models in 7 classes. (a) Cosine similarity. (b) Pearson correlation.

Figure 23. Confusion matrix under unbalanced data condition. (a) Four categories. (b) Seven categories.

Figure 24. Generate images in the ablation modules.

Figure 25. The influence of ablation modules on fault diagnosis.

Figure 26. Visualization of Grad-CAM Responses for Fault Pattern Recognition in MDFT-GAN Discriminator—Class 4 and Class 7. (a–d) Class 4. (e–h) Class 7.

Table 1. Hyperparameter settings.

Hyperparameter	Augmentation Model	Classification Model
Learning Rate	0.0001	0.00001
Optimizer	Adam	AdamW
Training Epochs	6000	100
Dropout	-	0.5
Batch size	128	64

Table 2. The proposed network structure.

Model	Layer	Parameters
Generator	Input Data	$3 \times 64 \times 64$ , emb = 100
	MDFT Block-1	d = 1, emb = 512, n_heads = 8, in_c = 100
		out_c = 512, k = $4 \times 4$ , s = 1, p = 0
	MDFT Block-2	d = 1, emb = 256, n_heads = 8, in_c = 512,
		out_c = 256, k = $4 \times 4$ , s = 2, p = 1
	MDFT Block-3	d = 1, emb = 128, n_heads = 8, in_c = 256,
		out_c = 128, k = $4 \times 4$ , s = 2, p = 1
	MDFT Block-4	d = 1, emb = 64, n_heads = 8, in_c = 128,
		out_c = 64, k = $4 \times 4$ , s = 2, p = 1
	Output Data	$3 \times 64 \times 64$
Discriminator	Input Data	$3 \times 64 \times 64$
	MDFT Block-1	d = 1, emb = 64, n_heads = 8, in_c = 3,
		out_c = 64, k = $4 \times 4$ , s = 2, p = 1
	MDFT Block-2	d = 1, emb = 128, n_heads = 8, in_c = 64,
		out_c = 128, k = $4 \times 4$ , s = 2, p = 1
	MDFT Block-3	d = 1, emb = 256, n_heads = 8, in_c = 128,
		out_c = 256, k = $4 \times 4$ , s = 2, p = 1
	MDFT Block-4	d = 1, emb = 512, n_heads = 8, in_c = 256,
		out_c = 512, k = $4 \times 4$ , s = 2, p = 1
	FC-1	Out = Real/Fake
	FC-2	Out = n_classes
Classification	Input Data	$3 \times 64 \times 64$
	Resblock-1	in_c = 3, out_c = 64, s = 2, BN, ReLU
	Resblock-2	in_c = 64, out_c = 128, s = 2, BN, ReLU
	Resblock-3	in_c = 128, out_c = 256, s = 2, BN, ReLU
	ViT Encoder	emb = 768, patch = $16 \times 16$ , d = 12, n_heads = 12
	FC-1	in = 768, Out = 512, ReLU, Dropout = 0.5
	FC-2	in = 512, Out = n_classes

Table 3. CWRU dataset category settings.

Labels	Fault Category	Fault Width (Inch)	Sample Length
Class 1	BF	0.007	4096
Class 2	BF	0.014	4096
Class 3	BF	0.021	4096
Class 4	IRF	0.007	4096
Class 5	IRF	0.014	4096
Class 6	IRF	0.021	4096
Class 7	ORF	0.007	4096
Class 8	ORF	0.014	4096
Class 9	ORF	0.021	4096
Class 10	N	-	4096

Table 4. Comparison of evaluation indicators of the proposed model with other models.

Models	Class Category
Models	1	2	3	4	5	6	7	8	9	10
[18] ¹	0.85	0.81	0.75	0.76	0.84	0.78	0.74	0.84	0.81	0.86
[18] ²	29.32	28.66	29.23	28.83	29.13	28.77	28.54	28.92	29.08	29.11
[28] ¹	0.82	0.90	0.89	0.84	0.84	0.89	0.79	0.91	0.84	0.79
[28] ²	28.30	30.24	29.88	29.13	28.92	30.04	29.41	30.13	28.63	28.40
[20] ¹	0.83	0.85	0.85	0.79	0.88	0.84	0.79	0.85	0.83	0.86
[20] ²	29.88	29.86	29.83	29.40	29.83	30.07	29.66	29.95	29.46	29.88
[38] ¹	0.84	0.86	0.86	0.81	0.89	0.86	0.81	0.85	0.83	0.87
[38] ²	26.78	29.70	29.77	29.43	29.82	29.93	29.42	29.62	29.45	29.07
Proposed ¹	0.92	0.90	0.92	0.90	0.89	0.92	0.90	0.89	0.92	0.90
Proposed ²	28.95	29.80	29.45	29.08	29.76	29.45	29.42	29.62	29.45	29.78

¹ Use of the SSIM evaluation indicator. ² Use of the PSNR evaluation indicator.

Table 5. The performance of MDFT-GAN in generating different training samples.

Sample Size	Class Category
Sample Size	1	2	3	4	5	6	7	8	9	10
40 ¹	0.87	0.87	0.85	0.89	0.90	0.86	0.89	0.82	0.86	0.89
40 ²	30.12	30.07	29.77	29.47	29.18	30.09	29.45	29.68	29.62	29.15
30 ¹	0.86	0.90	0.89	0.84	0.88	0.87	0.81	0.85	0.87	0.85
30 ²	28.88	29.12	29.13	28.95	28.91	29.07	28.59	28.94	29.03	28.76
20 ¹	0.86	0.87	0.87	0.79	0.85	0.85	0.80	0.83	0.84	0.82
20 ²	29.36	29.27	29.15	28.95	28.92	29.07	28.79	28.78	29.09	28.70

¹ Use of the SSIM evaluation indicator. ² Use of the PSNR evaluation indicator.

Table 6. Classification performance of data generated by different models.

Models	Sample Size per Category	Evaluation Metrics (%)
Models	Sample Size per Category	Accuracy	F1
[18]	50	92.85	92.96
[28]	50	98.56	98.77
[20]	50	97.37	97.46
[38]	50	97.24	97.63
Proposed	50	99.41	99.53
Proposed	30	98.69	98.74

Table 7. Comparison of different classification models.

Comparison Models		Evaluation Metrics (%)
Comparison Models		Accuracy	Precision	Recall	F1
Machine learing	RaF	98.01	98.06	98.23	98.07
Machine learing	SVM	98.11	98.08	98.11	98.11
Deep learing	HCNN	98.01	98.08	98.01	97.99
	2D-CNN	98.51	98.54	98.51	98.53
	2D-ResNet	92.54	95.17	92.54	93.06
	Proposed	99.89	99.91	99.89	99.92

Table 8. Jiangnan University dataset category settings.

Labels	Fault Category	Speeds (r/min)	Sample Length
Class 1	IRF	600	4096
Class 2	REF	600	4096
Class 3	ORF	600	4096
Class 4	N	600	4096
Class 5	IRF	800	4096
Class 6	REF	800	4096
Class 7	ORF	800	4096
Class 8	N	800	4096

Table 9. Classification performance of data generated by different models.

Comparison Models	Sample Size per Category	Evaluation Metrics (%)
Comparison Models	Sample Size per Category	Accuracy	F1
[18]	50	98.12	98.14
[28]	50	98.10	98.11
[20]	50	90.62	91.45
[38]	50	98.13	98.30
Proposed	50	99.91	99.90

Table 10. Comparison of different classification models.

Comparison Models		Evaluation Metrics (%)
Comparison Models		Accuracy	Precision	Recall	F1
Machine learning	RaF	93.13	93.44	93.13	93.01
Machine learning	SVM	95.63	95.82	95.63	95.60
Deep learning	HCNN	92.50	93.12	92.50	92.45
	2D-CNN	90.62	90.93	90.62	90.58
	2D-ResNet	98.12	98.22	98.12	98.12
	Proposed	99.25	99.25	99.25	99.25

Table 11. Limited unbalanced fault data settings.

Fault Class	Class Category	Fault Type	Fault Width	Sample Size
4 class	Class 1	BF	0.007	40
	Class 2	IRF	0.007	30
	Class 3	ORF	0.007	20
	Class 4	N	-	50
7 class	Class 1	BF	0.007	40
	Class 2	BF	0.021	20
	Class 3	IRF	0.007	30
	Class 4	IRF	0.021	20
	Class 5	ORF	0.007	20
	Class 6	ORF	0.021	40
	Class 7	N	-	50

Table 12. Comparison of the performance of ablation models. The checkmarks (✓) indicate that a particular module is included in the model, while the crosses (×) indicate that it is not.

Models			Accuracy of 50 Samples	Average Value
ACGAN	Conv-RES	Transformer	Accuracy of 50 Samples	Average Value
✓	×	×	83.60	79.60
✓	✓	×	89.60	84.16
✓	✓	✓	99.41	96.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, C.; Potekhin, V.V.; Li, P.; Kovalchuk, E.A.; Lian, J. MDFT-GAN: A Multi-Domain Feature Transformer GAN for Bearing Fault Diagnosis Under Limited and Imbalanced Data Conditions. Appl. Sci. 2025, 15, 6225. https://doi.org/10.3390/app15116225

AMA Style

Guo C, Potekhin VV, Li P, Kovalchuk EA, Lian J. MDFT-GAN: A Multi-Domain Feature Transformer GAN for Bearing Fault Diagnosis Under Limited and Imbalanced Data Conditions. Applied Sciences. 2025; 15(11):6225. https://doi.org/10.3390/app15116225

Chicago/Turabian Style

Guo, Chenxi, Vyacheslav V. Potekhin, Peng Li, Elena A. Kovalchuk, and Jing Lian. 2025. "MDFT-GAN: A Multi-Domain Feature Transformer GAN for Bearing Fault Diagnosis Under Limited and Imbalanced Data Conditions" Applied Sciences 15, no. 11: 6225. https://doi.org/10.3390/app15116225

APA Style

Guo, C., Potekhin, V. V., Li, P., Kovalchuk, E. A., & Lian, J. (2025). MDFT-GAN: A Multi-Domain Feature Transformer GAN for Bearing Fault Diagnosis Under Limited and Imbalanced Data Conditions. Applied Sciences, 15(11), 6225. https://doi.org/10.3390/app15116225

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MDFT-GAN: A Multi-Domain Feature Transformer GAN for Bearing Fault Diagnosis Under Limited and Imbalanced Data Conditions

Abstract

1. Introduction

2. Basic Theory

2.1. Multi-Domain Information Fusion

2.2. Auxiliary Classifier GAN

2.3. Multi-Head Attention

3. The Proposed Method

3.1. Motivation

3.2. Proposed Fault Diagnosis Method Based on MDFT-GAN

3.3. The Design of the Data Augmentation Model

3.4. The Design of Classification Model

3.5. Model Training Procedure

3.6. Mdft-Gan Based Fault Diagnosis Steps

4. Experiments and Analysis of Results

4.1. Mdft-Gan Based Fault Diagnosis Steps

4.2. Sample Quality Evaluation Indicator

4.3. Data Preprocessing

4.4. Case Study 1: CWRU Dataset

4.4.1. Data Augmentation Model Evaluation

4.4.2. Fault Classification Model Evaluation

4.5. Case Study 2: JNU Dataset

4.5.1. Data Augmentation Model Evaluation

4.5.2. Fault Classification Model Evaluation

4.6. Imbalanced Training Sample Evaluation

4.7. Ablation Experiments

4.8. Interpretability Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI