PerDCGAN: A Perceptual Generative Framework for High-Fidelity Bearing Fault Diagnosis

Li, Yuantao; Li, Ao; Wang, Xiaoli; Yin, Jiancheng

doi:10.3390/app16084054

Open AccessArticle

PerDCGAN: A Perceptual Generative Framework for High-Fidelity Bearing Fault Diagnosis

School of Mechanical Engineering, Shandong University of Technology, Zibo 255049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 4054; https://doi.org/10.3390/app16084054

Submission received: 11 March 2026 / Revised: 6 April 2026 / Accepted: 12 April 2026 / Published: 21 April 2026

(This article belongs to the Special Issue Mechanical Fault Diagnosis and Signal Processing)

Download

Browse Figures

Versions Notes

Featured Application

This work can be applied to intelligent health monitoring and early fault warning of rotating machinery in smart manufacturing.

Abstract

Data imbalance significantly hinders the performance of deep learning models in rolling bearing fault diagnosis. While Generative Adversarial Networks (GANs) are widely used for data augmentation, traditional architectures employing pixel-level loss functions often fail to capture complex time-frequency textures, resulting in blurred spectrograms and the loss of transient fault characteristics. To address this, we propose a data augmentation framework based on a Perceptually Optimized Deep Convolutional GAN (PerDCGAN). By integrating a perceptual loss function derived from a pre-trained VGG-16 network, the generator is constrained at the feature level rather than the pixel level, explicitly enforcing the preservation of structural details and high-frequency impact patterns. Extensive experiments on the Case Western Reserve University (CWRU) and Paderborn University (PU) datasets demonstrate that the proposed method effectively mitigates spectral blurring. Ablation studies confirm the synergistic effect of the joint loss function. Furthermore, under extreme 0 dB noise conditions, the classifier augmented by PerDCGAN maintains a robust diagnostic accuracy of 89.65% on the PU dataset, significantly outperforming standard DCGAN and demonstrating strong potential for complex industrial applications.

Keywords:

fault diagnosis; continuous wavelet transform; deep convolutional generative adversarial network; VGG16

1. Introduction

Data-driven fault diagnosis methods, particularly those based on Deep Learning (DL), have revolutionized the health monitoring of rotating machinery [1,2,3]. Unlike traditional model-based approaches that rely on complex physical modeling, DL models—such as Deep Autoencoders [3], Wide Kernel CNNs (WDCNN) [4], and Recurrent Neural Networks [5] can automatically extract hierarchical features from vibration signals, achieving superior diagnostic accuracy. However, the success of these algorithms is heavily predicated on the availability of large-scale, well-balanced labeled datasets. In practical industrial scenarios, fault samples are inherently scarce compared to normal operation data. This “small-sample” dilemma frequently leads to model overfitting and poor generalization, becoming a primary bottleneck in intelligent fault diagnosis.

To address the data scarcity issue, Generative Adversarial Networks (GANs) and advanced generative modeling have emerged as dominant techniques for data augmentation, undergoing rapid theoretical and structural advancements. Recent progress in signal synthesis includes improved Wasserstein formulations, such as WGAN with Gradient Penalty (WGAN-GP) [6], which explicitly enforce Lipschitz constraints to highly stabilize the adversarial training dynamics of mechanical signals. Concurrently, self-attention mechanisms have been actively integrated into generative frameworks to effectively capture long-range temporal dependencies and global contextual information within complex vibration sequences [7]. Furthermore, diffusion-based approaches have recently gained significant traction [8,9], demonstrating remarkable capabilities in synthesizing high-fidelity time-frequency representations through iterative forward and reverse denoising processes. Building upon these advanced generative paradigms, recent studies have extensively employed various specific architectures to synthesize artificial fault samples. For instance, Zhuo et al. [10] and Li et al. [11] utilized modified GANs and Wasserstein GANs (WGAN) to generate spectrograms and vibration signals, effectively mitigating class imbalance. Similarly, strategies combining Auxiliary Classifier GANs (ACGAN) with optimization algorithms [12] or Transformer networks [13] have been proposed to enhance generation stability.

While recent studies have extensively employed various GAN architectures to synthesize artificial fault samples, a comprehensive 2025 literature review by Zhang et al. [14] highlights that industrial adoption is still severely hindered by low-fidelity generation under extreme data scarcity. Specifically, Luo et al. [15] demonstrated that standard GANs are highly susceptible to training instability and mode collapse when trained on limited bearing fault data. Most conventional GAN algorithms operate by minimizing pixel-level discrepancies, such as L1 or MSE norms [16]. However, this pixel-level optimization penalizes all pixels equally, failing to distinguish between critical semantic features and irrelevant background noise [17]. Furthermore, minimizing pixel-level error inadvertently acts as a low-pass filter [18], which is particularly detrimental for mechanical vibration signals where diagnostic information manifests as sparse, high-frequency transient impacts [19]. As a result, standard optimization tends to average out these subtle time-frequency textures, leading to severe spectral blurring and the loss of discriminative fault signatures.

While perceptual loss is an established technique in computer vision for enhancing visual aesthetics, its application to bearing fault diagnosis introduces a fundamentally novel physical interpretation. This mechanism has not yet been utilized in the generative augmentation of fault time-frequency spectrograms. Unlike natural optical images, spectrograms are 2D mappings of 1D mechanical kinematics, where critical fault signatures—such as periodic transient impacts—manifest as sparse, high-frequency textural structures. By repurposing the perceptual constraint as a strict “physical kinematics preserver,” our method fundamentally overcomes the detrimental low-pass filtering effect of traditional pixel-level losses that obliterate these transients, thereby capturing the true mechanical degradation features.

This fundamental mismatch between standard pixel-level loss functions and the physical characteristics of transient fault signals represents a significant research gap. To bridge this gap and overcome the limitation of spectral blurring in traditional GANs, this paper proposes a data augmentation framework based on a Perceptually Optimized Deep Convolutional GAN (PerDCGAN). Instead of solely relying on pixel-wise differences, we shift the generative constraint from the pixel level to the feature level by introducing a perceptual loss mechanism derived from a pre-trained VGG-16 network. The core insight is that the intermediate feature maps of VGG-16 are highly sensitive to abstract texture and structural semantics. By constraining the generator to match these high-level features rather than raw pixels, our method explicitly forces the synthesis of time-frequency images that preserve the sharp, transient impacts and authentic textures of bearing faults, ensuring the generation of high-fidelity samples crucial for accurate few-shot fault diagnosis. The main contributions of this study are summarized as follows:

(1): We propose the PerDCGAN framework, introducing a perceptual loss mechanism to the generative augmentation of bearing fault signals. Moving beyond its conventional use for visual aesthetics in computer vision, we utilize this feature-level constraint to explicitly preserve the high-frequency transient impacts and periodic structural textures inherent in mechanical time-frequency spectrograms, thereby overcoming the spectral blurring caused by traditional pixel-level loss functions.
(2): We quantitatively and qualitatively demonstrate that the proposed method effectively mitigates mode collapse and generates fault samples with superior structural consistency. Compared to the standard DCGAN, PerDCGAN yields significantly higher visual fidelity, achieving SSIM values exceeding 0.60 and reducing FID scores to a range of 51.4 to 59.6.
(3): Validation on both the Case Western Reserve University (CWRU) and Paderborn University (PU) datasets demonstrates that augmenting the training set with the generated samples increases diagnostic accuracy from 93.0% to 96.2% on the CWRU dataset, and from 88.24% to 98.21% on the PU dataset. Furthermore, under 0 dB noise conditions, the model achieves accuracies of 84.5% and 89.65% on the CWRU and PU datasets, respectively, verifying its effectiveness and robustness in few-shot diagnosis scenarios.

2. Materials and Methods

2.1. Continuous Wavelet Transform

Continuous Wavelet Transform (CWT) [20] is a multiscale analysis method with strong time-frequency analysis capabilities. Let a continuous signal be

f (t) \in L^{2} (R)

, where

t

denotes time and

R

represents the set of real numbers,

L^{2} (R)

denotes the space of square-integrable finite energy signals. A mother wavelet function

ψ (t)

must satisfy the admissibility condition:

C_{ψ} \equiv \int_{- \infty}^{\infty} \frac{| Ψ (ω) |^{2}}{|ω|} d ω < \infty

(1)

where

Ψ (ω)

is the Fourier transform of

ψ (t)

,

|ω|

is the angular frequency, and

C_{ψ}

is the admissibility constant. Through scaling and translation operations, the mother wavelet generates a family of basis functions:

ψ_{a, τ} (t) = \frac{1}{\sqrt{| a |}} ψ (\frac{t - τ}{a})

(2)

where

a

(

a \neq 0

) is the scale parameter controlling the dilation of the wavelet, and

τ

is the translation parameter indicating the time shift. The continuous wavelet transform of the signal

f (t)

is defined as the inner product of

f (t)

and the wavelet basis:

W_{f} (a, τ) \equiv \frac{1}{\sqrt{| a |}} \int_{- \infty}^{\infty} f (t) ψ^{*} (\frac{t - τ}{a}) d t

(3)

where

W_{f} (a, τ)

indicates the wavelet transform coefficients, and

ψ^{*}

denotes the complex conjugate of the mother wavelet. A larger scale parameter

a

corresponds to a dilated wavelet, which captures low-frequency, macroscopic features of the signal with high frequency resolution. Conversely, a smaller

a

compresses the wavelet, providing high time resolution to effectively capture high-frequency, transient fault impacts.

2.2. Deep Convolutional Generative Adversarial Network

By integrating convolutional architectures into GAN technology and implementing subsequent enhancements, an unsupervised deep learning model termed Deep Convolutional Generative Adversarial Network [21] has been proposed. The DCGAN architecture comprises two core components: a Generator (G) and a Discriminator (D). Through an adversarial training process, these two components engage in a competitive game: G generates synthetic samples that approximate the feature distribution of real data, while D distinguishes between real and generated samples. This adversarial interaction progressively refines both components until they reach a Nash equilibrium [22]. The overall structure is depicted in Figure 1.

In the DCGAN architecture, the generator takes a 100-dimensional random noise vector z as its input. The network structure comprises one fully connected layer followed by four deconvolutional layers. Specifically, the deconvolutional layers employ transposed convolution with a kernel size of 4 × 4 and a stride of 2 for upsampling, thereby achieving the amplification of feature maps. Each transposed convolutional layer is followed by a batch normalization layer and a Rectified Linear Unit (ReLU) activation function [23]. At the output layer, a Hyperbolic Tangent (Tanh) activation function [24] is applied to generate the final samples. The objective of the discriminator is to determine whether the input sample originates from real data or is generated by the generator. Its architecture is symmetrical to that of the generator, consisting of four convolutional layers and one fully connected layer. The convolutional layers utilize 4 × 4 kernels with a stride of 2 to perform down sampling, thereby reducing the spatial dimensions of the feature maps. Each convolutional layer is followed by a batch normalization layer and a LeakyReLU [25] activation function. Finally, a sigmoid [26] activation function is applied at the output layer to produce the discrimination probability.

2.3. The Proposed PerDCGAN Model

2.3.1. Network Architecture of Generator and Discriminator

To address the scarcity of fault samples in bearing fault diagnosis, the bearing fault dataset is augmented using an enhanced Deep Convolutional Generative Adversarial Network model. The model comprises two core components: a generator and a discriminator. As illustrated in Figure 2, the framework of the generator G consists of one fully connected layer and four transposed convolutional layers. The fully connected layer first maps the 100-dimensional random noise following a Gaussian distribution to a feature map with 512 channels. This feature map is then progressively upsampled through the transposed convolutional layers, enhancing the feature representation and ultimately generating the synthetic image. To enhance the model’s stability and generalization capability, batch normalization is applied to the outputs of the first three transposed convolutional layers, with the exception of the final layer. These normalized outputs are then fed into the LeakyReLU function as the activation mechanism.

During the training process, the objective of the generator is to produce images that can deceive the discriminator. Specifically, the generated samples must make it difficult for the discriminator to distinguish them from real samples. Furthermore, to enhance the quality of the generated samples, the generator also aims to minimize the feature disparity between the generated samples and the real samples.

The framework of the discriminator, as illustrated in Figure 2, is composed of five convolutional layers. With the exception of the first convolutional layer, each subsequent layer is followed by batch normalization and a LeakyReLU activation function. This architecture is designed to extract image features while progressively reducing the spatial dimensions of the feature maps. The resulting vector is then passed through a fully connected layer to output an authenticity probability, determining whether the input image is real or generated.

The discriminator’s objective is to accurately distinguish between real and generated images, while the generator aims to produce counterfeit images that closely resemble the real ones. Through this adversarial interplay, the two components engage in a competitive process that progressively enhances the quality of the generated output.

2.3.2. Formulation of the Joint Objective Function

This study builds upon the foundational framework pioneered and validated by Isola et al. [27] for image-to-image translation tasks, which proposes synergistically guiding the generator by combining adversarial loss with pixel-level reconstruction loss. However, the classical combination of L1 loss and adversarial loss tends to produce blurred texture details when generating highly structured time-frequency diagrams of bearing faults. To address this limitation, a key innovation introduced in our model is the incorporation of insights from the perceptual loss theory proposed by Johnson et al. [28]. This theory demonstrates that employing high-level features extracted by a pre-trained deep network can more effectively capture and preserve the semantic content and visual quality of images compared to pure pixel-level loss. Hence, our joint objective function integrates three components: adversarial loss, L1 loss, and perceptual loss.

The adversarial loss quantifies the generator’s capability to deceive the discriminator and the discriminator’s capacity to distinguish between real and synthetic samples. It is formulated as follows [21]:

\underset{G}{m i n} \underset{D}{m a x} V (D, G) = E_{x \sim p_{data} (x)} [l o g D (x)] + E_{z \sim p_{z} (z)} [l o g (1 - D (G (z)))]

(4)

The generator and discriminator are inherently two independent networks, and therefore their training is performed separately. During the training process, the parameters of one network are fixed while the parameters of the other network are updated. First, the generator is optimized using the following loss function:

\underset{G}{m i n} L_{a d v_G} = E_{z \sim p_{z} (z)} [l o g (1 - D (G (z)))]

(5)

Subsequently, the discriminator is optimized with the following loss function:

\underset{D}{m a x} L_{a d v_D} = E_{x \sim p_{data} (x)} [l o g D (x)] + E_{z \sim p_{z} (z)} [l o g (1 - D (G (z)))]

(6)

where D and G denote the generator and the discriminator, respectively; V(D,G) is the objective function; E represents the expected value; Z indicates random noise; x corresponds to real data;

p_{data} (x)

describes the distribution of the real data and

p_{z} (z)

refers to the distribution of the input random noise.

Although L1 loss is commonly used in conditional image translation tasks [27], we include it in our unconditional model to directly control pixel-level differences. Minimizing the L1 distance helps maintain the overall energy balance and keeps the background of the spectrograms sparse. This effectively prevents the model from generating random background noise, providing a clean base image. Once the low-frequency global structure is stabilized by this L1 penalty, the perceptual loss can fully concentrate on reconstructing the fine-grained, high-frequency transient impact patterns. The L1 loss is calculated as follows:

L_{L 1} = \frac{1}{N} \sum_{i = 1}^{N} {∥ x_{i} - G (z)_{i} ∥}_{1}

(7)

where N is the total number of pixels in the image,

x_{i}

represents the value of the i-th pixel in the image generated by the generator and

G (z)_{i}

represents the corresponding pixel value of the synthetic image generated by

G

.

VGG perceptual loss further ensures the perceptual quality of generated images by comparing their feature representations against those of real images within the VGG feature extraction network. This loss function quantifies the difference between high-level representations of two images extracted from a pre-trained deep neural network, an approach that has been extensively validated in numerous image generation tasks [29,30]. Inspired by this approach, our study employs perceptual loss to generate high-quality fault images. The perceptual loss applied in the enhanced DCGAN comprises two components: a feature reconstruction loss that measures content differences, and a style loss that quantifies stylistic disparities. Both loss components are constructed upon the intermediate feature space embedded within the pre-trained VGG-16 model.

The Feature Reconstruction Loss is defined as follows [16]:

L_{f e a t} = \frac{1}{C_{j} H_{j} W_{j}} {∥ V_{j} (x) - V_{j} (G (z)) ∥}_{2}^{2}

(8)

where

C_{j}, H_{j}

and

W_{j}

denote the channel count, height, and width of the output feature maps, respectively; V() represents the non-linear feature extraction operation at the

j

-th layer of the VGG-16 network, as visually illustrated by the feature maps in Figure 3.

Style loss was initially proposed to quantify stylistic differences between images, capturing abstract features such as texture and color. In this study, we employ style loss to enhance the feature-level similarity between generated and real images. The style features are extracted from specific convolutional layers of a pre-trained neural network and are defined as the correlations between different feature map groups—a relationship typically represented by the Gram matrix [31] which computes the inner products of vectorized feature maps to capture spatial-independent texture information. For a given input x, the output of the j layer in VGG-16, denoted as

ϕ_{j} (x)

has dimensions

C_{j} H_{j} W_{j}

. The elements of the Gram matrix are defined as follows:

G M_{ϕ (x)}^{c, c^{'}} = \frac{1}{C_{j} H_{j} W_{j}} \sum_{h = 1}^{H_{j}} \sum_{w = 1}^{W_{j}} ϕ_{j} (x)_{h, w, c} ϕ_{j} (x)_{h, w, c^{'}}

(9)

The outputs from the Conv1_2, Conv2_2, Conv3_3, and Conv4_3 layers of VGG-16 are used to construct four correlation matrices. The style loss is then defined as the distance between the Gram matrices of the generated and real images:

L_{s t y l e} = \sum_{n = 1}^{4} {∥ G M (V_{n} (x)) - G M (V_{n} (G (z))) ∥}_{F}^{2}

(10)

The perceptual loss is defined as the weighted sum of the feature reconstruction loss and the style loss:

L_{percept} = L_{feat} + L_{style}

(11)

These loss functions act synergistically, enabling the generator to produce images that are both realistic and of high perceptual quality. Hence, the combined loss function is formulated as follows:

L_{G} = λ_{a d v} L_{a d v} + λ_{L 1} L_{L 1} + λ_{p} L_{p e r c e p t}

(12)

where

λ_{a d v}

,

λ_{L 1}

, and

λ_{p}

represent the weighting coefficients that control the relative contributions of the adversarial loss, L1 loss, and perceptual loss, respectively.

2.4. Overall Diagnostic Framework and Implementation

2.4.1. Overall Diagnostic Flowchart

To overcome the spectral blurring and mode collapse issues inherent in standard adversarial training, we propose a diagnostic framework centered on PerDCGAN, as illustrated in Figure 4.

Step 1: Signal Segmentation. The raw vibration signals for each fault category are segmented using an overlapping sliding window technique. Each segment contains 4096 consecutive data points, resulting in a dataset of 1500 samples.

Step 2: Time-Frequency Transformation. The segmented 1D signals are converted into 2D time-frequency scalograms via Continuous Wavelet Transform. The constructed image dataset is then stratified into training, validation, and testing subsets.

Step 3: PerDCGAN Training & Augmentation. The PerDCGAN model is trained on the limited fault images. Crucially, a pre-trained VGG-16 [32] network (a deep convolutional architecture widely recognized for robust feature extraction) is integrated to calculate feature reconstruction loss and style loss. This perceptual constraint guides the generator to synthesize high-fidelity samples with rich texture details until the training stabilizes.

Step 4: Classifier Construction. A VGG-16-based fault diagnosis model is established and trained using the augmented dataset from the Case Western Reserve University archive.

Step 5: Performance Evaluation The final classification results are visualized using confusion matrices [33], and the feature learning capability is analyzed via t-SNE [34] to verify the separability of different fault categories.

2.4.2. Experimental Setup and Hyperparameters

The proposed framework was implemented using Python version 3.9 and the PyTorch deep learning framework on an NVIDIA RTX 4070 GPU(NVIDIA Corporation, Santa Clara, CA, USA). The comprehensive hyperparameter configuration utilized during the training phase is summarized in Table 1.

2.4.3. Model Training Procedure

The core of fault sample augmentation lies in the iterative adversarial training between the generator and the discriminator, which ultimately converges to a Nash equilibrium [21]. This process preserves the optimized network parameters and enables the generation of novel samples. During adversarial training, the training set samples and random vectors are jointly fed into the generator. The generator captures the data distribution characteristics of the real samples, learns from them, and produces synthetic samples that approximate the real ones. These generated samples, along with the real samples, are then input into the discriminator for authenticity discrimination. As the adversarial training progresses, the image generation capability of the generator improves, making it increasingly difficult for discriminator to distinguish between real and generated samples. Simultaneously, the discriminative ability of the discriminator is continuously enhanced through this process.

The alternating adversarial training procedure maps the theoretical loss functions to the computational graph through a systematic execution. Initially, the weight parameters of the Generator (G) and Discriminator (D) are initialized, and the pre-trained VGG-16 network is loaded with frozen weights to serve exclusively as the feature extractor for the perceptual loss. During each training iteration, a mini-batch of 4 real time-frequency images (x) from the segmented dataset and random noise vectors (z) from a standard Gaussian distribution are sampled. The optimization is performed alternately: first, the real images x and synthetic images G(z) are fed into the Discriminator, whose weights are updated to maximize Equation (6) using the Binary Cross-Entropy (BCE) criterion. Subsequently, the Generator produces a new batch of synthetic images G(z) to calculate the comprehensive objective function.

Specifically, to compute the perceptual penalty, both x and G(z) are propagated through the frozen VGG-16 network to extract their intermediate hierarchical feature maps. The selection of these specific layers is strictly tailored to the physical characteristics of vibration spectrograms. We extract features from the Conv3_3 layer to compute the feature reconstruction loss (Equation (8)) for macro-semantic alignment. Simultaneously, a multi-scale combination of shallow and deep layers—specifically Conv1_2, Conv2_2, Conv3_3, and Conv4_3—is utilized to compute the corresponding Gram matrices (Equation (9)) to derive the style loss (Equation (10)). This configuration is explicitly chosen because the shallow layers (Conv1_2, Conv2_2) are highly sensitive to capturing fine-grained, high-frequency transient patterns (e.g., sharp, localized fault impacts), whereas the deeper layers (Conv3_3, Conv4_3) are adept at extracting coarse global structures (e.g., macroscopic periodic spacing and resonant bands).

These two components are summed to form the overall perceptual loss (Equation (11)), which is then integrated with the adversarial loss and the L1 pixel-level loss (Equation (7)). The joint gradients are backpropagated to update the weights of G, minimizing the final objective function (Equation (12)) according to the weighting coefficients

λ

specified in Table 1. This alternating process is repeated iteratively for 500 epochs until a stable Nash equilibrium is reached, ultimately yielding a model capable of generating high-fidelity structural textures.

3. Results and Discussion

3.1. Performance Evaluation on the CWRU Dataset

3.1.1. Dataset Construction

To validate the effectiveness of the proposed method, experiments were conducted using the bearing dataset from the Case Western Reserve University bearing test platform [35], as illustrated in Figure 5. The rolling bearing used in the experiment was a 6205-2RS deep-groove ball bearing (SKF, Gothenburg, Sweden). The descriptions of the rolling bearing category labels are provided in Table 2.

Four single-point damage faults with different diameters specifically 0.1778 mm, 0.3556 mm, and 0.5334 mm were introduced into the bearings using electrical discharge machining (EDM). Under operating conditions of 1730 r/min and a sampling frequency of 12 kHz, data were collected for four distinct health states: normal bearing (Norm), ball element fault (BE), outer race fault (OR), and inner race fault (IR).

For the experiment, time-domain signal data corresponding to these three distinct fault types and the normal bearing condition were selected. To strictly prevent data leakage—a common issue in time-series data processing where sliding windows may inadvertently share identical data points across subsets—a chronological partitioning strategy was employed prior to data segmentation. Specifically, the continuous 1D raw vibration signal for each health condition was first chronologically divided into mutually exclusive chunks for training and testing. Training set construction: Two separate training sets were constructed using two groups of original image sets of equal size. Each fault category in these sets contained 50 original time-frequency diagram samples, as detailed in the table.

Subsequently, an overlapping sliding window technique, utilizing a window size of 4096 points, was applied independently within these pre-divided boundaries. This rigorous isolation mechanism ensures that absolutely no data points overlap between the generated subsets. Through this procedure, a large pool of 1500 candidate samples per category was initially generated from the training chunks. To simulate an extreme few-shot diagnostic scenario, exactly 50 real time-frequency images per category were randomly selected from the independently segmented training pool to construct the training subsets. The specific data split configurations are detailed in Table 3. Per category were randomly selected from the independently segmented training pool. These limited samples were utilized to train both the generative models and the baseline diagnostic classifiers. For the test set, 100 completely unseen images per category were independently generated from the reserved testing chunks via Continuous Wavelet Transform. This rigorous protocol guarantees that the independent testing subset shares absolutely zero overlapping data points with the training subsets, thereby ensuring a fair and objective final performance evaluation.

3.1.2. Comparison of Fault Diagnosis Accuracy

To account for random initialization and ensure the reliability of the evaluation, all diagnostic experiments in this study were repeated 10 times using different random seeds. The classification performance is reported as the mean accuracy alongside the standard deviation.

Experiment 1: Effect of Training Set Size on Diagnostic Accuracy

To evaluate the proposed PerDCGAN in small-sample fault diagnosis, comparative experiments were conducted against a standard DCGAN and a Wasserstein GAN with Gradient Penalty (WGAN-GP). A baseline dataset was first constructed using only 50 real samples for each health condition to simulate data scarcity. Synthetic time-frequency images generated by DCGAN, WGAN-GP, and PerDCGAN were then incrementally added to this baseline, with the augmentation size ranging from 0 to 30 samples per category. A separate, non-overlapping test set containing 100 images per category was used for evaluation. This setup assesses how synthetic samples from different generative models affect the classifier’s performance under limited data conditions.

As shown in Table 4, augmenting the training set with 30 synthetic samples from PerDCGAN increases the mean diagnostic accuracy from the baseline of 93.0 ± 0.5% to 96.0 ± 0.2%. Under the identical experimental setup, adding 30 samples from the standard DCGAN results in an accuracy of 94.8 ± 0.6%, while the WGAN-GP baseline reaches 95.4 ± 0.3%. Furthermore, as the augmentation scale increases, the standard deviation of PerDCGAN progressively decreases to ±0.2%, indicating higher stability across multiple runs compared to the other baselines.

These quantitative results reflect the limitations of conventional generative objectives in mechanical signal processing. Standard DCGAN relies on pixel-level optimization, which tends to produce blurred spectrograms with limited discriminative information. Although WGAN-GP improves upon DCGAN by optimizing the global data distribution, it lacks explicit constraints on local micro-textures. By integrating perceptual constraints, PerDCGAN preserves the complex time-frequency textures required for fault classification, effectively providing the diagnostic model with more reliable physical features under data-scarce conditions.

Figure 6 details the per-class diagnostic accuracy and common confusions across different augmentation scales. A notable performance improvement is observed in the roller element faults. Using the proposed PerDCGAN, the identification accuracy for roller faults increases from 92% at the baseline (0 added samples) to 99% with 30 augmented samples. In comparison, standard DCGAN and WGAN-GP improve the roller fault accuracy to 94% and 96%, respectively. However, the confusion matrices also indicate that misclassifications between inner and outer race faults remain a challenge across all three models. For example, at the maximum augmentation scale, PerDCGAN still misclassifies 7 outer race samples as inner race faults. This common confusion can likely be attributed to the physical similarities in the resonant frequency bands of these two structural defects.

These class-level results reflect the underlying generation mechanisms of the evaluated models. Standard DCGAN relies on pixel-level optimization, which tends to smooth out the weak, high-frequency transient impacts characteristic of roller faults. While WGAN-GP improves global distribution alignment, it lacks explicit constraints on local micro-textures. Consequently, the synthetic features generated by these baselines struggle to completely separate roller faults from other classes in the feature space. By incorporating texture consistency through the Gram matrix, PerDCGAN helps reconstruct these high-frequency impact boundaries. This feature-level regularization alleviates the spectral blurring typically observed in conventional generative models, supplying the classifier with supplementary physical signatures that assist in delineating decision boundaries under data-scarce conditions.

Experiment 2: Effect of Varying Sample Proportions on Model Accuracy

The training and test sets are the same as those detailed in Table 3.

To rigorously assess whether the synthetic time-frequency representations capture the essential discriminative features of bearing faults—rather than merely mimicking surface-level visuals—a data substitution experiment was conducted. In this setup, subsets of the original real training samples were randomly replaced by an equivalent number of PerDCGAN-generated images, with substitution ratios set at 0%, 7.5%, 15%, and 30%. If the generated samples lacked critical fault patterns, the classifier’s performance would inevitably degrade as real data was removed.

However, the results presented in Table 5 reveal a counter-intuitive yet positive outcome. Remarkably, substituting real data with synthetic samples did not compromise diagnostic accuracy; instead, the model achieved a peak accuracy of 93.5% at a 7.5% substitution rate, slightly outperforming the baseline. Even at a 30% substitution rate, the accuracy remained robust at 93.3%. These findings confirm that PerDCGAN produces high-fidelity samples that effectively encapsulate the underlying fault characteristics.

This performance boost highlights the noise-filtering capability of PerDCGAN. Raw vibration data often contains random environmental noise, causing overfitting in small-sample scenarios. Guided by perceptual loss, PerDCGAN selectively reconstructs consistent periodic impact features while discarding random background noise, effectively generating “purified” fault prototypes. Substituting a portion of real data with these synthetic samples acts as a data-level regularization. It reduces irregular noise in the training set and helps the classifier focus on essential diagnostic signatures, slightly improving generalization on the test set.

Experiment 3: Robustness Analysis under Noisy Environments

To evaluate model robustness against realistic industrial interference, Additive White Gaussian Noise (AWGN) was injected directly into the raw 1D vibration signals before CWT processing. As a representative example, Figure 7 provides a visual comparison before and after 0 dB noise injection. The intense random noise completely submerges the periodic transient impacts in the 1D waveforms, thereby obscuring the discriminative high-frequency vertical stripes in the corresponding 2D spectrograms.

This visual degradation highlights the extreme difficulty of extracting fault signatures under 0 dB conditions. To quantitatively assess and compare diagnostic robustness, the identical dataset splitting strategy was maintained, augmenting the baseline with 30 synthetic samples per class. Classifiers trained with data from standard DCGAN, WGAN-GP, and the proposed PerDCGAN were then evaluated on a dedicated 0 dB noisy test set. The comparative classification accuracies under this adverse environment are detailed in Table 6.

Experiment 4: Ablation Study on Loss Function Components

To rigorously validate the individual contributions of the proposed objective function, particularly the introduction of the perceptual and pixel-level regularizations, a comprehensive ablation study was conducted. While the core innovation of PerDCGAN lies in decoupling high-frequency physical impact features from severe background noise, it is methodologically essential to isolate and quantify the impact of each regularization term in a controlled environment. Therefore, this ablation experiment is evaluated on the standard, noise-free CWRU dataset to eliminate the confounding variables introduced by extreme external interference. The generative performance was evaluated across four distinct loss configurations: Adv (Baseline), Adv + L1, Adv + Per, Adv + Per+ L1.

To guarantee a fair comparison, the identical dataset partitioning strategy (train-test split) utilized in the previous experiments was maintained. A strict small-sample augmentation protocol was implemented: for each of the four configurations, the corresponding trained generator was utilized to synthesize exactly 30 augmented samples per fault category. These synthetic samples were subsequently integrated into the identical limited training set to train the downstream classifier. The ultimate efficacy of each loss component is quantitatively evaluated by the diagnostic accuracy on a unified, unseen test set.

As detailed in Table 7, the full PerDCGAN configuration achieves the highest diagnostic accuracy of 96.60% with a minimal variance of ±0.45%. This result validates the synergistic design of the joint objective function. The baseline model driven solely by the adversarial loss yields a lower accuracy of 92.50% and exhibits a higher instability of ±1.20%. Integrating individual regularizations provides measurable improvements. Variant A, which adds the L1 loss, increases the accuracy to 94.80% primarily by suppressing low-frequency background artifacts. Variant B, which incorporates the perceptual loss, reaches 95.10% by better aligning the deep semantic features of high-frequency transient impacts. The superior performance of the combined model demonstrates that the L1 and perceptual losses are complementary rather than redundant. The L1 penalty establishes a clean background, allowing the perceptual loss to focus entirely on reconstructing the precise structural geometry of the fault signatures and ultimately maximizing the downstream diagnostic efficacy.

3.1.3. Analysis of Experimental Results

To evaluate the training stability and convergence quality, we monitored the evolution of generated samples throughout the training process. Figure 8 visualizes the comparative progression of the standard DCGAN and the proposed PerDCGAN at key iteration Epochs 100 and 500.

As illustrated above, during the early training phase, both models exhibit under-fitting characteristics. The generated spectrograms appear noisy and blurred, lacking distinct time-frequency textures required to identify specific fault categories.

However, a significant divergence in performance is observed by the 500th iteration. The proposed model achieves stable convergence, producing high-fidelity images with sharp fault patterns. In contrast, the standard DCGAN suffers from training instability and mode collapse, resulting in the generation of blurred, repetitive features that fail to capture the high-frequency details of the real data. These observations confirm that the proposed enhancement strategies effectively mitigate the instability issues inherent in standard GANs, ensuring both the visual quality and the diversity of the generated fault samples.

To qualitatively assess the feature separability between real and generated samples, t-SNE was employed to project the extracted features into a two-dimensional space. As illustrated in Figure 9, the feature space exhibits distinct clusters corresponding to the different fault categories. As a qualitative tool for visualizing local similarities, t-SNE reveals that the generated samples consistently project into the same local regions as their corresponding real samples. This spatial alignment indicates that the synthetic data successfully captures essential class-specific structural features. Consequently, the augmented samples provide the classifier with distributionally consistent representations, corroborating the diagnostic improvements observed in the quantitative experiments.

3.1.4. Image Quality Assessment Methodology

This study employs the Fréchet Inception Distance (FID) [36] and the Structural Similarity Index (SSIM) [37] to quantitatively assess the generation quality. The FID metric utilizes the Inception-v3 model to compute the Fréchet distance between the deep feature distributions of real and generated images. A lower FID and a higher SSIM indicate that the generated spectrograms successfully preserve structural textures and closely resemble the authentic physical data distribution.

Figure 10 compares the generated time-frequency diagrams against the original samples. DCGAN exhibits severe blurring that obscures critical fault-specific impact bands. While WGAN-GP preserves the general energy contours, its high-frequency details remain noisy. In contrast, PerDCGAN produces cleaner backgrounds and sharper vertical energy stripes, reconstructing the original structural features more effectively than the baseline methods.

Each of the three model standards—DCGAN, WGAN-GP, and the proposed PerDCGAN—were utilized to generate 50 synthetic samples per fault category. The quantitative generation quality evaluated by SSIM and FID is summarized in Table 8.

The baseline DCGAN yields the poorest performance, struggling to capture complex time-frequency distributions. WGAN-GP provides a measurable improvement in FID scores, indicating better macroscopic distribution alignment, yet it remains limited in preserving fine structural details. PerDCGAN outperforms both baselines across all categories. It achieves the lowest FID scores and uniquely elevates SSIM values above 0.60. These metrics quantitatively confirm that the proposed joint regularization successfully preserves the critical structural geometry of the spectrograms, such as the periodic spacing and localized energy of transient impacts.

3.2. Performance Evaluation on the Paderborn University Dataset

3.2.1. Dataset Construction

To validate the generalization performance of the proposed method on complex real-world industrial data, experiments were conducted using the bearing fault diagnosis dataset publicly released by Paderborn University [38]. This dataset is distinguished from other public benchmarks by its inclusion of real-world natural fault data acquired from a modular electromechanical drive test rig rather than artificially machined single-point faults. As shown in Figure 11, the physical structure of the data acquisition apparatus consists of five sequentially connected modules: a drive motor, a torque-measuring shaft, a test bearing module equipped with a vibration measurement device, a flywheel, and a load motor.

To prevent data leakage caused by sliding windows inadvertently sharing identical data points across subsets, a chronological partitioning strategy was employed prior to data segmentation. The continuous 1D raw vibration signal for each physical bearing was first chronologically divided into mutually exclusive training and testing chunks. An overlapping sliding window technique utilizing a window size of 4096 points was then applied independently within these pre-divided boundaries. The selected fault parameters and specific operating conditions for these bearings are detailed in Table 9.

To simulate a few-shot diagnostic scenario with complex natural faults while maintaining strict class balance, both the training and testing datasets were sourced from the N15_M07_F04 operating condition. The five bearing conditions were grouped into three macroscopic health states: Normal, Outer Race fault, and Inner Race fault. For the limited training set, exactly 60 real time-frequency images were allocated to each state, totaling 180 samples. Within the faulty states, these 60 samples were equally drawn from the two distinct damage profiles, meaning 30 samples from KA04 and 30 from KA15 formed the Outer Race class. A strictly balanced test set of 300 completely unseen images was generated using the identical distribution strategy, comprising 100 samples per macroscopic state as shown in Table 10. This balanced arrangement ensures zero data overlap, prevents majority class bias, and objectively evaluates the model performance against weak natural fault characteristics and intra-class diversity.

3.2.2. Comparison of Fault Diagnosis Accuracy

Since the fundamental generative mechanism, image quality assessment, and loss function ablation have been comprehensively validated through the CWRU case study in Section 3.1, this section focuses exclusively on evaluating the model’s diagnostic capability in highly complex industrial scenarios.

Experiment 5: Effect of Training Set Size on Diagnostic Accuracy on the PU Dataset

This experiment investigates the impact of synthetic sample augmentation on diagnostic generalization across different operating loads. The baseline classifier was trained using exactly 40 real samples per bearing condition. Synthetic time-frequency images generated by DCGAN, WGAN-GP, and PerDCGAN were incrementally added to the training set in volumes up to 30 samples per category. The resulting classification accuracies for each augmentation scale are detailed in Table 11.

Table 11 details the classification results for the natural fault diagnosis. The baseline model achieves 88.24% accuracy, reflecting the inherent difficulty of diagnosing natural faults. In real-world scenarios, weak impact features are heavily obscured by mechanical noise and complicated by significant intra-class diversity (e.g., varying damage severities between KA04 and KA15).

Adding synthetic samples yields distinct scaling behaviors across the generative models. Standard DCGAN provides limited improvement, reaching 91.16% accuracy with a standard deviation of 1.61% at maximum augmentation, suggesting that basic pixel-level generation struggles to reliably reproduce the subtle textures of natural faults. WGAN-GP shows a fast early performance gain by achieving 92.67% accuracy with 10 augmented samples, but it experiences notable performance fluctuation as data volume increases, dropping at 20 samples before eventually settling at 94.55%. PerDCGAN presents a completely different scaling trajectory. While its initial accuracy with 10 samples is lower at 90.18%, the model performance scales highly consistently as data volume increases, reaching 98.21% with a 0.65% variance at 30 samples. This initial lag followed by steady improvement suggests that the structural features generated by perceptual optimization may require a certain mass of samples to help the classifier establish clear decision boundaries. These findings indicate that perceptual loss can capture subtle physical impact textures more effectively than traditional distribution-matching methods, providing the classifier with robust features for complex fault diagnosis.

Figure 12 compares the classification performance before and after data augmentation. In the baseline matrix, the Normal condition shows 98% accuracy, while the Outer Race and Inner Race conditions achieve 89% and 78% accuracy, respectively. The Inner Race is the most frequently confused category in the baseline, misidentified as the Outer Race 14 times and the Normal state 8 times. Following augmentation with PerDCGAN, classification accuracy improves across all categories, with the Normal and Outer Race conditions reaching 100% and 98% accuracy. Among all fault types, the Inner Race condition demonstrates the most substantial improvement. The specific misclassifications of the Inner Race drop to 2 and 1, respectively, increasing its accuracy from 78% to 97%. This outcome suggests that the generated samples provide highly discriminative structural features for the Inner Race, a category that typically performs poorly due to its subtle signal characteristics, helping the classifier establish more accurate decision boundaries for the most challenging fault conditions.

Experiment 6: Robustness Analysis under Noisy Environments on the PU Dataset

To further assess diagnostic stability under severe background interference, additive white Gaussian noise was superimposed onto the raw 1D vibration signals of the testing set prior to the continuous wavelet transform. Figure 13 illustrates the signal degradation using a 0 dB noise injection as an example. The introduced random noise effectively buries the already weak natural impact transients in the time domain, which severely blurs the discriminative high-frequency structural features within the resulting 2D spectrograms. This configuration simulates harsh industrial environments where extreme noise heavily obscures the essential physical patterns required for accurate fault classification.

The quantitative results under the 0 dB noise condition Table 12 show how different generative augmentation methods perform under severe interference, where noise energy equals signal energy. The diagnostic accuracy of the classifier augmented by standard DCGAN drops to 79.42% with a variance of 1.85%. This performance decline occurs because pixel-level loss functions generally struggle to preserve sharp transient impact boundaries, making the generated features susceptible to additive noise. WGAN-GP yields a moderate improvement at 83.86% accuracy but still shows limitations in maintaining structural integrity under random interference. Meanwhile, PerDCGAN maintains an accuracy of 89.65% with a 0.95% variance, recording the smallest accuracy decrease among the tested models compared to its clean data performance. This difference suggests a practical advantage of the perceptual loss mechanism: by constraining the generator in the high-level VGG feature space rather than the raw pixel space, PerDCGAN prioritizes the reconstruction of macroscopic periodic structures. Since high-level structural semantics tend to be more resistant to additive white noise than low-level pixel intensities, this approach provides the classifier with features that support more stable decision boundaries under high-noise conditions.

4. Conclusions

This study proposed the PerDCGAN framework to overcome the low-fidelity generation and mode collapse issues inherent in standard GAN-based data augmentation for rolling bearing fault diagnosis. By transitioning from pixel-level constraints to a VGG-16-based perceptual loss, the model effectively preserves the high-frequency transient impacts and structural textures of bearing signals. Experimental validation across both CWRU and PU datasets demonstrates the superiority of this approach. Specifically, ablation studies verified that perceptual regularization and L1 loss function synergistically to reconstruct clear time-frequency geometries. Crucially, robustness analysis revealed that under severe 0 dB additive white noise, PerDCGAN maintained a high diagnostic accuracy of 89.65% on natural faults, exhibiting a significantly smaller performance degradation compared to standard generative models.

Despite these advances, the framework currently relies on a VGG-16 network pre-trained on natural optical images rather than domain-specific mechanical spectrograms. Accordingly, future research will focus on developing domain-adapted feature extractors tailored for time-frequency signals, extending the framework via conditional domain adaptation to synthesize high-fidelity samples under fluctuating speeds and loads, and exploring model compression techniques to facilitate real-time industrial deployment.

Author Contributions

Conceptualization, J.Y. and Y.L.; methodology, Y.L.; software, Y.L. and A.L.; validation, Y.L., A.L. and X.W.; formal analysis, Y.L. and X.W.; investigation, Y.L.; resources, J.Y.; data curation, Y.L. and A.L.; writing—original draft preparation, Y.L.; writing—review and editing, J.Y. and Y.L.; visualization, Y.L.; supervision, J.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China [No. 52502531], the Natural Science Foundation of Shandong Province, China [Grant No. ZR2023QE214 and No. ZR2025MS765], Opening Foundation of Shandong Key Laboratory of Intelligent Manufacturing Technology for Advanced Power Equipment, Weifang University, China [No. SKLOIMTFAPE26013].

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://engineering.case.edu/bearingdatacenter (accessed on 15 October 2025) (Case Western Reserve University Bearing Data Center) and https://mb.uni-paderborn.de/kat/forschung/bearing-datacenter (accessed on 22 March 2026) (Paderborn University Bearing Data Center).

Conflicts of Interest

The authors declare no conflict of interest.

References

Gao, Z.; Cecati, C.; Ding, S.X. A Survey of Fault Diagnosis and Fault-Tolerant Techniques—Part I: Fault Diagnosis With Model-Based and Signal-Based Approaches. IEEE Trans. Ind. Electron. 2015, 62, 3757–3767. [Google Scholar] [CrossRef]
Liu, R.; Yang, B.; Zio, E.; Chen, X. Artificial intelligence for fault diagnosis of rotating machinery: A review. Mech. Syst. Signal Process. 2018, 108, 33–47. [Google Scholar] [CrossRef]
Lei, Y.; Jia, F.; Lin, J.; Xing, S.; Ding, S.X. An Intelligent Fault Diagnosis Method Using Unsupervised Feature Learning Towards Mechanical Big Data. IEEE Trans. Ind. Electron. 2016, 63, 3137–3147. [Google Scholar] [CrossRef]
Zhang, W.; Peng, G.; Li, C.; Chen, Y.; Zhang, Z. A New Deep Learning Model for Fault Diagnosis with Good Anti-Noise and Domain Adaptation Ability on Raw Vibration Signals. Sensors 2017, 17, 425. [Google Scholar] [CrossRef]
Zhao, R.; Wang, D.; Yan, R.; Mao, K.; Shen, F.; Wang, J. Machine Health Monitoring Using Local Feature-Based Gated Recurrent Unit Networks. IEEE Trans. Ind. Electron. 2018, 65, 1539–1548. [Google Scholar] [CrossRef]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved Training of Wasserstein GANs. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5767–5777. Available online: https://proceedings.neurips.cc/paper/2017/hash/892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html (accessed on 6 April 2026).
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. Available online: http://proceedings.mlr.press/v97/zhang19d.html (accessed on 6 April 2026).
Wong, T.Y.; Lim, M.H.; Ngui, W.K.; Leong, M.S. Denoising diffusion implicit model for bearing fault diagnosis under different working loads. ITM Web Conf. 2024, 63, 01025. [Google Scholar] [CrossRef]
Yang, Y.; Deng, X. Power Transformer Winding Fault Diagnosis Method Based on Time–Frequency Diffusion Model and ConvNeXt-1D. Appl. Sci. 2026, 16, 2528. [Google Scholar] [CrossRef]
Zhou, K.; Diehl, E.; Tang, J. Deep convolutional generative adversarial network with semi-supervised learning enabled physics elucidation for extended gear fault diagnosis under data limitations. Mech. Syst. Signal Process. 2023, 185, 109774. [Google Scholar] [CrossRef]
Li, Z.; Jiang, H.; Wang, X. A novel reinforcement learning agent for rotating machinery fault diagnosis with data augmentation. Reliab. Eng. Syst. Saf. 2025, 253, 110486. [Google Scholar] [CrossRef]
Guan, S.; Wu, T.Y.; Yang, H.Q. Research on transformer fault diagnosis method based on ACGAN and CGWO-LSSVM. Sci. Rep. 2024, 14, 142. [Google Scholar] [CrossRef] [PubMed]
Fu, Z.; Liu, Z.; Ping, S.; Li, W.; Liu, J. TRA-ACGAN: A motor bearing fault diagnosis model based on an auxiliary classifier generative adversarial network and transformer network. ISA Trans. 2024, 149, 381–393. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, S.; Wang, B.; Habetler, T.G. Deep Learning Algorithms for Bearing Fault Diagnostics—A Comprehensive Review. IEEE Access 2020, 8, 29857–29881. [Google Scholar] [CrossRef]
Luo, J.; Zhu, L.; Li, Q.; Liu, D.; Chen, M. Imbalanced Fault Diagnosis of Rotating Machinery Based on Deep Generative Adversarial Networks with Gradient Penalty. Processes 2021, 9, 1751. [Google Scholar] [CrossRef]
Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. IEEE Trans. Knowl. Data Eng. 2023, 35, 3313–3332. [Google Scholar] [CrossRef]
Wang, Q.; Zhou, H.; Li, G.; Guo, J. Single Image Super-Resolution Method Based on an Improved Adversarial Generation Network. Appl. Sci. 2022, 12, 6067. [Google Scholar] [CrossRef]
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss Functions for Image Restoration with Neural Networks. IEEE Trans. Comput. Imaging 2017, 3, 47–57. [Google Scholar] [CrossRef]
Randall, R.B.; Antoni, J. Rolling element bearing diagnostics—A tutorial. Mech. Syst. Signal Process. 2011, 25, 485–520. [Google Scholar] [CrossRef]
Mallat, S. A Wavelet Tour of Signal Processing: The Sparse Way, 3rd ed.; Academic Press: Burlington, MA, USA, 2008. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 807–814. Available online: https://icml.cc/2010/papers/432.pdf (accessed on 6 April 2026).
LeCun, Y.; Bottou, L.; Orr, G.B.; Müller, K.-R. Efficient BackProp. In Neural Networks: Tricks of the Trade, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 9–48. [Google Scholar] [CrossRef]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; p. 3. Available online: https://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf (accessed on 6 April 2026).
Han, J.; Moraga, C. The influence of the sigmoid function parameters on the speed of backpropagation learning. In From Natural to Artificial Neural Computation; Springer: Berlin/Heidelberg, Germany, 1995; pp. 195–201. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar] [CrossRef]
Cai, X.; Wang, G.; Lou, J.; Jian, M.; Dong, J.; Chen, R.C. Perceptual loss guided Generative adversarial network for saliency detection. Inf. Sci. 2024, 654, 119846. [Google Scholar] [CrossRef]
Wang, C.; Xu, C.; Wang, C.; Tao, D. Perceptual Adversarial Networks for Image-to-Image Transformation. IEEE Trans. Image Process. 2018, 27, 4066–4079. [Google Scholar] [CrossRef]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Stehman, S.V. Selecting and interpreting measures of thematic classification accuracy. Remote Sens. Environ. 1997, 62, 77–89. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. Available online: https://www.jmlr.org/papers/v9/vandermaaten08a.html (accessed on 6 April 2026).
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6629–6640. Available online: https://proceedings.neurips.cc/paper/2017/hash/8a1d694707eb0fefe65871369074926d-Abstract.html (accessed on 6 April 2026).
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Lessmeier, C.; Kimotho, J.K.; Zimmer, D.; Sextro, W. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification. In Proceedings of the 3rd European Conference of the Prognostics and Health Management Society, Bilbao, Spain, 5–8 July 2016. [Google Scholar]

Figure 1. DCGAN Architecture.

Figure 2. Architecture of the PerDCGAN Model.

Figure 3. Examples of feature maps extracted from VGG-16. The shallow layers typically capture apparent edge and shape information, while the deep layers encode abstract semantic features.

Figure 4. Diagnostic Flowchart Based on the PerDCGAN.

Figure 5. Bearing data acquisition system of the CWRU dataset.

Figure 6. Evolution of confusion matrices under different data augmentation scales on the CWRU dataset.

Figure 7. Visual comparison of 1D vibration signals and 2D CWT spectrograms under clean and 0 dB noisy conditions on the CWRU dataset.

Figure 8. Training snapshots of the two methods at Epochs 100 and 500. PerDCGAN shows stable convergence, whereas DCGAN exhibits mode collapse.

Figure 9. t-SNE visualization of the feature distribution comparison between original real samples and PerDCGAN-generated samples.

Figure 10. Comparative visualization of generated images against the original. PerDCGAN effectively preserves sharp high-frequency transient stripes compared to the baselines.

Figure 11. Modular electromechanical drive test rig for the PU dataset.

Figure 12. Evolution of confusion matrices under different data augmentation scales on the PU dataset.

Figure 13. Visual comparison of 1D vibration signals and 2D CWT spectrograms under clean and 0 dB noisy conditions on the PU dataset.

Table 1. Hyperparameter settings.

Parameter	Value	Parameter	Value
Optimizer	Adam	$λ_{a d v}$	0.1
Learning Rate	0.0002	$λ_{L 1}$	1.0
Batch Size	4	$λ_{p}$	0.1
Epochs	500	$λ_{feat}$	1.0
Convergence	Visual Stability	$λ_{style}$	15

Table 2. Selected Data from the CWRU Dataset.

Label	Condition	Fault Size/mm
0	Norm	0.1778
		0.3556
		0.5334
1	Ir	0.1778
		0.3556
		0.5334
2	Or	0.1778
		0.3556
		0.5334
3	Be	0.1778
		0.3556
		0.5334

Table 3. Data set split for the CWRU experiments.

Source of Sample	Training Set Size	Test Set Size
DCGAN	200	400
WGAN-GP	200	400
PerDCGAN	200	400

Table 4. Comparison of classification accuracy improvement on the CWRU dataset (Mean ± Std %).

Addition of Generated Samples	DCGAN	WGAN-GP	PerDCGAN
0	93.5 ± 0.62	93.0 ± 0.51	93.0 ± 0.54
10	93.8 ± 0.54	94.2 ± 0.40	94.7 ± 0.42
20	94.3 ± 0.45	95.1 ± 0.44	96.0 ± 0.3
30	94.8 ± 0.60	95.4 ± 0.38	96.2 ± 0.25

Table 5. Effect of Replacing Different Proportions of Original Samples with Generated Samples on Classification Accuracy (Mean ± Std %).

Generated Sample Substitution Rate	Accuracy
0	93.0 ± 0.42
7.5	93.5 ± 0.63
15	93.4 ± 0.14
30	93.3 ± 0.52

Table 6. Comparison of classification accuracy under 0 dB noisy environment on the CWRU dataset (Mean ± Std %).

Model	Accuracy
DCGAN	73.2 ± 2.15
WGAN-GP	77.8 ± 1.56
PerDCGAN	84.5 ± 0.86

Table 7. Ablation Study on Loss Function Components (Mean ± Std %).

Model Configuration	Loss Function	Accuracy
Baseline	Adv	92.50 ± 1.20
Variant A	Adv + L1	94.80 ± 0.85
Variant B	Adv + Per	95.10 ± 0.70
PerDCGAN	Adv + Per+ L1	96.60 ± 0.45

Table 8. Quantitative generation quality assessment (Mean ± Std %).

Method	Fault Type	SSIM	FID
DCGAN	Ir	0.5030 ± 0.0474	104.7609
	Or	0.5030 ± 0.0474	128.4235
	Be	0.4903 ± 0.0798	123.2145
WGAN-GP	Ir	0.5482 ± 0.0512	84.3120
	Or	0.5513 ± 0.0489	88.1562
	Be	0.5395 ± 0.0621	92.4718
PerDCGAN	Ir	0.6024 ± 0.0421	59.6256
	Or	0.6124 ± 0.0551	51.4143
	Be	0.6013 ± 0.0967	56.1841

Table 9. Selected fault parameters and operating conditions from the PU dataset.

Label	Bearing No.	Condition	Sample Length	Speed (r/min)	Torque (Nm)	Radial Force (N)
0	K001	Normal	4096	1500	0.7	400
1	KA04	OR
1	KA15	OR
2	KI04	IR
2	KI14	IR

Table 10. Data set split for the PU experiments.

Source of Sample	Training Set Size	Test Set Size
DCGAN	180	300
WGAN-GP	180	300
PerDCGAN	180	300

Table 11. Effect of training set size on diagnostic accuracy on the PU dataset (Mean ± Std %).

Addition of Generated Samples	DCGAN	WGAN-GP	PerDCGAN
0	88.24 ± 1.42	88.24 ± 1.42	88.24 ± 1.42
10	89.58 ± 0.85	92.67 ± 0.96	90.18 ± 0.97
20	90.41 ± 1.47	91.42 ± 0.88	96.48 ± 0.54
30	91.16 ± 1.61	94.55 ± 0.72	98.21 ± 0.65

Table 12. Comparison of classification accuracy under 0 dB noisy environment on the PU dataset (Mean ± Std %).

Model	Accuracy
DCGAN	79.42 ± 1.85
WGAN-GP	83.86 ± 1.42
PerDCGAN	89.65 ± 0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Li, A.; Wang, X.; Yin, J. PerDCGAN: A Perceptual Generative Framework for High-Fidelity Bearing Fault Diagnosis. Appl. Sci. 2026, 16, 4054. https://doi.org/10.3390/app16084054

AMA Style

Li Y, Li A, Wang X, Yin J. PerDCGAN: A Perceptual Generative Framework for High-Fidelity Bearing Fault Diagnosis. Applied Sciences. 2026; 16(8):4054. https://doi.org/10.3390/app16084054

Chicago/Turabian Style

Li, Yuantao, Ao Li, Xiaoli Wang, and Jiancheng Yin. 2026. "PerDCGAN: A Perceptual Generative Framework for High-Fidelity Bearing Fault Diagnosis" Applied Sciences 16, no. 8: 4054. https://doi.org/10.3390/app16084054

APA Style

Li, Y., Li, A., Wang, X., & Yin, J. (2026). PerDCGAN: A Perceptual Generative Framework for High-Fidelity Bearing Fault Diagnosis. Applied Sciences, 16(8), 4054. https://doi.org/10.3390/app16084054

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PerDCGAN: A Perceptual Generative Framework for High-Fidelity Bearing Fault Diagnosis

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Continuous Wavelet Transform

2.2. Deep Convolutional Generative Adversarial Network

2.3. The Proposed PerDCGAN Model

2.3.1. Network Architecture of Generator and Discriminator

2.3.2. Formulation of the Joint Objective Function

2.4. Overall Diagnostic Framework and Implementation

2.4.1. Overall Diagnostic Flowchart

2.4.2. Experimental Setup and Hyperparameters

2.4.3. Model Training Procedure

3. Results and Discussion

3.1. Performance Evaluation on the CWRU Dataset

3.1.1. Dataset Construction

3.1.2. Comparison of Fault Diagnosis Accuracy

3.1.3. Analysis of Experimental Results

3.1.4. Image Quality Assessment Methodology

3.2. Performance Evaluation on the Paderborn University Dataset

3.2.1. Dataset Construction

3.2.2. Comparison of Fault Diagnosis Accuracy

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI