3.1.1. Dataset Construction
To validate the effectiveness of the proposed method, experiments were conducted using the bearing dataset from the Case Western Reserve University bearing test platform [
35], as illustrated in
Figure 5. The rolling bearing used in the experiment was a 6205-2RS deep-groove ball bearing (SKF, Gothenburg, Sweden). The descriptions of the rolling bearing category labels are provided in
Table 2.
Four single-point damage faults with different diameters specifically 0.1778 mm, 0.3556 mm, and 0.5334 mm were introduced into the bearings using electrical discharge machining (EDM). Under operating conditions of 1730 r/min and a sampling frequency of 12 kHz, data were collected for four distinct health states: normal bearing (Norm), ball element fault (BE), outer race fault (OR), and inner race fault (IR).
For the experiment, time-domain signal data corresponding to these three distinct fault types and the normal bearing condition were selected. To strictly prevent data leakage—a common issue in time-series data processing where sliding windows may inadvertently share identical data points across subsets—a chronological partitioning strategy was employed prior to data segmentation. Specifically, the continuous 1D raw vibration signal for each health condition was first chronologically divided into mutually exclusive chunks for training and testing. Training set construction: Two separate training sets were constructed using two groups of original image sets of equal size. Each fault category in these sets contained 50 original time-frequency diagram samples, as detailed in the table.
Subsequently, an overlapping sliding window technique, utilizing a window size of 4096 points, was applied independently within these pre-divided boundaries. This rigorous isolation mechanism ensures that absolutely no data points overlap between the generated subsets. Through this procedure, a large pool of 1500 candidate samples per category was initially generated from the training chunks. To simulate an extreme few-shot diagnostic scenario, exactly 50 real time-frequency images per category were randomly selected from the independently segmented training pool to construct the training subsets. The specific data split configurations are detailed in
Table 3. Per category were randomly selected from the independently segmented training pool. These limited samples were utilized to train both the generative models and the baseline diagnostic classifiers. For the test set, 100 completely unseen images per category were independently generated from the reserved testing chunks via Continuous Wavelet Transform. This rigorous protocol guarantees that the independent testing subset shares absolutely zero overlapping data points with the training subsets, thereby ensuring a fair and objective final performance evaluation.
3.1.2. Comparison of Fault Diagnosis Accuracy
To account for random initialization and ensure the reliability of the evaluation, all diagnostic experiments in this study were repeated 10 times using different random seeds. The classification performance is reported as the mean accuracy alongside the standard deviation.
To evaluate the proposed PerDCGAN in small-sample fault diagnosis, comparative experiments were conducted against a standard DCGAN and a Wasserstein GAN with Gradient Penalty (WGAN-GP). A baseline dataset was first constructed using only 50 real samples for each health condition to simulate data scarcity. Synthetic time-frequency images generated by DCGAN, WGAN-GP, and PerDCGAN were then incrementally added to this baseline, with the augmentation size ranging from 0 to 30 samples per category. A separate, non-overlapping test set containing 100 images per category was used for evaluation. This setup assesses how synthetic samples from different generative models affect the classifier’s performance under limited data conditions.
As shown in
Table 4, augmenting the training set with 30 synthetic samples from PerDCGAN increases the mean diagnostic accuracy from the baseline of 93.0 ± 0.5% to 96.0 ± 0.2%. Under the identical experimental setup, adding 30 samples from the standard DCGAN results in an accuracy of 94.8 ± 0.6%, while the WGAN-GP baseline reaches 95.4 ± 0.3%. Furthermore, as the augmentation scale increases, the standard deviation of PerDCGAN progressively decreases to ±0.2%, indicating higher stability across multiple runs compared to the other baselines.
These quantitative results reflect the limitations of conventional generative objectives in mechanical signal processing. Standard DCGAN relies on pixel-level optimization, which tends to produce blurred spectrograms with limited discriminative information. Although WGAN-GP improves upon DCGAN by optimizing the global data distribution, it lacks explicit constraints on local micro-textures. By integrating perceptual constraints, PerDCGAN preserves the complex time-frequency textures required for fault classification, effectively providing the diagnostic model with more reliable physical features under data-scarce conditions.
Figure 6 details the per-class diagnostic accuracy and common confusions across different augmentation scales. A notable performance improvement is observed in the roller element faults. Using the proposed PerDCGAN, the identification accuracy for roller faults increases from 92% at the baseline (0 added samples) to 99% with 30 augmented samples. In comparison, standard DCGAN and WGAN-GP improve the roller fault accuracy to 94% and 96%, respectively. However, the confusion matrices also indicate that misclassifications between inner and outer race faults remain a challenge across all three models. For example, at the maximum augmentation scale, PerDCGAN still misclassifies 7 outer race samples as inner race faults. This common confusion can likely be attributed to the physical similarities in the resonant frequency bands of these two structural defects.
These class-level results reflect the underlying generation mechanisms of the evaluated models. Standard DCGAN relies on pixel-level optimization, which tends to smooth out the weak, high-frequency transient impacts characteristic of roller faults. While WGAN-GP improves global distribution alignment, it lacks explicit constraints on local micro-textures. Consequently, the synthetic features generated by these baselines struggle to completely separate roller faults from other classes in the feature space. By incorporating texture consistency through the Gram matrix, PerDCGAN helps reconstruct these high-frequency impact boundaries. This feature-level regularization alleviates the spectral blurring typically observed in conventional generative models, supplying the classifier with supplementary physical signatures that assist in delineating decision boundaries under data-scarce conditions.
The training and test sets are the same as those detailed in
Table 3.
To rigorously assess whether the synthetic time-frequency representations capture the essential discriminative features of bearing faults—rather than merely mimicking surface-level visuals—a data substitution experiment was conducted. In this setup, subsets of the original real training samples were randomly replaced by an equivalent number of PerDCGAN-generated images, with substitution ratios set at 0%, 7.5%, 15%, and 30%. If the generated samples lacked critical fault patterns, the classifier’s performance would inevitably degrade as real data was removed.
However, the results presented in
Table 5 reveal a counter-intuitive yet positive outcome. Remarkably, substituting real data with synthetic samples did not compromise diagnostic accuracy; instead, the model achieved a peak accuracy of 93.5% at a 7.5% substitution rate, slightly outperforming the baseline. Even at a 30% substitution rate, the accuracy remained robust at 93.3%. These findings confirm that PerDCGAN produces high-fidelity samples that effectively encapsulate the underlying fault characteristics.
This performance boost highlights the noise-filtering capability of PerDCGAN. Raw vibration data often contains random environmental noise, causing overfitting in small-sample scenarios. Guided by perceptual loss, PerDCGAN selectively reconstructs consistent periodic impact features while discarding random background noise, effectively generating “purified” fault prototypes. Substituting a portion of real data with these synthetic samples acts as a data-level regularization. It reduces irregular noise in the training set and helps the classifier focus on essential diagnostic signatures, slightly improving generalization on the test set.
To evaluate model robustness against realistic industrial interference, Additive White Gaussian Noise (AWGN) was injected directly into the raw 1D vibration signals before CWT processing. As a representative example,
Figure 7 provides a visual comparison before and after 0 dB noise injection. The intense random noise completely submerges the periodic transient impacts in the 1D waveforms, thereby obscuring the discriminative high-frequency vertical stripes in the corresponding 2D spectrograms.
This visual degradation highlights the extreme difficulty of extracting fault signatures under 0 dB conditions. To quantitatively assess and compare diagnostic robustness, the identical dataset splitting strategy was maintained, augmenting the baseline with 30 synthetic samples per class. Classifiers trained with data from standard DCGAN, WGAN-GP, and the proposed PerDCGAN were then evaluated on a dedicated 0 dB noisy test set. The comparative classification accuracies under this adverse environment are detailed in
Table 6.
To rigorously validate the individual contributions of the proposed objective function, particularly the introduction of the perceptual and pixel-level regularizations, a comprehensive ablation study was conducted. While the core innovation of PerDCGAN lies in decoupling high-frequency physical impact features from severe background noise, it is methodologically essential to isolate and quantify the impact of each regularization term in a controlled environment. Therefore, this ablation experiment is evaluated on the standard, noise-free CWRU dataset to eliminate the confounding variables introduced by extreme external interference. The generative performance was evaluated across four distinct loss configurations: Adv (Baseline), Adv + L1, Adv + Per, Adv + Per+ L1.
To guarantee a fair comparison, the identical dataset partitioning strategy (train-test split) utilized in the previous experiments was maintained. A strict small-sample augmentation protocol was implemented: for each of the four configurations, the corresponding trained generator was utilized to synthesize exactly 30 augmented samples per fault category. These synthetic samples were subsequently integrated into the identical limited training set to train the downstream classifier. The ultimate efficacy of each loss component is quantitatively evaluated by the diagnostic accuracy on a unified, unseen test set.
As detailed in
Table 7, the full PerDCGAN configuration achieves the highest diagnostic accuracy of 96.60% with a minimal variance of ±0.45%. This result validates the synergistic design of the joint objective function. The baseline model driven solely by the adversarial loss yields a lower accuracy of 92.50% and exhibits a higher instability of ±1.20%. Integrating individual regularizations provides measurable improvements. Variant A, which adds the L1 loss, increases the accuracy to 94.80% primarily by suppressing low-frequency background artifacts. Variant B, which incorporates the perceptual loss, reaches 95.10% by better aligning the deep semantic features of high-frequency transient impacts. The superior performance of the combined model demonstrates that the L1 and perceptual losses are complementary rather than redundant. The L1 penalty establishes a clean background, allowing the perceptual loss to focus entirely on reconstructing the precise structural geometry of the fault signatures and ultimately maximizing the downstream diagnostic efficacy.
3.1.3. Analysis of Experimental Results
To evaluate the training stability and convergence quality, we monitored the evolution of generated samples throughout the training process.
Figure 8 visualizes the comparative progression of the standard DCGAN and the proposed PerDCGAN at key iteration Epochs 100 and 500.
As illustrated above, during the early training phase, both models exhibit under-fitting characteristics. The generated spectrograms appear noisy and blurred, lacking distinct time-frequency textures required to identify specific fault categories.
However, a significant divergence in performance is observed by the 500th iteration. The proposed model achieves stable convergence, producing high-fidelity images with sharp fault patterns. In contrast, the standard DCGAN suffers from training instability and mode collapse, resulting in the generation of blurred, repetitive features that fail to capture the high-frequency details of the real data. These observations confirm that the proposed enhancement strategies effectively mitigate the instability issues inherent in standard GANs, ensuring both the visual quality and the diversity of the generated fault samples.
To qualitatively assess the feature separability between real and generated samples, t-SNE was employed to project the extracted features into a two-dimensional space. As illustrated in
Figure 9, the feature space exhibits distinct clusters corresponding to the different fault categories. As a qualitative tool for visualizing local similarities, t-SNE reveals that the generated samples consistently project into the same local regions as their corresponding real samples. This spatial alignment indicates that the synthetic data successfully captures essential class-specific structural features. Consequently, the augmented samples provide the classifier with distributionally consistent representations, corroborating the diagnostic improvements observed in the quantitative experiments.