To evaluate the performance of the proposed method, this section analyzes case studies of the Southeast University gearbox and the port gearbox. In these case studies, all deep learning (DL) models were trained using PyTorch (version 2.7.0+cu128) within a Python (version 3.12.9) environment on a computer equipped with a GTX 4090 GPU and 64GB of memory. To ensure clarity and consistency in the experimental analysis process, before introducing specific case studies, we first provide a unified explanation of the evaluation criteria for data generation quality. Subsequently, the effectiveness of the proposed method is validated and analyzed through two case studies.
4.1. Data Generation Quality Assessment Criteria
To comprehensively evaluate the quality of generated data, this paper assesses the similarity between generated and original data from two perspectives: quantitative statistical analysis and qualitative feature visualization. Through multi-dimensional evaluation criteria, the performance of the generative model in terms of distributional consistency and feature preservation can be more systematically characterized.
(1) Quantitative Evaluation Indicators
This paper employs the Maximum Mean Discrepancy (MMD) metric in experiments to quantify the quality and diagnostic performance of generated samples. MMD is a statistical measure for quantifying the difference between two probability distributions, enabling quantitative evaluation of generated data quality through distribution-based similarity comparisons. Its functional form is:
In Equation (
7),
F represents the generalized Gaussian kernel function,
and
denote the distributions of genuine fault data and generated fault samples respectively,
m and
n represent the sample sizes of the two distributions,
and
denote sample points from
and
respectively, and
k represents the Gaussian kernel function. Multiple experiments were conducted by randomly sampling equal numbers of original samples and faulted samples from each dataset.
(2) Qualitative Feature Visualization Methods
To further visually assess the similarity between generated samples and real samples in the feature space, we employed dimensionality reduction visualization techniques such as t-SNE and PCA. t-SNE is a nonlinear dimensionality reduction method that maps high-dimensional data onto a two-dimensional plane, preserving the relative positions of similar samples in the low-dimensional plot. We first extracted high-dimensional features from both real fault signals and generated signals, then performed t-SNE and PCA projections, respectively. The distributions of both sample types were plotted within the same coordinate system.
4.2. Case 1: Southeast University Dataset
The gearbox fault data used in this study originates from Southeast University’s transmission system dynamic test bench based on the Drivetrain Dynamic Simulator (DDS) [
30], as shown in
Figure 6. This test bench acquires bearing and gear signals under two operating conditions (20 Hz–0 V and 30 Hz–2 V). Bearing signals encompass vibration data for one healthy condition and four fault states (ball fault, inner ring fault, outer ring fault, compound fault). Gear signals include one healthy condition and four fault states (defective fault, tooth breakage fault, root crack fault, surface wear fault). Each data type comprises 8 channel signals: Channel 1 is the motor vibration signal; Channels 2–4 represent the X, Y, and Z-axis vibrations of the planetary gearbox; Channel 5 is the motor output torque; Channels 6–8 are the three-axis vibrations of the parallel gearbox. The data sampling frequency is 12 kHz.
The experimental signals selected for this study are X-direction vibration signals from a parallel gearbox. Detailed information on the dataset partitioning is shown in
Table 2. The dataset comprises 10 categories (labels 0–9), with bearing data labeled 0–4 and gear data labeled 5–9. Labels 0 and 5 represent the “normal state,” with 800 training samples and 200 test samples per category. The remaining fault categories (e.g., ball fault, inner ring fault, outer ring fault, compound fault, defective fault, tooth breakage fault, root crack fault, and surface wear fault) each have 40 training samples and 200 test samples. To simulate data imbalance, an imbalance factor
is defined.
quantifies the severity of imbalance, where
represents the number of normal samples and
denotes the number of samples per fault type. This study employs
= 20:1.
In this study, both bearing and gear vibration signals were segmented into independent samples of 1024 points each without overlap. All samples were divided proportionally into training and test sets, ultimately yielding 960 training samples and 1000 test samples for bearings and gears, respectively. Specifically, only samples from each fault type within the training set were used to train the conditional Wasserstein generative adversarial network (GAN). This approach enhances features for minority faults, mitigates data imbalance, and provides diverse training inputs for subsequent diagnostic model construction.
(1) Analysis of Data Generation Quality Assessment Results
The point of training the CWGAN-GP model is to make it produce signals that are like those from Southeast University’s gearbox failures, thereby mitigating data imbalance issues.
Figure 7 illustrates the trend of loss values for the generator and discriminator in the CWGAN-GP model across training iterations. This loss function is constructed based on the Wasserstein distance, incorporating a gradient penalty (GP) term to constrain the discriminator gradient. This approach addresses challenges in GAN training, such as gradient vanishing/explosion and mode collapse. The figure reveals significant loss fluctuations during early training, gradually converging toward near-zero values as iterations progress. This demonstrates the generator and discriminator progressively achieving equilibrium through adversarial learning, validating CWGAN-GP’s optimization of training stability for complex generation tasks. The CWGAN-GP model completed training after 5000 iterations. A certain number of fault signals were generated using the trained generator to achieve data balance. To assess whether the generated fault data meet quality standards, an evaluation was conducted from two perspectives: statistics and visualizations.
As shown in
Figure 8, the MMD value is very low, which shows that the generated data is very similar to the original data. This shows that the data that has been created is very similar to the original data, and is very consistent. The MMD method shows how well the generative model can simulate the original data distribution. The underlying mechanisms of the model effectively capture the features of the original data.
The visualization results for real and generated data of bearings and gears are shown in
Figure 9. Real and generated samples cluster closely and overlap in the reduced-dimensional feature space, with the point clouds of both datasets primarily intertwined. This indicates that the generated samples successfully capture the distributional characteristics of the real data. The substantial overlapping region between the two sets in the reduced-dimension plot indicates that the generative model captures the intrinsic structure of real samples at the feature level. Such qualitative visual analysis provides intuitive evidence for evaluating the model’s generative performance, validating the consistency of distribution between generated and real data across multidimensional feature spaces.
(2) Results and Analysis
1. Data Imbalanced Grouping Design
To systematically validate the robustness and generalization performance of the proposed method under varying degrees of data imbalance, this study designed five gearbox fault datasets (A–E) as shown in
Table 3. Within each dataset, the number of ground-truth vibration samples for each fault type was fixed at 40, while CWGAN-GP generated 0, 40, 120, 360, and 740 samples using CWGAN-GP to form five imbalanced training subsets with total training sizes of 40, 80, 160, 400, and 800 samples, respectively. Additionally, 200 samples from each group’s original data were allocated as the test set, with unbalanced ratios decreasing from 20:1, 10:1, 5:1, 2:1 to 1:1. This design enables evaluation of the generated samples’ compensatory effect on expressing minority fault features while comparing diagnostic model classification accuracy under varying training set sizes and balancing conditions. During training, the AdamW optimizer was employed with a batch size of 64 and a learning rate of 0.0001.
In cases of sample imbalance, the improvement in classification performance achieved through sample generation is more pronounced. As shown in
Figure 10, experimental results indicate that data augmentation of minority faults using CWGAN-GP significantly enhances the overall accuracy of fault diagnosis. As the number of generated samples gradually increases, the model’s classification performance continues to improve: without introduced synthetic data, the diagnostic accuracy for both bearing and gear tasks remained around 80%. With progressively more synthetic samples, the model’s accuracy significantly increased and stabilized at approximately 97.5% once the sample size reached a certain scale. These results demonstrate that the high-quality synthetic samples generated by CWGAN-GP effectively mitigate the scarcity of minority class samples, enhance the model’s learning capability for critical fault features, and thereby significantly improve diagnostic accuracy and stability under imbalanced conditions.
After expanding the data imbalance ratio from 20:1 to 1:1, the confusion matrices for the bearing and gear tasks are shown in
Figure 11, where
Figure 11a and
Figure 11b correspond to the diagnostic results for bearing and gear faults, respectively. It can be observed that despite high overall recognition accuracy for both tasks, the model still faces certain identification challenges for specific fault types. In the bearing task, composite faults and outer ring faults were occasionally misclassified as normal conditions or other single fault types. This indicates that under conditions of multi-source coupling or less pronounced impact features, the distinguishing characteristics of highly complex faults may be affected by background noise or feature overlap. In the gear task, misclassifications primarily occurred between missing tooth faults and broken tooth faults, as well as root crack faults. This indicates that gear damage of varying degrees or forms exhibits certain similarities in time-frequency domain features. The above analysis indicates that while multi-channel feature fusion and synthetic data generation significantly enhance overall diagnostic performance, challenges persist in distinguishing faults with similar mechanisms or adjacent damage levels. This highlights potential areas for model improvement in practical engineering applications.
2. Ablation Experiment
To validate the effectiveness and necessity of the proposed multi-task learning model in bearing and gear fault diagnosis, multiple sets of ablation experiments were designed, with results shown in
Table 4. First, single-task models trained exclusively for each component achieved diagnostic accuracies of 90.2% and 91.5%, respectively, demonstrating the model’s basic fault recognition capability under single-task conditions. In contrast, the multi-task learning model—which employs shared feature representations and simultaneously optimizes both bearing and gear tasks—achieved significant improvements in both tasks, with accuracies reaching 97.5%. This demonstrates that sharing underlying features and introducing cross-task information exchange effectively enhances the model’s ability to extract key fault features and improves its generalization performance.
Furthermore, to analyze the roles of different feature extraction channels and the attention fusion module, control models were constructed by removing specific components. After removing the CNN channel, the accuracy rates for bearings and gears decreased to 94.1% and 93.6%, respectively. After removing the BiLSTM channel, the accuracy rates for the two tasks were 95.8% and 95.7%, respectively. When the attention fusion module was disabled, performance similarly declined, with accuracies dropping to 95.1% and 95.8%, respectively. These results demonstrate that each channel and the fusion module play crucial roles in feature extraction and task coordination. The multi-task joint optimization mechanism effectively enhances overall fault identification performance by sharing feature representation spaces, validating the rationality and superiority of the proposed method in multi-component coupled fault diagnosis scenarios.
3. Comparison with Other Methods
To validate the effectiveness and feasibility of the proposed network model, this paper compares CWGAN-GP-MTL with four state-of-the-art models: MT-1DCNN [
31], RI-MPCNN [
32], MTCASN [
24], and MSCNN [
33]. The training and testing strategies for each network are identical across all evaluations. As shown in
Table 5, the proposed CWGAN-GP-MTL model achieves high classification accuracy in both bearing and gear fault diagnosis tasks, significantly outperforming other comparison models. Compared to MT-1DCNN (which utilizes single-domain features), RI-MPCNN and MSCNN (which employ multi-scale convolutional structures), and MTCASN (which incorporates channel attention), CWGAN-GP-MTL effectively enhances the model’s ability to recognize complex coupled fault features by integrating time-frequency features extracted by CNN with time-domain features extracted by BiLSTM, and by utilizing an attention mechanism to achieve adaptive feature weighting. This demonstrates stronger generalization capabilities and task synergy advantages.
4.3. Case 2: Portal Crane Gearbox Dataset
The gearbox fault dataset used in this study originates from a portal crane gearbox at a port in Shandong Province, as shown in
Figure 12. The signals were collected from the hoisting mechanism of the portal crane. During operation, bearing and gear data were captured at a frequency of 5000 Hz. Bearing data comprised four categories: normal condition data, rolling element failure data, inner ring failure data, and outer ring failure data. Gear data also included four categories: normal condition data, gear tooth surface wear, gear tooth cracks, and abnormal gear meshing. The majority of collected data represents normal conditions, with only a small portion indicating fault types, aligning with fault diagnosis under data imbalance scenarios. Detailed information on the training and testing set division is shown in
Table 6. Bearing data labels range from 0 to 3, while gear data labels range from 4 to 7.
In the actual operating environment of port gantry cranes, the selection of sensors must comprehensively consider environmental adaptability and engineering feasibility. Temperature signals are susceptible to interference from changes in ambient temperature (such as diurnal temperature variations and the influence of sea breezes), while acoustic emission signals are easily affected by background noise in complex operational environments. Furthermore, torque measurement and multi-sensor fusion solutions typically entail high deployment and maintenance costs. In contrast, vibration signals can directly reflect gear meshing conditions and localized bearing damage characteristics, offering advantages such as fast response times, high sensitivity to early-stage failures, and ease of acquisition. Therefore, this paper selects vibration signals as the primary research focus to better align with practical engineering application requirements.
For the above datasets, both the bearing and gear datasets were segmented into samples of 1024 points each using a non-overlapping approach. The training and test sets were divided according to the format specified in the aforementioned table. Data augmentation was performed using the CWGAN-GP network to enhance diagnostic accuracy for rare fault data.
(1) Analysis of Data Generation Quality Assessment Results
Figure 13 illustrates the trend of loss values for the generator and discriminator in the CWGAN-GP model as training iterations progress. During the initial training phase, both curves exhibit significant fluctuations. Subsequently, they gradually stabilize with iterations and oscillate around the zero range, indicating that the adversarial process has reached a relative equilibrium. After completing 5000 training iterations, the model generates fault signals using the trained generator to achieve data balance.
As shown in
Figure 14, specifically, the MMD values for bearing samples exhibit a gradual downward trend, indicating that the generator’s ability to fit the original data distribution has improved across different categories of bearing samples. Meanwhile, gear samples reached their lowest point in category 2 and rebounded in category 3. This variation may be related to differences in complexity or sample size across categories, suggesting that the generator still has room for improvement in fitting certain gear categories. Overall, lower MMD values indicate that the generated data more closely approximates the global distribution statistics of the real data, demonstrating the model’s ability to capture several distributional features of the original data.
After extracting the same high-dimensional features from both real fault signals and generated signals, t-SNE and PCA projections were performed, as shown in
Figure 15. The reduced-dimensional point clouds exhibit high overlap and intertwined distributions in the feature space, indicating that the generated samples effectively match the primary distribution characteristics of the real samples. This demonstrates a high degree of consistency at the feature level.
(2) Results and Analysis
1. Data Imbalanced Grouping Design
This study constructed five training subsets (denoted as A–E) based on the gearbox fault data of gantry cranes, as configured in
Table 7. Specifically, the number of genuine vibration samples for each fault category was fixed at 40. CWGAN-GP was then employed to synthesize 0, 40, 120, 360, and 740 additional samples, respectively, yielding five training subsets with sample sizes of 40, 80, 160, 400, and 800 per category. Training employs the AdamW optimizer with a learning rate of 0.0001 and a batch size of 64.
Experimental results indicate that as the sample imbalance among different fault categories in the gearbox gradually diminishes, the classification performance for both bearing and gear diagnostic tasks shows a continuous improvement trend, as illustrated in
Figure 16. After augmenting the minority class samples using generative adversarial networks (GANs) to adjust the sample distribution from a highly imbalanced state to a balanced one, the overall diagnostic accuracy for the bearing task and gear task increased to 97.63% and 99.75%, respectively. The corresponding confusion matrix results are shown in
Figure 17.
Figure 17a indicates that in the bearing task, the model accurately distinguishes between normal conditions, rolling element faults, and outer ring faults. However, inner ring faults are occasionally misclassified as normal in rare instances. This may stem from the weak early-stage characteristics and subtle impact components of inner ring faults, whose vibration responses exhibit similarities to normal operating conditions in time-domain features, thereby complicating discrimination. In contrast, outer ring faults are easier for the model to capture and identify due to their fixed excitation location and distinct periodic impact characteristics. In the gear task, as shown in
Figure 17b, high recognition accuracy is achieved for normal conditions, tooth surface wear, and gear tooth cracks, while a small number of gear meshing anomaly samples are misclassified as normal. This primarily stems from the fact that abnormal meshing does not always accompany obvious localized damage in certain operating conditions. Its characteristics are more often reflected in subtle changes in the overall vibration pattern, leading to some overlap with the normal state in the feature space. Overall, while the generated samples significantly enhance the model’s diagnostic performance under unbalanced conditions, challenges remain in distinguishing categories with weak feature differences or similar failure mechanisms. This highlights potential areas for improvement in practical engineering applications.
2. Ablation Experiment
The ablation results are shown in
Table 8. Attention fusion combined with the multi-task model achieved the best performance, indicating that the collaboration among modules is crucial for diagnostic capability. Compared to single-task training, the baseline improved by 11.25 percentage points for bearings and 5.87 percentage points for gears, demonstrating that multi-task sharing significantly enhances cross-task complementary information and sample category generalization ability. Removing attention fusion resulted in declines of approximately 2.03 and 2.25 percentage points, respectively, indicating that attention plays a key role in feature weighting and noise suppression. Using only CNN or only BiLSTM led to a significant decrease in the ability to distinguish certain fault categories, suggesting that CNN excels at extracting time-frequency features to differentiate gear faults, while BiLSTM is better suited for capturing temporal dynamics to identify bearing faults, with the two complementing each other. Therefore, the multi-channel (CNN+BiLSTM) architecture combined with attention and multi-task learning achieves superior diagnostic performance.
3. The Impact of Classification Loss Weighting on Improving CWGAN-GP Performance
To verify the impact of the classification loss term and its weight coefficient
on the performance of the improved CWGAN-GP, experiments were conducted with different values of
while keeping the gradient penalty coefficient
fixed at 10 (a common setting in WGAN-GP). The results are shown in
Table 9. When
, the model degenerates into a CWGAN-GP without classification constraints. At this point, the MMD values for the bearing and gear datasets were 0.095 and 0.083, respectively—both at relatively high levels—with diagnostic accuracy rates of 94.6% and 98.8%. This indicates that relying solely on adversarial loss is insufficient to fully constrain the class discrimination characteristics of the generated samples, and discrepancies still exist between the generated distribution and the true distribution. When
, the MMDs decrease to 0.093 and 0.081, respectively, while the accuracy rates improve to 95.6% and 99.2%. Compared to
, both distribution consistency and classification performance have improved, but the extent of improvement is limited, indicating that the classification loss has a weaker effect at this weight. When
is increased to 0.5, the MMDs for bearings and gears are 0.090 and 0.080, respectively, with bearings achieving the optimal value. The accuracy rates rise to 97.6% and 99.8%, both of which are the highest values, indicating that the classification loss and adversarial loss have reached an optimal balance at this point. When
, the MMDs rise to 0.093 and 0.082, while the accuracy rates drop to 95.4% and 99.4%, indicating that overly strong classification constraints weaken the model’s ability to capture the true distribution, thereby affecting both generation quality and classification performance. In summary, setting the classification loss weight appropriately is crucial; in this experiment, the model achieved the best overall performance when
.
4. The Impact of Feature Fusion Strategies on Fault Diagnosis Performance
To further validate the effectiveness of the introduced attention-based feature fusion mechanism in multi-task fault diagnosis, we conducted ablation experiments comparing different feature fusion strategies while maintaining the CNN–BiLSTM dual-channel architecture and the multi-task learning framework. Specifically, we compared the performance differences among attention-based feature fusion, feature concatenation, and mean-based feature fusion in bearing and gear diagnosis tasks. The experimental results are shown in
Table 10. The results show that the attention-based feature fusion method achieved the highest diagnostic accuracy for both bearing and gear tasks, at 97.6% and 99.8%, respectively, outperforming both feature concatenation and mean-based feature fusion in overall performance. Specifically, feature concatenation achieved a high accuracy rate for the gear task (93.1%), but its performance dropped significantly for the bearing task (84.4%); Mean-based feature fusion performed relatively well on the bearing task (91.0%), but its diagnostic accuracy was slightly lower on the gear task (92.5%). This indicates that simple feature fusion methods struggle to accommodate the differing requirements of multi-channel features across different tasks. In contrast, the attention-based feature fusion mechanism can adaptively assign importance weights to features across channels based on different tasks, highlighting key discriminative information while suppressing redundant features. This enables more effective feature representation in multi-task joint diagnosis, significantly improving diagnostic performance.
5. Comparison with Other Methods
As shown in
Table 11, the results of the comparative experiments indicate that the CWGAN-GP-MTL method proposed in this paper significantly outperforms the comparison methods in terms of bearing, gear, and average metrics. Its average performance is approximately 10 percentage points higher than that of the best-performing MSCNN model, demonstrating a clear advantage. At the same time, an analysis from the perspective of computational efficiency reveals differences in training time among the various models. Among them, MT-1DCNN, MSCNN, and MTCASN have relatively simple structures and thus shorter training times, whereas RI-MPCNN and the method proposed in this paper have relatively longer training times due to their higher structural complexity and the introduction of additional modules. Further comparison reveals that MT-1DCNN performs the worst, indicating that single-domain convolution is insufficient for distinguishing complex coupled faults; RI-MPCNN achieves good results for bearings but performs poorly for gears, while MTCASN performs strongly for gears but weakly for bearings, suggesting that different network architectures prioritize different sub-tasks and struggle to simultaneously address the classification requirements of both fault types. In contrast, CWGAN-GP-MTL achieves high accuracy on both tasks, demonstrating that its multi-channel feature extraction, task-sharing mechanism, and fusion strategy possess stronger representational capabilities in capturing the temporal features of bearings and the time-frequency features of gears. An analysis of overall accuracy and computation time reveals that, although the proposed method incurs some computational overhead during the training phase compared to certain lightweight models, it offers significant advantages in terms of accuracy improvement, demonstrating a favorable performance-efficiency trade-off. Furthermore, since the CWGAN-GP data augmentation process can be completed offline, it imposes no additional burden on subsequent model training and deployment, thereby ensuring good feasibility in practical engineering applications.
Furthermore, from an engineering implementation perspective, the two-stage diagnostic framework proposed in this paper demonstrates good feasibility at both the training and deployment levels. CWGAN-GP is only used in the offline stage for data augmentation of minority fault samples and remains fixed once training is complete. During actual online diagnostics, only the multi-task fault diagnosis network is required for deployment for forward inference, and thus does not significantly increase the real-time computational load. This approach meets the basic requirements of industrial applications for diagnostic efficiency and resource consumption.