Next Article in Journal
A DACO-XGBoost-Driven Method for Evaluating Braking Performance of High-Speed Elevators
Previous Article in Journal
A Comprehensive Experimental–Analytical Framework for Motorcycle Testing with Fourier-Based Curve Fitting and Adaptive Control
Previous Article in Special Issue
Research on Hybrid Control Methods for Electromechanical Actuation Systems Under the Influence of Nonlinear Factors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning

Institute of Logistics Science and Engineering, Shanghai Maritime University, Shanghai 201306, China
*
Author to whom correspondence should be addressed.
Actuators 2026, 15(4), 223; https://doi.org/10.3390/act15040223
Submission received: 9 March 2026 / Revised: 12 April 2026 / Accepted: 14 April 2026 / Published: 16 April 2026
(This article belongs to the Special Issue Fault Diagnosis and Prognosis in Actuators)

Abstract

With increasing port automation and operational intensity, the gearboxes of gantry cranes widely used in bulk cargo terminals are prone to bearing and gear failures under prolonged heavy loads, intense vibrations, and complex operating conditions. Since fault samples often exhibit imbalanced distributions, this imposes two higher requirements on diagnostic methods—first, the ability to effectively address sample imbalance and, second, the capability to simultaneously identify multiple fault categories. To address these challenges, this paper proposes a joint diagnostic method integrating an improved Conditional Wasserstein Generative Adversarial Network with Gradient Penalty (CWGAN-GP) and Multi-Task Learning (MTL). First, the modified CWGAN-GP performs conditional augmentation for minority fault classes, evaluating synthetic sample authenticity and diversity through multiple metrics. Subsequently, a multi-channel diagnostic network is constructed, in which vibration signals are fed into two parallel sub-networks: time–frequency features are extracted from the Short-Time Fourier Transform (STFT)-based time–frequency representations via a residual-block Convolutional Neural Network (CNN), while temporal features are captured from the raw time-domain signal using a Bidirectional Long Short-Term Memory (Bi-LSTM) with an attention mechanism. An attention fusion layer then integrates these two feature types, enabling joint classification of bearings and gears within a multi-task learning framework. Experimental validation on public gearbox datasets and port gantry crane gearbox datasets demonstrates that this method achieves an average diagnostic accuracy exceeding 97%. The proposed method reduces the impact of class imbalance, thereby improving the accuracy and stability of multi-task fault identification.

1. Introduction

Gantry cranes serve as critical large-scale equipment in port handling, terminal operations, and industrial production, where their reliable operation is vital to productivity and safety. As the core component in the power transmission chain of gantry cranes, the gearbox converts the high-speed rotational input from the motor into a lower-speed output with greater torque, thereby enabling functions such as lifting and slewing. Cracks, wear in internal gears, or bearing damage within the gearbox can lead to equipment shutdowns, production delays, and even personnel or property safety incidents [1]. However, in actual operation, most collected vibration signals correspond to normal operating conditions, while fault samples are extremely scarce and severely imbalanced in distribution. This makes it difficult for models to fully learn fault characteristics and increases the likelihood of missed detections. In practical engineering applications, it is often required to the simultaneous diagnosis of both gears and bearings. This requires a unified diagnostic model that can effectively mitigate sample imbalance while efficiently sharing useful representations and coordinating task conflicts within a multi-task framework. Consequently, proposing a fault diagnosis method that integrates minority class sample augmentation with multi-objective joint identification holds significant theoretical and practical importance.
Methods for mechanical fault diagnosis can be broadly categorized into two main types: traditional data-driven methods and novel data-driven methods [2]. Among these, traditional data-driven methods primarily include signal analysis-based methods and traditional machine learning-based methods. Signal-based diagnostics represent one of the most widely applied methods for gearbox fault detection, centered on analyzing machine operating conditions through signal characteristics such as vibration, sound, temperature, and current. Widely adopted signal processing techniques include Fast Fourier Transform (FFT) [3], Wavelet Transform (WT) [4], and Empirical Mode Decomposition (EMD) [5]. FFT effectively separates frequency components, enabling the identification of specific frequency signals caused by bearings or gears and the detection of fault characteristics such as early wear or cracks. Tao H et al. [6] employed FFT to process raw vibration signals from gearboxes, treating them as graph nodes, and proposed a k-nearest neighbor (KNN) graph construction method utilizing pooling for fuzzy distance calculation. However, FFT struggles with non-stationary signals under complex operating conditions. To address this, wavelet transform and empirical mode decomposition both provide information about both time and frequency, making them more suitable for handling nonlinear, non-stationary signals. Meng L et al. [7] addressed the challenges of feature extraction and low pattern recognition accuracy in gearbox fault diagnosis by applying first-order differentiation followed by continuous wavelet transform to signals, effectively enhancing the resolution of time-frequency feature images. On the other hand, traditional machine learning methods [8] (such as Support Vector Machines (SVM) [9] and Random Forests (RF) [10]), which involve manually designed features combined with classifiers, have achieved some success in classifying gearbox and bearing failures. Overall, these methods offer advantages such as simplicity of implementation and high interpretability; however, under complex operating conditions, they still suffer from drawbacks including strong feature dependency, limited noise resistance, and a heavy reliance on human expertise.
With the rapid advancement of sensor technology and computing power, new data-driven fault diagnosis methods—particularly those based on deep learning—have begun to attract widespread attention. Compared to traditional methods, deep learning methods [11]—such as Convolutional Neural Networks (CNNs) [12] and Long Short-Term Memory (LSTM) networks [13]—enable end-to-end feature learning with superior expressive power and noise robustness. They have demonstrated superiority in fault diagnosis for equipment like bearings and gearboxes. Wang X et al. [14] integrated raw vibration and acoustic signals from bearings using a 1D-CNN network. Lv Y et al. [15] proposed a 2D BILSTM network to deeply extract and identify 2D time-frequency features. Kang J et al. [16] integrated long short-term memory with convolutional neural networks to effectively extract local signal features, enhancing their time-series analysis capabilities. In addition, Yuan et al. [17] proposed a gearbox fault diagnosis method based on empirical mode decomposition (EMD), multi-scale convolutional neural networks (MSCNN), and a lightweight convolutional attention mechanism. Through multi-scale feature extraction and attention modeling, this method effectively improved the model’s diagnostic performance under complex operating conditions. However, this method primarily addresses the diagnostic task for a single type of gear failure and does not address multi-component coupled failure scenarios. Furthermore, it does not account for the sample imbalance issue that is prevalent in real-world engineering applications. Although deep learning methods have made significant progress in feature extraction and single-task classification accuracy, their reliance on large amounts of labeled data, performance degradation under low-data-availability conditions, and lack of task-cooperative modeling capabilities in multi-component coupled fault diagnosis scenarios continue to limit their engineering applications. To mitigate these issues and enhance diagnostic accuracy and generalization capability, this paper introduces a multi-task learning (MTL) framework. By sharing representations and enabling collaborative learning across tasks, it improves learning efficiency under small sample conditions and promotes the coordinated fusion of multi-source information.
To address the challenges of small sample sizes and multi-component diagnostics in gearboxes, data augmentation and multi-task learning have emerged as critical research directions. On one hand, Generative Adversarial Networks (GANs) and their variants leverage adversarial training to learn data distributions from limited real samples, generating high-quality, diverse synthetic data to mitigate overfitting and performance degradation caused by data imbalance; Guo Q et al. [18] proposed the Multi-Label 1D Generative Adversarial Network (ML1-D-GAN) diagnostic framework to address low accuracy caused by insufficient fault data. Lyu P et al. [19] introduced a novel data augmentation model, the Gradient Penalty Separation Classifier (GPSC), based on GANs. Compared to traditional GANs, this model can more efficiently generate synthetic samples fused with fault samples. On the other hand, multi-task learning (MTL) [20,21,22] achieves feature complementarity and regularization across tasks by sharing underlying features within a single network while configuring dedicated output branches for different tasks, thereby enhancing model generalization and robustness. Niu G et al. [23] proposed a deep residual convolutional neural network with enhanced discriminative feature learning and information fusion capabilities for multi-task bearing diagnosis. Gao L et al. [24] proposed a lightweight, multi-task convolutional explainable shared network (MTCASN) framework for cross-device fault diagnosis to handle failure data from different equipment components. While existing research has explored GANs for bearing or gear fault data augmentation and applied MTL to mechanical fault diagnosis, systematic approaches that organically integrate both methodologies for multi-component, multi-task diagnosis scenarios in portal crane gearboxes remain scarce.
Furthermore, addressing the sample imbalance issue in gearbox fault diagnosis, some studies have explored generative data augmentation approaches. Su Y et al. [25] proposed a method integrating an improved GAN with a dual-stream convolutional network for fault diagnosis of wind turbine gearboxes under small-sample conditions. By incorporating gradient boosting, KNN decision boundaries, and Mahalanobis distance constraints, they enhanced the generation quality and discriminative capability of minority fault samples. Liang P et al. [26] combined the Stockwell transform, data-augmented GANs, and capsule networks to achieve effective identification of both single and composite faults in wind turbine gearboxes, demonstrating robust diagnostic performance under conditions of small sample sizes and class imbalance. Guo Z et al. [27] addressed the uneven distribution of gearbox fault samples by proposing a fault feature generation method based on wavelet packet features and an improved WGAN-GP, validating the effectiveness of generative models in mitigating sample imbalance. Although the aforementioned methods have made some progress in small-sample fault diagnosis for gearboxes, most studies remain focused on single-component or single-task scenarios. There is insufficient consideration for joint modeling and task collaborative learning of coupled faults involving multiple components such as gears and bearings. Furthermore, the collaborative optimization mechanism between generative models and downstream diagnostic models requires further investigation.
To deal with the dual challenges of sample imbalance and multi-task diagnosis in the fault diagnosis of portal crane gearboxes, this paper proposes a joint solution combining an “Improved Conditional Wasserstein GAN (CWGAN-GP)” and a “Multi-channel Multi-task Diagnostic Network.” The primary contributions of this study can be outlined as follows:
(1) A CWGAN-GP optimization strategy combining adversarial loss with auxiliary classification loss was developed. Under the stable generation mechanism ensured by Wasserstein distance and gradient penalties, this approach enables the generator to simultaneously prioritize distribution fidelity and category discriminability during minority class sample generation. This approach effectively mitigates data skew caused by class imbalance in transmission failure datasets at the generative mechanism level, providing high-quality synthetic samples with enhanced discriminative power for subsequent diagnostic models.
(2) Based on data generation and balancing, a multi-task learning framework for integrated gearbox diagnostics has been established. This framework comprises a BiLSTM temporal encoding branch and a CNN spatiotemporal encoding branch, incorporating a cross-modal multi-head attention mechanism for feature fusion. It acquires discriminative shared representations by adaptively modeling the correlation between temporal and spatiotemporal features. Building upon this foundation, a task-sharing layer and task-specific prediction heads are employed to achieve joint learning for component identification and fault discrimination, effectively mitigating representation drift and negative transfer issues in multi-task settings.
(3) Systematic comparison and ablation experiments were conducted on vibration datasets from portal crane gearboxes, verifying that the proposed generative augmentation and multi-task fusion framework maintains leading diagnostic accuracy even under conditions of significant class imbalance. This further demonstrates the robustness and broad application potential of this method in real industrial scenarios.

2. Theoretical Basis

2.1. Improving the CWGAN-GP Network

Building upon the framework of the traditional Conditional Wasserstein GAN with Gradient Penalty (CWGAN-GP), where GP denotes gradient penalty used to stabilize generative adversarial training, this paper improves the loss function structure of CWGAN-GP to address the issue of insufficient discriminability in generated samples under conditions of class imbalance in gearbox failure data. Specifically, by introducing an auxiliary classification branch into the discriminator and jointly incorporating a classification loss term into the optimization objectives of both the generator and the discriminator, we construct an improved CWGAN-GP with category discrimination constraints to generate more distinguishable minority-class fault samples. Unlike the traditional CWGAN-GP, which constrains the distribution of generated samples solely through adversarial loss, our method jointly optimizes adversarial and classification losses. This approach explicitly enhances the category discriminability of the samples while maintaining the consistency of the generated sample distribution.
Conditional Generative Adversarial Networks (CGANs) introduce additional conditional information y based on GANs. By concatenating labels or other control variables with noise as input to the generator, and simultaneously using them as supplementary input to the discriminator, CGANs achieve targeted generation of data for specific categories [28]. The schematic of the conditional GAN is shown in Figure 1. Specifically, the generator receives the concatenated input [z, y] (noise vector z and conditional label y) and generates a sample x = G(z, y). The discriminator receives the sample and its corresponding label (x, y) as input and outputs the conditional probability that x is a real sample. Its objective function is formulated as:
min G max D V ( D , G ) = E x p data ( x ) [ log D ( x y ) ] + E z p z ( z ) log 1 D ( G ( z y ) )
In this equation, G denotes the generator network, which takes the latent noise vector z p z ( z ) and the conditional label y as inputs to generate synthetic samples via the mapping function G ( z y ) . D represents the discriminator network, which receives samples and their corresponding labels ( x , y ) as inputs and outputs the conditional authenticity probability D ( x y ) ( 0 , 1 ) . Here, D ( x y ) measures the confidence that the sample is genuine given label y, while 1 D ( G ( z y ) ) corresponds to the probability that the discriminator judges the generated sample G ( z y ) as fake. E x p data ( x ) [ · ] and E z p z ( z ) [ · ] denote expectation operations under the true data distribution p data ( x ) and noise distribution p z ( z ) , respectively. The entire objective forms a minimax adversarial game: the discriminator D maximizes this log-likelihood sum to enhance conditional authenticity detection, while the generator G minimizes this function to learn generating samples consistent with the true data distribution under given condition y.
CGAN retains the adversarial training framework of the original GAN, incorporating conditional variables into the inputs of both the generator and discriminator. This enables the generated samples to reflect the information of the given category labels. In fault diagnosis, the conditional information can represent fault types or operational conditions. Through CGAN, synthetic data with specific fault characteristics can be generated to augment the dataset.
Traditional GANs often suffer from vanishing gradients due to the saturation of the objective function (e.g., JS divergence) when distributions differ significantly, resulting in training instability or mode collapse. To address this, we introduce the Wasserstein GAN framework, employing Earth-Mover distance (Wasserstein distance) as a metric for distribution divergence. We further adopt gradient penalties (WGAN-GP) to explicitly constrain the discriminator’s Lipschitz constant, significantly improving training stability and convergence behavior. Extending this framework to its conditional form (CWGAN-GP), we establish a foundation for both class-specific sample generation and more stable training.
Most generative adversarial models have the same main problem during training: pattern collapse. This is when a lot of the samples that are generated become very similar to the real samples, resulting in a loss of diversity. Based on the aforementioned improvement principles, this paper designs the generator and discriminator loss functions for the enhanced CWGAN-GP, as shown in Equations (2) and (3).
L ( G ) = E z P z D ( G ( z ) ) + λ 1 E z P z log P ( y = c G ( z ) )
L ( D ) = E x P r D ( x ) E z P z D ( G ( z ) ) + λ 2 E x ^ P x ^ x ^ D ( x ^ ) 2 1 2 + λ 1 E x P r log P ( y = c x )
Equations (2) and (3) represent the loss functions for the generator and discriminator, respectively. λ 1 denotes the weight coefficient associated with classification loss, while λ 2 is the gradient penalty coefficient. Referencing existing WGAN-GP research, this paper sets λ 2 to 10. The value of λ 1 is analyzed and validated through ablation experiments, with relevant results presented in Section 4.2. Based on a comprehensive evaluation of sample quality and downstream diagnostic performance, λ 1 = 0.5 and λ 2 = 10 are ultimately selected as the default parameter configuration.
In the standard WGAN-GP framework, the generator approximates the overall distribution of real data solely through adversarial loss. Under conditions of class imbalance, it tends to favor high-frequency classes, leading to the neglect of minority class patterns and subsequently triggering pattern collapse issues. This paper introduces an auxiliary classification loss into CWGAN-GP. This mechanism constrains the generator with explicit class discrimination while optimizing distribution similarity, thereby encouraging distinct pattern distributions for different categories in the feature space. This joint optimization effectively suppresses the generator’s tendency to collapse into singular or minority patterns, enhancing the diversity and coverage of generated samples across categories.

2.2. Multitask Learning Networks

Multi-task learning is a machine learning paradigm that simultaneously handles multiple related tasks within a unified model. By sharing underlying representations, it uncovers task-to-task correlations to enhance learning efficiency across all tasks [29]. The advantage of MTL lies in its shared layer parameters, which capture common features across different tasks while reducing overfitting risks. Meanwhile, dedicated layers for each task learn high-level representations tailored to their specific objectives. In the field of fault diagnosis, MTL can be applied to simultaneously diagnose multiple fault modes or severity levels of equipment. The tasks considered in this paper include two subtasks: “bearing fault diagnosis” and “gear fault diagnosis.” By constructing a multi-task deep network architecture featuring shared convolutional layers and task-specific output branches, the model can extract more comprehensive and effective feature information from limited samples, enabling collaborative identification of faults across multiple components. Common MTL strategies include hard parameter sharing (sharing partial layer parameters) and soft parameter sharing (fusing task-specific features through regularization or specialized modules). This paper adopts the hard parameter sharing approach, where common features are extracted in the convolutional layer before separately outputting classification results for each task.

3. Gearbox Fault Diagnosis

This paper proposes a multi-task fault diagnosis framework based on CWGAN-GP adversarial generation and multi-domain information fusion, as shown in Figure 2. The overall process comprises three stages: signal segmentation and preprocessing, data generation and augmentation, and diagnostic model training and classification. First, the original vibration signal is split into a training set and a test set, depending on the conditions of the sampling, followed by denoising and normalization. Subsequently, the CWGAN-GP generator synthesizes gear and bearing fault samples using random noise and fault labels as inputs to enhance dataset diversity. The discriminator then performs adversarial training to optimize the generated samples against real samples. Finally, multi-domain feature representations were constructed separately for real and synthetic samples: On one hand, short-time Fourier transforms were applied to one-dimensional vibration signals, mapping them into two-dimensional time-frequency spectra for input into 2D-CNN channels to extract local time-frequency features. On the other hand, raw time-domain signals were fed in parallel into Bi-LSTM channels to model the temporal dependency characteristics of vibration signals. Subsequently, a feature fusion module based on cross-modal multi-head attention mechanisms was introduced to adaptively model and fuse the key time-frequency features extracted by the CNN channel with the temporal features output by the Bi-LSTM channel. After further refinement through a shared representation layer, the fused features were fed into the fully connected layers and Softmax classifiers of the bearing and gear fault diagnosis task branches, respectively. Through multi-task loss-weighted joint training, this approach achieved simultaneous high-precision identification of both fault types.
The model primarily consists of data augmentation and fault diagnosis phases, with the diagnostic process illustrated in Figure 3.
A. Data Augmentation Phase
(1) Initial Data Collection. Gather imbalanced multi-class fault signal samples (e.g., gears, bearings) to form the original fault data training set.
(2) CWGAN-GP Model Training. Train CWGAN-GP on the original fault data to learn the distributional characteristics of various fault signals.
(3) Synthetic Fault Data Generation. Utilize the trained CWGAN-GP generator to synthesize new fault signal samples based on fault category labels.
(4) Fault Data Augmentation and Balancing. Merge synthetic samples with original samples to construct a training dataset with balanced category distribution, while the test set consists solely of original fault samples to evaluate the model’s diagnostic performance on real data.
B. Fault Diagnosis Phase
(1) Model Construction and Parameter Tuning. A multi-task fusion model (CNN-BiLSTM-MTL) was developed to perform dual-channel feature extraction targeting both the time-frequency characteristics and time-domain sequence features of vibration signals. Key hyperparameters—such as convolution kernel size, Bi-LSTM hidden layer dimension, and task loss weighting—were optimized through a combination of empirical tuning and parameter search.
(2) Model Training and Validation. Perform multi-task joint training on the CNN-BiLSTM-MTL model using the augmented balanced training dataset. Validate model performance on an independent test set, evaluating classification accuracy metrics.

3.1. Data Augmentation

The network architecture is shown in Figure 4. The generator receives random noise and a class index as input. It first maps the label to an embedding vector of the same dimension as z, then performs element-wise multiplication with z. This output is upsampled to an output dimension of 1024 through three fully connected layers. The final output activation uses tanh to match the data’s normalized range of [−1,1]. The discriminator multiplies the input signal element-wise with the category vector obtained via embedding, then feeds it into three fully connected layers. It ultimately produces two output branches: one outputs a real-valued score for WGAN adversarial training, while the other outputs category logits to assist classification. Training employs the WGAN-GP adversarial objective with gradient penalty (gp = 10), updating the discriminator ncritic = 5 times per step using the Adam optimizer. Label smoothing of 0.1 is applied for auxiliary classification. The trained CWGAN-GP generates diverse samples consistent with the distribution of real fault signals, enabling subsequent training of multi-task diagnostic networks while mitigating data scarcity and class imbalance issues.

3.2. Residual Learning Unit

To address the issues of feature degradation and insufficient utilization of deep-layer information during multi-layer convolutional feature extraction for gearbox vibration signals, this paper introduces residual learning units into the frequency-domain feature extraction channel. This enhances the deep convolutional network’s ability to model complex fault features. This design represents not a simple application of residual structures, but rather a critical component within the dual-channel feature extraction and multi-task learning framework proposed herein. It ensures the effective representation of deep spatiotemporal features prior to cross-modal fusion.
As shown in Figure 5, the residual unit comprises two layers of 2D convolutions, batch normalization, and ReLU activation functions. Input and output features are directly summed via an identity mapping, preserving gradient stability during backpropagation while preventing shallow discriminative information from weakening as the network deepens. This architecture enhances feature reuse capability without significantly increasing computational complexity. It provides more robust high-level feature representations for subsequent cross-modal attention fusion and multi-task shared layers, thereby improving the model’s diagnostic accuracy and robustness under complex coupled fault conditions.

3.3. Feature Extraction Based on Bi-LSTM

Given the distinct characteristics of fault features in gearbox vibration signals—exhibiting both pronounced temporal correlation and coexisting local transient features—this paper constructs a temporal modeling module based on a bidirectional long short-term memory (Bi-LSTM) network within the time-domain feature extraction channel. This enhances the model’s ability to perceive dynamically evolving features under complex operating conditions. Unlike traditional sequential models that rely solely on unidirectional temporal information, Bi-LSTM simultaneously models dependencies between historical and future time steps, providing a more comprehensive sequential feature representation for subsequent multi-task diagnostics.
Building upon Bi-LSTM, this paper further introduces a time-dimension-based attention weighting mechanism for adaptive modeling of temporal features across different time points. Specifically, linear mapping of Bi-LSTM outputs at each time step, combined with learnable context vectors, calculates attention weights to highlight key temporal features contributing significantly to fault discrimination while suppressing redundant or noisy information. Subsequently, these features are aggregated through weighted summation to form a discriminative global temporal representation. This temporal attention mechanism effectively enhances the model’s ability to perceive transient impact features and periodic fault patterns within gearbox vibration signals.

3.4. Fault Diagnosis Method Based on CWGAN-GP-MTL

Multitask learning uncovers latent correlations between bearing and gear failures by sharing network parameters. This approach enhances generalization and robustness for small-sample tasks while dynamically balancing learning progress across tasks, thereby achieving synchronized high-precision classification in gearbox fault diagnosis.
(1) Structure of the CWGAN-GP-MTL Model
For vibration data generated and augmented via CWGAN-GP, this paper proposes a dual-channel classification model based on multi-task learning. First, for both augmented and genuine vibration signals, a short-time Fourier transform is applied to map the one-dimensional time-domain signal into a two-dimensional time-frequency spectrum, which is then fed into the 2D-CNN channel to extract time-frequency features. simultaneously, the augmented and original raw time-domain signals are fed into a Bi-LSTM channel to extract temporal features. Cross-modal multi-head attention mechanisms adaptively fuse features from both time-frequency and time domains. Subsequently, the fused features undergo further refinement through a shared layer before being fed into separate task branches for gear and bearing fault classification. These branches are jointly optimized using multi-task loss to achieve simultaneous high-precision diagnosis of both fault types. The network configuration of the CWGAN-GP-MTL architecture is shown in Table 1.
This paper employs a dual-channel feature extraction strategy to separately mine complementary information from time-domain sequences and time-frequency spectra. The time-domain channel utilizes a bidirectional long short-term memory network as its backbone, incorporating an attention mechanism on its output to automatically emphasize temporal step features relevant to fault discrimination, thereby enhancing sensitivity to both periodic and transient events. The time-frequency channel converts one-dimensional vibration signals into time-frequency representations via short-time Fourier transform (STFT). A convolutional neural network then performs multi-layer spatial feature extraction on the time-frequency images. Residual learning units and adaptive pooling are incorporated within the network to enhance deep feature learning capabilities and ensure dimensional stability after feature flattening. Features learned by the two channels are fused through a cross-modal multi-head attention mechanism, enabling selective information exchange while preserving discriminative features from both modalities. This design allows the model to maintain intra-modal expressive power while dynamically allocating attention weights during fusion, thereby improving discrimination capability and robustness against complex coupled faults.
Based on the fused features obtained, a “shared-dedicated” architecture was designed to enable multi-task collaborative learning. The shared network extracts general features from the fused features through channel mapping and residual augmentation modules, aiming to learn underlying representations valuable across all tasks and improve gradient flow during training, thereby reducing the risk of overfitting in individual tasks. Subsequently, a lightweight task branch is established for each subtask. Shared features are projected into task-specific discriminative vectors through channel remapping and small-scale fully connected mapping, preserving task-specific differences and enabling final classification predictions.
It should be noted that the multi-task learning framework proposed in this paper targets the joint fault diagnosis of two critical components: bearings and gears in gearboxes. In the experimental setup, the model’s input comprises two data streams: bearing signal samples and gear signal samples. Both streams correspond to the operational state of the same gearbox and are respectively labeled with bearing fault tags and gear fault tags. After feature extraction and fusion within the network, these two feature streams form a unified shared representation. This representation is simultaneously fed into both the bearing task branch and the gear task branch, each of which outputs a fault category prediction for its respective component.
During training, the output of the bearing task branch is only loss-calculated with the bearing fault labels, while the output of the gear task branch is only loss-calculated with the gear fault labels. The multi-task loss function combines backpropagation to constrain the shared layer learning to acquire a general feature representation with discriminative capabilities for both task types. During the testing phase, unknown samples undergo shared feature extraction and are fed in parallel to both task branches, yielding separate fault identification results for bearings and gears.
(2) Multiple Loss Functions
This paper employs the cross-entropy loss function as the model optimization objective for both bearing and gear fault diagnosis tasks. This loss function compares the predicted distribution of the model with the true distribution to see how different they are, making it suitable for the single-label multi-class fault identification task addressed in this study. By minimizing the negative log-likelihood of the true class, cross-entropy loss drives the model to output probability distributions closer to the true class, thereby enhancing classification accuracy and convergence stability. Two independent cross-entropy loss functions are set for the bearing and gear tasks, expressed as follows:
L bearing = j = 1 k b p j b log q j b
L gear = j = 1 k g p j g log q j g
Equations (4) and (5) represent the loss functions for the bearing and gear tasks, respectively. L bearing and L gear denote the loss functions for the bearing and gear tasks, respectively. p b and p g denote the target distributions for the bearing and gear tasks, respectively. q b and q g denote the estimated distributions for the bearing and gear tasks, respectively.
To enable the model to train collaboratively for both tasks, the sum of the losses for the bearing and gear components is directly used as the network’s final loss function. The total loss is expressed as follows:
L o s s = L bearing + L gear
(3) Cross-Modal Multi-Head Attention Fusion Module and Rationale
To achieve effective fusion between temporal BiLSTM features and time-frequency CNN features, this paper introduces a cross-modal multi-head attention mechanism as the feature interaction module within a multi-task learning framework. This mechanism uses temporal features as the Query and spatiotemporal features as both Key and Value. By modeling correlations across modalities through parallel multi-head subspaces, it adaptively assigns weights to both modalities within the shared representation, thereby highlighting key feature dimensions and suppressing redundant information.
Compared to traditional fusion methods (e.g., simple concatenation or weighted averaging), multi-head attention offers the following advantages: (1) Explicitly captures cross-modal correspondences, suitable for time-domain/time-frequency features with significant modal differences; (2) The multi-head architecture enables complementary information learning across different subspaces, enhancing fusion expressiveness; (3) Relatively low parameter count facilitates integration into multi-task frameworks with manageable computational overhead; (4) Improves model robustness and generalization capabilities under complex operating conditions.
Thus, the cross-modal multi-head attention mechanism enhances fusion quality and joint diagnostic performance, serving as a critical component in the fault diagnosis methodology developed in this study.

4. Experimental Verification

To evaluate the performance of the proposed method, this section analyzes case studies of the Southeast University gearbox and the port gearbox. In these case studies, all deep learning (DL) models were trained using PyTorch (version 2.7.0+cu128) within a Python (version 3.12.9) environment on a computer equipped with a GTX 4090 GPU and 64GB of memory. To ensure clarity and consistency in the experimental analysis process, before introducing specific case studies, we first provide a unified explanation of the evaluation criteria for data generation quality. Subsequently, the effectiveness of the proposed method is validated and analyzed through two case studies.

4.1. Data Generation Quality Assessment Criteria

To comprehensively evaluate the quality of generated data, this paper assesses the similarity between generated and original data from two perspectives: quantitative statistical analysis and qualitative feature visualization. Through multi-dimensional evaluation criteria, the performance of the generative model in terms of distributional consistency and feature preservation can be more systematically characterized.
(1) Quantitative Evaluation Indicators
This paper employs the Maximum Mean Discrepancy (MMD) metric in experiments to quantify the quality and diagnostic performance of generated samples. MMD is a statistical measure for quantifying the difference between two probability distributions, enabling quantitative evaluation of generated data quality through distribution-based similarity comparisons. Its functional form is:
M M D F , P ( X ) , P ( G ( X ) ) = [ 1 m ( m 1 ) i j m k ( x i , x j ) 2 m n i , j = 1 m , n k ( x i , y i ) + 1 n ( n 1 ) i j n k ( y i , y j ) ] 1 2
In Equation (7), F represents the generalized Gaussian kernel function, P ( X ) and P ( G ( X ) ) denote the distributions of genuine fault data and generated fault samples respectively, m and n represent the sample sizes of the two distributions, x i and y i denote sample points from P ( X ) and P ( G ( X ) ) respectively, and k represents the Gaussian kernel function. Multiple experiments were conducted by randomly sampling equal numbers of original samples and faulted samples from each dataset.
(2) Qualitative Feature Visualization Methods
To further visually assess the similarity between generated samples and real samples in the feature space, we employed dimensionality reduction visualization techniques such as t-SNE and PCA. t-SNE is a nonlinear dimensionality reduction method that maps high-dimensional data onto a two-dimensional plane, preserving the relative positions of similar samples in the low-dimensional plot. We first extracted high-dimensional features from both real fault signals and generated signals, then performed t-SNE and PCA projections, respectively. The distributions of both sample types were plotted within the same coordinate system.

4.2. Case 1: Southeast University Dataset

The gearbox fault data used in this study originates from Southeast University’s transmission system dynamic test bench based on the Drivetrain Dynamic Simulator (DDS) [30], as shown in Figure 6. This test bench acquires bearing and gear signals under two operating conditions (20 Hz–0 V and 30 Hz–2 V). Bearing signals encompass vibration data for one healthy condition and four fault states (ball fault, inner ring fault, outer ring fault, compound fault). Gear signals include one healthy condition and four fault states (defective fault, tooth breakage fault, root crack fault, surface wear fault). Each data type comprises 8 channel signals: Channel 1 is the motor vibration signal; Channels 2–4 represent the X, Y, and Z-axis vibrations of the planetary gearbox; Channel 5 is the motor output torque; Channels 6–8 are the three-axis vibrations of the parallel gearbox. The data sampling frequency is 12 kHz.
The experimental signals selected for this study are X-direction vibration signals from a parallel gearbox. Detailed information on the dataset partitioning is shown in Table 2. The dataset comprises 10 categories (labels 0–9), with bearing data labeled 0–4 and gear data labeled 5–9. Labels 0 and 5 represent the “normal state,” with 800 training samples and 200 test samples per category. The remaining fault categories (e.g., ball fault, inner ring fault, outer ring fault, compound fault, defective fault, tooth breakage fault, root crack fault, and surface wear fault) each have 40 training samples and 200 test samples. To simulate data imbalance, an imbalance factor β = N normal / N fault is defined. β quantifies the severity of imbalance, where N normal represents the number of normal samples and N fault denotes the number of samples per fault type. This study employs β = 20:1.
In this study, both bearing and gear vibration signals were segmented into independent samples of 1024 points each without overlap. All samples were divided proportionally into training and test sets, ultimately yielding 960 training samples and 1000 test samples for bearings and gears, respectively. Specifically, only samples from each fault type within the training set were used to train the conditional Wasserstein generative adversarial network (GAN). This approach enhances features for minority faults, mitigates data imbalance, and provides diverse training inputs for subsequent diagnostic model construction.
(1) Analysis of Data Generation Quality Assessment Results
The point of training the CWGAN-GP model is to make it produce signals that are like those from Southeast University’s gearbox failures, thereby mitigating data imbalance issues. Figure 7 illustrates the trend of loss values for the generator and discriminator in the CWGAN-GP model across training iterations. This loss function is constructed based on the Wasserstein distance, incorporating a gradient penalty (GP) term to constrain the discriminator gradient. This approach addresses challenges in GAN training, such as gradient vanishing/explosion and mode collapse. The figure reveals significant loss fluctuations during early training, gradually converging toward near-zero values as iterations progress. This demonstrates the generator and discriminator progressively achieving equilibrium through adversarial learning, validating CWGAN-GP’s optimization of training stability for complex generation tasks. The CWGAN-GP model completed training after 5000 iterations. A certain number of fault signals were generated using the trained generator to achieve data balance. To assess whether the generated fault data meet quality standards, an evaluation was conducted from two perspectives: statistics and visualizations.
As shown in Figure 8, the MMD value is very low, which shows that the generated data is very similar to the original data. This shows that the data that has been created is very similar to the original data, and is very consistent. The MMD method shows how well the generative model can simulate the original data distribution. The underlying mechanisms of the model effectively capture the features of the original data.
The visualization results for real and generated data of bearings and gears are shown in Figure 9. Real and generated samples cluster closely and overlap in the reduced-dimensional feature space, with the point clouds of both datasets primarily intertwined. This indicates that the generated samples successfully capture the distributional characteristics of the real data. The substantial overlapping region between the two sets in the reduced-dimension plot indicates that the generative model captures the intrinsic structure of real samples at the feature level. Such qualitative visual analysis provides intuitive evidence for evaluating the model’s generative performance, validating the consistency of distribution between generated and real data across multidimensional feature spaces.
(2) Results and Analysis
1. Data Imbalanced Grouping Design
To systematically validate the robustness and generalization performance of the proposed method under varying degrees of data imbalance, this study designed five gearbox fault datasets (A–E) as shown in Table 3. Within each dataset, the number of ground-truth vibration samples for each fault type was fixed at 40, while CWGAN-GP generated 0, 40, 120, 360, and 740 samples using CWGAN-GP to form five imbalanced training subsets with total training sizes of 40, 80, 160, 400, and 800 samples, respectively. Additionally, 200 samples from each group’s original data were allocated as the test set, with unbalanced ratios decreasing from 20:1, 10:1, 5:1, 2:1 to 1:1. This design enables evaluation of the generated samples’ compensatory effect on expressing minority fault features while comparing diagnostic model classification accuracy under varying training set sizes and balancing conditions. During training, the AdamW optimizer was employed with a batch size of 64 and a learning rate of 0.0001.
In cases of sample imbalance, the improvement in classification performance achieved through sample generation is more pronounced. As shown in Figure 10, experimental results indicate that data augmentation of minority faults using CWGAN-GP significantly enhances the overall accuracy of fault diagnosis. As the number of generated samples gradually increases, the model’s classification performance continues to improve: without introduced synthetic data, the diagnostic accuracy for both bearing and gear tasks remained around 80%. With progressively more synthetic samples, the model’s accuracy significantly increased and stabilized at approximately 97.5% once the sample size reached a certain scale. These results demonstrate that the high-quality synthetic samples generated by CWGAN-GP effectively mitigate the scarcity of minority class samples, enhance the model’s learning capability for critical fault features, and thereby significantly improve diagnostic accuracy and stability under imbalanced conditions.
After expanding the data imbalance ratio from 20:1 to 1:1, the confusion matrices for the bearing and gear tasks are shown in Figure 11, where Figure 11a and Figure 11b correspond to the diagnostic results for bearing and gear faults, respectively. It can be observed that despite high overall recognition accuracy for both tasks, the model still faces certain identification challenges for specific fault types. In the bearing task, composite faults and outer ring faults were occasionally misclassified as normal conditions or other single fault types. This indicates that under conditions of multi-source coupling or less pronounced impact features, the distinguishing characteristics of highly complex faults may be affected by background noise or feature overlap. In the gear task, misclassifications primarily occurred between missing tooth faults and broken tooth faults, as well as root crack faults. This indicates that gear damage of varying degrees or forms exhibits certain similarities in time-frequency domain features. The above analysis indicates that while multi-channel feature fusion and synthetic data generation significantly enhance overall diagnostic performance, challenges persist in distinguishing faults with similar mechanisms or adjacent damage levels. This highlights potential areas for model improvement in practical engineering applications.
2. Ablation Experiment
To validate the effectiveness and necessity of the proposed multi-task learning model in bearing and gear fault diagnosis, multiple sets of ablation experiments were designed, with results shown in Table 4. First, single-task models trained exclusively for each component achieved diagnostic accuracies of 90.2% and 91.5%, respectively, demonstrating the model’s basic fault recognition capability under single-task conditions. In contrast, the multi-task learning model—which employs shared feature representations and simultaneously optimizes both bearing and gear tasks—achieved significant improvements in both tasks, with accuracies reaching 97.5%. This demonstrates that sharing underlying features and introducing cross-task information exchange effectively enhances the model’s ability to extract key fault features and improves its generalization performance.
Furthermore, to analyze the roles of different feature extraction channels and the attention fusion module, control models were constructed by removing specific components. After removing the CNN channel, the accuracy rates for bearings and gears decreased to 94.1% and 93.6%, respectively. After removing the BiLSTM channel, the accuracy rates for the two tasks were 95.8% and 95.7%, respectively. When the attention fusion module was disabled, performance similarly declined, with accuracies dropping to 95.1% and 95.8%, respectively. These results demonstrate that each channel and the fusion module play crucial roles in feature extraction and task coordination. The multi-task joint optimization mechanism effectively enhances overall fault identification performance by sharing feature representation spaces, validating the rationality and superiority of the proposed method in multi-component coupled fault diagnosis scenarios.
3. Comparison with Other Methods
To validate the effectiveness and feasibility of the proposed network model, this paper compares CWGAN-GP-MTL with four state-of-the-art models: MT-1DCNN [31], RI-MPCNN [32], MTCASN [24], and MSCNN [33]. The training and testing strategies for each network are identical across all evaluations. As shown in Table 5, the proposed CWGAN-GP-MTL model achieves high classification accuracy in both bearing and gear fault diagnosis tasks, significantly outperforming other comparison models. Compared to MT-1DCNN (which utilizes single-domain features), RI-MPCNN and MSCNN (which employ multi-scale convolutional structures), and MTCASN (which incorporates channel attention), CWGAN-GP-MTL effectively enhances the model’s ability to recognize complex coupled fault features by integrating time-frequency features extracted by CNN with time-domain features extracted by BiLSTM, and by utilizing an attention mechanism to achieve adaptive feature weighting. This demonstrates stronger generalization capabilities and task synergy advantages.

4.3. Case 2: Portal Crane Gearbox Dataset

The gearbox fault dataset used in this study originates from a portal crane gearbox at a port in Shandong Province, as shown in Figure 12. The signals were collected from the hoisting mechanism of the portal crane. During operation, bearing and gear data were captured at a frequency of 5000 Hz. Bearing data comprised four categories: normal condition data, rolling element failure data, inner ring failure data, and outer ring failure data. Gear data also included four categories: normal condition data, gear tooth surface wear, gear tooth cracks, and abnormal gear meshing. The majority of collected data represents normal conditions, with only a small portion indicating fault types, aligning with fault diagnosis under data imbalance scenarios. Detailed information on the training and testing set division is shown in Table 6. Bearing data labels range from 0 to 3, while gear data labels range from 4 to 7.
In the actual operating environment of port gantry cranes, the selection of sensors must comprehensively consider environmental adaptability and engineering feasibility. Temperature signals are susceptible to interference from changes in ambient temperature (such as diurnal temperature variations and the influence of sea breezes), while acoustic emission signals are easily affected by background noise in complex operational environments. Furthermore, torque measurement and multi-sensor fusion solutions typically entail high deployment and maintenance costs. In contrast, vibration signals can directly reflect gear meshing conditions and localized bearing damage characteristics, offering advantages such as fast response times, high sensitivity to early-stage failures, and ease of acquisition. Therefore, this paper selects vibration signals as the primary research focus to better align with practical engineering application requirements.
For the above datasets, both the bearing and gear datasets were segmented into samples of 1024 points each using a non-overlapping approach. The training and test sets were divided according to the format specified in the aforementioned table. Data augmentation was performed using the CWGAN-GP network to enhance diagnostic accuracy for rare fault data.
(1) Analysis of Data Generation Quality Assessment Results
Figure 13 illustrates the trend of loss values for the generator and discriminator in the CWGAN-GP model as training iterations progress. During the initial training phase, both curves exhibit significant fluctuations. Subsequently, they gradually stabilize with iterations and oscillate around the zero range, indicating that the adversarial process has reached a relative equilibrium. After completing 5000 training iterations, the model generates fault signals using the trained generator to achieve data balance.
As shown in Figure 14, specifically, the MMD values for bearing samples exhibit a gradual downward trend, indicating that the generator’s ability to fit the original data distribution has improved across different categories of bearing samples. Meanwhile, gear samples reached their lowest point in category 2 and rebounded in category 3. This variation may be related to differences in complexity or sample size across categories, suggesting that the generator still has room for improvement in fitting certain gear categories. Overall, lower MMD values indicate that the generated data more closely approximates the global distribution statistics of the real data, demonstrating the model’s ability to capture several distributional features of the original data.
After extracting the same high-dimensional features from both real fault signals and generated signals, t-SNE and PCA projections were performed, as shown in Figure 15. The reduced-dimensional point clouds exhibit high overlap and intertwined distributions in the feature space, indicating that the generated samples effectively match the primary distribution characteristics of the real samples. This demonstrates a high degree of consistency at the feature level.
(2) Results and Analysis
1. Data Imbalanced Grouping Design
This study constructed five training subsets (denoted as A–E) based on the gearbox fault data of gantry cranes, as configured in Table 7. Specifically, the number of genuine vibration samples for each fault category was fixed at 40. CWGAN-GP was then employed to synthesize 0, 40, 120, 360, and 740 additional samples, respectively, yielding five training subsets with sample sizes of 40, 80, 160, 400, and 800 per category. Training employs the AdamW optimizer with a learning rate of 0.0001 and a batch size of 64.
Experimental results indicate that as the sample imbalance among different fault categories in the gearbox gradually diminishes, the classification performance for both bearing and gear diagnostic tasks shows a continuous improvement trend, as illustrated in Figure 16. After augmenting the minority class samples using generative adversarial networks (GANs) to adjust the sample distribution from a highly imbalanced state to a balanced one, the overall diagnostic accuracy for the bearing task and gear task increased to 97.63% and 99.75%, respectively. The corresponding confusion matrix results are shown in Figure 17. Figure 17a indicates that in the bearing task, the model accurately distinguishes between normal conditions, rolling element faults, and outer ring faults. However, inner ring faults are occasionally misclassified as normal in rare instances. This may stem from the weak early-stage characteristics and subtle impact components of inner ring faults, whose vibration responses exhibit similarities to normal operating conditions in time-domain features, thereby complicating discrimination. In contrast, outer ring faults are easier for the model to capture and identify due to their fixed excitation location and distinct periodic impact characteristics. In the gear task, as shown in Figure 17b, high recognition accuracy is achieved for normal conditions, tooth surface wear, and gear tooth cracks, while a small number of gear meshing anomaly samples are misclassified as normal. This primarily stems from the fact that abnormal meshing does not always accompany obvious localized damage in certain operating conditions. Its characteristics are more often reflected in subtle changes in the overall vibration pattern, leading to some overlap with the normal state in the feature space. Overall, while the generated samples significantly enhance the model’s diagnostic performance under unbalanced conditions, challenges remain in distinguishing categories with weak feature differences or similar failure mechanisms. This highlights potential areas for improvement in practical engineering applications.
2. Ablation Experiment
The ablation results are shown in Table 8. Attention fusion combined with the multi-task model achieved the best performance, indicating that the collaboration among modules is crucial for diagnostic capability. Compared to single-task training, the baseline improved by 11.25 percentage points for bearings and 5.87 percentage points for gears, demonstrating that multi-task sharing significantly enhances cross-task complementary information and sample category generalization ability. Removing attention fusion resulted in declines of approximately 2.03 and 2.25 percentage points, respectively, indicating that attention plays a key role in feature weighting and noise suppression. Using only CNN or only BiLSTM led to a significant decrease in the ability to distinguish certain fault categories, suggesting that CNN excels at extracting time-frequency features to differentiate gear faults, while BiLSTM is better suited for capturing temporal dynamics to identify bearing faults, with the two complementing each other. Therefore, the multi-channel (CNN+BiLSTM) architecture combined with attention and multi-task learning achieves superior diagnostic performance.
3. The Impact of Classification Loss Weighting on Improving CWGAN-GP Performance
To verify the impact of the classification loss term and its weight coefficient λ 1 on the performance of the improved CWGAN-GP, experiments were conducted with different values of λ 1 while keeping the gradient penalty coefficient λ 2 fixed at 10 (a common setting in WGAN-GP). The results are shown in Table 9. When λ 1 = 0 , the model degenerates into a CWGAN-GP without classification constraints. At this point, the MMD values for the bearing and gear datasets were 0.095 and 0.083, respectively—both at relatively high levels—with diagnostic accuracy rates of 94.6% and 98.8%. This indicates that relying solely on adversarial loss is insufficient to fully constrain the class discrimination characteristics of the generated samples, and discrepancies still exist between the generated distribution and the true distribution. When λ 1 = 0.2 , the MMDs decrease to 0.093 and 0.081, respectively, while the accuracy rates improve to 95.6% and 99.2%. Compared to λ 1 = 0 , both distribution consistency and classification performance have improved, but the extent of improvement is limited, indicating that the classification loss has a weaker effect at this weight. When λ 1 is increased to 0.5, the MMDs for bearings and gears are 0.090 and 0.080, respectively, with bearings achieving the optimal value. The accuracy rates rise to 97.6% and 99.8%, both of which are the highest values, indicating that the classification loss and adversarial loss have reached an optimal balance at this point. When λ 1 = 1.0 , the MMDs rise to 0.093 and 0.082, while the accuracy rates drop to 95.4% and 99.4%, indicating that overly strong classification constraints weaken the model’s ability to capture the true distribution, thereby affecting both generation quality and classification performance. In summary, setting the classification loss weight appropriately is crucial; in this experiment, the model achieved the best overall performance when λ 1 = 0.5 .
4. The Impact of Feature Fusion Strategies on Fault Diagnosis Performance
To further validate the effectiveness of the introduced attention-based feature fusion mechanism in multi-task fault diagnosis, we conducted ablation experiments comparing different feature fusion strategies while maintaining the CNN–BiLSTM dual-channel architecture and the multi-task learning framework. Specifically, we compared the performance differences among attention-based feature fusion, feature concatenation, and mean-based feature fusion in bearing and gear diagnosis tasks. The experimental results are shown in Table 10. The results show that the attention-based feature fusion method achieved the highest diagnostic accuracy for both bearing and gear tasks, at 97.6% and 99.8%, respectively, outperforming both feature concatenation and mean-based feature fusion in overall performance. Specifically, feature concatenation achieved a high accuracy rate for the gear task (93.1%), but its performance dropped significantly for the bearing task (84.4%); Mean-based feature fusion performed relatively well on the bearing task (91.0%), but its diagnostic accuracy was slightly lower on the gear task (92.5%). This indicates that simple feature fusion methods struggle to accommodate the differing requirements of multi-channel features across different tasks. In contrast, the attention-based feature fusion mechanism can adaptively assign importance weights to features across channels based on different tasks, highlighting key discriminative information while suppressing redundant features. This enables more effective feature representation in multi-task joint diagnosis, significantly improving diagnostic performance.
5. Comparison with Other Methods
As shown in Table 11, the results of the comparative experiments indicate that the CWGAN-GP-MTL method proposed in this paper significantly outperforms the comparison methods in terms of bearing, gear, and average metrics. Its average performance is approximately 10 percentage points higher than that of the best-performing MSCNN model, demonstrating a clear advantage. At the same time, an analysis from the perspective of computational efficiency reveals differences in training time among the various models. Among them, MT-1DCNN, MSCNN, and MTCASN have relatively simple structures and thus shorter training times, whereas RI-MPCNN and the method proposed in this paper have relatively longer training times due to their higher structural complexity and the introduction of additional modules. Further comparison reveals that MT-1DCNN performs the worst, indicating that single-domain convolution is insufficient for distinguishing complex coupled faults; RI-MPCNN achieves good results for bearings but performs poorly for gears, while MTCASN performs strongly for gears but weakly for bearings, suggesting that different network architectures prioritize different sub-tasks and struggle to simultaneously address the classification requirements of both fault types. In contrast, CWGAN-GP-MTL achieves high accuracy on both tasks, demonstrating that its multi-channel feature extraction, task-sharing mechanism, and fusion strategy possess stronger representational capabilities in capturing the temporal features of bearings and the time-frequency features of gears. An analysis of overall accuracy and computation time reveals that, although the proposed method incurs some computational overhead during the training phase compared to certain lightweight models, it offers significant advantages in terms of accuracy improvement, demonstrating a favorable performance-efficiency trade-off. Furthermore, since the CWGAN-GP data augmentation process can be completed offline, it imposes no additional burden on subsequent model training and deployment, thereby ensuring good feasibility in practical engineering applications.
Furthermore, from an engineering implementation perspective, the two-stage diagnostic framework proposed in this paper demonstrates good feasibility at both the training and deployment levels. CWGAN-GP is only used in the offline stage for data augmentation of minority fault samples and remains fixed once training is complete. During actual online diagnostics, only the multi-task fault diagnosis network is required for deployment for forward inference, and thus does not significantly increase the real-time computational load. This approach meets the basic requirements of industrial applications for diagnostic efficiency and resource consumption.

5. Conclusions

This paper proposes a fault diagnosis method for gantry crane gearboxes based on the integration of adversarial generation and multi-task learning. By employing an improved CWGAN-GP for data augmentation of minority fault classes and simultaneously learning time-domain and time-frequency features within a multi-task neural network, effective utilization of imbalanced fault data is achieved. Experimental results demonstrate that both MMD quantitative evaluation and t-SNE visualization validate the statistical consistency between generated and real samples, demonstrating that the generative model can effectively capture the characteristics of real fault data. In terms of diagnostic performance, the proposed method achieves an overall diagnostic accuracy exceeding 97% for both diagnostic tasks, significantly outperforming traditional single-channel or single-task approaches. Ablation experiments demonstrate that this performance improvement does not stem from a single feature channel or independent network architecture. Instead, it is jointly attributed to the adaptive weighting of key information through the synergistic modeling of time-domain and time-frequency multi-channel features and the attention fusion mechanism. This synergistic effect constitutes the core advantage of the method. In summary, this paper provides an effective and a solution with strong engineering potential for multi-task fault diagnosis of gearboxes under sample imbalance conditions. Future work will further integrate cross-condition transfer learning and domain adaptation strategies to enhance the model’s generalization capability and engineering applicability. While maintaining diagnostic accuracy, further optimization of the model architecture and training workflow will reduce computational overhead and improve deployment efficiency in real-world industrial settings.

Author Contributions

Methodology, Y.Y., Z.L. and H.W.; Investigation, Y.Y.; Writing—Original Draft, Z.L. and H.W.; Writing—Review and Editing, Y.Y., Z.L. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data involved in this article has been presented in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Assaad, B.; Eltabach, M.; Antoni, J. Vibration based condition monitoring of a multistage epicyclic gearbox in lifting cranes. Mech. Syst. Signal Process. 2014, 42, 351–367. [Google Scholar] [CrossRef]
  2. Neupane, D.; Bouadjenek, M.R.; Dazeley, R.; Aryal, S. Data-driven machinery fault diagnosis: A comprehensive review. Neurocomputing 2025, 627, 129588. [Google Scholar] [CrossRef]
  3. Luo, X.; Wang, H.; Han, T.; Zhang, Y. FFT-trans: Enhancing robustness in mechanical fault diagnosis with Fourier transform-based transformer under noisy conditions. IEEE Trans. Instrum. Meas. 2024, 73, 1–12. [Google Scholar] [CrossRef]
  4. Yan, R.; Shang, Z.; Xu, H.; Wen, J.; Zhao, Z.; Chen, X.; Gao, R.X. Wavelet transform for rotary machine fault diagnosis: 10 years revisited. Mech. Syst. Signal Process. 2023, 200, 110545. [Google Scholar] [CrossRef]
  5. Li, Y.; Zhou, J.; Li, H.; Meng, G.; Bian, J. A fast and adaptive empirical mode decomposition method and its application in rolling bearing fault diagnosis. IEEE Sens. J. 2022, 23, 567–576. [Google Scholar] [CrossRef]
  6. Tao, H.; Shi, H.; Qiu, J.; Jin, G.; Stojanovic, V. Planetary gearbox fault diagnosis based on FDKNN-DGAT with few labeled data. Meas. Sci. Technol. 2023, 35, 025036. [Google Scholar] [CrossRef]
  7. Meng, L.; Su, Y.; Kong, X.; Xu, T.; Lan, X.; Li, Y. Intelligent fault diagnosis of gearbox based on differential continuous wavelet transform-parallel multi-block fusion residual network. Measurement 2023, 206, 112318. [Google Scholar] [CrossRef]
  8. Wang, Z.; Huang, H.; Wang, Y. Fault diagnosis of planetary gearbox using multi-criteria feature selection and heterogeneous ensemble learning classification. Measurement 2021, 173, 108654. [Google Scholar] [CrossRef]
  9. Wan, A.; Zhang, F.; Khalil, A.B.; Cheng, X.; Ji, X.; Wang, J.; Shan, T. A novel GA-PSO-SVM model for compound fault diagnosis in gearboxes with limited data. IEEE Sens. J. 2025, 25, 30431–30443. [Google Scholar] [CrossRef]
  10. Wei, Y.; Yang, Y.; Xu, M.; Huang, W. Intelligent fault diagnosis of planetary gearbox based on refined composite hierarchical fuzzy entropy and random forest. ISA Trans. 2021, 109, 340–351. [Google Scholar] [CrossRef]
  11. Ravikumar, K.; Yadav, A.; Kumar, H.; Gangadharan, K.; Narasimhadhan, A. Gearbox fault diagnosis based on Multi-Scale deep residual learning and stacked LSTM model. Measurement 2021, 186, 110099. [Google Scholar] [CrossRef]
  12. Zhang, J.; Zhang, Q.; Qin, X.; Sun, Y. Robust fault diagnosis of quayside container crane gearbox based on 2D image representation in frequency domain and CNN. Struct. Health Monit. 2024, 23, 324–342. [Google Scholar] [CrossRef]
  13. Shi, J.; Peng, D.; Peng, Z.; Zhang, Z.; Goebel, K.; Wu, D. Planetary gearbox fault diagnosis using bidirectional-convolutional LSTM networks. Mech. Syst. Signal Process. 2022, 162, 107996. [Google Scholar] [CrossRef]
  14. Wang, X.; Mao, D.; Li, X. Bearing fault diagnosis based on vibro-acoustic data fusion and 1D-CNN network. Measurement 2021, 173, 108518. [Google Scholar] [CrossRef]
  15. Lv, Y.; Liu, Y.; Li, S.; Liu, J.; Wang, T. Enhancing marine shaft generator reliability through intelligent fault diagnosis of gearbox bearings via improved Bidirectional LSTM. Ocean. Eng. 2025, 337, 121860. [Google Scholar] [CrossRef]
  16. Kang, J.; Zhu, X.; Shen, L.; Li, M. Fault diagnosis of a wave energy converter gearbox based on an Adam optimized CNN-LSTM algorithm. Renew. Energy 2024, 231, 121022. [Google Scholar] [CrossRef]
  17. Yuan, B.; Li, Y.; Chen, S. Efficient gearbox fault diagnosis based on improved multi-scale CNN with lightweight convolutional attention. Sensors 2025, 25, 2636. [Google Scholar] [CrossRef]
  18. Guo, Q.; Li, Y.; Song, Y.; Wang, D.; Chen, W. Intelligent fault diagnosis method based on full 1-D convolutional generative adversarial network. IEEE Trans. Ind. Inform. 2019, 16, 2044–2053. [Google Scholar] [CrossRef]
  19. Lyu, P.; Cheng, Y.; Zhang, M.; Yu, W.; Xia, L.; Liu, C. GPSC-GAN: A data enhanced model for intelligent fault diagnosis. IEEE Trans. Instrum. Meas. 2024, 73, 3532116. [Google Scholar] [CrossRef]
  20. Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
  21. Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 2018, 5, 30–43. [Google Scholar] [CrossRef]
  22. Thung, K.H.; Wee, C.Y. A brief review on multi-task learning. Multimed. Tools Appl. 2018, 77, 29705–29725. [Google Scholar] [CrossRef]
  23. Niu, G.; Liu, E.; Wang, X.; Ziehl, P.; Zhang, B. Enhanced discriminate feature learning deep residual CNN for multitask bearing fault diagnosis with information fusion. IEEE Trans. Ind. Inform. 2022, 19, 762–770. [Google Scholar] [CrossRef]
  24. Gao, L.; Huang, J.; Yu, D.; Liu, S. Cross-component fault diagnosis based on lightweight multitasking networks. IEEE Sens. J. 2024, 25, 2231–2243. [Google Scholar] [CrossRef]
  25. Su, Y.; Meng, L.; Kong, X.; Xu, T.; Lan, X.; Li, Y. Small sample fault diagnosis method for wind turbine gearbox based on optimized generative adversarial networks. Eng. Fail. Anal. 2022, 140, 106573. [Google Scholar] [CrossRef]
  26. Liang, P.; Deng, C.; Yuan, X.; Zhang, L. A deep capsule neural network with data augmentation generative adversarial networks for single and simultaneous fault diagnosis of wind turbine gearbox. ISA Trans. 2023, 135, 462–475. [Google Scholar] [CrossRef]
  27. Guo, Z.; Pu, Z.; Du, W.; Wang, H.; Li, C. Improved adversarial learning for fault feature generation of wind turbine gearbox. Renew. Energy 2022, 185, 255–266. [Google Scholar] [CrossRef]
  28. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
  29. Shao, X.; Ra, I.; Kim, C.S. DSMT-1DCNN: Densely supervised multitask 1DCNN for fault diagnosis. Knowl. Based Syst. 2024, 292, 111609. [Google Scholar] [CrossRef]
  30. Shao, S.; McAleer, S.; Yan, R.; Baldi, P. Highly accurate machine fault diagnosis using deep transfer learning. IEEE Trans. Ind. Inform. 2018, 15, 2446–2455. [Google Scholar] [CrossRef]
  31. Liu, Z.; Wang, H.; Liu, J.; Qin, Y.; Peng, D. Multitask learning based on lightweight 1DCNN for fault diagnosis of wheelset bearings. IEEE Trans. Instrum. Meas. 2020, 70, 1–11. [Google Scholar] [CrossRef]
  32. Guo, S.; Yang, T.; Hua, H.; Cao, J. Coupling fault diagnosis of wind turbine gearbox based on multitask parallel convolutional neural networks with overall information. Renew. Energy 2021, 178, 639–650. [Google Scholar] [CrossRef]
  33. Jiang, G.; He, H.; Yan, J.; Xie, P. Multiscale convolutional neural networks for fault diagnosis of wind turbine gearbox. IEEE Trans. Ind. Electron. 2018, 66, 3196–3207. [Google Scholar] [CrossRef]
Figure 1. Conditional Generation Network Schematic Diagram.
Figure 1. Conditional Generation Network Schematic Diagram.
Actuators 15 00223 g001
Figure 2. CWGAN-GP-MTL Diagnostic Framework.
Figure 2. CWGAN-GP-MTL Diagnostic Framework.
Actuators 15 00223 g002
Figure 3. CWGAN-GP and CNN-BiLSTM-MTL Fault Diagnosis Process.
Figure 3. CWGAN-GP and CNN-BiLSTM-MTL Fault Diagnosis Process.
Actuators 15 00223 g003
Figure 4. CWGAN-GP Architecture Diagram.
Figure 4. CWGAN-GP Architecture Diagram.
Actuators 15 00223 g004
Figure 5. Structure of Residual Learning Units.
Figure 5. Structure of Residual Learning Units.
Actuators 15 00223 g005
Figure 6. Southeast University Test Bench.
Figure 6. Southeast University Test Bench.
Actuators 15 00223 g006
Figure 7. Generator and discriminator loss on Southeast University Dataset.
Figure 7. Generator and discriminator loss on Southeast University Dataset.
Actuators 15 00223 g007
Figure 8. Quality Assessment of MMD Metric Based on Southeast University Dataset.
Figure 8. Quality Assessment of MMD Metric Based on Southeast University Dataset.
Actuators 15 00223 g008
Figure 9. T-SNE Visualization Results for Real and Generated Data.
Figure 9. T-SNE Visualization Results for Real and Generated Data.
Actuators 15 00223 g009
Figure 10. Accuracy Rates of Southeast University Dataset Under Different Generated Data.
Figure 10. Accuracy Rates of Southeast University Dataset Under Different Generated Data.
Actuators 15 00223 g010
Figure 11. Confusion Matrix for Bearing (a) and Gear (b) Task Results on Southeast University Dataset.
Figure 11. Confusion Matrix for Bearing (a) and Gear (b) Task Results on Southeast University Dataset.
Actuators 15 00223 g011
Figure 12. Port Portal Crane Gearbox.
Figure 12. Port Portal Crane Gearbox.
Actuators 15 00223 g012
Figure 13. Generator and discriminator loss on Portal Crane Datasets.
Figure 13. Generator and discriminator loss on Portal Crane Datasets.
Actuators 15 00223 g013
Figure 14. Quality Assessment of MMD Metrics Based on Portal Crane Datasets.
Figure 14. Quality Assessment of MMD Metrics Based on Portal Crane Datasets.
Actuators 15 00223 g014
Figure 15. Visualization Results of T-SNE for Real and Generated Data.
Figure 15. Visualization Results of T-SNE for Real and Generated Data.
Actuators 15 00223 g015
Figure 16. Accuracy Rates of Port Portal Crane Datasets Under Different Generated Data.
Figure 16. Accuracy Rates of Port Portal Crane Datasets Under Different Generated Data.
Actuators 15 00223 g016
Figure 17. Confusion Matrix for Bearing (a) and Gear (b) Task Results on Portal Crane Datasets.
Figure 17. Confusion Matrix for Bearing (a) and Gear (b) Task Results on Portal Crane Datasets.
Actuators 15 00223 g017
Table 1. Network Configuration for the CWGAN-GP-MTL Architecture.
Table 1. Network Configuration for the CWGAN-GP-MTL Architecture.
LayerTypeKernel SizeInputOutput
1Input1-128 × 128 × 1128 × 128 × 1
2Conv2d3 × 3128 × 128 × 1128 × 128 × 32
3MaxPool2d2 × 2128 × 128 × 3264 × 64 × 32
4Conv2d3 × 364 × 64 × 3264 × 64 × 64
5MaxPool2d2 × 264 × 64 × 6432 × 32 × 64
6ResidualBlock2D-32 × 32 × 6432 × 32 × 64
7AdaptiveAvgPool2d-32 × 32 × 6432 × 32 × 64
8Flatten-32 × 32 × 644096
9Linear-4096256
10Dropout-256256
11Linear-256256
12Input2-1024 × 11024 × 1
13BiLSTM-1024 × 11024 × 512
14Dropout-1024 × 5121024 × 512
15Attention Weighting-1024 × 512512
16Linear-512128
17Linear-128128
18CrossModalAttention-128128
19Conv1d1 × 1128 × 1256 × 1
20ResidualBlock-256 × 1256 × 1
Task-Specific Layers
TypeKernel sizeInputOutput
Conv1d1 × 1256 × 1256 × 1
Flatten-256 × 1256
Linear-256N
Conv1d1 × 1256 × 1256 × 1
Flatten-256 × 1256
Linear-256N
Table 2. Detailed Information on Dataset Partitioning at Southeast University.
Table 2. Detailed Information on Dataset Partitioning at Southeast University.
Component NameType LabelData StatusTraining SetTest Set
0Normal state800200
1Ball fault40200
Bearing2Inner ring fault40200
3Outer ring fault40200
4Compound fault40200
5Normal state800200
6Defective fault40200
Gear7Tooth breakage fault40200
8Root crack fault40200
9Tooth surface wear fault40200
Table 3. Gearbox Failure Unbalance Data Set on Southeast University Dataset.
Table 3. Gearbox Failure Unbalance Data Set on Southeast University Dataset.
Real SamplesGenerated SamplesTraining SetTest SetImbalance Factor
A4004020020:1
B40408020010:1
C401201602005:1
D403604002002:1
E407408002001:1
Table 4. Ablation Experiment on Southeast University Dataset.
Table 4. Ablation Experiment on Southeast University Dataset.
Ablation StudyBearing Accuracy (%)Gear Accuracy (%)
Baseline97.597.5
Single-task-Bearing90.2-
Single-task-Gear-91.5
CNN-only95.895.7
BiLSTM-only94.193.6
No-Attention Fusion95.195.8
Table 5. Accuracy of Fault Diagnosis Using Different Methods on Southeast University Dataset.
Table 5. Accuracy of Fault Diagnosis Using Different Methods on Southeast University Dataset.
MethodBearingGearAverage
CWGAN-GP-MTL97.5%97.5%97.5%
MT-1DCNN78.3%57%67.65%
RI-MPCNN90.1%83.1%86.6%
MTCASN72.1%72.1%72.1%
MSCNN88.1%80.5%84.3%
Table 6. Detailed Information on the Partitioning of Portal Crane Datasets.
Table 6. Detailed Information on the Partitioning of Portal Crane Datasets.
Component NameType LabelData StatusTraining SetTest Set
Bearing0Normal state800200
1Rolling Element failure40200
2Inner ring failure40200
3Outer ring failure40200
Gear4Normal state800200
5Gear tooth surface wear40200
6Gear tooth crack40200
7Abnormal gear meshing40200
Table 7. Gearbox Failure Unbalance Data Set on Portal Crane Dataset.
Table 7. Gearbox Failure Unbalance Data Set on Portal Crane Dataset.
Real SamplesGenerated SamplesTraining SetTest SetImbalance Factor
A4004020020:1
B40408020010:1
C401201602005:1
D403604002002:1
E407408002001:1
Table 8. Ablation Experiment on Portal Crane Dataset.
Table 8. Ablation Experiment on Portal Crane Dataset.
Ablation StudyBearing Accuracy (%)Gear Accuracy (%)
Baseline97.6399.75
Single-task-Bearing86.38-
Single-task-Gear-93.88
CNN-only88.5098.50
BiLSTM-only93.7596.25
No-Attention Fusion95.6097.50
Table 9. Performance Comparison of Improved CWGAN-GP Under Different Classification Loss Weighting.
Table 9. Performance Comparison of Improved CWGAN-GP Under Different Classification Loss Weighting.
λ 1 λ 2 MMD (Bearing)MMD (Gear)Accuracy (Bearing) (%)Accuracy (Gear) (%)
0100.0950.08394.698.8
0.2100.0930.08195.699.2
0.5100.090.0897.699.8
1100.0930.08295.499.4
Table 10. Comparison of Ablation Experiment Results for Different Feature Fusion Strategies.
Table 10. Comparison of Ablation Experiment Results for Different Feature Fusion Strategies.
Feature Fusion MethodBearing Accuracy (%)Gear Accuracy (%)
Attention-based Feature Fusion97.699.8
Feature Concatenation Fusion84.493.1
Mean-based Feature Fusion9192.5
Table 11. Accuracy of Fault Diagnosis Using Different Methods on Portal Crane Dataset.
Table 11. Accuracy of Fault Diagnosis Using Different Methods on Portal Crane Dataset.
MethodBearingGearAverageTime
CWGAN-GP-MTL97.63%99.75%98.69%476.77 s
MT-1DCNN78.25%81.50%79.88%88 s
RI-MPCNN90.88%81.63%86.26%472.03 s
MTCASN80.50%90.00%85.25%71.17 s
MSCNN86.25%91.00%88.63%86.62 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Liao, Z.; Wang, H. Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning. Actuators 2026, 15, 223. https://doi.org/10.3390/act15040223

AMA Style

Yang Y, Liao Z, Wang H. Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning. Actuators. 2026; 15(4):223. https://doi.org/10.3390/act15040223

Chicago/Turabian Style

Yang, Yongsheng, Zuohuang Liao, and Heng Wang. 2026. "Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning" Actuators 15, no. 4: 223. https://doi.org/10.3390/act15040223

APA Style

Yang, Y., Liao, Z., & Wang, H. (2026). Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning. Actuators, 15(4), 223. https://doi.org/10.3390/act15040223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop