Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning

Yang, Yongsheng; Liao, Zuohuang; Wang, Heng

doi:10.3390/act15040223

Open AccessArticle

Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning

by

Yongsheng Yang

^*,

Zuohuang Liao

and

Heng Wang

Institute of Logistics Science and Engineering, Shanghai Maritime University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Actuators 2026, 15(4), 223; https://doi.org/10.3390/act15040223

Submission received: 9 March 2026 / Revised: 12 April 2026 / Accepted: 14 April 2026 / Published: 16 April 2026

(This article belongs to the Special Issue Fault Diagnosis and Prognosis in Actuators)

Download

Browse Figures

Versions Notes

Abstract

With increasing port automation and operational intensity, the gearboxes of gantry cranes widely used in bulk cargo terminals are prone to bearing and gear failures under prolonged heavy loads, intense vibrations, and complex operating conditions. Since fault samples often exhibit imbalanced distributions, this imposes two higher requirements on diagnostic methods—first, the ability to effectively address sample imbalance and, second, the capability to simultaneously identify multiple fault categories. To address these challenges, this paper proposes a joint diagnostic method integrating an improved Conditional Wasserstein Generative Adversarial Network with Gradient Penalty (CWGAN-GP) and Multi-Task Learning (MTL). First, the modified CWGAN-GP performs conditional augmentation for minority fault classes, evaluating synthetic sample authenticity and diversity through multiple metrics. Subsequently, a multi-channel diagnostic network is constructed, in which vibration signals are fed into two parallel sub-networks: time–frequency features are extracted from the Short-Time Fourier Transform (STFT)-based time–frequency representations via a residual-block Convolutional Neural Network (CNN), while temporal features are captured from the raw time-domain signal using a Bidirectional Long Short-Term Memory (Bi-LSTM) with an attention mechanism. An attention fusion layer then integrates these two feature types, enabling joint classification of bearings and gears within a multi-task learning framework. Experimental validation on public gearbox datasets and port gantry crane gearbox datasets demonstrates that this method achieves an average diagnostic accuracy exceeding 97%. The proposed method reduces the impact of class imbalance, thereby improving the accuracy and stability of multi-task fault identification.

Keywords:

gantry crane; sample imbalance; multitask learning; Generative Adversarial Networks

1. Introduction

Gantry cranes serve as critical large-scale equipment in port handling, terminal operations, and industrial production, where their reliable operation is vital to productivity and safety. As the core component in the power transmission chain of gantry cranes, the gearbox converts the high-speed rotational input from the motor into a lower-speed output with greater torque, thereby enabling functions such as lifting and slewing. Cracks, wear in internal gears, or bearing damage within the gearbox can lead to equipment shutdowns, production delays, and even personnel or property safety incidents [1]. However, in actual operation, most collected vibration signals correspond to normal operating conditions, while fault samples are extremely scarce and severely imbalanced in distribution. This makes it difficult for models to fully learn fault characteristics and increases the likelihood of missed detections. In practical engineering applications, it is often required to the simultaneous diagnosis of both gears and bearings. This requires a unified diagnostic model that can effectively mitigate sample imbalance while efficiently sharing useful representations and coordinating task conflicts within a multi-task framework. Consequently, proposing a fault diagnosis method that integrates minority class sample augmentation with multi-objective joint identification holds significant theoretical and practical importance.

Methods for mechanical fault diagnosis can be broadly categorized into two main types: traditional data-driven methods and novel data-driven methods [2]. Among these, traditional data-driven methods primarily include signal analysis-based methods and traditional machine learning-based methods. Signal-based diagnostics represent one of the most widely applied methods for gearbox fault detection, centered on analyzing machine operating conditions through signal characteristics such as vibration, sound, temperature, and current. Widely adopted signal processing techniques include Fast Fourier Transform (FFT) [3], Wavelet Transform (WT) [4], and Empirical Mode Decomposition (EMD) [5]. FFT effectively separates frequency components, enabling the identification of specific frequency signals caused by bearings or gears and the detection of fault characteristics such as early wear or cracks. Tao H et al. [6] employed FFT to process raw vibration signals from gearboxes, treating them as graph nodes, and proposed a k-nearest neighbor (KNN) graph construction method utilizing pooling for fuzzy distance calculation. However, FFT struggles with non-stationary signals under complex operating conditions. To address this, wavelet transform and empirical mode decomposition both provide information about both time and frequency, making them more suitable for handling nonlinear, non-stationary signals. Meng L et al. [7] addressed the challenges of feature extraction and low pattern recognition accuracy in gearbox fault diagnosis by applying first-order differentiation followed by continuous wavelet transform to signals, effectively enhancing the resolution of time-frequency feature images. On the other hand, traditional machine learning methods [8] (such as Support Vector Machines (SVM) [9] and Random Forests (RF) [10]), which involve manually designed features combined with classifiers, have achieved some success in classifying gearbox and bearing failures. Overall, these methods offer advantages such as simplicity of implementation and high interpretability; however, under complex operating conditions, they still suffer from drawbacks including strong feature dependency, limited noise resistance, and a heavy reliance on human expertise.

With the rapid advancement of sensor technology and computing power, new data-driven fault diagnosis methods—particularly those based on deep learning—have begun to attract widespread attention. Compared to traditional methods, deep learning methods [11]—such as Convolutional Neural Networks (CNNs) [12] and Long Short-Term Memory (LSTM) networks [13]—enable end-to-end feature learning with superior expressive power and noise robustness. They have demonstrated superiority in fault diagnosis for equipment like bearings and gearboxes. Wang X et al. [14] integrated raw vibration and acoustic signals from bearings using a 1D-CNN network. Lv Y et al. [15] proposed a 2D BILSTM network to deeply extract and identify 2D time-frequency features. Kang J et al. [16] integrated long short-term memory with convolutional neural networks to effectively extract local signal features, enhancing their time-series analysis capabilities. In addition, Yuan et al. [17] proposed a gearbox fault diagnosis method based on empirical mode decomposition (EMD), multi-scale convolutional neural networks (MSCNN), and a lightweight convolutional attention mechanism. Through multi-scale feature extraction and attention modeling, this method effectively improved the model’s diagnostic performance under complex operating conditions. However, this method primarily addresses the diagnostic task for a single type of gear failure and does not address multi-component coupled failure scenarios. Furthermore, it does not account for the sample imbalance issue that is prevalent in real-world engineering applications. Although deep learning methods have made significant progress in feature extraction and single-task classification accuracy, their reliance on large amounts of labeled data, performance degradation under low-data-availability conditions, and lack of task-cooperative modeling capabilities in multi-component coupled fault diagnosis scenarios continue to limit their engineering applications. To mitigate these issues and enhance diagnostic accuracy and generalization capability, this paper introduces a multi-task learning (MTL) framework. By sharing representations and enabling collaborative learning across tasks, it improves learning efficiency under small sample conditions and promotes the coordinated fusion of multi-source information.

To address the challenges of small sample sizes and multi-component diagnostics in gearboxes, data augmentation and multi-task learning have emerged as critical research directions. On one hand, Generative Adversarial Networks (GANs) and their variants leverage adversarial training to learn data distributions from limited real samples, generating high-quality, diverse synthetic data to mitigate overfitting and performance degradation caused by data imbalance; Guo Q et al. [18] proposed the Multi-Label 1D Generative Adversarial Network (ML1-D-GAN) diagnostic framework to address low accuracy caused by insufficient fault data. Lyu P et al. [19] introduced a novel data augmentation model, the Gradient Penalty Separation Classifier (GPSC), based on GANs. Compared to traditional GANs, this model can more efficiently generate synthetic samples fused with fault samples. On the other hand, multi-task learning (MTL) [20,21,22] achieves feature complementarity and regularization across tasks by sharing underlying features within a single network while configuring dedicated output branches for different tasks, thereby enhancing model generalization and robustness. Niu G et al. [23] proposed a deep residual convolutional neural network with enhanced discriminative feature learning and information fusion capabilities for multi-task bearing diagnosis. Gao L et al. [24] proposed a lightweight, multi-task convolutional explainable shared network (MTCASN) framework for cross-device fault diagnosis to handle failure data from different equipment components. While existing research has explored GANs for bearing or gear fault data augmentation and applied MTL to mechanical fault diagnosis, systematic approaches that organically integrate both methodologies for multi-component, multi-task diagnosis scenarios in portal crane gearboxes remain scarce.

Furthermore, addressing the sample imbalance issue in gearbox fault diagnosis, some studies have explored generative data augmentation approaches. Su Y et al. [25] proposed a method integrating an improved GAN with a dual-stream convolutional network for fault diagnosis of wind turbine gearboxes under small-sample conditions. By incorporating gradient boosting, KNN decision boundaries, and Mahalanobis distance constraints, they enhanced the generation quality and discriminative capability of minority fault samples. Liang P et al. [26] combined the Stockwell transform, data-augmented GANs, and capsule networks to achieve effective identification of both single and composite faults in wind turbine gearboxes, demonstrating robust diagnostic performance under conditions of small sample sizes and class imbalance. Guo Z et al. [27] addressed the uneven distribution of gearbox fault samples by proposing a fault feature generation method based on wavelet packet features and an improved WGAN-GP, validating the effectiveness of generative models in mitigating sample imbalance. Although the aforementioned methods have made some progress in small-sample fault diagnosis for gearboxes, most studies remain focused on single-component or single-task scenarios. There is insufficient consideration for joint modeling and task collaborative learning of coupled faults involving multiple components such as gears and bearings. Furthermore, the collaborative optimization mechanism between generative models and downstream diagnostic models requires further investigation.

To deal with the dual challenges of sample imbalance and multi-task diagnosis in the fault diagnosis of portal crane gearboxes, this paper proposes a joint solution combining an “Improved Conditional Wasserstein GAN (CWGAN-GP)” and a “Multi-channel Multi-task Diagnostic Network.” The primary contributions of this study can be outlined as follows:

(1) A CWGAN-GP optimization strategy combining adversarial loss with auxiliary classification loss was developed. Under the stable generation mechanism ensured by Wasserstein distance and gradient penalties, this approach enables the generator to simultaneously prioritize distribution fidelity and category discriminability during minority class sample generation. This approach effectively mitigates data skew caused by class imbalance in transmission failure datasets at the generative mechanism level, providing high-quality synthetic samples with enhanced discriminative power for subsequent diagnostic models.

(2) Based on data generation and balancing, a multi-task learning framework for integrated gearbox diagnostics has been established. This framework comprises a BiLSTM temporal encoding branch and a CNN spatiotemporal encoding branch, incorporating a cross-modal multi-head attention mechanism for feature fusion. It acquires discriminative shared representations by adaptively modeling the correlation between temporal and spatiotemporal features. Building upon this foundation, a task-sharing layer and task-specific prediction heads are employed to achieve joint learning for component identification and fault discrimination, effectively mitigating representation drift and negative transfer issues in multi-task settings.

(3) Systematic comparison and ablation experiments were conducted on vibration datasets from portal crane gearboxes, verifying that the proposed generative augmentation and multi-task fusion framework maintains leading diagnostic accuracy even under conditions of significant class imbalance. This further demonstrates the robustness and broad application potential of this method in real industrial scenarios.

2. Theoretical Basis

2.1. Improving the CWGAN-GP Network

Building upon the framework of the traditional Conditional Wasserstein GAN with Gradient Penalty (CWGAN-GP), where GP denotes gradient penalty used to stabilize generative adversarial training, this paper improves the loss function structure of CWGAN-GP to address the issue of insufficient discriminability in generated samples under conditions of class imbalance in gearbox failure data. Specifically, by introducing an auxiliary classification branch into the discriminator and jointly incorporating a classification loss term into the optimization objectives of both the generator and the discriminator, we construct an improved CWGAN-GP with category discrimination constraints to generate more distinguishable minority-class fault samples. Unlike the traditional CWGAN-GP, which constrains the distribution of generated samples solely through adversarial loss, our method jointly optimizes adversarial and classification losses. This approach explicitly enhances the category discriminability of the samples while maintaining the consistency of the generated sample distribution.

Conditional Generative Adversarial Networks (CGANs) introduce additional conditional information y based on GANs. By concatenating labels or other control variables with noise as input to the generator, and simultaneously using them as supplementary input to the discriminator, CGANs achieve targeted generation of data for specific categories [28]. The schematic of the conditional GAN is shown in Figure 1. Specifically, the generator receives the concatenated input [z, y] (noise vector z and conditional label y) and generates a sample x = G(z, y). The discriminator receives the sample and its corresponding label (x, y) as input and outputs the conditional probability that x is a real sample. Its objective function is formulated as:

\min_{G} \max_{D} V (D, G) = E_{x \sim p_{data} (x)} [\log D (x ∣ y)] + E_{z \sim p_{z} (z)} [\log (1 - D (G (z ∣ y)))]

(1)

In this equation, G denotes the generator network, which takes the latent noise vector

z \sim p_{z} (z)

and the conditional label y as inputs to generate synthetic samples via the mapping function

G (z ∣ y)

. D represents the discriminator network, which receives samples and their corresponding labels

(x, y)

as inputs and outputs the conditional authenticity probability

D (x ∣ y) \in (0, 1)

. Here,

D (x ∣ y)

measures the confidence that the sample is genuine given label y, while

1 - D (G (z ∣ y))

corresponds to the probability that the discriminator judges the generated sample

G (z ∣ y)

as fake.

E_{x \sim p_{data} (x)} [\cdot]

and

E_{z \sim p_{z} (z)} [\cdot]

denote expectation operations under the true data distribution

p_{data} (x)

and noise distribution

p_{z} (z)

, respectively. The entire objective forms a minimax adversarial game: the discriminator D maximizes this log-likelihood sum to enhance conditional authenticity detection, while the generator G minimizes this function to learn generating samples consistent with the true data distribution under given condition y.

CGAN retains the adversarial training framework of the original GAN, incorporating conditional variables into the inputs of both the generator and discriminator. This enables the generated samples to reflect the information of the given category labels. In fault diagnosis, the conditional information can represent fault types or operational conditions. Through CGAN, synthetic data with specific fault characteristics can be generated to augment the dataset.

Traditional GANs often suffer from vanishing gradients due to the saturation of the objective function (e.g., JS divergence) when distributions differ significantly, resulting in training instability or mode collapse. To address this, we introduce the Wasserstein GAN framework, employing Earth-Mover distance (Wasserstein distance) as a metric for distribution divergence. We further adopt gradient penalties (WGAN-GP) to explicitly constrain the discriminator’s Lipschitz constant, significantly improving training stability and convergence behavior. Extending this framework to its conditional form (CWGAN-GP), we establish a foundation for both class-specific sample generation and more stable training.

Most generative adversarial models have the same main problem during training: pattern collapse. This is when a lot of the samples that are generated become very similar to the real samples, resulting in a loss of diversity. Based on the aforementioned improvement principles, this paper designs the generator and discriminator loss functions for the enhanced CWGAN-GP, as shown in Equations (2) and (3).

L (G) = E_{z \sim P_{z}} [D (G (z))] + λ_{1} E_{z \sim P_{z}} [\log P (y = c ∣ G (z))]

(2)

\begin{matrix} L (D) = & E_{x \sim P_{r}} [D (x)] - E_{z \sim P_{z}} [D (G (z))] \\ + λ_{2} E_{\hat{x} \sim P_{\hat{x}}} [{({∥\nabla_{\hat{x}} D (\hat{x})∥}_{2} - 1)}^{2}] \\ + λ_{1} E_{x \sim P_{r}} [\log P (y = c ∣ x)] \end{matrix}

(3)

Equations (2) and (3) represent the loss functions for the generator and discriminator, respectively.

λ_{1}

denotes the weight coefficient associated with classification loss, while

λ_{2}

is the gradient penalty coefficient. Referencing existing WGAN-GP research, this paper sets

λ_{2}

to 10. The value of

λ_{1}

is analyzed and validated through ablation experiments, with relevant results presented in Section 4.2. Based on a comprehensive evaluation of sample quality and downstream diagnostic performance,

λ_{1} = 0.5

and

λ_{2} = 10

are ultimately selected as the default parameter configuration.

In the standard WGAN-GP framework, the generator approximates the overall distribution of real data solely through adversarial loss. Under conditions of class imbalance, it tends to favor high-frequency classes, leading to the neglect of minority class patterns and subsequently triggering pattern collapse issues. This paper introduces an auxiliary classification loss into CWGAN-GP. This mechanism constrains the generator with explicit class discrimination while optimizing distribution similarity, thereby encouraging distinct pattern distributions for different categories in the feature space. This joint optimization effectively suppresses the generator’s tendency to collapse into singular or minority patterns, enhancing the diversity and coverage of generated samples across categories.

2.2. Multitask Learning Networks

Multi-task learning is a machine learning paradigm that simultaneously handles multiple related tasks within a unified model. By sharing underlying representations, it uncovers task-to-task correlations to enhance learning efficiency across all tasks [29]. The advantage of MTL lies in its shared layer parameters, which capture common features across different tasks while reducing overfitting risks. Meanwhile, dedicated layers for each task learn high-level representations tailored to their specific objectives. In the field of fault diagnosis, MTL can be applied to simultaneously diagnose multiple fault modes or severity levels of equipment. The tasks considered in this paper include two subtasks: “bearing fault diagnosis” and “gear fault diagnosis.” By constructing a multi-task deep network architecture featuring shared convolutional layers and task-specific output branches, the model can extract more comprehensive and effective feature information from limited samples, enabling collaborative identification of faults across multiple components. Common MTL strategies include hard parameter sharing (sharing partial layer parameters) and soft parameter sharing (fusing task-specific features through regularization or specialized modules). This paper adopts the hard parameter sharing approach, where common features are extracted in the convolutional layer before separately outputting classification results for each task.

3. Gearbox Fault Diagnosis

This paper proposes a multi-task fault diagnosis framework based on CWGAN-GP adversarial generation and multi-domain information fusion, as shown in Figure 2. The overall process comprises three stages: signal segmentation and preprocessing, data generation and augmentation, and diagnostic model training and classification. First, the original vibration signal is split into a training set and a test set, depending on the conditions of the sampling, followed by denoising and normalization. Subsequently, the CWGAN-GP generator synthesizes gear and bearing fault samples using random noise and fault labels as inputs to enhance dataset diversity. The discriminator then performs adversarial training to optimize the generated samples against real samples. Finally, multi-domain feature representations were constructed separately for real and synthetic samples: On one hand, short-time Fourier transforms were applied to one-dimensional vibration signals, mapping them into two-dimensional time-frequency spectra for input into 2D-CNN channels to extract local time-frequency features. On the other hand, raw time-domain signals were fed in parallel into Bi-LSTM channels to model the temporal dependency characteristics of vibration signals. Subsequently, a feature fusion module based on cross-modal multi-head attention mechanisms was introduced to adaptively model and fuse the key time-frequency features extracted by the CNN channel with the temporal features output by the Bi-LSTM channel. After further refinement through a shared representation layer, the fused features were fed into the fully connected layers and Softmax classifiers of the bearing and gear fault diagnosis task branches, respectively. Through multi-task loss-weighted joint training, this approach achieved simultaneous high-precision identification of both fault types.

The model primarily consists of data augmentation and fault diagnosis phases, with the diagnostic process illustrated in Figure 3.

A. Data Augmentation Phase

(1) Initial Data Collection. Gather imbalanced multi-class fault signal samples (e.g., gears, bearings) to form the original fault data training set.

(2) CWGAN-GP Model Training. Train CWGAN-GP on the original fault data to learn the distributional characteristics of various fault signals.

(3) Synthetic Fault Data Generation. Utilize the trained CWGAN-GP generator to synthesize new fault signal samples based on fault category labels.

(4) Fault Data Augmentation and Balancing. Merge synthetic samples with original samples to construct a training dataset with balanced category distribution, while the test set consists solely of original fault samples to evaluate the model’s diagnostic performance on real data.

B. Fault Diagnosis Phase

(1) Model Construction and Parameter Tuning. A multi-task fusion model (CNN-BiLSTM-MTL) was developed to perform dual-channel feature extraction targeting both the time-frequency characteristics and time-domain sequence features of vibration signals. Key hyperparameters—such as convolution kernel size, Bi-LSTM hidden layer dimension, and task loss weighting—were optimized through a combination of empirical tuning and parameter search.

(2) Model Training and Validation. Perform multi-task joint training on the CNN-BiLSTM-MTL model using the augmented balanced training dataset. Validate model performance on an independent test set, evaluating classification accuracy metrics.

3.1. Data Augmentation

The network architecture is shown in Figure 4. The generator receives random noise and a class index as input. It first maps the label to an embedding vector of the same dimension as z, then performs element-wise multiplication with z. This output is upsampled to an output dimension of 1024 through three fully connected layers. The final output activation uses tanh to match the data’s normalized range of [−1,1]. The discriminator multiplies the input signal element-wise with the category vector obtained via embedding, then feeds it into three fully connected layers. It ultimately produces two output branches: one outputs a real-valued score for WGAN adversarial training, while the other outputs category logits to assist classification. Training employs the WGAN-GP adversarial objective with gradient penalty (gp = 10), updating the discriminator ncritic = 5 times per step using the Adam optimizer. Label smoothing of 0.1 is applied for auxiliary classification. The trained CWGAN-GP generates diverse samples consistent with the distribution of real fault signals, enabling subsequent training of multi-task diagnostic networks while mitigating data scarcity and class imbalance issues.

3.2. Residual Learning Unit

To address the issues of feature degradation and insufficient utilization of deep-layer information during multi-layer convolutional feature extraction for gearbox vibration signals, this paper introduces residual learning units into the frequency-domain feature extraction channel. This enhances the deep convolutional network’s ability to model complex fault features. This design represents not a simple application of residual structures, but rather a critical component within the dual-channel feature extraction and multi-task learning framework proposed herein. It ensures the effective representation of deep spatiotemporal features prior to cross-modal fusion.

As shown in Figure 5, the residual unit comprises two layers of 2D convolutions, batch normalization, and ReLU activation functions. Input and output features are directly summed via an identity mapping, preserving gradient stability during backpropagation while preventing shallow discriminative information from weakening as the network deepens. This architecture enhances feature reuse capability without significantly increasing computational complexity. It provides more robust high-level feature representations for subsequent cross-modal attention fusion and multi-task shared layers, thereby improving the model’s diagnostic accuracy and robustness under complex coupled fault conditions.

3.3. Feature Extraction Based on Bi-LSTM

Given the distinct characteristics of fault features in gearbox vibration signals—exhibiting both pronounced temporal correlation and coexisting local transient features—this paper constructs a temporal modeling module based on a bidirectional long short-term memory (Bi-LSTM) network within the time-domain feature extraction channel. This enhances the model’s ability to perceive dynamically evolving features under complex operating conditions. Unlike traditional sequential models that rely solely on unidirectional temporal information, Bi-LSTM simultaneously models dependencies between historical and future time steps, providing a more comprehensive sequential feature representation for subsequent multi-task diagnostics.

Building upon Bi-LSTM, this paper further introduces a time-dimension-based attention weighting mechanism for adaptive modeling of temporal features across different time points. Specifically, linear mapping of Bi-LSTM outputs at each time step, combined with learnable context vectors, calculates attention weights to highlight key temporal features contributing significantly to fault discrimination while suppressing redundant or noisy information. Subsequently, these features are aggregated through weighted summation to form a discriminative global temporal representation. This temporal attention mechanism effectively enhances the model’s ability to perceive transient impact features and periodic fault patterns within gearbox vibration signals.

3.4. Fault Diagnosis Method Based on CWGAN-GP-MTL

Multitask learning uncovers latent correlations between bearing and gear failures by sharing network parameters. This approach enhances generalization and robustness for small-sample tasks while dynamically balancing learning progress across tasks, thereby achieving synchronized high-precision classification in gearbox fault diagnosis.

(1) Structure of the CWGAN-GP-MTL Model

For vibration data generated and augmented via CWGAN-GP, this paper proposes a dual-channel classification model based on multi-task learning. First, for both augmented and genuine vibration signals, a short-time Fourier transform is applied to map the one-dimensional time-domain signal into a two-dimensional time-frequency spectrum, which is then fed into the 2D-CNN channel to extract time-frequency features. simultaneously, the augmented and original raw time-domain signals are fed into a Bi-LSTM channel to extract temporal features. Cross-modal multi-head attention mechanisms adaptively fuse features from both time-frequency and time domains. Subsequently, the fused features undergo further refinement through a shared layer before being fed into separate task branches for gear and bearing fault classification. These branches are jointly optimized using multi-task loss to achieve simultaneous high-precision diagnosis of both fault types. The network configuration of the CWGAN-GP-MTL architecture is shown in Table 1.

This paper employs a dual-channel feature extraction strategy to separately mine complementary information from time-domain sequences and time-frequency spectra. The time-domain channel utilizes a bidirectional long short-term memory network as its backbone, incorporating an attention mechanism on its output to automatically emphasize temporal step features relevant to fault discrimination, thereby enhancing sensitivity to both periodic and transient events. The time-frequency channel converts one-dimensional vibration signals into time-frequency representations via short-time Fourier transform (STFT). A convolutional neural network then performs multi-layer spatial feature extraction on the time-frequency images. Residual learning units and adaptive pooling are incorporated within the network to enhance deep feature learning capabilities and ensure dimensional stability after feature flattening. Features learned by the two channels are fused through a cross-modal multi-head attention mechanism, enabling selective information exchange while preserving discriminative features from both modalities. This design allows the model to maintain intra-modal expressive power while dynamically allocating attention weights during fusion, thereby improving discrimination capability and robustness against complex coupled faults.

Based on the fused features obtained, a “shared-dedicated” architecture was designed to enable multi-task collaborative learning. The shared network extracts general features from the fused features through channel mapping and residual augmentation modules, aiming to learn underlying representations valuable across all tasks and improve gradient flow during training, thereby reducing the risk of overfitting in individual tasks. Subsequently, a lightweight task branch is established for each subtask. Shared features are projected into task-specific discriminative vectors through channel remapping and small-scale fully connected mapping, preserving task-specific differences and enabling final classification predictions.

It should be noted that the multi-task learning framework proposed in this paper targets the joint fault diagnosis of two critical components: bearings and gears in gearboxes. In the experimental setup, the model’s input comprises two data streams: bearing signal samples and gear signal samples. Both streams correspond to the operational state of the same gearbox and are respectively labeled with bearing fault tags and gear fault tags. After feature extraction and fusion within the network, these two feature streams form a unified shared representation. This representation is simultaneously fed into both the bearing task branch and the gear task branch, each of which outputs a fault category prediction for its respective component.

During training, the output of the bearing task branch is only loss-calculated with the bearing fault labels, while the output of the gear task branch is only loss-calculated with the gear fault labels. The multi-task loss function combines backpropagation to constrain the shared layer learning to acquire a general feature representation with discriminative capabilities for both task types. During the testing phase, unknown samples undergo shared feature extraction and are fed in parallel to both task branches, yielding separate fault identification results for bearings and gears.

(2) Multiple Loss Functions

This paper employs the cross-entropy loss function as the model optimization objective for both bearing and gear fault diagnosis tasks. This loss function compares the predicted distribution of the model with the true distribution to see how different they are, making it suitable for the single-label multi-class fault identification task addressed in this study. By minimizing the negative log-likelihood of the true class, cross-entropy loss drives the model to output probability distributions closer to the true class, thereby enhancing classification accuracy and convergence stability. Two independent cross-entropy loss functions are set for the bearing and gear tasks, expressed as follows:

L_{bearing} = - \sum_{j = 1}^{k_{b}} p_{j}^{b} \log (q_{j}^{b})

(4)

L_{gear} = - \sum_{j = 1}^{k_{g}} p_{j}^{g} \log (q_{j}^{g})

(5)

Equations (4) and (5) represent the loss functions for the bearing and gear tasks, respectively.

L_{bearing}

and

L_{gear}

denote the loss functions for the bearing and gear tasks, respectively.

p^{b}

and

p^{g}

denote the target distributions for the bearing and gear tasks, respectively.

q^{b}

and

q^{g}

denote the estimated distributions for the bearing and gear tasks, respectively.

To enable the model to train collaboratively for both tasks, the sum of the losses for the bearing and gear components is directly used as the network’s final loss function. The total loss is expressed as follows:

L o s s = L_{bearing} + L_{gear}

(6)

(3) Cross-Modal Multi-Head Attention Fusion Module and Rationale

To achieve effective fusion between temporal BiLSTM features and time-frequency CNN features, this paper introduces a cross-modal multi-head attention mechanism as the feature interaction module within a multi-task learning framework. This mechanism uses temporal features as the Query and spatiotemporal features as both Key and Value. By modeling correlations across modalities through parallel multi-head subspaces, it adaptively assigns weights to both modalities within the shared representation, thereby highlighting key feature dimensions and suppressing redundant information.

Compared to traditional fusion methods (e.g., simple concatenation or weighted averaging), multi-head attention offers the following advantages: (1) Explicitly captures cross-modal correspondences, suitable for time-domain/time-frequency features with significant modal differences; (2) The multi-head architecture enables complementary information learning across different subspaces, enhancing fusion expressiveness; (3) Relatively low parameter count facilitates integration into multi-task frameworks with manageable computational overhead; (4) Improves model robustness and generalization capabilities under complex operating conditions.

Thus, the cross-modal multi-head attention mechanism enhances fusion quality and joint diagnostic performance, serving as a critical component in the fault diagnosis methodology developed in this study.

4. Experimental Verification

To evaluate the performance of the proposed method, this section analyzes case studies of the Southeast University gearbox and the port gearbox. In these case studies, all deep learning (DL) models were trained using PyTorch (version 2.7.0+cu128) within a Python (version 3.12.9) environment on a computer equipped with a GTX 4090 GPU and 64GB of memory. To ensure clarity and consistency in the experimental analysis process, before introducing specific case studies, we first provide a unified explanation of the evaluation criteria for data generation quality. Subsequently, the effectiveness of the proposed method is validated and analyzed through two case studies.

4.1. Data Generation Quality Assessment Criteria

To comprehensively evaluate the quality of generated data, this paper assesses the similarity between generated and original data from two perspectives: quantitative statistical analysis and qualitative feature visualization. Through multi-dimensional evaluation criteria, the performance of the generative model in terms of distributional consistency and feature preservation can be more systematically characterized.

(1) Quantitative Evaluation Indicators

This paper employs the Maximum Mean Discrepancy (MMD) metric in experiments to quantify the quality and diagnostic performance of generated samples. MMD is a statistical measure for quantifying the difference between two probability distributions, enabling quantitative evaluation of generated data quality through distribution-based similarity comparisons. Its functional form is:

\begin{matrix} M M D [F, P_{(X)}, P_{(G (X))}] = [ & \frac{1}{m (m - 1)} \sum_{i \neq j}^{m} k (x_{i}, x_{j}) \\ - \frac{2}{m n} \sum_{i, j = 1}^{m, n} k (x_{i}, y_{i}) \\ + \frac{1}{n (n - 1)} \sum_{i \neq j}^{n} k (y_{i}, y_{j})]^{\frac{1}{2}} \end{matrix}

(7)

In Equation (7), F represents the generalized Gaussian kernel function,

P (X)

and

P (G (X))

denote the distributions of genuine fault data and generated fault samples respectively, m and n represent the sample sizes of the two distributions,

x_{i}

and

y_{i}

denote sample points from

P (X)

and

P (G (X))

respectively, and k represents the Gaussian kernel function. Multiple experiments were conducted by randomly sampling equal numbers of original samples and faulted samples from each dataset.

(2) Qualitative Feature Visualization Methods

To further visually assess the similarity between generated samples and real samples in the feature space, we employed dimensionality reduction visualization techniques such as t-SNE and PCA. t-SNE is a nonlinear dimensionality reduction method that maps high-dimensional data onto a two-dimensional plane, preserving the relative positions of similar samples in the low-dimensional plot. We first extracted high-dimensional features from both real fault signals and generated signals, then performed t-SNE and PCA projections, respectively. The distributions of both sample types were plotted within the same coordinate system.

4.2. Case 1: Southeast University Dataset

The gearbox fault data used in this study originates from Southeast University’s transmission system dynamic test bench based on the Drivetrain Dynamic Simulator (DDS) [30], as shown in Figure 6. This test bench acquires bearing and gear signals under two operating conditions (20 Hz–0 V and 30 Hz–2 V). Bearing signals encompass vibration data for one healthy condition and four fault states (ball fault, inner ring fault, outer ring fault, compound fault). Gear signals include one healthy condition and four fault states (defective fault, tooth breakage fault, root crack fault, surface wear fault). Each data type comprises 8 channel signals: Channel 1 is the motor vibration signal; Channels 2–4 represent the X, Y, and Z-axis vibrations of the planetary gearbox; Channel 5 is the motor output torque; Channels 6–8 are the three-axis vibrations of the parallel gearbox. The data sampling frequency is 12 kHz.

The experimental signals selected for this study are X-direction vibration signals from a parallel gearbox. Detailed information on the dataset partitioning is shown in Table 2. The dataset comprises 10 categories (labels 0–9), with bearing data labeled 0–4 and gear data labeled 5–9. Labels 0 and 5 represent the “normal state,” with 800 training samples and 200 test samples per category. The remaining fault categories (e.g., ball fault, inner ring fault, outer ring fault, compound fault, defective fault, tooth breakage fault, root crack fault, and surface wear fault) each have 40 training samples and 200 test samples. To simulate data imbalance, an imbalance factor

β = N_{normal} / N_{fault}

is defined.

β

quantifies the severity of imbalance, where

N_{normal}

represents the number of normal samples and

N_{fault}

denotes the number of samples per fault type. This study employs

β

= 20:1.

In this study, both bearing and gear vibration signals were segmented into independent samples of 1024 points each without overlap. All samples were divided proportionally into training and test sets, ultimately yielding 960 training samples and 1000 test samples for bearings and gears, respectively. Specifically, only samples from each fault type within the training set were used to train the conditional Wasserstein generative adversarial network (GAN). This approach enhances features for minority faults, mitigates data imbalance, and provides diverse training inputs for subsequent diagnostic model construction.

(1) Analysis of Data Generation Quality Assessment Results

The point of training the CWGAN-GP model is to make it produce signals that are like those from Southeast University’s gearbox failures, thereby mitigating data imbalance issues. Figure 7 illustrates the trend of loss values for the generator and discriminator in the CWGAN-GP model across training iterations. This loss function is constructed based on the Wasserstein distance, incorporating a gradient penalty (GP) term to constrain the discriminator gradient. This approach addresses challenges in GAN training, such as gradient vanishing/explosion and mode collapse. The figure reveals significant loss fluctuations during early training, gradually converging toward near-zero values as iterations progress. This demonstrates the generator and discriminator progressively achieving equilibrium through adversarial learning, validating CWGAN-GP’s optimization of training stability for complex generation tasks. The CWGAN-GP model completed training after 5000 iterations. A certain number of fault signals were generated using the trained generator to achieve data balance. To assess whether the generated fault data meet quality standards, an evaluation was conducted from two perspectives: statistics and visualizations.

As shown in Figure 8, the MMD value is very low, which shows that the generated data is very similar to the original data. This shows that the data that has been created is very similar to the original data, and is very consistent. The MMD method shows how well the generative model can simulate the original data distribution. The underlying mechanisms of the model effectively capture the features of the original data.

The visualization results for real and generated data of bearings and gears are shown in Figure 9. Real and generated samples cluster closely and overlap in the reduced-dimensional feature space, with the point clouds of both datasets primarily intertwined. This indicates that the generated samples successfully capture the distributional characteristics of the real data. The substantial overlapping region between the two sets in the reduced-dimension plot indicates that the generative model captures the intrinsic structure of real samples at the feature level. Such qualitative visual analysis provides intuitive evidence for evaluating the model’s generative performance, validating the consistency of distribution between generated and real data across multidimensional feature spaces.

(2) Results and Analysis

1. Data Imbalanced Grouping Design

To systematically validate the robustness and generalization performance of the proposed method under varying degrees of data imbalance, this study designed five gearbox fault datasets (A–E) as shown in Table 3. Within each dataset, the number of ground-truth vibration samples for each fault type was fixed at 40, while CWGAN-GP generated 0, 40, 120, 360, and 740 samples using CWGAN-GP to form five imbalanced training subsets with total training sizes of 40, 80, 160, 400, and 800 samples, respectively. Additionally, 200 samples from each group’s original data were allocated as the test set, with unbalanced ratios decreasing from 20:1, 10:1, 5:1, 2:1 to 1:1. This design enables evaluation of the generated samples’ compensatory effect on expressing minority fault features while comparing diagnostic model classification accuracy under varying training set sizes and balancing conditions. During training, the AdamW optimizer was employed with a batch size of 64 and a learning rate of 0.0001.

In cases of sample imbalance, the improvement in classification performance achieved through sample generation is more pronounced. As shown in Figure 10, experimental results indicate that data augmentation of minority faults using CWGAN-GP significantly enhances the overall accuracy of fault diagnosis. As the number of generated samples gradually increases, the model’s classification performance continues to improve: without introduced synthetic data, the diagnostic accuracy for both bearing and gear tasks remained around 80%. With progressively more synthetic samples, the model’s accuracy significantly increased and stabilized at approximately 97.5% once the sample size reached a certain scale. These results demonstrate that the high-quality synthetic samples generated by CWGAN-GP effectively mitigate the scarcity of minority class samples, enhance the model’s learning capability for critical fault features, and thereby significantly improve diagnostic accuracy and stability under imbalanced conditions.

After expanding the data imbalance ratio from 20:1 to 1:1, the confusion matrices for the bearing and gear tasks are shown in Figure 11, where Figure 11a and Figure 11b correspond to the diagnostic results for bearing and gear faults, respectively. It can be observed that despite high overall recognition accuracy for both tasks, the model still faces certain identification challenges for specific fault types. In the bearing task, composite faults and outer ring faults were occasionally misclassified as normal conditions or other single fault types. This indicates that under conditions of multi-source coupling or less pronounced impact features, the distinguishing characteristics of highly complex faults may be affected by background noise or feature overlap. In the gear task, misclassifications primarily occurred between missing tooth faults and broken tooth faults, as well as root crack faults. This indicates that gear damage of varying degrees or forms exhibits certain similarities in time-frequency domain features. The above analysis indicates that while multi-channel feature fusion and synthetic data generation significantly enhance overall diagnostic performance, challenges persist in distinguishing faults with similar mechanisms or adjacent damage levels. This highlights potential areas for model improvement in practical engineering applications.

2. Ablation Experiment

To validate the effectiveness and necessity of the proposed multi-task learning model in bearing and gear fault diagnosis, multiple sets of ablation experiments were designed, with results shown in Table 4. First, single-task models trained exclusively for each component achieved diagnostic accuracies of 90.2% and 91.5%, respectively, demonstrating the model’s basic fault recognition capability under single-task conditions. In contrast, the multi-task learning model—which employs shared feature representations and simultaneously optimizes both bearing and gear tasks—achieved significant improvements in both tasks, with accuracies reaching 97.5%. This demonstrates that sharing underlying features and introducing cross-task information exchange effectively enhances the model’s ability to extract key fault features and improves its generalization performance.

Furthermore, to analyze the roles of different feature extraction channels and the attention fusion module, control models were constructed by removing specific components. After removing the CNN channel, the accuracy rates for bearings and gears decreased to 94.1% and 93.6%, respectively. After removing the BiLSTM channel, the accuracy rates for the two tasks were 95.8% and 95.7%, respectively. When the attention fusion module was disabled, performance similarly declined, with accuracies dropping to 95.1% and 95.8%, respectively. These results demonstrate that each channel and the fusion module play crucial roles in feature extraction and task coordination. The multi-task joint optimization mechanism effectively enhances overall fault identification performance by sharing feature representation spaces, validating the rationality and superiority of the proposed method in multi-component coupled fault diagnosis scenarios.

3. Comparison with Other Methods

To validate the effectiveness and feasibility of the proposed network model, this paper compares CWGAN-GP-MTL with four state-of-the-art models: MT-1DCNN [31], RI-MPCNN [32], MTCASN [24], and MSCNN [33]. The training and testing strategies for each network are identical across all evaluations. As shown in Table 5, the proposed CWGAN-GP-MTL model achieves high classification accuracy in both bearing and gear fault diagnosis tasks, significantly outperforming other comparison models. Compared to MT-1DCNN (which utilizes single-domain features), RI-MPCNN and MSCNN (which employ multi-scale convolutional structures), and MTCASN (which incorporates channel attention), CWGAN-GP-MTL effectively enhances the model’s ability to recognize complex coupled fault features by integrating time-frequency features extracted by CNN with time-domain features extracted by BiLSTM, and by utilizing an attention mechanism to achieve adaptive feature weighting. This demonstrates stronger generalization capabilities and task synergy advantages.

4.3. Case 2: Portal Crane Gearbox Dataset

The gearbox fault dataset used in this study originates from a portal crane gearbox at a port in Shandong Province, as shown in Figure 12. The signals were collected from the hoisting mechanism of the portal crane. During operation, bearing and gear data were captured at a frequency of 5000 Hz. Bearing data comprised four categories: normal condition data, rolling element failure data, inner ring failure data, and outer ring failure data. Gear data also included four categories: normal condition data, gear tooth surface wear, gear tooth cracks, and abnormal gear meshing. The majority of collected data represents normal conditions, with only a small portion indicating fault types, aligning with fault diagnosis under data imbalance scenarios. Detailed information on the training and testing set division is shown in Table 6. Bearing data labels range from 0 to 3, while gear data labels range from 4 to 7.

In the actual operating environment of port gantry cranes, the selection of sensors must comprehensively consider environmental adaptability and engineering feasibility. Temperature signals are susceptible to interference from changes in ambient temperature (such as diurnal temperature variations and the influence of sea breezes), while acoustic emission signals are easily affected by background noise in complex operational environments. Furthermore, torque measurement and multi-sensor fusion solutions typically entail high deployment and maintenance costs. In contrast, vibration signals can directly reflect gear meshing conditions and localized bearing damage characteristics, offering advantages such as fast response times, high sensitivity to early-stage failures, and ease of acquisition. Therefore, this paper selects vibration signals as the primary research focus to better align with practical engineering application requirements.

For the above datasets, both the bearing and gear datasets were segmented into samples of 1024 points each using a non-overlapping approach. The training and test sets were divided according to the format specified in the aforementioned table. Data augmentation was performed using the CWGAN-GP network to enhance diagnostic accuracy for rare fault data.

(1) Analysis of Data Generation Quality Assessment Results

Figure 13 illustrates the trend of loss values for the generator and discriminator in the CWGAN-GP model as training iterations progress. During the initial training phase, both curves exhibit significant fluctuations. Subsequently, they gradually stabilize with iterations and oscillate around the zero range, indicating that the adversarial process has reached a relative equilibrium. After completing 5000 training iterations, the model generates fault signals using the trained generator to achieve data balance.

As shown in Figure 14, specifically, the MMD values for bearing samples exhibit a gradual downward trend, indicating that the generator’s ability to fit the original data distribution has improved across different categories of bearing samples. Meanwhile, gear samples reached their lowest point in category 2 and rebounded in category 3. This variation may be related to differences in complexity or sample size across categories, suggesting that the generator still has room for improvement in fitting certain gear categories. Overall, lower MMD values indicate that the generated data more closely approximates the global distribution statistics of the real data, demonstrating the model’s ability to capture several distributional features of the original data.

After extracting the same high-dimensional features from both real fault signals and generated signals, t-SNE and PCA projections were performed, as shown in Figure 15. The reduced-dimensional point clouds exhibit high overlap and intertwined distributions in the feature space, indicating that the generated samples effectively match the primary distribution characteristics of the real samples. This demonstrates a high degree of consistency at the feature level.

(2) Results and Analysis

1. Data Imbalanced Grouping Design

This study constructed five training subsets (denoted as A–E) based on the gearbox fault data of gantry cranes, as configured in Table 7. Specifically, the number of genuine vibration samples for each fault category was fixed at 40. CWGAN-GP was then employed to synthesize 0, 40, 120, 360, and 740 additional samples, respectively, yielding five training subsets with sample sizes of 40, 80, 160, 400, and 800 per category. Training employs the AdamW optimizer with a learning rate of 0.0001 and a batch size of 64.

Experimental results indicate that as the sample imbalance among different fault categories in the gearbox gradually diminishes, the classification performance for both bearing and gear diagnostic tasks shows a continuous improvement trend, as illustrated in Figure 16. After augmenting the minority class samples using generative adversarial networks (GANs) to adjust the sample distribution from a highly imbalanced state to a balanced one, the overall diagnostic accuracy for the bearing task and gear task increased to 97.63% and 99.75%, respectively. The corresponding confusion matrix results are shown in Figure 17. Figure 17a indicates that in the bearing task, the model accurately distinguishes between normal conditions, rolling element faults, and outer ring faults. However, inner ring faults are occasionally misclassified as normal in rare instances. This may stem from the weak early-stage characteristics and subtle impact components of inner ring faults, whose vibration responses exhibit similarities to normal operating conditions in time-domain features, thereby complicating discrimination. In contrast, outer ring faults are easier for the model to capture and identify due to their fixed excitation location and distinct periodic impact characteristics. In the gear task, as shown in Figure 17b, high recognition accuracy is achieved for normal conditions, tooth surface wear, and gear tooth cracks, while a small number of gear meshing anomaly samples are misclassified as normal. This primarily stems from the fact that abnormal meshing does not always accompany obvious localized damage in certain operating conditions. Its characteristics are more often reflected in subtle changes in the overall vibration pattern, leading to some overlap with the normal state in the feature space. Overall, while the generated samples significantly enhance the model’s diagnostic performance under unbalanced conditions, challenges remain in distinguishing categories with weak feature differences or similar failure mechanisms. This highlights potential areas for improvement in practical engineering applications.

2. Ablation Experiment

The ablation results are shown in Table 8. Attention fusion combined with the multi-task model achieved the best performance, indicating that the collaboration among modules is crucial for diagnostic capability. Compared to single-task training, the baseline improved by 11.25 percentage points for bearings and 5.87 percentage points for gears, demonstrating that multi-task sharing significantly enhances cross-task complementary information and sample category generalization ability. Removing attention fusion resulted in declines of approximately 2.03 and 2.25 percentage points, respectively, indicating that attention plays a key role in feature weighting and noise suppression. Using only CNN or only BiLSTM led to a significant decrease in the ability to distinguish certain fault categories, suggesting that CNN excels at extracting time-frequency features to differentiate gear faults, while BiLSTM is better suited for capturing temporal dynamics to identify bearing faults, with the two complementing each other. Therefore, the multi-channel (CNN+BiLSTM) architecture combined with attention and multi-task learning achieves superior diagnostic performance.

3. The Impact of Classification Loss Weighting on Improving CWGAN-GP Performance

To verify the impact of the classification loss term and its weight coefficient

λ_{1}

on the performance of the improved CWGAN-GP, experiments were conducted with different values of

λ_{1}

while keeping the gradient penalty coefficient

λ_{2}

fixed at 10 (a common setting in WGAN-GP). The results are shown in Table 9. When

λ_{1} = 0

, the model degenerates into a CWGAN-GP without classification constraints. At this point, the MMD values for the bearing and gear datasets were 0.095 and 0.083, respectively—both at relatively high levels—with diagnostic accuracy rates of 94.6% and 98.8%. This indicates that relying solely on adversarial loss is insufficient to fully constrain the class discrimination characteristics of the generated samples, and discrepancies still exist between the generated distribution and the true distribution. When

λ_{1} = 0.2

, the MMDs decrease to 0.093 and 0.081, respectively, while the accuracy rates improve to 95.6% and 99.2%. Compared to

λ_{1} = 0

, both distribution consistency and classification performance have improved, but the extent of improvement is limited, indicating that the classification loss has a weaker effect at this weight. When

λ_{1}

is increased to 0.5, the MMDs for bearings and gears are 0.090 and 0.080, respectively, with bearings achieving the optimal value. The accuracy rates rise to 97.6% and 99.8%, both of which are the highest values, indicating that the classification loss and adversarial loss have reached an optimal balance at this point. When

λ_{1} = 1.0

, the MMDs rise to 0.093 and 0.082, while the accuracy rates drop to 95.4% and 99.4%, indicating that overly strong classification constraints weaken the model’s ability to capture the true distribution, thereby affecting both generation quality and classification performance. In summary, setting the classification loss weight appropriately is crucial; in this experiment, the model achieved the best overall performance when

λ_{1} = 0.5

.

4. The Impact of Feature Fusion Strategies on Fault Diagnosis Performance

To further validate the effectiveness of the introduced attention-based feature fusion mechanism in multi-task fault diagnosis, we conducted ablation experiments comparing different feature fusion strategies while maintaining the CNN–BiLSTM dual-channel architecture and the multi-task learning framework. Specifically, we compared the performance differences among attention-based feature fusion, feature concatenation, and mean-based feature fusion in bearing and gear diagnosis tasks. The experimental results are shown in Table 10. The results show that the attention-based feature fusion method achieved the highest diagnostic accuracy for both bearing and gear tasks, at 97.6% and 99.8%, respectively, outperforming both feature concatenation and mean-based feature fusion in overall performance. Specifically, feature concatenation achieved a high accuracy rate for the gear task (93.1%), but its performance dropped significantly for the bearing task (84.4%); Mean-based feature fusion performed relatively well on the bearing task (91.0%), but its diagnostic accuracy was slightly lower on the gear task (92.5%). This indicates that simple feature fusion methods struggle to accommodate the differing requirements of multi-channel features across different tasks. In contrast, the attention-based feature fusion mechanism can adaptively assign importance weights to features across channels based on different tasks, highlighting key discriminative information while suppressing redundant features. This enables more effective feature representation in multi-task joint diagnosis, significantly improving diagnostic performance.

5. Comparison with Other Methods

As shown in Table 11, the results of the comparative experiments indicate that the CWGAN-GP-MTL method proposed in this paper significantly outperforms the comparison methods in terms of bearing, gear, and average metrics. Its average performance is approximately 10 percentage points higher than that of the best-performing MSCNN model, demonstrating a clear advantage. At the same time, an analysis from the perspective of computational efficiency reveals differences in training time among the various models. Among them, MT-1DCNN, MSCNN, and MTCASN have relatively simple structures and thus shorter training times, whereas RI-MPCNN and the method proposed in this paper have relatively longer training times due to their higher structural complexity and the introduction of additional modules. Further comparison reveals that MT-1DCNN performs the worst, indicating that single-domain convolution is insufficient for distinguishing complex coupled faults; RI-MPCNN achieves good results for bearings but performs poorly for gears, while MTCASN performs strongly for gears but weakly for bearings, suggesting that different network architectures prioritize different sub-tasks and struggle to simultaneously address the classification requirements of both fault types. In contrast, CWGAN-GP-MTL achieves high accuracy on both tasks, demonstrating that its multi-channel feature extraction, task-sharing mechanism, and fusion strategy possess stronger representational capabilities in capturing the temporal features of bearings and the time-frequency features of gears. An analysis of overall accuracy and computation time reveals that, although the proposed method incurs some computational overhead during the training phase compared to certain lightweight models, it offers significant advantages in terms of accuracy improvement, demonstrating a favorable performance-efficiency trade-off. Furthermore, since the CWGAN-GP data augmentation process can be completed offline, it imposes no additional burden on subsequent model training and deployment, thereby ensuring good feasibility in practical engineering applications.

Furthermore, from an engineering implementation perspective, the two-stage diagnostic framework proposed in this paper demonstrates good feasibility at both the training and deployment levels. CWGAN-GP is only used in the offline stage for data augmentation of minority fault samples and remains fixed once training is complete. During actual online diagnostics, only the multi-task fault diagnosis network is required for deployment for forward inference, and thus does not significantly increase the real-time computational load. This approach meets the basic requirements of industrial applications for diagnostic efficiency and resource consumption.

5. Conclusions

This paper proposes a fault diagnosis method for gantry crane gearboxes based on the integration of adversarial generation and multi-task learning. By employing an improved CWGAN-GP for data augmentation of minority fault classes and simultaneously learning time-domain and time-frequency features within a multi-task neural network, effective utilization of imbalanced fault data is achieved. Experimental results demonstrate that both MMD quantitative evaluation and t-SNE visualization validate the statistical consistency between generated and real samples, demonstrating that the generative model can effectively capture the characteristics of real fault data. In terms of diagnostic performance, the proposed method achieves an overall diagnostic accuracy exceeding 97% for both diagnostic tasks, significantly outperforming traditional single-channel or single-task approaches. Ablation experiments demonstrate that this performance improvement does not stem from a single feature channel or independent network architecture. Instead, it is jointly attributed to the adaptive weighting of key information through the synergistic modeling of time-domain and time-frequency multi-channel features and the attention fusion mechanism. This synergistic effect constitutes the core advantage of the method. In summary, this paper provides an effective and a solution with strong engineering potential for multi-task fault diagnosis of gearboxes under sample imbalance conditions. Future work will further integrate cross-condition transfer learning and domain adaptation strategies to enhance the model’s generalization capability and engineering applicability. While maintaining diagnostic accuracy, further optimization of the model architecture and training workflow will reduce computational overhead and improve deployment efficiency in real-world industrial settings.

Author Contributions

Methodology, Y.Y., Z.L. and H.W.; Investigation, Y.Y.; Writing—Original Draft, Z.L. and H.W.; Writing—Review and Editing, Y.Y., Z.L. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data involved in this article has been presented in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Assaad, B.; Eltabach, M.; Antoni, J. Vibration based condition monitoring of a multistage epicyclic gearbox in lifting cranes. Mech. Syst. Signal Process. 2014, 42, 351–367. [Google Scholar] [CrossRef]
Neupane, D.; Bouadjenek, M.R.; Dazeley, R.; Aryal, S. Data-driven machinery fault diagnosis: A comprehensive review. Neurocomputing 2025, 627, 129588. [Google Scholar] [CrossRef]
Luo, X.; Wang, H.; Han, T.; Zhang, Y. FFT-trans: Enhancing robustness in mechanical fault diagnosis with Fourier transform-based transformer under noisy conditions. IEEE Trans. Instrum. Meas. 2024, 73, 1–12. [Google Scholar] [CrossRef]
Yan, R.; Shang, Z.; Xu, H.; Wen, J.; Zhao, Z.; Chen, X.; Gao, R.X. Wavelet transform for rotary machine fault diagnosis: 10 years revisited. Mech. Syst. Signal Process. 2023, 200, 110545. [Google Scholar] [CrossRef]
Li, Y.; Zhou, J.; Li, H.; Meng, G.; Bian, J. A fast and adaptive empirical mode decomposition method and its application in rolling bearing fault diagnosis. IEEE Sens. J. 2022, 23, 567–576. [Google Scholar] [CrossRef]
Tao, H.; Shi, H.; Qiu, J.; Jin, G.; Stojanovic, V. Planetary gearbox fault diagnosis based on FDKNN-DGAT with few labeled data. Meas. Sci. Technol. 2023, 35, 025036. [Google Scholar] [CrossRef]
Meng, L.; Su, Y.; Kong, X.; Xu, T.; Lan, X.; Li, Y. Intelligent fault diagnosis of gearbox based on differential continuous wavelet transform-parallel multi-block fusion residual network. Measurement 2023, 206, 112318. [Google Scholar] [CrossRef]
Wang, Z.; Huang, H.; Wang, Y. Fault diagnosis of planetary gearbox using multi-criteria feature selection and heterogeneous ensemble learning classification. Measurement 2021, 173, 108654. [Google Scholar] [CrossRef]
Wan, A.; Zhang, F.; Khalil, A.B.; Cheng, X.; Ji, X.; Wang, J.; Shan, T. A novel GA-PSO-SVM model for compound fault diagnosis in gearboxes with limited data. IEEE Sens. J. 2025, 25, 30431–30443. [Google Scholar] [CrossRef]
Wei, Y.; Yang, Y.; Xu, M.; Huang, W. Intelligent fault diagnosis of planetary gearbox based on refined composite hierarchical fuzzy entropy and random forest. ISA Trans. 2021, 109, 340–351. [Google Scholar] [CrossRef]
Ravikumar, K.; Yadav, A.; Kumar, H.; Gangadharan, K.; Narasimhadhan, A. Gearbox fault diagnosis based on Multi-Scale deep residual learning and stacked LSTM model. Measurement 2021, 186, 110099. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, Q.; Qin, X.; Sun, Y. Robust fault diagnosis of quayside container crane gearbox based on 2D image representation in frequency domain and CNN. Struct. Health Monit. 2024, 23, 324–342. [Google Scholar] [CrossRef]
Shi, J.; Peng, D.; Peng, Z.; Zhang, Z.; Goebel, K.; Wu, D. Planetary gearbox fault diagnosis using bidirectional-convolutional LSTM networks. Mech. Syst. Signal Process. 2022, 162, 107996. [Google Scholar] [CrossRef]
Wang, X.; Mao, D.; Li, X. Bearing fault diagnosis based on vibro-acoustic data fusion and 1D-CNN network. Measurement 2021, 173, 108518. [Google Scholar] [CrossRef]
Lv, Y.; Liu, Y.; Li, S.; Liu, J.; Wang, T. Enhancing marine shaft generator reliability through intelligent fault diagnosis of gearbox bearings via improved Bidirectional LSTM. Ocean. Eng. 2025, 337, 121860. [Google Scholar] [CrossRef]
Kang, J.; Zhu, X.; Shen, L.; Li, M. Fault diagnosis of a wave energy converter gearbox based on an Adam optimized CNN-LSTM algorithm. Renew. Energy 2024, 231, 121022. [Google Scholar] [CrossRef]
Yuan, B.; Li, Y.; Chen, S. Efficient gearbox fault diagnosis based on improved multi-scale CNN with lightweight convolutional attention. Sensors 2025, 25, 2636. [Google Scholar] [CrossRef]
Guo, Q.; Li, Y.; Song, Y.; Wang, D.; Chen, W. Intelligent fault diagnosis method based on full 1-D convolutional generative adversarial network. IEEE Trans. Ind. Inform. 2019, 16, 2044–2053. [Google Scholar] [CrossRef]
Lyu, P.; Cheng, Y.; Zhang, M.; Yu, W.; Xia, L.; Liu, C. GPSC-GAN: A data enhanced model for intelligent fault diagnosis. IEEE Trans. Instrum. Meas. 2024, 73, 3532116. [Google Scholar] [CrossRef]
Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 2018, 5, 30–43. [Google Scholar] [CrossRef]
Thung, K.H.; Wee, C.Y. A brief review on multi-task learning. Multimed. Tools Appl. 2018, 77, 29705–29725. [Google Scholar] [CrossRef]
Niu, G.; Liu, E.; Wang, X.; Ziehl, P.; Zhang, B. Enhanced discriminate feature learning deep residual CNN for multitask bearing fault diagnosis with information fusion. IEEE Trans. Ind. Inform. 2022, 19, 762–770. [Google Scholar] [CrossRef]
Gao, L.; Huang, J.; Yu, D.; Liu, S. Cross-component fault diagnosis based on lightweight multitasking networks. IEEE Sens. J. 2024, 25, 2231–2243. [Google Scholar] [CrossRef]
Su, Y.; Meng, L.; Kong, X.; Xu, T.; Lan, X.; Li, Y. Small sample fault diagnosis method for wind turbine gearbox based on optimized generative adversarial networks. Eng. Fail. Anal. 2022, 140, 106573. [Google Scholar] [CrossRef]
Liang, P.; Deng, C.; Yuan, X.; Zhang, L. A deep capsule neural network with data augmentation generative adversarial networks for single and simultaneous fault diagnosis of wind turbine gearbox. ISA Trans. 2023, 135, 462–475. [Google Scholar] [CrossRef]
Guo, Z.; Pu, Z.; Du, W.; Wang, H.; Li, C. Improved adversarial learning for fault feature generation of wind turbine gearbox. Renew. Energy 2022, 185, 255–266. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Shao, X.; Ra, I.; Kim, C.S. DSMT-1DCNN: Densely supervised multitask 1DCNN for fault diagnosis. Knowl. Based Syst. 2024, 292, 111609. [Google Scholar] [CrossRef]
Shao, S.; McAleer, S.; Yan, R.; Baldi, P. Highly accurate machine fault diagnosis using deep transfer learning. IEEE Trans. Ind. Inform. 2018, 15, 2446–2455. [Google Scholar] [CrossRef]
Liu, Z.; Wang, H.; Liu, J.; Qin, Y.; Peng, D. Multitask learning based on lightweight 1DCNN for fault diagnosis of wheelset bearings. IEEE Trans. Instrum. Meas. 2020, 70, 1–11. [Google Scholar] [CrossRef]
Guo, S.; Yang, T.; Hua, H.; Cao, J. Coupling fault diagnosis of wind turbine gearbox based on multitask parallel convolutional neural networks with overall information. Renew. Energy 2021, 178, 639–650. [Google Scholar] [CrossRef]
Jiang, G.; He, H.; Yan, J.; Xie, P. Multiscale convolutional neural networks for fault diagnosis of wind turbine gearbox. IEEE Trans. Ind. Electron. 2018, 66, 3196–3207. [Google Scholar] [CrossRef]

Figure 1. Conditional Generation Network Schematic Diagram.

Figure 2. CWGAN-GP-MTL Diagnostic Framework.

Figure 3. CWGAN-GP and CNN-BiLSTM-MTL Fault Diagnosis Process.

Figure 4. CWGAN-GP Architecture Diagram.

Figure 5. Structure of Residual Learning Units.

Figure 6. Southeast University Test Bench.

Figure 7. Generator and discriminator loss on Southeast University Dataset.

Figure 8. Quality Assessment of MMD Metric Based on Southeast University Dataset.

Figure 9. T-SNE Visualization Results for Real and Generated Data.

Figure 10. Accuracy Rates of Southeast University Dataset Under Different Generated Data.

Figure 11. Confusion Matrix for Bearing (a) and Gear (b) Task Results on Southeast University Dataset.

Figure 12. Port Portal Crane Gearbox.

Figure 13. Generator and discriminator loss on Portal Crane Datasets.

Figure 14. Quality Assessment of MMD Metrics Based on Portal Crane Datasets.

Figure 15. Visualization Results of T-SNE for Real and Generated Data.

Figure 16. Accuracy Rates of Port Portal Crane Datasets Under Different Generated Data.

Figure 17. Confusion Matrix for Bearing (a) and Gear (b) Task Results on Portal Crane Datasets.

Table 1. Network Configuration for the CWGAN-GP-MTL Architecture.

Layer	Type	Kernel Size	Input	Output
1	Input1	-	128 × 128 × 1	128 × 128 × 1
2	Conv2d	3 × 3	128 × 128 × 1	128 × 128 × 32
3	MaxPool2d	2 × 2	128 × 128 × 32	64 × 64 × 32
4	Conv2d	3 × 3	64 × 64 × 32	64 × 64 × 64
5	MaxPool2d	2 × 2	64 × 64 × 64	32 × 32 × 64
6	ResidualBlock2D	-	32 × 32 × 64	32 × 32 × 64
7	AdaptiveAvgPool2d	-	32 × 32 × 64	32 × 32 × 64
8	Flatten	-	32 × 32 × 64	4096
9	Linear	-	4096	256
10	Dropout	-	256	256
11	Linear	-	256	256
12	Input2	-	1024 × 1	1024 × 1
13	BiLSTM	-	1024 × 1	1024 × 512
14	Dropout	-	1024 × 512	1024 × 512
15	Attention Weighting	-	1024 × 512	512
16	Linear	-	512	128
17	Linear	-	128	128
18	CrossModalAttention	-	128	128
19	Conv1d	1 × 1	128 × 1	256 × 1
20	ResidualBlock	-	256 × 1	256 × 1
Task-Specific Layers
Type		Kernel size	Input	Output
Conv1d		1 × 1	256 × 1	256 × 1
Flatten		-	256 × 1	256
Linear		-	256	N
Conv1d		1 × 1	256 × 1	256 × 1
Flatten		-	256 × 1	256
Linear		-	256	N

Table 2. Detailed Information on Dataset Partitioning at Southeast University.

Component Name	Type Label	Data Status	Training Set	Test Set
	0	Normal state	800	200
	1	Ball fault	40	200
Bearing	2	Inner ring fault	40	200
	3	Outer ring fault	40	200
	4	Compound fault	40	200
	5	Normal state	800	200
	6	Defective fault	40	200
Gear	7	Tooth breakage fault	40	200
	8	Root crack fault	40	200
	9	Tooth surface wear fault	40	200

Table 3. Gearbox Failure Unbalance Data Set on Southeast University Dataset.

	Real Samples	Generated Samples	Training Set	Test Set	Imbalance Factor
A	40	0	40	200	20:1
B	40	40	80	200	10:1
C	40	120	160	200	5:1
D	40	360	400	200	2:1
E	40	740	800	200	1:1

Table 4. Ablation Experiment on Southeast University Dataset.

Ablation Study	Bearing Accuracy (%)	Gear Accuracy (%)
Baseline	97.5	97.5
Single-task-Bearing	90.2	-
Single-task-Gear	-	91.5
CNN-only	95.8	95.7
BiLSTM-only	94.1	93.6
No-Attention Fusion	95.1	95.8

Table 5. Accuracy of Fault Diagnosis Using Different Methods on Southeast University Dataset.

Method	Bearing	Gear	Average
CWGAN-GP-MTL	97.5%	97.5%	97.5%
MT-1DCNN	78.3%	57%	67.65%
RI-MPCNN	90.1%	83.1%	86.6%
MTCASN	72.1%	72.1%	72.1%
MSCNN	88.1%	80.5%	84.3%

Table 6. Detailed Information on the Partitioning of Portal Crane Datasets.

Component Name	Type Label	Data Status	Training Set	Test Set
Bearing	0	Normal state	800	200
	1	Rolling Element failure	40	200
	2	Inner ring failure	40	200
	3	Outer ring failure	40	200
Gear	4	Normal state	800	200
	5	Gear tooth surface wear	40	200
	6	Gear tooth crack	40	200
	7	Abnormal gear meshing	40	200

Table 7. Gearbox Failure Unbalance Data Set on Portal Crane Dataset.

	Real Samples	Generated Samples	Training Set	Test Set	Imbalance Factor
A	40	0	40	200	20:1
B	40	40	80	200	10:1
C	40	120	160	200	5:1
D	40	360	400	200	2:1
E	40	740	800	200	1:1

Table 8. Ablation Experiment on Portal Crane Dataset.

Ablation Study	Bearing Accuracy (%)	Gear Accuracy (%)
Baseline	97.63	99.75
Single-task-Bearing	86.38	-
Single-task-Gear	-	93.88
CNN-only	88.50	98.50
BiLSTM-only	93.75	96.25
No-Attention Fusion	95.60	97.50

Table 9. Performance Comparison of Improved CWGAN-GP Under Different Classification Loss Weighting.

$λ_{1}$	$λ_{2}$	MMD (Bearing)	MMD (Gear)	Accuracy (Bearing) (%)	Accuracy (Gear) (%)
0	10	0.095	0.083	94.6	98.8
0.2	10	0.093	0.081	95.6	99.2
0.5	10	0.09	0.08	97.6	99.8
1	10	0.093	0.082	95.4	99.4

Table 10. Comparison of Ablation Experiment Results for Different Feature Fusion Strategies.

Feature Fusion Method	Bearing Accuracy (%)	Gear Accuracy (%)
Attention-based Feature Fusion	97.6	99.8
Feature Concatenation Fusion	84.4	93.1
Mean-based Feature Fusion	91	92.5

Table 11. Accuracy of Fault Diagnosis Using Different Methods on Portal Crane Dataset.

Method	Bearing	Gear	Average	Time
CWGAN-GP-MTL	97.63%	99.75%	98.69%	476.77 s
MT-1DCNN	78.25%	81.50%	79.88%	88 s
RI-MPCNN	90.88%	81.63%	86.26%	472.03 s
MTCASN	80.50%	90.00%	85.25%	71.17 s
MSCNN	86.25%	91.00%	88.63%	86.62 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Y.; Liao, Z.; Wang, H. Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning. Actuators 2026, 15, 223. https://doi.org/10.3390/act15040223

AMA Style

Yang Y, Liao Z, Wang H. Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning. Actuators. 2026; 15(4):223. https://doi.org/10.3390/act15040223

Chicago/Turabian Style

Yang, Yongsheng, Zuohuang Liao, and Heng Wang. 2026. "Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning" Actuators 15, no. 4: 223. https://doi.org/10.3390/act15040223

APA Style

Yang, Y., Liao, Z., & Wang, H. (2026). Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning. Actuators, 15(4), 223. https://doi.org/10.3390/act15040223

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Fault Diagnosis of Portal Crane Gearboxes Based on Improved CWGAN-GP and Multi-Task Learning

Abstract

1. Introduction

2. Theoretical Basis

2.1. Improving the CWGAN-GP Network

2.2. Multitask Learning Networks

3. Gearbox Fault Diagnosis

3.1. Data Augmentation

3.2. Residual Learning Unit

3.3. Feature Extraction Based on Bi-LSTM

3.4. Fault Diagnosis Method Based on CWGAN-GP-MTL

4. Experimental Verification

4.1. Data Generation Quality Assessment Criteria

4.2. Case 1: Southeast University Dataset

4.3. Case 2: Portal Crane Gearbox Dataset

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI