1. Introduction
Gearboxes function as critical transmission components in industrial actuator driving systems and are extensively utilized across numerous fields, such as industrial production and transportation. They are commonly seen in actuation-critical applications like robotic joint reducers, automotive transmissions, construction machinery, ships, and wind turbine pitch control systems [
1,
2,
3], where they perform the core task of transmitting mechanical power to actuator outputs. However, during actual operation, gearboxes in actuation chains frequently encounter variable rotational speeds, fluctuating loads, and prolonged exposure to high-intensity alternating stresses. These demanding conditions inevitably heighten the risk of failures, including missing gear teeth and gear cracks, which can degrade actuator positioning accuracy or cause unexpected downtime. The occurrence of such failures not only severely disrupts the operational continuity of equipment but also poses latent safety hazards to overall machinery integrity and actuation reliability. In recent years, fault diagnosis has become a critical method for monitoring actuation system condition and mitigating the risk of equipment failure.
Fault diagnosis is a critical technology in industrial production and equipment operation maintenance. Its core lies in precisely monitoring the operational status of equipment or systems and promptly identifying the type and location of faults when anomalies occur. Currently, fault diagnosis techniques are widely applied to various equipment, such as aircraft engines [
4,
5], gearboxes [
6,
7], and bearings [
8,
9]. Traditional fault diagnosis methods often rely on manual feature extraction and expert experience, with their performance constrained by the completeness of prior knowledge and generalization capabilities [
10]. In recent years, with advancements in IoT sensing technology and computational capabilities, data-driven fault diagnosis methods based on deep learning have achieved significant progress. These methods overcome the limitations of traditional approaches by automatically learning complex fault feature representations from multi-source monitoring signals, including raw vibration, temperature, and acoustic emission. The application of deep reinforcement learning in modeling fault diagnosis as Markov decision processes [
11], Long Short-Term Memory (LSTM) networks in capturing long-term dependencies of non-stationary fault signals [
7], and Transformers in global feature extraction [
12] has significantly enhanced fault recognition accuracy, driving a technological paradigm shift in this field.
However, fault diagnosis based on deep learning in industrial scenarios faces severe challenges. First, deep learning models typically require a large number of labeled training samples to achieve good performance [
13,
14]. However, fault samples, especially those of rare or novel faults, are often difficult to obtain, leading to the sample scarcity problem [
15]. This significantly degrades the diagnostic performance of traditional deep learning methods. Real-world fault diagnosis data often suffers from a serious class imbalance problem, where the number of healthy-state or common fault samples far exceeds that of certain specific fault samples [
16]. This imbalance causes general learning algorithms to be easily biased towards the majority class, reducing the recall rate for minority-class faults. Moreover, when a diagnostic model is transferred from a laboratory environment to real operating conditions, factors such as load variations, environmental noise, and sensor drift cause data distribution shifts [
4], leading to significant degradation in model performance.
To address the issue of scarce fault samples, existing research has conducted preliminary explorations, yielding various tentative or representative methodological frameworks. These methods typically involve improvements and integrations across dimensions including data augmentation strategies, loss function modifications, and meta-learning strategies. Data augmentation strategies encompass approaches such as generating fault samples via Generative Adversarial Networks (GANs) [
17,
18,
19] and the Synthetic Minority Over-sampling Technique (SMOTE) [
16,
20]. Wang et al. [
21] incorporated an adaptive network into the feature extraction component of a generative adversarial network, improving the traditional variational autoencoder to expand the dataset. Gao et al. [
22] introduced multi-distribution mega-trend diffusion into generative adversarial networks, proposing an intelligent virtual sample method that demonstrated promising experimental results on real-world industrial datasets. Wang et al. [
23] combined transfer entropy with convolutional generative adversarial networks to augment training data for fault diagnosis in air conditioning systems. Yusoff et al. [
24] developed a comprehensive strategy integrating feature transformation and data-level resampling to address issues of sample overlap and class imbalance. Liu et al. [
25] proposed an adaptive correction method for imbalanced datasets by incorporating niche techniques and the Synthetic Minority Over-sampling Technique (SMOTE), effectively enhancing model training outcomes. However, the physical interpretability and diversity of such methods remain questionable. Regarding loss function adjustments, deep reinforcement learning can flexibly leverage imbalanced distributions to construct reward functions, enhancing agents’ sensitivity to limited samples [
26,
27], yet its effectiveness is constrained in extreme imbalance scenarios. The core advantage of meta-learning methods lies in their few-shot learning capability, minimizing reliance on extensive labeled data and overcoming the dependence of traditional machine learning’s on large sample sizes. Current research primarily focuses on metric-based meta-methods that aim to learn effective distance or similarity functions for the classification of novel categories in few-shot tasks. For instance: Lin et al. [
28] embedded deep features into BiLSTM to enhance discriminability between features of different fault categories, subsequently using prototypes per fault class to identify similar samples; Wang et al. [
29] integrated a self-attentive multi-scale convolutional module to significantly enhance the classical MAML architecture, maintaining high accuracy with minimal fault samples. Chen et al. [
30] employed a multi-scale attention mechanism for feature extraction to intensify focus on critical features while refining prototype vector accuracy and Bi et al. [
31] adopted instance-guided matching and dimension-guided attention modules to dynamically adjust prototype features. Mu et al. [
32] developed a fine-grained feature learner and a task inequality metric index, enhancing the discrimination capability for subtle features. Dong et al. [
33] proposed an interpretable integration fusion time-frequency prototype contrastive learning method, enhancing the fusion of multi-sensor signals and selection of unlabeled samples, thereby improving the network’s decision-making capability through prototypes. Nevertheless, existing metric-based meta-learning methods emphasize increasing inter-sample distance metrics and feature extraction but lack sufficient exploration of global dependencies in deep features and causal relationship analysis. This deficiency renders the methods more vulnerable to sample scarcity and background noise.
To address the limitations of existing data augmentation methods, this paper proposes a frequency-domain weighted mixing (FWM) method for the augmentation of fault signals. This approach superimposes existing signals in the frequency domain to enhance sample diversity while preserving physical plausibility. Another key contribution is the design of a causal self-attention (CSA) mechanism, which guides the feature extraction network to focus on the causal characteristics of signals and the dependencies between channels in deep features. Finally, to improve network interpretability, we introduce a dual-stream wavelet convolution block (DWCB) that directly extracts time-frequency domain magnitude and phase information from signals and fuses both types of information.
Given the multimodal nature of fault diagnosis data, current research extensively explores diverse modalities, including vibration signals, audio signals [
34], temperature data, and image data [
35]. Furthermore, this study incorporates cross-domain experiments between vibration and acoustic signals, aiming to align and fuse information from disparate sensors or data sources to achieve more robust and comprehensive feature representations.
The proposed method is experimentally validated to enable effective early warning by monitoring actuator vibration or acoustic signals during incipient faults, such as gearbox wear and pitting. This approach suppresses the escalation of localized faults into system-level failures, significantly reducing potential economic losses and personnel safety risks. Constructed for cross-domain imbalanced sample scenarios, the intelligent diagnostic model requires a limited amount of datato train a highly mature model, thereby mitigating deployment challenges in practical applications.
The main contributions of this paper are summarized as follows:
(1) We propose the Multi-layer Feature Fusion Meta-Learning (MLFFML) framework for cross-domain class-imbalance scenarios, further incorporating cross-modal experimental scenarios.
(2) FWM is designed to synthesize training samples by fusing amplitude spectra of multiple signals while preserving phase characteristics, effectively addressing sample scarcity with physical plausibility.
(3) DWCB replaces standard CNN layers to process magnitude and phase information through parallel paths, significantly enhancing time-frequency feature discrimination for weak fault signatures.
(4) CSA is integrated to dynamically isolate domain-invariant features. It enforces independence constraints between noise-corrupted and clean feature representations, suppressing spurious correlations during prototype generation.
2. Materials and Methods
2.1. Overall Framework
Meta-learning, a novel network training paradigm also termed “learning to learn”, has been extensively studied for its superior few-shot learning capability and generalization performance. However, fault diagnosis imposes stringent requirements on feature extraction quality. This paper proposes MLFFML, which implements effective improvements at both shallow and deep feature extraction stages. At the shallow stage, the DWCB architecture processes amplitude and phase information in signal time-frequency representations separately, followed by feature fusion. At the deep stage, CSA leverages transformer encoder layers to enhance global feature dependencies, while a variable independence loss constrains the network to suppress redundant features and enhance feature sparsity. Additionally, FWM augments signal diversity by synthesizing spectral components across different samples, enriching training sample variability for improved model robustness. The overall framework diagram of MLFFML is shown in
Figure 1.
The MLFFML framework explicitly addresses three persistent challenges in planetary gearbox fault diagnosis: complex signal transmission paths from planet gears to sensors, weak fault characteristics for planet gear and sun gear-localized faults, and strong cross-modal heterogeneity. Our technical contributions directly target these issues: DWCB’s wavelet convolution adapts to non-stationary gear-meshing signals, extracting physically interpretable time-frequency representations. CSA’s causal constraints suppress noise-induced pseudo-features under variable operating conditions. FWM preserves fault-related spectral components while augmenting limited samples.
The proposed method’s training process is divided into meta-training and meta-testing phases, both leveraging known data from source and target domains to train the model. Since the model architecture, loss computation methodology, and dataset construction remain identical across these two phases, they can be described uniformly. As an example, meta-training begins with the random generation of meta-tasks () from the available data. The set () used for task-specific adaptation is called the support set, while the set () used to evaluate task performance is termed the query set. Here, N denotes the number of classes, and K and L represent the number of samples per category in the support set and query set, respectively.The specific details of the proposed method are detailed as follows.
First, during the data augmentation stage, FWM expands the number of samples for each fault category to match the count of healthy samples, thereby balancing the dataset. Subsequently, meta-training tasks are constructed from the expanded dataset, with Gaussian noise injected into the support set to create noised samples. The feature extraction stage then employs DWCB, followed by CNN and Transformer encoder layers. Notably, the dual-stream wavelet convolution replaces the first convolutional layer in conventional CNNs, utilizing trainable, complex-valued wavelet kernels optimized through gradient descent to extract interpretable time-frequency representations. Meanwhile, the multi-head self-attention mechanism within Transformer encoders enhances global feature dependencies. Following feature extraction, prototype vectors are computed per class by averaging support-set features. These prototypes serve two purposes: to enable the calculation of classification probability and to function as input to the triplet loss. Furthermore, a variable independence loss constrains feature representations between clean and noised support sets to enforce causal disentanglement and feature sparsity.
The remaining portion of
Section 2 details the constituent modules of the proposed method. Specifically,
Section 2.4 addresses not only the causal self-attention mechanism but also comprehensively elaborates on the computation of training loss functions and classification probabilities.
2.2. Dual-Stream Wavelet Convolution Block
A novel dual-stream wavelet convolution block (DWCB) is designed to extract key frequency-band information from signals. Inspired by WaveletKernelNet [
36] and wavelet transform [
37], it employs Morlet complex wavelet convolution for time-frequency decomposition of the signal. The wavelet basis function adopts the Morlet complex wavelet, whose mathematical expression is expressed as follows:
where
denotes the central frequency and
s represents the scale parameter, which also functions as a trainable parameter in the complex wavelet convolutional layer.
denotes the complex Morlet wavelet basis function at scale
s. By varying the scale parameter (
s), a set of complex Morlet wavelet filters is generated. The convolution process is divided into real-part computation and imaginary-part computation, with their mathematical expressions provided below.
represents the real part of the wavelet,
represents the imaginary part of the wavelet, * denotes the convolution operation, and
x is the input signal being processed.
As shown in
Figure 2, the DWCB separates amplitude and phase information after complex Morlet wavelet convolution. This design is necessitated by the inherent differences in the nature of amplitude and phase information. Using the same feature extraction methods—especially the ReLU activation function—may cause phase information loss, thereby diminishing its value in subsequent processes. After extracting preliminary features, amplitude and phase information are integrated via channel-wise concatenation. Multi-dimensional feature weighting is then applied to adaptively adjust the convolutional features.
2.3. Frequency-Domain Weighted Mixing Method
To address the issue of fault data scarcity, this paper proposes a sample augmentation method based on frequency-domain signal mixing, termed the frequency-domain weighted mixing (FWM) method. The method operates by randomly selecting three original signal samples (
,
, and
), decomposing them into frequency-domain representations via Fourier transform, with the computational procedure detailed as follows:
where
is the frequency-domain representation of signal
and
represents the amplitude. Randomly weighted mixing is applied to amplitude spectra across multiple samples using weights of
while preserving the phase information of a single reference signal to maintain the temporal characteristics of the waveform.
represents the amplitude of the synthesized signal in the frequency domain, and
denotes the phase information of the synthesized signal in the frequency domain.
Ultimately, the time-domain signal is reconstructed via the inverse Fourier transform and subjected to standardization processing. This fusion strategy preserves the energy within fault-characteristic frequency bands while ensuring phase consistency to prevent temporal distortion in synthesized signals, thereby effectively expanding the diversity of fault samples.
denotes the normalization of the time-domain signal obtained from the inverse Fourier transform.
2.4. Causal Self-Attention
The proposed method incorporates a Transformer encoder layer after the CNN layers. The Transformer encoder layer consists of two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. Each sub-layer includes residual connections and layer normalization. This architecture establishes global dependencies across sequences while dynamically weighting and integrating contextual information. Its structure is shown below. Here,
denotes the multi-head attention mechanism [
38], where
,
, and
represent the query, key, and value vectors, respectively.
is the input feature, and
is the output feature.
is the dimension of the key vector.
A causal decomposition loss is introduced to enforce independence constraints between variables of the original support-set features and the noise-injected support-set features. Combined with feature transformations in the Transformer, this approach uncovers underlying causal associations and enhances the sparsity of feature representations.
denotes the computation of the correlation matrix.
and
represent the deep features of the original sample and the noise-added sample, respectively.
2.5. Classification Loss
The prototype vector for each class is calculated from the support-set samples, and triplet loss is employed to reduce intra-class sample distance while increasing inter-class sample distance; in this triplet loss configuration, the anchor is set to the features of all samples from both the support set and query set, the positive sample corresponds to the prototype feature of the anchor’s own class, and the negative samples are selected as the closest heterogeneous samples to prototype vectors, thereby enhancing intra-class compactness by aligning samples with their class prototypes while improving inter-class discrimination through explicit contrast with the most confusable negative prototypes.
The total loss consists of two components: causal disentanglement loss and classification loss.
The classification loss is calculated using the triplet function, as shown below.
denotes the Euclidean distance.
Here, denotes the anchor sample, which is selected from all samples in both the support set and query set during calculation. represents the positive sample, chosen as the prototype sample of the anchor sample’s category. is the negative sample, selected as the most similar sample from a class different than that of the anchor sample. This configuration effectively minimizes intra-class distances while maintaining separation between the closest inter-class samples. a is a manually set distance parameter, set to 20 here. The classification outcome is determined by the distances between sample features and the prototype features of each category. This process is relatively simple and does not affect model training; thus, we will not elaborate on it here.