A Recursive Generative Adversarial Denoising Learning Method for Acoustic-Based Gear Fault Diagnosis Under Non-Stationary Noise Interference

E, Zhiqun; Ma, Xingjiang; Yao, Yong; Sun, Lei

doi:10.3390/acoustics7040076

Open AccessArticle

A Recursive Generative Adversarial Denoising Learning Method for Acoustic-Based Gear Fault Diagnosis Under Non-Stationary Noise Interference

¹

National Institute of Measurement and Testing Technology, Chengdu 610021, China

²

College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Acoustics 2025, 7(4), 76; https://doi.org/10.3390/acoustics7040076

Submission received: 31 August 2025 / Revised: 30 October 2025 / Accepted: 18 November 2025 / Published: 21 November 2025

Download

Browse Figures

Versions Notes

Abstract

Acoustic-based diagnosis (ABD) technology demonstrates promising application prospects for rotating machinery such as gears. However, non-stationary background noise may obscure or distort the target acoustic signal, potentially resulting in misdiagnosis or inadequate diagnosis in practical application. Therefore, preserving the inherent periodicity and sparsity features of mechanical sound signals from non-stationary background noise constitutes a critical challenge to facilitating the effective application of ABD in practical industrial environments. To address the shortcoming, this paper proposes an ABD method based on Recursive Generative Adversarial Denoising (RGAD). Specifically, a Global Window-aware Attention Module (GWAM)-based generator is first designed to reconstruct periodic structural features of gear rotational acoustic signals by adaptively representing non-stationary noise components and recursively capturing global dependencies in the time–frequency domain. Subsequently, a generative adversarial mechanism is established through developing a recursive discriminative architecture, which enables the model to effectively alleviate the vanishing gradients during adversarial learning and recover the texture details of gear acoustic features in a coarse-to-fine manner through progressive guidance. Finally, combined with a fault diagnosis network (FDN), a complete RGAD-based ABD framework is constructed. Experimental results demonstrate that the proposed method effectively suppresses noise components while simultaneously reconstructing the periodic characteristics and fine texture details of gear rotational acoustic signals, thereby significantly improving the accuracy and reliability for gear acoustic diagnosis in real industrial scenarios.

Keywords:

gear fault diagnosis; acoustic-based diagnosis; generative adversarial mechanism; recursive denoising learning

1. Introduction

Gears are an essential component of modern rotating machinery and are widely used in various fields such as wind power generation, aerospace, and transportation. Gear fault diagnosis has always been a challenging and highly researched topic, as the operating condition of gears directly affects the performance of the equipment and the personal safety of workers [1,2]. Although vibration-based gear fault diagnosis methods have been commonly used and have achieved good results, the installation of vibration sensors is often limited by working conditions and the complex structure of the equipment to be diagnosed. Moreover, vibration signals are difficult to measure in certain special environments [3].

In recent years, non-contact acoustic-based diagnosis (ABD) methods have attracted increasing attention due to their ability to complement traditional vibration-based techniques. Early studies by Rezaei et al. [4] investigated the effectiveness of ABD for bearing fault detection under varying speeds and load conditions. Peng et al. [5] proposed an acoustic signal-based anomaly detection framework for industrial machinery, enhancing detection accuracy through time–frequency feature perception, further confirming the feasibility and stability of acoustic signals for diagnosis. In the field of rotating machinery, Scanlon and Zhang et al. [6,7] applied non-contact microphone sensors for health monitoring of rotating machinery. Hassan et al. [8] employed a physics-informed deep learning approach to develop an acoustic-based engine fault diagnosis method. Furthermore, in the field of gear fault detection, Hou et al. [9] developed acoustic fault mode detection methods based on near-field acoustic holography, successfully capturing the sound field characteristics of gears in complex structures. Meanwhile, Yao et al. [10,11] proposed a series of air-coupled acoustic diagnostic approaches integrating deep learning and attention mechanisms, which improved gear fault recognition accuracy and real-time performance through automatic feature extraction. Glowacz et al. [12,13,14] introduced innovative acoustic features and applied ABD to the fault diagnosis of various devices, including induction motors, electric impact drills, coffee grinders, and commutator motors, demonstrating strong generalization capabilities. In addition, Ebrahimkhanlou et al. [15] integrated deep learning with acoustic technology and applied it to complex plate-like structures for acoustic emission source localization and characterization. Despite the promising performance of these ABD methods in rotating machinery diagnostics, most studies have been conducted under noise-free or ideal laboratory conditions, with insufficient consideration of background noise. This creates a performance gap between experimental results and practical applications. In real-world industrial scenarios, intense and highly non-stationary background noise presents a critical challenge for ABD tasks—originating from machine collisions, abnormal bearing wear, and complex disturbances caused by load fluctuations in machining systems. These interferences severely degrade the separability of acoustic features and remain a major obstacle to the practical deployment of ABD methods [3]. However, to our knowledge, few studies have so far been reported on separating acoustic characteristics from non-stationary background noise for a precise diagnosis. But similar acoustic signal separation and enhancement methods have been widely studied in the field of speech processing, which can provide valuable insights for noise suppression in ABD applications.

In traditional speech processing, non-stationary noise suppression methods include spectral subtraction [16], filtering-based denoising [17], and minimum mean square error (MMSE) estimation [18]. Among these, filtering methods are widely applied. Haykin et al. [19] first proposed Least Mean Square (LMS), a traditional adaptive filtering algorithm based on Wiener filters. Subsequently, Martinek et al. [20] proposed Recursive Least Squares (RLS), a representative adaptive filtering algorithm, based on Kalman filtering theory [21]. However, the aforementioned methods rely on statistical properties between speech and noise and cannot guarantee signal reconstruction quality in highly non-stationary environments due to often introducing artifacts such as “musical noise”. To overcome this limitation, deep learning-based methods have been further developed for high-quality speech reconstruction in non-stationary environments, which are typically divided into masking-based and mapping-based approaches. Masking-based methods learn ideal time–frequency masks, such as ideal binary mask (IBM) [22] and ideal ratio mask (IRM) [23], but their performance drops significantly when faced with unseen noise. In contrast, mapping-based methods directly predict clean speech using deep neural networks [24], and they employ strategies like noise-aware training [25] and batch normalization to improve generalization. Although more robust to unknown noise, mapping methods often offer slightly lower fine-grained details than mask-based approaches [23]. Considering the issues of fine-grained speech reconstruction in unseen noise, the Generative Adversarial Network (GAN) is involved in the speech processing task for texture detail enhancement. Pascual et al. [26] first investigated generative architectures, called SEGAN, for speech enhancement at the waveform level. Pandey et al. [27] comprehensively verified the superiority of adversarial training compared to regularization-based training for noise enhancement. Fu et al. [28] proposed MetricGAN, which improves speech enhancement performance by replacing traditional adversarial loss with surrogate functions to enhance speech intelligibility and quality. MetricGAN+ [29] further optimizes the training strategies based on this foundation, demonstrating better generalization capabilities across various speech enhancement tasks. Additionally, Wang et al. [30] proposed T-F Masking, which employs a generative adversarial mechanism to construct masks in the time–frequency domain, effectively suppressing noise components under non-stationary conditions. Although the existing generative architectures in the field of speech processing provide sufficient reference for generative-based non-stationary noise suppression strategies in ABD tasks, the speech generative-based framework may not be suitable for acoustic-based gear fault diagnosis due to the unique periodicity and sparsity characteristics of gear operating sounds.

To address the aforementioned challenges, a Recursive Generative Adversarial Denoising learning method (RGAD) is explored for acoustic-based gear fault diagnosis in real industrial noise environments. Inspired by the Swin Transformer for capturing global dependency in a sequence signal, the Global Window-aware Attention module (GWAM) is first introduced to restore the periodic structure of the gear acoustic signal on the time–frequency spectrum. Unlike the original design of the Window-based Multi-Head Self-Attention (W-MSA) that shifts at the patch level, the GWAM focuses on global patches of the spectrum for periodic feature reconstruction from non-stationary background noise. After repeatedly embedding the GWAM unit to form the generator, a recursive discriminative architecture is further developed to construct a basic adversarial mechanism, where discriminators with different capabilities are placed in different parts of the generator to interact with equivalent GWAM units for playing max-min games. This design effectively avoids vanishing gradients during the adversarial process and guides the generator to gradually eliminate noise components and progressively restore the texture details of the target clean spectrum in a coarse-to-fine manner. Finally, the RGAD is integrated with a well-trained fault classification module to construct a Recursive Generative Adversarial Denoising diagnosis framework for acoustic-based gear fault diagnosis in real industrial noise scenarios. Experimental results demonstrate the superiority of the proposed RGAD-based ABD framework in both noise suppression performance and diagnosis accuracy compared to conventional approaches.

The main contributions delivered in this work are summarized as follows:

A novel GWAM-based generator is first proposed to capture the periodic structure characteristics of gear acoustic signals under noise interference by adaptively representing non-stationary noise components and recursively modeling the global dependence of time–frequency features.
A new adversarial mechanism is further developed by constructing a recursive discriminative architecture, which enables the model to effectively avoid the vanishing gradient problem and significantly refine the detail reconstruction quality of acoustic features from the noise condition.
Building upon the above modules, a complete RGAD-based ABD framework is constructed to detect gear fault patterns in non-stationary noise conditions, which demonstrates the effectiveness of the proposed ABD framework in real industrial scenarios.

2. Background

2.1. Swin Transformer

The Transformer architecture was originally proposed by Vaswani et al. [31], featuring a Multi-Head Self-Attention mechanism as its core component. It has achieved remarkable success in both natural language processing and computer vision tasks. The traditional Transformer block consists of Layer Normalization (LN), Multi-Head Self-Attention (MSA), and a Multi-Layer Perceptron (MLP), where LN is applied before each MSA and MLP module. These core components are integrated via residual connections, which can be expressed as

\begin{matrix} {\hat{z}}^{l} & = MSA (LN (z^{(l - 1)})) + z^{(l - 1)}, \\ z^{l} & = MLP (LN ({\hat{z}}^{l})) + {\hat{z}}^{l}, \end{matrix}

(1)

where

z^{l}

represents the output of the

l^{th}

Transformer block.

{\hat{z}}^{l}

denotes the intermediate features after passing through the MSA module. The MSA mechanism captures long-range dependencies in input sequences by computing attention weights between queries (Q), keys (K), and values (V). Its core computation can be represented as

\begin{matrix} Q = z^{l} W_{Q}, K = z^{l} W_{K}, V = z^{l} W_{V}, \\ MSA (z^{l}) = Softmax (\frac{Q K^{T}}{\sqrt{d}}) V . \end{matrix}

(2)

The MSA is computed by first obtaining the dot product between the query matrix (Q) and the transposed key matrix (

K^{T}

), scaled by the reciprocal square root of the feature dimension

\sqrt{d}

. This scaled attention score matrix is then normalized through the Softmax activation function along the sequence dimension to generate attention weights, which subsequently interact with the value matrix (V) via matrix multiplication to produce the context-aware output representation.

In the conventional Transformer, input tokens need to be computed based on its relationships to all other tokens through the standard MSA mechanism, where the computational complexity is quadratic to the number of tokens. This makes it impractical for high-resolution inputs and resource-constrained applications. To address this issue, Liu et al. [32] proposed the Shifted Window Transformer (Swin Transformer), which adopts a “window-MSA + shifted window-MSA” strategy to significantly reduce computational complexity while preserving modeling capability. The formulation for W-MSA can be represented as follows:

\begin{matrix} W - MSA (z^{l}) = Softmax (\frac{Q K^{T}}{\sqrt{d}} + B) V, \end{matrix}

(3)

where B is the relative positional bias term.

To alleviate the interaction limitation between windows, SW-MSA is introduced following the W-MSA structure for cross-window interaction without additional computation. Different from the W-MSA structure, the SW-MSA employs a cyclic-shifting mechanism to aggregate non-adjacent sub-windows from feature maps for promoting information interaction. With this shifted window partitioning mechanism, the definition of the SW-MSA can be written as

\begin{matrix} SW - MSA (z^{l}) = SoftMax (\frac{Q K^{T}}{\sqrt{d}} + B + M) V, \end{matrix}

(4)

where M is a predefined binary mask matrix. It enforces the attention computation to be performed only within specific sub-windows after the shift operation by setting the attention scores between non-adjacent patches to negative infinity.

2.2. Generative Adversarial Network

Generative Adversarial Networks (GANs) were first proposed by Goodfellow et al. in 2014 [33]. The core idea of the GAN is to use two networks, a generator G and a discriminator D, which engage in adversarial training to model the distribution of data by generating realistic samples from random noise:

\begin{matrix} L_{G A N (G)} = E_{z \sim p_{z}} [log (1 - D (G (z)))], \\ L_{G A N (D)} = E_{x \sim p_{d a t a}} [log D (x)] - E_{z \sim p_{z}} [log (1 - D (G (z)))], \end{matrix}

(5)

where x∼

p_{d a t a}

represents real samples from the data distribution and z∼

p_{z}

is the noise vector.

The discriminator attempts to maximize the ability to distinguish between real and fake samples, while the generator tries to generate fake data that fools the discriminator. In recent years, the GAN has shown significant potential in tasks such as image super-resolution and speech enhancement, where it demonstrates strong generative power for learning complex, high-dimensional features [26]. Compared to traditional denoising methods, the GAN is more robust to non-stationary noise and is capable of learning intricate noise patterns. Moreover, GANs can learn more complex features and background noise patterns, offering stronger modeling capabilities for highly dynamic and non-stationary acoustic signals. The GAN’s ability to model and separate these signals makes it a promising approach for tasks that involve complex interference or environmental noise [34].

Considering that the classical method is potentially affected by vanishing gradients due to the sigmoid cross-entropy loss used for training, the least-squares GAN (LSGAN) approach substitutes the cross-entropy loss by the least-squares function with binary coding. With this in mind, the formulation in Equation (5) changes to

\begin{matrix} L_{L S G A N (G)} = \frac{1}{2} E_{z \sim p_{z}} [{(D (G (z)) - β)}^{2}], \\ L_{L S G A N (D)} = \frac{1}{2} E_{x \sim p_{d a t a}} [{(D (x) - α)}^{2}] + \frac{1}{2} E_{z \sim p_{z}} [{(D (G (z)) - β)}^{2}], \end{matrix}

(6)

where

α = 1

and

β = 0

denote true labels and false labels.

3. Methodology

In this section, the GWAM, which is a core component of the generator responsible for the extraction of periodicity characteristics, is first introduced. Then, the proposed adversarial mechanism comprising a recursive discriminative architecture is further described. Subsequently, the overall architecture and optimization strategy of RGAD are elaborated sequentially. Finally, the complete RGAD-based ABD framework and corresponding diagnosis procedure are presented.

3.1. Global Window-Aware Attention Module

As the core component of the generator in the RGAD model, the designed Global Window-aware Attention Module (GWAM) consists mainly of Layer Normalization (LN), Window-based Multi-Head Self-Attention (W-MSA), residual connections, and a Multi-Layer Perceptron (MLP). It is responsible for capturing the periodic structural features of gear rotational acoustic signals under non-stationary noise by establishing long-range time–frequency (T-F) dependencies. The operational mechanism is illustrated in Figure 1. Let the original noisy time–frequency input signal be denoted as

x_{n} \in R^{T \times F \times 1}

, where T and F represent the time frames and frequency bins, respectively. The input spectrum is partitioned into P patches of size

T^{'} \times F^{'}

to form localized T-F representation

x_{n}^{p} \in R^{\frac{T}{T^{'}} * \frac{F}{F^{'}} \times T^{'} F^{'}}

, where

P = \frac{T}{T^{'}} * \frac{F}{F^{'}}

. After the

x_{n}^{p}

are projected into the patch embedding space and normalized using LN, the patch embedding features are fed into the W-MSA module for exploring T-F dependencies on global patches. Specifically,

x_{n}^{p}

is projected through three linear layers to obtain high-dimensional (

(3 * T^{'} F^{'})

) query (Q), key (K), and value (V) matrices, which are computed by

\begin{matrix} Q = x_{n}^{p} W_{Q}, \end{matrix}

(7)

\begin{matrix} K = x_{n}^{p} W_{K}, \end{matrix}

(8)

\begin{matrix} V = x_{n}^{p} W_{V}, \end{matrix}

(9)

where

W_{Q}

,

W_{K}

,

W_{V}

are learnable parameter matrices of size

[T^{'} F^{'}, 3 \times T^{'} \times F^{'}]

, and the dimensions of the output Q, K, V are

[P, 3 * T^{'} F^{'}]

. Next, we set the number of attention heads to h and perform attention splitting on Q, K, and V:

Q, K, V \overset{S p l i t}{\to} Q_{i}, K_{i}, V_{i} \in R^{h \times P \times \frac{3 * T^{'} F^{'}}{h}},

(10)

where i denotes the i-th attention head. The global attention weight

H_{i}

for a single attention head is computed as follows:

H_{i} = Attention (Q_{i}, K_{i}, V_{i}) = Softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d}}) V_{i},

(11)

where ^T denotes the matrix transpose operation. By concatenating the h single-head attention weights, the final global attention weight

x_{n}^{A}

of the W-MSA module is obtained:

x_{n}^{A} = ε (Linear (Concat (H_{1}, H_{2}, . . ., H_{h}))),

(12)

where

ε (\cdot)

denotes the nonlinear activation function, and

Linear (\cdot)

denotes the fully connected layer. The purpose of introducing the linear layer is to reshape

x_{n}^{A} \in R^{P \times 3 * T^{'} F^{'}}

from embedding space to original input dimension

x_{n}^{A} \in R^{P \times T^{'} F^{'}}

. The output of W-MSA is then added to the original input

x_{n}^{P}

via a residual connection, yielding an intermediate representation

x_{n}^{'}

. This design helps alleviate the vanishing gradient problem and enhances the training stability of deep networks. Next,

x_{n}^{'}

undergoes a second normalization step and is fed into an MLP to further improve the representation of features. Finally, the output of the MLP is fused with

x_{n}^{'}

via another residual connection, producing the final output

x_{n}^{″}

of the GWAM.

As a fundamental building block of the generator in RGAD, the GWAM adaptively models global dependencies along the T-F dimension of gear rotation acoustic signals. This is beneficial to the capture of periodic structural features under non-stationary noise interference, which facilitates the perception of key discriminative features related to faults and provides a cleaner signal representation for subsequent diagnostic tasks.

3.2. Recursive Adversarial Mechanism

In non-stationary noise environments, high-quality reconstruction of gear fault features is hard to achieve based solely on the complex nonlinearity model with conventional regularization loss. Therefore, an adversarial mechanism is introduced as a new path to cooperate with the regularization loss for the improvement of feature reconstruction ability under noise interference. Considering that conventional adversarial training based on a single discriminator is insufficient to effectively constrain the details of acoustic signals at the global level due to potential issues such as vanishing gradients, mode collapse, and unstable convergence, a recursive adversarial mechanism is proposed, in which multiple discriminators are embedded within the multi-stage structure of the generator to perform a max-min game with equivalent GWAM units. This design decomposes the conventional mapping process into multi-stage adversarial learning by constructing auxiliary sub-GANs, which not only alleviates vanishing gradients but also facilitates the stable reconstruction of acoustic feature texture details in a “coarse-to-fine” recursive manner. The specific structures are illustrated in Figure 2.

Three discriminators with different structures

D_{1}, D_{2}, D_{3}

impose constraints on the outputs generated by the GWAM at different levels of the generator, where

D_{1}, D_{2}, D_{3}

are mainly composed of convolutional layers with different configurations and fully connected (FC) layers. Given an input spectrogram

X \in R^{T \times F}

and a convolution kernel

K \in R^{W \times H}

, the convolution calculation process with stride s and padding p is defined as follows:

\begin{matrix} {Conv}_{W \times H} (X_{(i, j)}) = φ (\sum_{m = 0}^{W - 1} \sum_{n = 0}^{H - 1} X (i + m, j + n) \cdot K (m, n)), \end{matrix}

(13)

where

(i, j)

denotes the location coordinates of the convolution kernel on the input feature map,

φ (\cdot)

denotes the LeakyReLU activation function, and K denotes the convolution kernel with a size of

W \times H

. To align with the recursive adversarial denoising process in a “coarse-to-fine” manner, the discriminators are designed following a progressive strategy from weak to strong, with the specific parameter configurations detailed in Table 1.

D_{1}

is configured with a shallow nonlinear mapping that provides mild constraints, thereby stabilizing the early training stage, guiding the generator to eliminate prominent noise components, and preventing excessive penalization in the initial noise reduction phase:

\begin{matrix} D_{1} (X_{1}) & = FC ({Conv}_{2 \times 2} ({Conv}_{4 \times 4} (X_{1}))), \end{matrix}

(14)

where

X_{1}

represents either the clean signal or the denoised signal generated by

G_{1}

, and

{Conv}_{2 \times 2}

and

{Conv}_{4 \times 4}

denote convolutional layers with kernel sizes of

2 \times 2

and

4 \times 4

, respectively.

D_{2}

further increases the network depth based on the

D_{1}

structure and cooperating with

G_{2}

to extract finer details and periodic structural features, thus enhancing the feedback on intermediate denoising results:

\begin{matrix} D_{2} (X_{2}) & = FC ({Conv}_{2 \times 2} ({Conv}_{4 \times 4} ({Conv}_{4 \times 4} (X_{2})))), \end{matrix}

(15)

where

X_{2}

represents either the clean signal or the denoised signal generated by

G_{2}

.

D_{3}

, armed with the deepest structure to promote

G_{3}

to, enforces constraints on the residual noise at the final stage, thereby restoring the periodic structures and texture details along the T-F dimension:

\begin{matrix} D_{3} (X_{3}) & = FC ({Conv}_{1 \times 1} ({Conv}_{2 \times 2} ({Conv}_{4 \times 4} ({Conv}_{4 \times 4} (X_{3}))))), \end{matrix}

(16)

where

X_{3}

represents either the clean signal or the denoised signal generated by

G_{3}

. Such a recursive discriminative architecture with progressively increasing intensity ensures that the difficulty of discrimination matches that of generation, thereby enhancing the stability of model convergence and enabling a coarse-to-fine denoising process.

Recursive Generative Adversarial Denoising learning is a strategy that progressively enhances the denoising performance through the multi-level optimization of generated outputs, where the preceding result serves as the input for the subsequent stage, thereby achieving coarse-to-fine feature reconstruction and error correction. The overall structure of the proposed RGAD is shown in Figure 3. The generator and its corresponding discriminator

{(G_{t}, D_{t})}_{t = 1}^{T}

(

T = 3

) jointly function as a basic adversarial unit at each stage, and multiple adversarial units are then serially connected across multiple stages to form the overall RGAD structure. Specifically, the output of generator

G_{t}

at one stage is fed as the input into the subsequent stage to achieve recursive denoising, where generator

G_{t}

is mainly composed of two consecutive GWAMs, described in Section 3.1. At the same time, the output of

G_{t}

is reconstructed from the patch level to the original T-F dimension and input into

D_{t}

for adversarial discrimination and gradient feedback. The overall process can be expressed as

\begin{matrix} x_{g}^{t} = G_{t} (x_{g}^{t - 1}; θ_{G_{t}}), \end{matrix}

(17)

where

G_{t}

receives the denoised output from the previous stage

x_{g}^{t - 1}

(where the input is the original noisy signal

x_{n}^{p}

, which is obtained through the patch partitioning operation, described in Section 3.1, when

t = 1

) and generates the denoised signal

x_{g}^{t}

.

θ_{G_{t}}

denotes the learnable parameters of the generator at the i th stage in the recursive learning framework, where the parameters of the generator are entirely derived from the GWAM. The processed output

x_{g}^{t}

is then passed into discriminator

D_{t}

, which evaluates its similarity to the target clean signal

x_{c}

:

\begin{matrix} d_{t}^{g} = D_{t} (x_{g}^{t}; θ_{D_{t}}), \\ d_{t}^{c} = D_{t} (x_{c}; θ_{D_{t}}), \end{matrix}

(18)

where

d_{t}^{g}

denotes the output score of the discriminator for the denoised signal

x_{g}^{t}

,

d_{t}^{c}

denotes the output score for the clean signal

x_{c}

, and

θ_{D_{t}}

represents the learnable parameters of the discriminator. During the adversarial process, the clean spectrum

x_{c}

is used as the real sample and fed into the current discriminator to be distinguished from the generated denoised results. In this way, the discriminator can learn the statistical features of the true clean spectrogram and guide the generator to progressively produce outputs closer to the real distribution

x_{c}

through discrimination error feedback. The final denoised output

x_{g}^{3}

can thus be expressed as

\begin{matrix} x_{g}^{3} = G_{3} (x_{g}^{2}, θ_{G_{3}}), \\ x_{g}^{2} = G_{2} (x_{g}^{1}, θ_{G_{2}}), \\ x_{g}^{1} = G_{1} (x_{n}^{p}, θ_{G_{1}}), \\ x_{n}^{p} = Patch (x_{n}), \end{matrix}

(19)

where

Patch (\cdot)

represents the patch partitioning operation introduced in Section 3.1.

This recursive generative adversarial strategy enables the model to effectively mitigate the vanishing gradient problem and significantly enhance its capability to restore the texture details of acoustic features in noise interference. The loss calculation and optimization strategy of RGAD are elaborated in detail in Section 3.3.

3.3. Loss Functions and Optimization Strategies

Considering that vanilla GANs are prone to suffer from vanishing gradients and training oscillation due to the sigmoid cross-entropy loss, the LSGAN substitutes the cross-entropy loss and form hybrid optimization objective with Mean Squared Error (MSE) to provide a stable training process in this work. The hybrid loss computation of the basic adversarial unit is illustrated in Figure 4. Specifically, the MSE is combined with the generator loss to impose a constraint with respect to the denoised spectrum in each adversarial unit. The modified loss function for the three-stage generator

G_{t}

is formulated as

\begin{matrix} L_{G_{1}} = λ \frac{1}{N} \sum_{i = 1}^{N} {(x_{c, i} - G_{1} (x_{n, i}^{p}))}^{2} + \frac{1}{2} E_{x_{n}} [{(D_{1} (G_{1} (x_{n}^{p})) - α)}^{2}], \end{matrix}

(20)

\begin{matrix} L_{G_{2}} = λ \frac{1}{N} \sum_{i = 1}^{N} {(x_{c, i} - G_{2} (x_{g, i}^{1}))}^{2} + \frac{1}{2} E_{x_{g}^{1}} [{(D_{2} (G_{2} (x_{g}^{1})) - α)}^{2}], \end{matrix}

(21)

\begin{matrix} L_{G_{3}} = λ \frac{1}{N} \sum_{i = 1}^{N} {(x_{c, i} - G_{3} (x_{g, i}^{2}))}^{2} + \frac{1}{2} E_{x_{g}^{2}} [{(D_{3} (G_{3} (x_{g}^{2})) - α)}^{2}], \end{matrix}

(22)

where N represents the number of samples,

x_{c, i}

denotes the i-th clean sample,

x_{n, i}^{p}

indicates the i-th noisy sample,

x_{g, i}^{t}

represents the i-th denoised sample, and

λ

is a weighting parameter set to 100 in this work to maintain an equilibrium between minimization and adversarial behavior according to previous studies on GANs. Meanwhile, it is worth noting that the Label Smoothing strategy is further introduced into the adversarial loss, which sets the real label to

α = 0.9

. By assigning the target value of real samples to a constant slightly less than 1, the risk of overfitting is effectively reduced and the training stability can be further improved. Correspondingly, the loss functions for the three-stage discriminator

D_{t}

are as follows:

\begin{matrix} L_{D_{1}} = \frac{1}{2} E_{x_{c}} [{(D_{1} (x_{c}) - α)}^{2}] + \frac{1}{2} E_{x_{n}} [{(D_{1} (G_{1} (x_{n}^{p})) - β)}^{2}], \end{matrix}

(23)

\begin{matrix} L_{D_{2}} = \frac{1}{2} E_{x_{c}} [{(D_{2} (x_{c}) - α)}^{2}] + \frac{1}{2} E_{x_{g}^{1}} [{(D_{2} (G_{2} (x_{g}^{1})) - β)}^{2}], \end{matrix}

(24)

\begin{matrix} L_{D_{3}} = \frac{1}{2} E_{x_{c}} [{(D_{3} (x_{c}) - α)}^{2}] + \frac{1}{2} E_{x_{g}^{2}} [{(D_{3} (G_{3} (x_{g}^{2})) - β)}^{2}], \end{matrix}

(25)

where

β

represents the fake label and is set to

β = 0

in this work. The losses at each stage of the recursive three-stage training process are computed separately and optimized jointly in an end-to-end manner.

This paper employs a recursive learning framework to optimize the aforementioned loss functions. The core idea of recursion is to replace the complex direct mapping with a multi-stage indirect mapping process, achieving progressive denoisation through multiple simplified denoising modules. Specifically, we utilize the Adam optimizer to independently optimize the three sets of generators and discriminators

{(G_{t}, D_{t})}_{t = 1}^{T}, T = 3

. Within each group,

G_{t}

and

D_{t}

share an optimizer and are updated simultaneously to maintain the dynamic balance of the adversarial process. It should be noted that this recursive architecture not only prevents error accumulation but also helps mitigate the gradient vanishing problem caused by network depth. The specific update rules for the generator parameters are as follows:

\begin{matrix} θ_{G_{1}} \leftarrow θ_{G_{1}} - η_{adam} (\frac{\partial L_{G_{1}}}{θ_{G_{1}}} + \frac{\partial L_{G_{2}}}{θ_{G_{1}}} + \frac{\partial L_{G_{3}}}{θ_{G_{1}}}), \\ θ_{G_{2}} \leftarrow θ_{G_{2}} - η_{adam} (\frac{\partial L_{G_{2}}}{θ_{G_{2}}} + \frac{\partial L_{G_{3}}}{θ_{G_{2}}}), \\ θ_{G_{3}} \leftarrow θ_{G_{3}} - η_{adam} (\frac{\partial L_{G_{3}}}{θ_{G_{3}}}), \end{matrix}

(26)

where

η_{adam}

represents the Adam optimizer with a fixed learning rate of

η = 0.0002

. This configuration plays a crucial role in maintaining the delicate balance between the generators and discriminators throughout the adversarial training process. Similarly, the generator also updates its parameters along the gradient descent direction, expressed as

\begin{matrix} θ_{D_{1}} \leftarrow θ_{D_{1}} - η_{adam} (\frac{\partial L_{D_{1}}}{θ_{D_{1}}} + \frac{\partial L_{D_{2}}}{θ_{D_{1}}} + \frac{\partial L_{D_{3}}}{θ_{D_{1}}}), \\ θ_{D_{2}} \leftarrow θ_{D_{2}} - η_{adam} (\frac{\partial L_{D_{2}}}{θ_{D_{2}}} + \frac{\partial L_{D_{3}}}{θ_{D_{2}}}), \\ θ_{D_{3}} \leftarrow θ_{D_{3}} - η_{adam} (\frac{\partial L_{D_{3}}}{θ_{D_{3}}}) . \end{matrix}

(27)

3.4. Overall RGAD-Based ABD Framework

The proposed RGAD is integrated with the fault diagnosis network (FDN) to construct the complete RGAD-based ABD framework. The overall structure of the framework is shown in Figure 5. Firstly, the original gear rotational acoustic signal collected under noisy conditions is transformed into a spectral representation

x_{n}

and then fed into RGAD to progressively remove noise components and gradually reconstruct periodic structural features and texture details. Subsequently, the denoised spectrum

x_{g}^{3}

output by the final-stage generator

G_{3}

is input into the FDN for sequential fault feature extraction and classification.

As shown in the lower part of Figure 5, the FDN mainly consists of depthwise separable convolution, group convolution, multi-scale convolution, global pooling, and a fully connected layer. The first three components of the network form the feature extraction module, which is designed to capture the periodic pulse patterns and time–frequency harmonic distributions associated with fault characteristics, while the global pooling and fully connected layers are utilized for fault classification based on the extracted features.

3.5. Fault Diagnosis Procedure of the Proposed Framework

The fault diagnosis process of the RGAD-based ABD framework is shown in Figure 6. The original gear rotational acoustic signal is first collected in real factory environments and in a semi-anechoic chamber, respectively, using the established acoustic measurement platform (for details, see Section 4.1.1 and Section 4.1.2). Subsequently, the acquired one-dimensional waveform signal is preprocessed into a two-dimensional spectrum through the method presented in Section 4.1.3, which is utilized for training, validation, and testing. During the training stage, noise suppression and fault diagnosis are treated as entirely independent processes: the spectrum containing non-stationary noise is exclusively utilized in the RGAD model for supervised learning, with the clean spectrum serving as the target, while the clean spectrum is independently employed for the FDN. During the testing phase, the trained RGAD model and FDN network are cascaded to facilitate gear fault pattern recognition based on acoustic samples collected in real industrial environments.

4. Experiments

In this section, we first present the experimental setup, including the experimental conditions, data acquisition process, and implementation details. Subsequently, a series of ablation studies and comparative experiments are conducted to evaluate the effectiveness of the proposed RGAD method against non-stationary noise interference and to further demonstrate the superiority of the RGAD-based ABD framework for gear fault diagnosis under real industrial scenarios.

4.1. Experimental Setup

4.1.1. Experiment System

The experimental investigations were conducted on a self-designed gearbox test platform, as illustrated in Figure 7a–d. The test system can be divided into three major components: (i) the gearbox test bench, (ii) the acoustic measurement setup, and (iii) the data acquisition and recording module. The gearbox test bench is driven by a variable-frequency motor controlled by a frequency inverter, and it is mechanically loaded through a tension controller and a magnetic brake. To capture the acoustic response, a four-channel microphone array is deployed around the gearbox. The microphones are mounted in either a hemispherical or rectangular enclosing configuration, with symmetric placement that complies with the ISO 3745:2012 standard. The acoustic signals are transmitted via Bayonet Nut Connectors (BNCs) to a multi-channel data acquisition instrument, ensuring stable and synchronized signal collection. Finally, the data are stored in .WAVformat through a dedicated recording software package for subsequent analysis.

Accordingly, experimental datasets comprising clean acoustic signals and noise-contaminated acoustic signals were, respectively, established. The dataset covers four gear health states under three different rotational speeds, including one normal condition and three representative fault types. For each health state, 35 four-channel audio recordings of 60 s were collected. The recordings were then divided into training, validation, and test sets with an approximate ratio of 70%, 15%, and 15%, respectively. To facilitate effective modeling in acoustic signal processing tasks, each 60 s audio file in .WAV format was further segmented into non-overlapping 1 s segments, thereby ensuring sufficient sample quantity in each dataset.

4.1.2. Data Preprocessing

In the experiments, the acoustic signals from both datasets were sampled at a frequency of 16 kHz. The signals were framed using a 64 ms Hamming window with 50% overlap between consecutive frames. For each frame, a Mel-frequency filter bank consisting of 40 filters was applied, producing a 40-dimensional feature vector. Consequently, each 1 s acoustic sample was transformed into a two-dimensional Mel-frequency spectrum, which characterizes the time–frequency (T-F) representation of the gear acoustic signals. To facilitate model training, the Mel-frequency spectra of both clean and noise-contaminated signals were normalized by scaling their amplitudes into the range

[- 1, 1]

.

4.1.3. Evaluation Metrics

The primary objective of RGAD is to eliminate non-stationary noise components from the input acoustic signals through the proposed recursive adversarial strategy while simultaneously refining periodic structural features and texture details that may be obscured by noise, thereby facilitating improved subsequent diagnostic performance. To quantitatively evaluate the effectiveness of this strategy, two widely adopted similarity metrics, Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR), are introduced for assessing the signal-level denoising quality. Furthermore, the diagnostic accuracy is employed to evaluate the performance at the task level.

MSE quantifies the global statistical error by calculating the average squared difference between the denoised spectrum and the clean target spectrum. It provides a direct measure of the overall deviation, where a lower MSE value indicates a smaller error and reflects superior denoising performance by signifying a closer approximation to the target clean signal. The MSE between the denoised signal

x_{g}^{3}

and the clean reference

x_{c}

, both of size, is defined as

\begin{matrix} \frac{1}{N} \sum_{i = 1}^{N} {(x_{c, i} - x_{g, i}^{3})}^{2} . \end{matrix}

(28)

The PSNR is derived from the MSE and serves as a logarithmic measure of reconstruction fidelity. It is computed as the ratio between the maximum possible power of a signal and the power of the introduced noise (represented by the MSE). An increased PSNR value suggests that the noise power in the denoised signal is comparatively lower relative to the peak power of the clean signal, thus indicating enhanced denoising quality and better preservation of the signal’s structural integrity. The PSNR (in decibels, dB) is defined as

\begin{matrix} PSNR = 10 \cdot {log}_{10} (\frac{{MAX}^{2}}{MSE}), \end{matrix}

(29)

where

MAX

represents the maximum possible value in the clean signal

x_{c}

and is defined as 1 in the normalized spectrum.

While the MSE and PSNR evaluate the quality of the denoised signal itself, the ultimate objective is to improve the performance of downstream fault diagnosis tasks. Diagnostic accuracy is thus introduced to measure the effectiveness of the denoising process at the task level. It is defined as the ratio of correctly classified samples to the total number of samples tested, when the diagnosis is performed on the denoised signals. Higher accuracy signifies that the denoising process has more effectively retained or enhanced the discriminative features necessary for reliable classification. The accuracy is calculated as

\begin{matrix} Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \times 100 %, \end{matrix}

(30)

where

T P

(True Positives) represents the number of faulty samples correctly identified as faulty;

T N

(True Negatives) represents the number of normal samples correctly identified as normal;

F P

(False Positives) represents the number of normal samples incorrectly identified as faulty; and

F N

(False Negatives) represents the number of faulty samples incorrectly identified as normal.

4.2. Analysis of RGAD

The effectiveness of RGAD is comprehensively evaluated at the statistical level using both the MSE and PSNR metrics, with the experimental results summarized in Table 2. It is easy to observe from Table 2 that the MSE between the spectrum processed by the three adversarial units and the clean spectrum consistently decreases, while the PSNR correspondingly increases within the recursive adversarial framework. This verifies the effectiveness of the recursive strategy at the global statistical level.

To provide a more intuitive understanding of how the recursive adversarial strategy in RGAD removes non-stationary noise and restores periodic and sparse features associated with fault patterns that are masked by noise, we conducted a visualization experiment on selected test samples, as shown in Figure 8. By comparing Figure 8a and Figure 8e, it can be observed that the meshing frequencies in the range of 200 to 500 Hz, along with their higher-order harmonics around 1000 to 1500 Hz, are obscured by non-stationary noise. The purpose of RGAD is to progressively eliminate such non-stationary industrial noise through the recursive adversarial framework and recover the obscured features in a coarse-to-fine manner. As observed in the outputs of adversarial units 1 to 3 (i.e., Figure 8b–d), it can be found that in the frequency domain, although the reconstruction performance of RGAD is limited in frequency components above 3000 Hz, the noise-interference characteristics are progressively restored in the mesh frequency and high-harmonic frequency regions. Meanwhile, the periodic structure of the gear acoustic signal along the time axis is clearly reconstructed. Further analysis incorporating time–frequency information reveals that the texture details present in the clean spectrum are also gradually recovered by RGAD. The above observations provide strong evidence for the effectiveness of RGAD in addressing non-stationary noise interference at a microscopic level.

4.3. Ablation Experiments

As a core component of RGAD, the designed recursive discriminative architecture functions to refine the denoised spectrum in a coarse-to-fine manner through progressive guidance. To further evaluate the effectiveness of this progressive adversarial mechanism, ablation studies are performed. Specifically, the discriminators with three different structures in the recursive discriminative architecture are removed or configured to the identical structure (e.g., all adopting

D_{1}

) to compare with the proposed RGAD in denoising performance and diagnostic accuracy. It can be clearly observed from Table 3 that an inappropriate discriminative architecture adversely affects the overall recursive process. In particular, after all adversarial units are equipped with the discriminator of the

D_{3}

structure, the performance of adversarial unit 3 is attenuated, and its similarity metrics with the clean target spectrum is lower than that of adversarial unit 2. This is because the output in the early stage is over-constrained, which blocks recursive propagation in the later stage. Conversely, once a shallow discriminator with a structure of

D_{1}

is adopted in the later stage, the final output denoised spectrum following the recursive process shows a significant difference from the clean target spectrum in terms of similarity evaluation metrics. Meanwhile, the medium-strength discriminator, configured with the D2 structure throughout the recursive process, can achieve intermediate performance between the two aforementioned cases. Different from that, constraining the recursive process solely through the regularization loss rather than the adversarial loss is beneficial for the recursive behavior and denoising results, but it suffers from the over-smoothing issue, as illustrated in Figure 9. Combined with the accuracy indicators in Table 3, it can be found that the lack of spectral texture details causes the loss of diagnostic information, thereby resulting in reduced diagnostic accuracy compared to most recursive models based on the adversarial mechanism under the FDN with the same structure and parameter configuration. Furthermore, by comparing the sampling visualization denoising spectrum and diagnostic accuracy from

D_{1}

to

D_{3}

in Figure 9 and Table 3, it is easy to find that improved restoration quality of spectral texture details leads to a corresponding increase in diagnostic accuracy. The proposed RGAD method, benefiting from its superior capability in reconstructing spectral texture details, achieves the highest diagnostic accuracy. These experimental results confirm the positive correlation between the preservation of spectral texture details and diagnostic accuracy.

Considering that the theoretical computational complexity of the inference stage is an objective indicator to evaluate the efficiency of the model, we focus exclusively on evaluating the computational cost after model training. As shown in Table 4, since the discriminator is not invoked during the testing phase, all methods exhibit identical FLOPs and single-sample inference time. Furthermore, although the proposed method has the highest parameter count among the compared approaches, it remains under 4M, reflecting a lightweight structure and indicating strong potential for deployment on edge devices and in practical applications. Meanwhile, the proposed method achieves the best performance by slightly increasing the number of parameters compared to other approaches, demonstrating that the trade-off between complexity and performance is well justified.

4.4. Compared with Other Methods

To further evaluate the superiority of the proposed RGAD, a comparative analysis is conducted with the methods outlined in the Introduction. Specifically, the comparison encompasses the traditional denoising filtering approach based on statistical property modeling, as well as the classical generative adversarial denoising method widely adopted in speech enhancement tasks. A detailed description of these methods is provided as follows:

(1) Least Mean Square (LMS) [19] is a traditional adaptive filtering algorithm based on the Wiener filter, and it has been widely applied to non-stationary signal processing.

(2) Recursive Least Squares (RLS) [20] is a representative adaptive filtering algorithm derived from Kalman filtering theory [21], which provides superior noise tracking and estimation performance for non-stationary signals.

(3) TF-Masking [30] is a novel speech enhancement approach that innovatively employs the generative adversarial mechanism to construct masks in the time–frequency domain, thereby enabling the effective suppression of noise components under non-stationary conditions.

(4) GAN-

L_{1}

[27] represents a classical generative adversarial denoising framework, which effectively integrates generative adversarial loss with

L_{1}

regularization loss to achieve superior performance in suppressing non-stationary noise.

Table 5 presents the comparative results between RGAD and the aforementioned methods in denoising performance and diagnostic accuracy. It can be seen that the adversarial method outperforms the traditional filtering approaches based on statistical property modeling in terms of the global similarity index related to noise reduction performance. To further analyze from the perspective of spectral reconstruction, the denoising outcomes of these approaches are sampled for visualized comparison, as shown in Figure 10. It is clear that the spectra recovered by LMS, RLS, and TF-Masking maintain well-preserved texture details while exhibiting varying degrees of periodic structure loss. Specifically, LMS and TF-Masking fail to accurately reconstruct the periodic structural features of the signal in the high-order harmonic frequency band (approximately 1000–1500 Hz), whereas RLS demonstrates suboptimal performance in recovering periodic information around the meshing frequency (500 Hz). By integrating the diagnostic accuracy presented in Table 3, it can be observed that RLS, which demonstrates superior periodic structure reconstruction capability within the high-order harmonic frequency band, significantly outperforms LMS and T-F masking in terms of diagnostic performance. This indicates that higher-order harmonics within the frequency range of approximately 1000 to 1500 Hz carry critical information associated with the gear’s health state, thereby having a significant impact on diagnostic accuracy. This phenomenon can also be further confirmed by the performance of GAN-

L_{1}

, where the second-highest diagnostic accuracy is achieved as a result of preserving complete periodic structure information in the higher-order harmonic frequency band despite being affected by over-smoothing issues. Different from that, the proposed RGAD effectively balances the reconstruction of periodic structural features with the restoration of fine texture details, thereby achieving the best diagnostic performance (as shown in Figure 11). This further illustrates the superiority of RGAD for ABD tasks under non-stationary noise interference conditions in real industrial scenarios.

5. Conclusions

This article proposes a novel RGAD-based ABD framework for gear fault diagnosis in real industrial noise conditions. As the backbone of the model, the GWAM-based generator is capable of effectively reconstructing the periodic structural characteristics of gear acoustic signals under non-stationary noise interference through recursively capturing T-F global dependencies. The adversarial mechanism based on recursive discriminator architecture significantly enhances reconstruction quality of texture details for acoustic features while effectively preventing the vanishing gradient problem in adversarial learning. Through the above design, the non-stationary noise is effectively suppressed and the high-quality reconstruction of the periodic structure and texture details of the gear rotation acoustic signal can be achieved in a coarse-to-fine manner. Finally, by integrating RGAD with a diagnosis module FDN, a complete RGAD-based ABD framework is well-established for gear fault diagnosis in real industrial scenarios. Experimental results under real industrial background noise conditions demonstrate that the proposed RGAD-based ABD franework achieves superior noise suppression and gear fault diagnosis performance.

Author Contributions

Conceptualization, Z.E. and Y.Y.; Methodology, Z.E.; Software, X.M.; Validation, L.S.; Formal analysis, Y.Y. and L.S.; Resources, Z.E.; Data curation, X.M.; Writing – original draft, Z.E. and X.M.; Writing – review & editing, Z.E., Y.Y. and L.S.; Visualization, X.M.; Supervision, Z.E. and L.S.; Project administration, Z.E. and Y.Y.; Funding acquisition, Z.E. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the China Postdoctoral Science Foundation Project under Grant 2024M753007 and in part by the Central Government-Guided Special Fund for Local Science and Technology Development Project under Grant 2024ZYD0254.

Data Availability Statement

The data supporting the conclusions of this study are included in the article. For privacy reasons, these data are not publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ABD	Acoustic-Based Diagnosis
RGAD	Recursive Generative Adversarial Denoising
GWAM	Global Window-Aware Attention Module
FDN	Fault Diagnosis Network
IBM	Ideal Binary Mask
IRM	Ideal Ratio Mask
GAN	Generative Adversarial Network
W-MSA	Window-Based Multi-Head Self-Attention
LN	Layer Normalization
MSA	Multi-Head Self-Attention
MLP	Multi-Layer Perceptron
Swin Transformer	Shifted Window Transformer
SW-MSA	Shifted Window Multi-Head Self-Attention
LSGAN	Least-Squares Generative Adversarial Network
T-F	Time–Frequency
FC	Fully Connected
MSE	Mean Squared Error
BNC	Bayonet Nut Connector
WAV	Waveform Audio File Format
PSNR	Peak Signal-to-Noise Ratio
LMS	Least Mean Square
RLS	Recursive Least Squares
TF-Masking	Time–Frequency Masking
GAN- $L_{1}$	Generative Adversarial Loss with $L_{1}$ Regularization Loss

References

Chen, C.; Shen, F.; Xu, J.; Yan, R. Probabilistic latent semantic analysis-based gear fault diagnosis under variable working conditions. IEEE Trans. Instrum. Meas. 2019, 69, 2845–2857. [Google Scholar] [CrossRef]
Yao, Y.; Gui, G.; Yang, S.; Zhang, S. A recursive multi-head self-attention learning for acoustic-based gear fault diagnosis in real-industrial noise condition. Eng. Appl. Artif. Intell. 2024, 133, 108240. [Google Scholar] [CrossRef]
Yao, Y.; Gui, G.; Yang, S.; Zhang, S. A recursive denoising learning for gear fault diagnosis based on acoustic signal in real industrial noise condition. IEEE Trans. Instrum. Meas. 2021, 70, 3524015. [Google Scholar] [CrossRef]
Rezaei, A.; Dadouche, A.; Wickramasinghe, V.; Dmochowski, W. A comparison study between acoustic sensors for bearing fault detection under different speed and load using a variety of signal processing techniques. Tribol. Trans. 2011, 54, 179–186. [Google Scholar] [CrossRef]
Peng, B.; Li, D.; Wang, K.I.K.; Abdulla, W.H. Acoustic-Based Industrial Diagnostics: A Scalable Noise-Robust Multiclass Framework for Anomaly Detection. Processes 2025, 13, 544. [Google Scholar] [CrossRef]
Scanlon, P.; Kavanagh, D.F.; Boland, F.M. Residual life prediction of rotating machines using acoustic noise signals. IEEE Trans. Instrum. Meas. 2012, 62, 95–108. [Google Scholar] [CrossRef]
Zhang, D.; Stewart, E.; Entezami, M.; Roberts, C.; Yu, D. Intelligent acoustic-based fault diagnosis of roller bearings using a deep graph convolutional network. Measurement 2020, 156, 107585. [Google Scholar] [CrossRef]
Hassan, A.; Hashem, A.F.; Sayed, A.; Kayed, M. Physics-guided deep learning for acoustic-based fault diagnosis. Int. J. Engine Res. 2025. [Google Scholar] [CrossRef]
Hou, J.; Jiang, W.; Lu, W. Application of a near-field acoustic holography-based diagnosis technique in gearbox fault diagnosis. J. Vib. Control 2013, 19, 3–13. [Google Scholar] [CrossRef]
Yao, Y.; Zhang, S.; Yang, S.; Gui, G. Learning attention representation with a multi-scale CNN for gear fault diagnosis under different working conditions. Sensors 2020, 20, 1233. [Google Scholar] [CrossRef]
Yao, Y.; Wang, H.; Li, S.; Liu, Z.; Gui, G.; Dan, Y.; Hu, J. End-to-end convolutional neural network model for gear fault diagnosis based on sound signals. Appl. Sci. 2018, 8, 1584. [Google Scholar] [CrossRef]
Glowacz, A. Acoustic based fault diagnosis of three-phase induction motor. Appl. Acoust. 2018, 137, 82–89. [Google Scholar] [CrossRef]
Glowacz, A. Fault detection of electric impact drills and coffee grinders using acoustic signals. Sensors 2019, 19, 269. [Google Scholar] [CrossRef]
Glowacz, A. Acoustic fault analysis of three commutator motors. Mech. Syst. Signal Process. 2019, 133, 106226. [Google Scholar] [CrossRef]
Ebrahimkhanlou, A.; Dubuc, B.; Salamone, S. A generalizable deep learning framework for localizing and characterizing acoustic emission sources in riveted metallic panels. Mech. Syst. Signal Process. 2019, 130, 248–272. [Google Scholar] [CrossRef]
Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 2003, 27, 113–120. [Google Scholar] [CrossRef]
Scalart, P. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; Volume 2, pp. 629–632. [Google Scholar]
Ephraim, Y.; Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 2003, 32, 1109–1121. [Google Scholar] [CrossRef]
Haykin, S.; Widrow, B. LEAST-MEAN-SQUARE ADAPTIVE FILTERS; Wiley Online Library: Hoboken, NJ, USA, 2003. [Google Scholar]
Martinek, R.; Vanus, J.; Kelnar, M.; Bilik, P.; Zidek, J. Application of recursive least square algorithm to adaptive channel equalization. In Proceedings of the Measurement in Research and Industry, Shenzhen, China, 27–28 December 2015. [Google Scholar]
Diniz, P.S. Adaptive Filtering; Springer: Berlin/Heidelberg, Germany, 1997; Volune 4. [Google Scholar]
Shao, Y.; Srinivasan, S.; Jin, Z.; Wang, D. A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang. 2010, 24, 77–93. [Google Scholar] [CrossRef]
Narayanan, A.; Wang, D. Investigation of speech separation as a front-end for noise robust speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 826–835. [Google Scholar] [CrossRef]
Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 7–19. [Google Scholar] [CrossRef]
Seltzer, M.L.; Yu, D.; Wang, Y. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7398–7402. [Google Scholar]
Pascual, S.; Bonafonte, A.; Serra, J. SEGAN: Speech enhancement generative adversarial network. arXiv 2017, arXiv:1703.09452. [Google Scholar] [CrossRef]
Pandey, A.; Wang, D. On adversarial training and loss functions for speech enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5414–5418. [Google Scholar]
Fu, S.W.; Liao, C.F.; Tsao, Y.; Lin, S.D. Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2031–2041. [Google Scholar]
Fu, S.W.; Yu, C.; Hsieh, T.A.; Plantinga, P.; Ravanelli, M.; Lu, X.; Tsao, Y. Metricgan+: An improved version of metricgan for speech enhancement. arXiv 2021, arXiv:2104.03538. [Google Scholar]
Wang, D. Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 2008, 12, 332–353. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2021; pp. 10012–10022. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Gogate, M.; Dashtipour, K.; Hussain, A. Robust real-time audio-visual speech enhancement based on dnn and gan. IEEE Trans. Artif. Intell. 2024, 6, 2860–2869. [Google Scholar] [CrossRef]

Figure 1. The proposed GWAM structure diagram, where ⨂ and ⨁ represent element-wise multiplication and element-wise addition, respectively.

Figure 2. Discriminator structure diagram: (a)

D_{1}

, (b)

D_{2}

, (c)

D_{3}

. The convolutional layers may have identical structures but different parameters.

Figure 2. Discriminator structure diagram: (a)

D_{1}

, (b)

D_{2}

, (c)

D_{3}

. The convolutional layers may have identical structures but different parameters.

Figure 3. The overall structure of the proposed Recursive Generative Adversarial Denoising (RGAD) learning method.

Figure 4. RGAD loss calculation strategy.

Figure 5. Schematic diagram of the overall structure of the proposed RGAD-based ABD system.

Figure 6. Complete flowchart of the proposed gear fault diagnosis method under noisy conditions.

Figure 7. Experimental system. (a) Gearbox test rig. (b) Measuring system. (c) Semi-anechoic condition. (d) Real industrial condition.

Figure 8. Visualization of the denoising process and results of RGAD. (a) Original input noisy spectrum. (b) Denoised spectrum from adversarial unit 1. (c) Denoised spectrum from adversarial unit 2. (d) Denoised spectrum from adversarial unit 3, which is also the final denoised output of RGAD. (e) Target clean spectrum.

Figure 9. Visualization of denoising results under different ablation settings of RGAD. (a) Clean spectrum. (b) MSE-only constraint. (c) All discriminators set to

D_{1}

. (d) All discriminators set to

D_{2}

. (e) All discriminators set to

D_{3}

. (f) Final output of RGAD.

Figure 9. Visualization of denoising results under different ablation settings of RGAD. (a) Clean spectrum. (b) MSE-only constraint. (c) All discriminators set to

D_{1}

. (d) All discriminators set to

D_{2}

. (e) All discriminators set to

D_{3}

. (f) Final output of RGAD.

Figure 10. Spectra after denoising using different methods and the target clean spectrum. (a) Target clean spectrum. (b) LMS [3]. (c) RLS [3]. (d) TF-Masking. (e) GAN-

L_{1}

. (f) Our proposed RGAD.

Figure 10. Spectra after denoising using different methods and the target clean spectrum. (a) Target clean spectrum. (b) LMS [3]. (c) RLS [3]. (d) TF-Masking. (e) GAN-

L_{1}

. (f) Our proposed RGAD.

Figure 11. Confusion matrices of different methods. (a) LMS. (b) RLS. (c) TF-Masking. (d) GAN-L1. (e) Our proposed RGAD.

Table 1. Hyperparameters of discriminators.

Module	Structure	Hyperparameters
Discriminator-1	Conv 4 × 4	$W = 4$ , $H = 4$ , $s = 4$ , $p = False$
	Conv 2 × 2	$W = 2$ , $H = 2$ , $s = 2$ , $p = False$
	Fully Connected	$Layers = 1$
Discriminator-2	Conv 4 × 4	$W = 4$ , $H = 4$ , $s = 4$ , $p = False$
	Conv 4 × 4	$W = 4$ , $H = 4$ , $s = 4$ , $p = False$
	Conv 2 × 2	$W = 2$ , $H = 2$ , $s = 2$ , $p = False$
	Fully Connected	$Layers = 1$
Discriminator-3	Conv 4 × 4	$W = 4$ , $H = 4$ , $s = 4$ , $p = False$
	Conv 4 × 4	$W = 4$ , $H = 4$ , $s = 4$ , $p = False$
	Conv 2 × 2	$W = 2$ , $H = 2$ , $s = 2$ , $p = False$
	Conv 1 × 1	$W = 1$ , $H = 1$ , $s = 1$ , $p = False$
	Fully Connected	$Layers = 1$

Table 2. MSE and PSNR values between RGAD-denoised acoustic signals and clean acoustic signals.

Method	MSE	PSNR
Original	0.07892	11.08
Adversarial Unit 1	0.01734	17.61
Adversarial Unit 2	0.01654	17.81
Adversarial Unit 3 (final output)	0.01644	17.84

Table 3. RGAD ablation experiment results, where Accuracy represents the fault diagnosis accuracy rate (%).

Method	Unit 1-MSE	Unit 2-MSE	Unit 3-MSE	PSNR	Accuracy
Regularization loss	0.01785	0.01669	0.01651	17.82	95.28
Equivalent discriminator $D_{1}$	0.01966	0.01821	0.01766	17.53	91.33
Equivalent discriminator $D_{2}$	0.02529	0.02368	0.01655	17.81	94.86
Equivalent discriminator $D_{3}$	0.01665	0.01647	0.01649	17.83	96.52
RGAD (ours)	0.01734	0.01654	0.01644	17.84	97.31

Table 4. Computational costs of different methods.

Method	FLOPs	Parameters	Inference/Sample
Regularization loss	1,569,138	2,341,718	0.30 ms
Equivalent discriminator $D_{1}$	1,569,138	2,344,972	0.30 ms
Equivalent discriminator $D_{2}$	1,569,138	2,420,524	0.30 ms
Equivalent discriminator $D_{3}$	1,569,138	2,496,172	0.30 ms
RGAD (ours)	1,569,138	3,980,684	0.30 ms

Table 5. The MSE and PSNR of the denoised spectrum and target spectrum for different methods, along with the fault diagnostic accuracy.

Method	MSE	PSNR	Accuracy
LMS	0.05559	12.55	31.99
RLS	0.04535	13.43	79.78
TF-Masking	0.03892	14.10	53.47
GAN-L1	0.01766	17.53	95.41
RGAD (ours)	0.01644	17.84	97.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

E, Z.; Ma, X.; Yao, Y.; Sun, L. A Recursive Generative Adversarial Denoising Learning Method for Acoustic-Based Gear Fault Diagnosis Under Non-Stationary Noise Interference. Acoustics 2025, 7, 76. https://doi.org/10.3390/acoustics7040076

AMA Style

E Z, Ma X, Yao Y, Sun L. A Recursive Generative Adversarial Denoising Learning Method for Acoustic-Based Gear Fault Diagnosis Under Non-Stationary Noise Interference. Acoustics. 2025; 7(4):76. https://doi.org/10.3390/acoustics7040076

Chicago/Turabian Style

E, Zhiqun, Xingjiang Ma, Yong Yao, and Lei Sun. 2025. "A Recursive Generative Adversarial Denoising Learning Method for Acoustic-Based Gear Fault Diagnosis Under Non-Stationary Noise Interference" Acoustics 7, no. 4: 76. https://doi.org/10.3390/acoustics7040076

APA Style

E, Z., Ma, X., Yao, Y., & Sun, L. (2025). A Recursive Generative Adversarial Denoising Learning Method for Acoustic-Based Gear Fault Diagnosis Under Non-Stationary Noise Interference. Acoustics, 7(4), 76. https://doi.org/10.3390/acoustics7040076

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Recursive Generative Adversarial Denoising Learning Method for Acoustic-Based Gear Fault Diagnosis Under Non-Stationary Noise Interference

Abstract

1. Introduction

2. Background

2.1. Swin Transformer

2.2. Generative Adversarial Network

3. Methodology

3.1. Global Window-Aware Attention Module

3.2. Recursive Adversarial Mechanism

3.3. Loss Functions and Optimization Strategies

3.4. Overall RGAD-Based ABD Framework

3.5. Fault Diagnosis Procedure of the Proposed Framework

4. Experiments

4.1. Experimental Setup

4.1.1. Experiment System

4.1.2. Data Preprocessing

4.1.3. Evaluation Metrics

4.2. Analysis of RGAD

4.3. Ablation Experiments

4.4. Compared with Other Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI