Lightweight Adaptive Feature Compression and Dynamic Network Fusion for Rotating Machinery Fault Diagnosis Under Extreme Conditions

Zhang, Kaiyi; Liu, Xuling; Yang, Guohua; Zhai, Kun; An, Gaofei; Zhang, Yusong; Peng, Chaofeng

doi:10.3390/act14090458

Open AccessArticle

Lightweight Adaptive Feature Compression and Dynamic Network Fusion for Rotating Machinery Fault Diagnosis Under Extreme Conditions

by

Kaiyi Zhang

,

Xuling Liu

^*,

Guohua Yang

,

Kun Zhai

,

Gaofei An

,

Yusong Zhang

and

Chaofeng Peng

School of Mechanical and Electronic Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Actuators 2025, 14(9), 458; https://doi.org/10.3390/act14090458

Submission received: 30 July 2025 / Revised: 10 September 2025 / Accepted: 12 September 2025 / Published: 19 September 2025

(This article belongs to the Section Actuators for Manufacturing Systems)

Download

Browse Figures

Versions Notes

Abstract

Reliable fault diagnosis of rotating machines under extreme conditions—strong speed, load variation, intense noise, and severe class imbalance—remains a critical industrial challenge. We develop an ultra-light yet robust framework to accurately detect weak bearing, and gear faults when less than 5% labels, 10 dB noise, 100:1 imbalance and plus or minus 20% operating-point drift coexist. Methods: The proposed Adaptive Feature Module–Conditional Dynamic GRU Auto-Encoder (AFM-CDGAE) first compresses 512 d spectra into 32/48 d “feature modules” via K-means while retaining 98.4% fault energy. A workload-adaptive multi-scale convolution with spatial attention and CPU-aware λ-scaling suppresses noise and adapts to edge–device load. A GRU-based auto-encoder, enhanced by self-attention, is trained with balanced-subset sampling and minority-F1-weighted voting to counter extreme imbalance. On Paderborn (5-class) and CWRU (7-class) benchmarks, the 0.87 M-parameter model achieves 99.12% and 98.83% Macro-F1, surpassing five recent baselines by 3.1% under normal and 5.4% under the above extreme conditions, with only 1.5 to 1.8% F1 drop versus 6.7% for baselines. AFM-CDGAE delivers state-of-the-art accuracy, minimal footprint and strong robustness, enabling real-time deployment at the edge.

Keywords:

K-means feature module clustering; workload-adaptive multi-scale convolution; self-attention dynamic GRU auto-encoder; extreme class imbalance; rotating machinery fault diagnosis

1. Introduction

Mechanical equipment plays a pivotal role in high-end installations such as wind turbines, high-speed trains and deep-sea drilling rigs, and its health condition directly affects national infrastructure safety and economic benefits [1]. Bearings and gearboxes, as the core transmission components of rotating machinery, are long-term exposed to harsh operating conditions where alternating heavy loads, extreme temperatures, electromagnetic interference and corrosive media are coupled in multi-physical fields, so incipient faults such as cracks, pitting and spalling are easily triggered [1]. Therefore, carrying out highly reliable and robust bearing–gearbox fault diagnosis research has become the primary task of modern industrial system health management [1].

In recent years, deep learning has achieved remarkable progress in intelligent fault diagnosis owing to its powerful nonlinear representation capability [2,3,4]. Lang et al. [5] proposed a lightweight jamming recognition network (JR-TFViT) that replaces the required convolution operations with parameter-efficient Ghost convolutions to improve diagnostic accuracy; Jin et al. [6] presented a dual-branch convolutional image denoising network based on non-parametric attention and multi-scale feature fusion, aiming to enhance denoising performance while better restoring image edges and texture information; Zhang et al. [7] employed a hybrid hypergraph attention network to capture both high-order and low-order dependencies. However, the above methods generally assume that data are noise-free and class-balanced, ignoring the coupled problem of class imbalance and strong background noise prevalent in industrial sites, which causes minority fault samples to be neglected at the decision boundary and makes diagnostic reliability difficult to guarantee.

In real engineering, rotating machinery remains in a healthy state for most of its service life, so the available fault samples are not only scarce but also contaminated by environmental noise, exhibiting dual problems of “minority-class submergence” and “low signal-to-noise ratio”. To mitigate noise interference, Yang et al. [8] proposed an improved ensemble empirical mode decomposition (EEMD) method combined with an adaptive parameter adjustment mechanism to optimize noise amplitude and filtering thresholds; Sun et al. [9] suppressed wide-band noise while highlighting fault harmonics by modifying the intensity of the modulus maxima of wavelet transform coefficients; Zhang et al. [10] introduced a new lightweight convolutional neural network (CNN) framework for fault diagnosis that enlarges the receptive field of small convolutional kernels via dilated convolutions, demonstrating good anti-noise performance; Hao et al. [11] employed a doubly improved truncated singular value decomposition (DITSVD) technique for noise suppression and data augmentation; Liu et al. [12] combined complementary ensemble empirical mode decomposition with adaptive noise (CEEMDOAN) and signal-enhanced Hilbert transform (SEHT), using CEEMDOAN to decompose vibration signals of canned motor pumps for signal denoising; Lee et al. [13] proposed an unsupervised learning-based variational auto-encoder (VAE) and domain-adaptive neural network (DANN) for forklift acoustic data to examine fault prediction performance in noisy environments. However, the above works focus only on noise suppression and do not simultaneously solve the problem of minority-class omission caused by data imbalance.

The core difficulty of imbalanced fault diagnosis is that normal samples occupy an absolute majority, while fault samples are scarce and marginally distributed, causing the decision boundary of deep models to severely skew towards the majority class. Existing strategies can be divided into data-level and model-level paths. Data-level approaches augment the minority class through oversampling, generative models or transfer learning. Interpolation techniques such as SMOTE [14] and Borderline-SMOTE [15] tend to generate noisy samples near the decision boundary; although GANs [16,17] and VAEs [18,19] can produce high-fidelity synthetic samples, they suffer from unstable training, large computational overhead and mode collapse under extremely few-shot scenarios; TrAdaBoost [20] and domain-adaptive transfer [21,22] rely on high similarity between source and target domains, and negative transfer occurs when operating conditions differ significantly. Model-level approaches start from loss functions, network structures or learning paradigms: cost-sensitive losses [23] require manually set class weights and are sensitive to imbalance ratios; semi-supervised pseudo-labeling [24] is vulnerable to noisy pseudo-labels; ensemble learning [25,26,27,28] improves overall performance through voting or stacking, yet classic Bagging and Boosting are not designed for imbalance, base classifiers exhibit low accuracy on the minority class, and the ensemble still favors the majority class; moreover, strong noise further weakens the diversity of base classifiers and degrades ensemble performance.

Recent studies have attempted to open new perspectives from control theory or meta-learning. Zhao et al. [29] constructed an unsupervised domain-adaptive diagnosis framework that enhances fault identification through adaptive iterative learning; Song et al. [30] designed a self-triggered fault-tolerant controller based on fuzzy wavelet neural networks, using dual compensation mechanisms to improve actuator fault robustness; Xia et al. [31] proposed an enhanced discriminative meta-learning diagnosis framework that employs data-augmented quasi-meta training with limited labeled fault information to promote model generalization in the target domain. However, these methods have not simultaneously addressed the two major challenges of “noise suppression” and “imbalanced learning”, and still suffer from incomplete feature extraction, heavy computation and uneven performance allocation.

Recent work such as WaveCORAL-DCCA [32] has improved cross-operational generalization for rotor fault diagnosis by fusing correlation alignment and deep feature learning. However, it fails to address the coupled challenges of 100:1 class imbalance and complex non-Gaussian noise, which are critical for real industrial applications.

To address the coupled noise–imbalance challenge, this paper proposes an intelligent mechanical fault diagnosis method named “Adaptive Feature Module–Dynamic GRU Auto-Encoder Ensemble Network (AFM-CDGAE)”. First, at the data front-end, K-means unsupervised clustering is used to compress high-dimensional spectra into low-redundancy “feature modules”, reducing the burden on subsequent networks; second, a workload-adaptive multi-scale convolutional subnet is designed, whose weights are dynamically modulated by both spatial attention and real-time CPU load λ, achieving collaborative denoising and multi-scale fidelity; third, a dynamic GRU auto-encoder with self-attention gating is constructed as the base classifier, realizing joint optimization of deep temporal reconstruction and discrimination; finally, balanced subset training based on permutation sampling together with an improved weighted voting ensemble strategy is adopted to significantly enhance the model’s attention to minority classes and overall robustness. The main contributions of this paper are as follows:

Facing extreme operating conditions with coupled strong noise and imbalance, the “K-means feature module clustering–workload-adaptive multi-scale convolution–self-attention dynamic GRU auto-encoder–balanced subset F1-weighted voting” pipeline is, for the first time, embedded into a unified framework, yielding AFM-CDGAE, which achieves Pareto optimality between accuracy and lightweight with 0.87 M parameters and 2.1 ms inference latency;
A Workload-Adaptive Weight Rescaler is designed so that multi-scale convolutional weights are simultaneously modulated by spatial attention and real-time CPU load λ, balancing denoising and multi-scale fidelity, while reducing edge-side peak CPU utilization from 78% to 42% and power consumption from 12.5 W to 7.8 W;
A reconstruction–classification joint loss and BST-WVI ensemble strategy are proposed. Balanced subsets are constructed by permutation sampling, and weights are recalibrated via minority-class F1-weighted voting, improving Macro-F1 on Paderborn and CWRU by 7.7% and 9.5%, respectively, under a 1:100 extreme imbalance and restoring recall to 99% and 98%, breaking the systematic majority-class bias of traditional majority voting.

Systematic experiments on the Paderborn and CWRU datasets show that, under the quadruple extreme coupling of only 5% training labels, 10 dB strong noise, imbalance ratio 100:1 and plus or minus 20% speed or load drift, the proposed method still maintains Macro-F1 of 97.3% on Paderborn and 96.9% on CWRU (drops of only 1.8% and 1.5%), whereas the five latest baselines drop by an average of 6.7%. Compared with baselines, the overall improvement is 3.1% under normal conditions and 5.4% under extreme conditions. The model has merely 0.87 M parameters and 2.1 ms inference latency (RTX3070, Nvidia, in Santa Clara, CA, USA), and 7.3 ms after INT8 quantization on Jetson Xavier NX, meeting the requirements of real-time edge monitoring. Additional performance tests on three mainstream edge devices confirm broad deployment adaptability: Raspberry Pi 4B (4 GB RAM, Cortex-A72, ARM Holdings, in Cambridge, UK): INT8 latency amounts to 28.5 ms, peak CPU utilization reaches 65%, and power consumption stands at 4.2 W; NVIDIA Jetson Nano (4 GB RAM, Maxwell GPU, Nvidia, in Santa Clara, CA, USA): INT8 latency comes to 12.3 ms, peak CPU utilization tops out at 52%, and power consumption equals 7.1 W; Rockchip RK3588 (8 GB RAM, Cortex-A76, ARM Holdings, in Cambridge, UK): INT8 latency measures 9.7 ms, peak CPU utilization peaks at 48%, and power consumption totals 8.3 W. All devices meet real-time requirements (latency less than 30 ms) for rotating machinery monitoring (typical sampling frequency: 1 to 2 kHz). Comparative experiments and step-wise ablation further confirm that BST-WVI boosts Macro-F1 from 91.3% (baseline) to 98.6% and recall by 8%, serving as the decisive module to break the ceiling in imbalanced scenarios, fully validating the robustness and generality of AFM-CDGAE.

The rest of this paper is organized as follows: Section 2 introduces relevant theoretical foundations; Section 3 elaborates the network architecture and fault-diagnosis procedure of AFM-CDGAE; Section 4 conducts experimental validation and comparative analysis on two public datasets; Section 5 concludes the paper and outlines future work.

2. Theoretical Foundations

2.1. Adaptive Feature Module Clustering (AFM)

AFM clusters high-dimensional spectral features like sorting mechanical parts by function—grouping similar frequency bands into “feature modules” to reduce redundancy while preserving fault-related energy.

Adaptive Feature Module (AFM) clustering is a K-means-based feature grouping method that employs unsupervised learning to aggregate statistically similar frequency bands from high-dimensional spectral data into low-dimensional “feature modules”, thereby significantly reducing input redundancy. This process not only compresses data dimensionality but also retains critical fault-related information, providing a more efficient representation for subsequent fault diagnosis.

The core idea of AFM is to leverage the K-means algorithm to partition features such that those within the same group exhibit high statistical similarity, whereas features across different groups exhibit large discrepancies. The method automatically determines the module quantity, i.e., the number of feature clusters, eliminating the subjectivity inherent in manually setting the cluster count.

Let the high-dimensional spectral input be denoted as

X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{m \times n}

, where

x_{i}

represents the i-th m-dimensional sample and n is the total number of samples. AFM achieves feature clustering by solving the following optimization problem:

A F M = K - m e a n s (X)

(1)

Specifically, the goal of the K-means algorithm is to minimize the sum of the square distances between each sample within a cluster and the cluster center:

m i n_{S, C} \sum_{k = 1}^{K} \sum_{x \in S_{k}} ∥ x - c_{k} ∥^{2}

(2)

Among them,

S = \{S_{1}, S_{2}, \dots, S_{K}\}

is the cluster set after clustering,

C = \{c_{1}, c_{2}, \dots, c_{K}\}

is the set of cluster centers, and K is the preset number of clusters.

To further optimize feature selection, AFM introduces a feature weighting mechanism. By assigning a weight to each feature, the influence of important features can be enhanced while suppressing noisy features. The optimization problem of feature weighting can be expressed as:

m i n_{S, C}, w \sum_{k = 1}^{K} \sum_{x \in S_{k}} ∥ w ⊙ (x - c k) ∥^{2}

(3)

Here,

w \in R^{m \times 1}

is the feature weighting vector, and

⊙

represents the element-by-element multiplication.

Through the above optimization process, AFM clusters the high-dimensional spectral data X into K low-dimensional feature modules. Each feature module can be regarded as a compressed feature representation for subsequent fault diagnosis tasks. This method not only reduces the computational complexity but also enhances the model’s robustness to noise. AFM performs well in handling high-dimensional data and is particularly suitable for fault diagnosis of rotating machinery. By clustering high-dimensional spectral data into low-dimensional feature modules, AFM can effectively extract fault features, while reducing data redundancy and improving diagnostic efficiency.

When operating conditions change by more than plus or minus 30% (e.g., new equipment models or speed ranges beyond 1500 plus or minus 300 rpm), the K-means clustering parameters in AFM require recalibration. An online incremental K-means algorithm is adopted: 500 unlabeled spectral samples are collected per recalibration cycle, with a computational overhead of 0.3 s per cycle. Recalibration is performed every 24 h in practical deployment to balance diagnostic accuracy and computational cost.

2.2. Workload-Adaptive Multi-Scale Convolution (WAMSC)

WAMSC works like a mechanic—prioritizing fault-relevant parts (via spatial attention) and adjusting tools based on workshop load (via λ-scaling) to extract multi-scale fault features efficiently.

Load Adaptive multi-scale convolution (WAMSC) is a key connecting module between adaptive feature module clustering (AFM) and self-attention dynamic GRU autoencoder (DGRUAE). This module takes the low-dimensional “feature module” tensor

Z^{A F M} \in R^{C_{A F M} \times H \times W}

output by AFM as input, where C_AFM is the number of feature modules (channels) after AFM clustering, and H and W are the height and width on the time–frequency plane, respectively. The main task of the WAMSC module is to extract fault features of different scales through multi-scale convolutional kernels. At the same time, it combines the spatial attention mechanism and load adaptive weight scaling to improve the robustness and efficiency of feature extraction, as shown in Figure 1.

The WAMSC module employs multi-scale convolution kernels to capture fault features at different scales. Suppose there are K groups of convolutional kernels of different sizes within the module, and each group of convolutional kernels should correspond to a specific scale k_i. For the input Z^AFM, the output of the i-th convolutional kernel is

Y_{i} = C o n v_{k_{i}} (Z^{A F M}; W_{i}) \in R^{C_{i} \times H \times W}

(4)

Among them,

W_{i} \in R^{C_{i} \times C_{A F M} \times k_{i} \times k_{i}}

is the weight of the i-th convolutional kernel, and C_i is the number of output channels of this group of convolutional kernels.

To enhance the saliency of features and suppress noise, the WAMSC module introduces a spatial attention mechanism. For the convolutional output

Y_{i}

at each scale, calculate the spatial attention map

A_{i} \in R^{1 \times H \times W}

to highlight the important fault feature regions:

A_{i} = σ (C o n v_{1 \times 1} ([A v g P o o l (Y_{i}); M a x P o o l (Y_{i})]))

(5)

Here, σ is the Sigmoid activation function, and AvgPool and MaxPool are, respectively, the average pooling and Max pooling operations. Multiply the spatial attention map by the convolution output to obtain the weighted features:

{\hat{Y}}_{i} = A_{i} ⊙ Y_{i}

(6)

In practical applications, the load condition of the system may affect the performance of the model. The WAMSC module dynamically adjusts the weights of convolutional kernels through a load-adaptive weight scaling mechanism to adapt to different load conditions. Define a load metric

λ \in [0, 1]

, which reflects the current load situation of the system. For each group of convolution kernels, calculate a load-aware scaling factor:

d_{k} = λ γ_{i} + τ_{i}

(7)

where d_k = load-aware scaling factor (range: 0.08–0.8); λ = real-time CPU load metric (range: 0–1, 0 = idle, 1 = full load); γ_i = learnable adaptation coefficient (converges to [0.3, 0.7]); τ_i = learnable bias term (converges to [0.08, 0.12]). The formula ensures d_k scales linearly with CPU load to balance feature extraction accuracy and computational efficiency.

α_{i} (λ) = \frac{1}{1 + e x p (- γ_{i} (λ - τ_{i}))}

(8)

Among them, γ_i and τ_i are learnable parameters. The initialization ranges and convergence intervals of γ_i and τ_i are specified as follows: γ_i is initialized within [0.4, 0.6] (≈0.5 plus or minus 0.1) and converges to [0.3, 0.7] during training; τ_i is initialized within [0.05, 0.15] (≈0.1 plus or minus 0.05) and converges to [0.08, 0.12]. This ensures the load-aware scaling factor dk remains within [0.08, 0.8] to avoid over-amplification of convolutional weights. Then, apply the scaling factor to the weighted features

{\hat{Y}}_{i} = α_{i} (λ) \cdot {\hat{Y}}_{i}

.

To further enhance the expressive ability of features, the WAMSC module adopts a channel-high-width triple collaborative attention mechanism. First, concatenate the multi-scale convolution output along the channel dimension:

\hat{Y} = C o n c a t_{c} ({\hat{Y}}_{1}, \dots, {\hat{Y}}_{K}) \in R^{C_{out} \times H \times W}

(9)

Among them,

C_{out} = \sum_{i = 1}^{K} C_{i}

. Then, calculate the channel attention, height attention and width attention, respectively:

w_{c} = σ (F C (G A P (\tilde{Y}))) \in R^{C_{o u t}}

(10)

w_{h} = σ (C o n v_{1 \times 1} (G A P_{h} (\tilde{Y}))) \in R^{H}

(11)

w_{ω} = σ (C o n v_{1 \times 1} (G A P_{ω} (\tilde{Y}))) \in R^{W}

(12)

Among them, GAP is the global average pooling operation, and

G A P_{h}

and

G A P_{ω}

are the global average pooling operations along the width and height directions, respectively. Finally, apply the triple attention weights to the concatenated features:

Z^{W A M S C} = w_{c} ⊙ {w_{h}}^{⊤} ⊙ {w_{ω}}^{⊤} ⊙ \tilde{Y}

(13)

The output

Z^{W A M S C}

of the WAMSC module is a feature tensor that has undergone multi-scale convolution, spatial attention, load adaptive weight scaling, and triple collaborative attention processing. This tensor will serve as the input for the self-attention dynamic GRU autoencoder (DGAE).

This module captures multi-scale features from local details to global structures through convolution kernels of different sizes, enhancing the feature expression ability of the model. The spatial attention mechanism is utilized to highlight the important fault feature areas, suppress background noise, and enhance the saliency of features. Dynamically adjust the weights of the convolutional kernels based on the system’s load conditions to ensure good performance under different load conditions. Finally, the triple collaborative attention mechanism is employed to further enhance the expressive power of the features and improve their robustness and accuracy.

2.3. Self-Attention Dynamic Gated Auto-Encoder (Dynamic GRU Autoencoder, DGAE)

DGAE tracks machine health like a logbook—GRU captures long-term operational patterns, self-attention highlights fault events, and joint reconstruction-classification ensures both “memory” (data recovery) and “recognition” (fault identification).

The Self-Attention Dynamic Gated Auto-Encoder (DGAE) is a deep-learning model that integrates self-attention with dynamic GRU, designed for processing time-series data and excelling in tasks such as fault diagnosis. Via an encoder–decoder architecture, GRU units capture long-term temporal dependencies, while the self-attention mechanism enhances the model’s focus on critical information. The overall structure is illustrated in Figure 2.

2.3.1. Encoder

The encoder part uses GRU units to encode the input sequence and extract the hidden states of the time series. The GRU unit effectively handles long-term dependencies in time series by controlling the flow of information through updating gates and resetting gates. Specifically, the update formula for GRU units is as follows:

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}] + b_{r})

(14)

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}] + b_{z})

(15)

{\tilde{h}}_{t} = \tanh (W \cdot [r_{t} \cdot h_{t - 1}, x_{t}] + b)

(16)

h_{t} = (1 - z_{t}) \cdot h_{t - 1} + z_{t} \cdot {\tilde{h}}_{t}

(17)

Among them, r_t is the reset gate, z_t is the update gate,

{\tilde{h}}_{t}

is the candidate state, and

h_{t}

is the hidden state at the current time step.

2.3.2. Self-Attention Mechanism

The self-attention mechanism enables the model to dynamically focus on different parts of the input sequence at each time step, enhancing the attention to key information. In terms of specific implementation, the self-attention mechanism generates attention weights by calculating the similarity scores among Query, Key, and Value vectors, and then sums the weighted value vectors to generate context vectors. This process can be expressed as:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(18)

2.3.3. Decoder

The decoder part also adopts the GRU unit, and reconstructs the input sequence by using the context vector generated by the encoder and the context information provided by the self-attention mechanism. The output of the decoder not only depends on the input of the current time step, but also on the hidden state of the previous time step and the context vector generated by the self-attention mechanism. The structure of the DGAE model is shown in Figure 3.

2.3.4. Loss Function

The DGAE model adopts a combined loss function of reconstruction loss and classification loss to enhance the model’s discriminative ability. The reconstruction loss usually adopts the mean square error (MSE), and the classification loss adopts the cross-entropy loss. The joint loss function can be expressed as:

L_{D G A E} = L o s s_{r e c o n s t r u c t i o n} + α \cdot L o s s_{c l a s s i f i c a t i o n}

(19)

Among them, Equation (19) represents the “local joint loss” for the DGAE module alone, α is the balance parameter, used to adjust the weight between the reconstruction loss and the classification loss.

The GAE unit in the DGAE model can effectively handle long-term dependencies in time series, perform feature reconstruction and reduce computational complexity at the same time. The self-attention mechanism enables the model to dynamically focus on the key information in the input sequence, thereby enhancing the model’s discriminative ability. Through the joint loss function, the model can still maintain a high diagnostic accuracy in the presence of noise interference and data imbalance. The input of the DGAE module is the output feature tensor

Z^{W A M S C} \in R^{C_{out} \times H \times W}

of the Load Adaptive Multi-scale convolution (WAMSC), where C_out is the number of channels output by the WAMSC module, and H and W are the height and width of the time series, respectively. The output of the DGAE module is the reconstructed feature tensor and classification results, which are used for subsequent fault diagnosis tasks.

2.4. Classification-Reconstruction Joint Loss

Based on Adaptive Feature Module Clustering (AFM) and Load Adaptive multi-scale convolution (WAMSC), the classification-reconstruction Joint Loss is a key component in the self-attention dynamic gated autoencoder (DGAE). Used to enhance the diagnostic performance of the model. This joint loss function combines classification loss and reconstruction loss, aiming to simultaneously optimize the classification accuracy and reconstruction ability of the model.

2.4.1. Reconstruction Loss

Reconstruction loss measures the difference between the model output and the actual input, and is usually used to evaluate the performance of autoencoders. In DGAE, the reconstruction loss is used to ensure that the model can accurately reconstruct the input time series data. Commonly used reconstruction Loss functions include mean square error (MSE) and Cosine Similarity Loss.

The mean square error (MSE) is:

L_{m s e} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - {\hat{x}}_{i})}^{2}

(20)

Among them, x_i is the original input,

{\hat{x}}_{i}

is the reconstructed output, and N is the number of samples. MSE loss focuses on the numerical difference between the input and the reconstructed output, ensuring that the reconstructed features are as numerically close as possible to the original features.

The Cosine Similarity Loss is:

L_{c o s} = λ_{c o s} (1 - \frac{x \cdot \hat{x}}{| | x | | | | \hat{x} | |})

(21)

Cosine similarity loss focuses on the directional consistency between the input and the reconstructed output, rather than the numerical magnitude. It ensures that the reconstructed features are aligned with the original features in direction.

2.4.2. Classification Loss

Classification loss is used to evaluate the model’s classification ability for input data. In DGAE, the classification Loss usually adopts the Cross-Entropy loss to measure the difference between the probability distribution of the model output and the true label.

The Cross-Entropy Loss is:

L_{c e} = - \sum_{i = 1}^{C} y_{i} l o g ({\hat{y}}_{i})

(22)

Among them, y_i represents the one-hot code of the real label,

{\hat{y}}_{i}

is the probability distribution of the model output, and C is the number of categories. Cross-entropy loss can ensure that the model can accurately classify the categories of the input data.

2.4.3. Joint Loss Function

The joint loss function combines the reconstruction loss and classification loss to simultaneously optimize the reconstruction ability and classification performance of the model. The formula of the joint loss function is as follows:

L_{total} = α_{m s e} L_{m s e} + α_{c e} L_{c e}

(23)

Among them, Equation (23) is the “global joint loss” for the entire AFM-CDGAE framework, α_mse and α_ce are balanced parameters used to adjust the weights between the reconstruction loss and the classification loss.

By minimizing the joint loss function, the DGAE model can simultaneously learn the reconstructed representation and classification representation of the input data. This not only improves the diagnostic accuracy of the model, but also enhances its robustness against noise and data imbalance.

2.5. Balanced Subset Training and Weighted Voting Integration

Based on the Classification-Reconstruction Joint Loss, Balanced Subset Training and Weighted Voting integration (BST-WVI) is an important part of the self-attention dynamic gated autoencoder (DGAE), aiming to further enhance the robustness and accuracy of the model when dealing with imbalanced data. The classifies-reconstruction joint loss, by combining classification loss and reconstruction loss, ensures that the model can not only accurately determine the category in classification tasks but also effectively reconstruct the input data, thereby enhancing the robustness against noise and outliers.

2.5.1. Balanced Subset Training

The core idea of balanced subset training is to train the model by constructing balanced data subsets, thereby reducing the impact of data imbalance on model performance. The specific steps are as follows:

Data partitioning: The training dataset is divided into multiple subsets, each containing the same number of minority class samples and majority class samples. This can be achieved through a clustering-based approach;
Subset training: Train a base model independently on each balanced subset. This method ensures that each base model can learn on a relatively balanced data distribution, thereby reducing the model’s preference for the majority of classes;
Model integration: Integrate the classification results of all base models to obtain the final classification result, among which the integration method is weighted voting.

2.5.2. Weighted Voting Integration

Weighted voting ensemble is a model fusion strategy that makes the final decision by combining the classification results of multiple models and allocating weights based on the performance or importance of each model. The specific steps are as follows:

Model training: Train multiple independent models, each based on different data subsets or using different algorithms;
Classification weighting: For new data points, each model is independently classified, and the classification results are weighted based on the pre-defined weights;
Result fusion: The weighted results are summed or averaged to obtain the comprehensive classification value.

Weight allocation is a key step in the weighted voting ensemble, and the selection of weights is crucial to the final result. In this study, weights are computed based on the minority-class F1 score on a dedicated calibration set (10% of training data, balanced across classes), with the following steps:

1. For each base model, calculate the per-class F1 score (F1_c) on the calibration set (c = 1, 2, …, C, where C is the number of classes);

2. Normalize per-class F1 scores to obtain class-specific weights: w_c = F1_c/Σ(F1_c) (ensuring Σ(w_c) = 1);

3. For a test sample, each base model outputs class probabilities P_c; the final weighted probability is P_final = Σ(w_c × P_c), and the class with the highest P_final is the diagnosis result.

The pseudo-code for weighted voting is shown in Algorithm 1 (Pseudo-code for weighted voting):

Algorithm 1 Minority-F1 Weighted Voting for AFM-CDGAE Fault Diagnosis
Inputs:
$Base models (M = \{M_{1}, \dots, M_{K}\})$ :		Pre-trained DGAE base classifiers (ensemble members of AFM-CDGAE)
$Calibration set D_{c a l} = (X_{c a l}, y_{c a l})$ :		$10 % of training data (balanced across classes, (X_{cal} \in R^{N \times F})$ $, (y_{cal} \in {\{0, 1\}}^{N \times C})$ $; where N = s a m p l e c o u n t$ $, F = f e a t u r e d i m e n s i o n$ $, C = c l a s s c o u n t$
$Test sample (x^{*} \in R^{1 \times F})$ :		Single input sample for fault diagnosis
Output:
$Predicted class label y^{*} \in \{0, 1, \dots, C - 1\}$ :		$Fault / health class of the test sample x^{*}$
1:	$(F \leftarrow [])$ $/ / Initialize list to store per - class F 1 vectors (each f^{(k)} \in R^{C}$ )
2:	$for (k = 1 \dots K)$ do
3:	$({\hat{y}}^{(k)} \leftarrow M_{k} (X_{cal}))$ //Predict labels of the calibration set using the k-th base model
4:	$f^{(k)} \leftarrow PerClassF 1 (y_{cal}, {\hat{y}}^{(k)})$ //Compute C-dimensional F1 score (one score per class)
5:	$append f^{(k)} to F$
6:	end for
7:	$w^{(k)} \leftarrow f^{(k)} / \sum_{k = 1}^{K} f^{(k)}$ $/ / Normalize F 1 scores into class - wise weights (sum of w^{(k)}$ equals 1)
8:	$p_{fuse} \leftarrow 0_{C}$ $/ / Initialize fused probability vector (all zeros, 0_{C} \in R^{C}$ )
9:	$for k = 1 \dots K$ do
10:	$p^{(k)} \leftarrow M_{k} (x^{*})$ //Obtain soft-max probability of the test sample from the k-th model
11:	$p_{fuse} \leftarrow p_{fuse} + w^{(k)} ⊙ p^{(k)}$ //Update fused probability ( $⊙$ denotes element-wise product)
12:	end for
13:	$return \arg \max_{c} p_{fuse} [c]$ //Output the class with the maximum fused probability

In this paper, the model allocates weights based on its performance on the validation set (such as accuracy, F1 score, etc.), which can better optimize the model’s performance.

Through balanced subset training, the robustness of the model when dealing with imbalanced data has been significantly improved. Weighted voting ensemble can fully utilize the advantages of multiple models, reduce overfitting or bias of individual models, and thereby improve the overall classification performance. The model can be flexibly adjusted according to the characteristics of specific problems and datasets to fully leverage its advantages.

3. The Network Structure and Fault Diagnosis Process of AFM-CDGAE

The diagnostic process of this paper is illustrated in Figure 4. First, raw vibration signals are pre-processed: a 1024-point sliding window with 50% overlap and a Hamming window are applied to obtain short-time frames, and a 1024-point FFT converts each frame into a 512-dimensional amplitude spectrum to eliminate temporal drift and spectral leakage. Second, the Adaptive Feature Module (AFM) based on K-means clustering compresses the 512-D spectrum into 32/48-D “feature modules” while preserving 98.4% of the fault-related energy, removing more than 90% redundancy. Third, the Workload-Adaptive Multi-Scale Convolution (WAMSC) leverages multi-scale kernels (3 × 3, 5 × 5, 7 × 7), spatial attention and a CPU-load-aware λ-Scale weight rescaler to suppress background noise, adjust to real-time computational load, and extract refined multi-scale fault features. Fourth, the Conditional Dynamic GRU Auto-Encoder (DGAE) encodes the resulting feature maps with a self-attention-enhanced GRU, models temporal dependencies, and simultaneously reconstructs the input and predicts fault labels through a joint reconstruction-classification loss. Finally, Balanced Subset Training with Weighted Voting Integration (BST-WVI) constructs multiple balanced mini-datasets via permutation sampling, trains an ensemble of DGAE base learners, and fuses their outputs by minority-class-F1-weighted voting to counter extreme class imbalance. The model parameters are optimized by Adam with CLR scheduling, and the final fault diagnosis is evaluated by accuracy, Macro-F1, confusion matrix and robustness tests under 5% labels, 10 dB noise and plus or minus 20% operating-point drift.

4. Experimental Verification and Comparative Analysis

4.1. Introduction to the Dataset

4.1.1. The Paderborn Public Dataset from Germany

The test selected the public dataset of rolling bearings from the University of Paderborn (KAt-DataCenter) in Germany [33]. The test platform is shown in Figure 5. It is composed of a three-phase asynchronous drive motor, a torque measurement shaft, a 6203 type rolling bearing test module, an inertial flywheel and a load motor, which are connected in series from left to right.

This included four fixed operating conditions—operating condition 1:1500 rpm, 0.7 nm, 1000 N; working condition 2:900 rpm, 0.7 nm, 1000 N; operating conditions: 1500 rpm, 0.1 nm, 1000 N; and five typical bearing conditions which were collected under working conditions 4:1500 rpm, 0.7 nm, 400 N. This included healthy status, Outer single-point injury (Outer Single), outer Repetitive injury (Outer Repetitive), Inner single-point injury (Inner Single) and inner repetitive injury (Inner Repetitive). This dataset has advantages such as multi-physical quantity synchronization, fine damage morphology, and comprehensive coverage of working conditions compared with the public dataset of CWRU. The global time-domain waveform of its vibration signal is shown in Figure 6.

The Paderborn dataset is highly representative of most real industrial scenarios characterized by non-Gaussian noise and complex operations. Its four fixed operating conditions (covering 900 to 1500 rpm speed, 0.1 to 0.7 nm torque, 400 to 1000 N load) fully simulate multi-condition switching in industrial sites (e.g., variable-speed operation of conveyors, variable-torque work of machine tools); the fine damage morphology (single-point/repetitive injury) is consistent with the actual fault evolution process of rolling bearings. Although it lacks complex interference from auxiliary equipment (e.g., pumps, fans) in shop-floor environments, this can be compensated by adding simulated interference during data preprocessing—making the dataset a valid proxy for most industrial rotating machinery fault diagnosis scenarios. Future validation with real high-speed train data will further extend its applicability to extreme industrial conditions.

4.1.2. The Public Dataset of Rolling Bearings from Case Western Reserve University (CWRU) in the United States

The experiment selected the public dataset of rolling bearings from Case Western Reserve University (CWRU) in the United States [34]. The test platform is shown in Figure 7. Seven typical bearing states were collected under three load conditions of 1 HP, 2 HP and 3 HP: Under normal condition (NC), minor faults (BF14) and moderate faults (BF21) of the rolling elements, minor faults (IF7) and moderate faults (IF14) of the inner ring, and minor faults (OR7) and moderate faults (OR14) of the outer ring. Figure 8 presents examples of time-domain vibration waveforms for various states under the above three load conditions.

The CWRU dataset can represent most of the core characteristics of real industrial scenarios with non-Gaussian noise and complex operations to a certain extent. Specifically, its single-frequency Gaussian noise simulates stable noise sources (e.g., motor electromagnetic interference) common in medium-small rotating machinery; the 1 to 3 HP load range covers the typical working load of auxiliary equipment in industrial sites (e.g., small fans, pumps). While it does not fully cover extreme load drift (plus or minus 50%) in large-scale equipment (e.g., offshore wind turbines), its fault type diversity (inner/outer race, ball faults with different severities) and load variation trend are consistent with the fault diagnosis needs of most rotating machinery, providing a reliable basic test benchmark for industrial applications.

4.2. Experimental Analysis

To systematically verify the comprehensive diagnostic performance of AFM-CDGAE under the quadruple coupling extreme conditions of “strong variable speed, strong variable load, strong noise, and extremely unbalanced categories”, the experiment takes the two major datasets of Paderborn and CWRU as the core and uniformly conducts a complete process covering data preprocessing, feature compression, and extreme disturbances. Firstly, a 1024-point sliding window and 50% overlap splitting strategy is adopted for the original vibration sequence. Hamming is selected as the window function to suppress spectral leakage. Subsequently, a 1024-point FFT is performed and the amplitude spectrum is taken to form a 512-dimensional high-dimensional spectral vector. Then, by using the K-means unsupervised clustering in the AFM module, the frequency bands that are statistically similar, energy-concentrated and fault-sensitive are aggregated into 32 (Paderborn) or 48 (CWRU) dimensional “feature modules”, achieving a dimension compression of over 90% while retaining 98.4% of the fault energy, thereby reducing the burden on the subsequent network. Finally, to closely meet the most stringent operating conditions in industrial sites, only 5% of the labels were randomly retained in the training set, 10 dB zero-mean Gaussian noise was injected, and plus or minus 20% random disturbances were applied to the rotational speed or load to construct a triple extreme subset of “scarce labels, strong noise, and operating condition drift”. It is used to simulate the problem of identifying weak faults in real scenarios such as offshore wind turbines and high-speed trains.

4.2.1. Main Experiment Results

On the Paderborn dataset, AFM-CDGAE conducted experiments on multi-channel samples in five states: “healthy, single-point injury on the outer circle, repetitive injury on the outer circle, single-point injury on the inner circle, and repetitive injury on the inner circle”. The training curve shows that the model achieved a validation accuracy rate of 99.7% in just 50 epochs, with the validation loss dropping to 0.028, and no signs of overfitting were observed (Figure 9a,b).

t-SNE visualization further confirmed that the five types of samples formed completely separated compact clusters in the two-dimensional latent space, with clear boundaries and no overlap (Figure 10a). From the confusion matrix, it can be seen that all 169 test samples of each category fall on the main diagonal, and only 3 combined faults are misjudged as outer circle faults. The overall misclassification rate is as low as 0.375% (Figure 10b).

In the experimental results of the Paderborn dataset, Figure 11 presents the variation in normalized features with character length in the form of a 16 × 1 grayscale heatmap: when the input length is between 4 and 12, the heatmap is highlighted (value approximately equals positive 1), the features are stable and have the strongest discriminative power; The colors in the 0 to 3 and 13 to 15 regions darken, indicating that both too short and too long Windows will weaken feature consistency, once again verifying the rationality of choosing a 12-character length in the experiment. We further analyzed the influence of input length on feature stability.

After 10 five-fold cross-validations, the model achieved a Macro-F1 of 99.12% plus or minus 0.32%, a Micro-F1 of 99.20% plus or minus 0.29%, and an AUC of 0.9989 plus or minus 0.0012 on the test set. The multi-class AUC is calculated using the One-vs-One strategy—pairwise AUC values are computed for all class pairs (e.g., Healthy vs. Outer Single, Outer Single vs. Inner Repetitive), and the average of these pairwise AUCs is the final multi-class AUC. This avoids bias toward majority classes inherent in the One-vs-Rest strategy. Compared with ADAGCN (Macro-F1 97.30%) of the same input specification, it increased by 1.82%, and compared with ResNet-18 (95.78%), it increased by 3.34%. Furthermore, on the 10 dB noise subset, Macro-F1 decreased by only 1.2%, while ADAGCN decreased by 4.7% and ResNet-18 decreased by 8.4%, significantly demonstrating the denoising robustness of the model.

For the CWRU dataset: When the sample size of minority classes (e.g., OR7/IF7) is greater than or equal to 50, the Macro-F1 remains greater than or equal to 96%; when the sample size is less than 30, the Macro-F1 drops to less than or equal to 90%, so the minimum effective sample size threshold is set to 50.

For the Paderborn dataset: When the sample size of minority classes (e.g., Inner Single) is greater than or equal to 40, the Macro-F1 remains greater than or equal to 97%; when the sample size is less than 25, the Macro-F1 drops to less than or equal to 92%, so the minimum effective sample size threshold is set to 40.

To further verify adaptability to practical industrial noise (non-Gaussian, impulsive, and non-stationary), additional tests were conducted:

Under 10 dB impulsive noise (simulating mechanical impact interference), the Macro-F1 of the Paderborn dataset reached 95.6% (a decrease of 3.5%);

Under 10 dB time-varying noise (simulating variable-speed interference), the Macro-F1 was 94.8% (a decrease of 4.3%).

These results outperform the average Macro-F1 decrease of 8.2% among the five baseline models, confirming the model’s resilience to complex real-world noise.

When the triple extreme conditions of 5% label absence, 10 dB noise, and plus or minus 20% operating condition drift were applied simultaneously, the Macro-F1 of the Paderborn dataset only slightly decreased from 99.12% to 97.3%, with a decline of only 1.5%. In contrast, the average decline of the five latest baselines (including ResNet-18, EfficientNet-B0, Inception Time, TS-TCC, and ADAGCN) was as high as 6.7%. CWRU Dataset: Majority class (Normal Condition, NC) comprises 10,000 samples; Minority classes (OR7, IF7) comprise 100 samples each; Moderate classes (OR14, IF14, BF14, BF21) comprise 200 samples each. Paderborn Dataset: Majority class (Healthy) comprises 8000 samples; Minority classes (Outer Single, Inner Single) comprise 80 samples each; Moderate classes (Outer Repetitive, Inner Repetitive) comprise 160 samples each. This distribution simulates the “healthy-dominant” scenario in real industrial maintenance. The curve in Figure 12 further indicates that the performance of AFM-CDGAE remains almost at the same level under extreme working conditions, while the baseline method shows a sharp drop around 10 dB, fully verifying the robustness advantage of the method proposed in this paper in the dual coupling scenarios of strong noise and extreme imbalance.

On the CWRU bearing public dataset, AFM-CDGAE conducted end-to-end training for seven types of operating conditions: “normal, minor/moderate faults of rolling elements, minor/moderate faults of inner rings, and minor/moderate faults of outer rings”. As can be seen from the experimental curve of the CWRU dataset, the accuracy rate is shown in Figure 13a. The training accuracy rate rapidly approaches 1.0, and the verification accuracy rate eventually reaches 98.9%, with no significant overfitting throughout the process. The loss rate is shown in Figure 13b. The training loss rapidly decreased from approximately 2.0 to below 0.5 within 50 epochs, and the final verification loss stabilized at 0.034.

On the CWRU bearing public dataset, Figure 14 presents the variation curve of normalized feature values with character length (1–15): when the character length is less than or equal to 5, the feature fluctuates sharply, indicating that the short window is difficult to capture complete fault information; the curve tends to be stable and close to 0 within the length range of 6 to 12, indicating that the features extracted by the model in this range are of consistent scale and have the highest discriminative power. Although there are slight fluctuations after exceeding 12, their impact on the overall performance is limited. This result is consistent with the 12-character length setting selected in the experiment, verifying the rationality of the input window length selection.

Figure 15a shows the t-SNE visualization: Seven types of samples form seven compact and completely separated clusters in the two-dimensional latent space. Even if OR14 and IF14 have slight intersections in the confusion matrix, they can still be clearly distinguished in the t-SNE diagram, intuitively explaining the origin of 98.83% of Macro-F1.

Figure 15b presents the confusion matrix of the seven types of tasks in CWRU (the values have been aligned to 118–119 samples/classes). The main diagonal elements 117–119 are almost fully occupied. Only a handful of misjudgments—two or three cases—occur between the moderate faults in the outer circle (OR14) and the moderate faults in the inner circle (IF14), totaling five errors, corresponding to an overall misclassification rate of 0.6%, once again confirming the model’s high resolution for weak defects.

To counter EOV (e.g., plus or minus 20% speed/load drift), the framework integrates two core strategies: (1) WAMSC’s λ-scaling dynamically adjusts convolutional weights based on real-time operating load, reducing sensitivity to working condition fluctuations; (2) BST-WVI’s balanced subset training ensures the model learns from diverse operational conditions, avoiding bias toward a single working point.

This aligns with baseline compensation techniques summarized in a recent review [35], which emphasizes multi-modal EOV mitigation for reliable structural health monitoring.

When the triple extreme conditions of “5% label, 10 dB noise, and plus or minus 20% operating condition drift” were applied simultaneously, the Macro-F1 in the CWRU dataset dropped from 98.83% plus or minus 0.41% to 96.9% plus or minus 0.53%, a decrease of only 1.8%. The Paderborn dataset dropped from 99.12% to 97.3%, a decrease of only 1.5%. In contrast, the average decline of the five latest baselines (including ResNet-18, EfficientNet-B0, InceptionTime, TS-TCC, and ADAGCN) was as high as 6.7%. Figure 16 shows the Macro-F1 variation curves of each method under extreme conditions: The AFM-CDGAE curve remains almost horizontal, while the baseline drops sharply around 10 dB, fully demonstrating the robustness advantage of the method proposed in this paper in the dual coupling scenarios of strong noise and extreme imbalance.

4.2.2. Comparative Experiment

All baseline models (ResNet-18, EfficientNet-B0, Inception Time, TS-TCC, ADAGCN) were retrained under the same experimental conditions as AFM-CDGAE: 10 dB Gaussian noise, 100:1 class imbalance, 5% labeled samples, and plus or minus 20% operating-point drift. Hyper-parameters for the baselines were taken from their original publications: ResNet-18 (batch size of 32, learning rate of 1 × 10⁻⁴), EfficientNet-B0 (batch size of 32, learning rate of 5 × 10⁻⁵), Inception Time (batch size of 64, learning rate of 1e-3), TS-TCC (batch size of 64, learning rate of 2× 10⁻⁴), ADAGCN (batch size of 32, learning rate of 1 × 10⁻⁴). All models used Adam optimizer with CLR scheduling and were trained for 100 epochs to ensure fair comparison. Figure 17 presents the three-dimensional indicators of “accuracy-parameter quantity-robustness” at one time using a grouped bar chart. On the Paderborn and CWRU datasets, the Macro-F1 column tops of AFM-CDGAE were as high as 99.2% and 98.8%, respectively, far exceeding all baselines. Its parameter count column is only at a low point of 0.7 M, achieving a 93% compression compared to 10.1 M of ResNet-18. The robustness column is also the lowest, with an F1 reduction of only 1.2% at 10 dB noise on Paderborn, and 2.2% on CWRU. In contrast, although the F1 of ResNet-18 is close to 95%, its parameter bar is high at 10.1 M, and its robustness bars are as high as 8.4% and 6.85%, respectively. EfficientNet-B0 achieves 96.8% and 94.8% F1, but still has 4.2 M parameters and noise reductions of 7.5% and 7.2%. Inception Time and TS-TCC hover within the F1 range of 96.3% to 96.3% and 95.5% to 95.7%, with parameter counts still exceeding 5.7 M and 6.5 M, and noise reductions between 6.8% and 7.6%. As the strongest baseline, ADAGCN, with an F1 of approximately 97.4% and 96.3%, a parameter count of 2.7 M, and a decrease of 4.5% and 6.3%, was still comprehensively suppressed by AFM-CDGAE. The six groups of columns clearly indicate that AFM-CDGAE simultaneously holds the triple advantages of “the highest precision, the smallest model, and the strongest robustness”, fully meeting the strict requirements of long-term online monitoring in industrial scenarios such as offshore wind turbines and high-speed trains.

4.2.3. Ablation Experiment

To quantify the contribution of each core module in AFM-CDGAE, in this paper, two sets of ablation processes, namely “mode-wise elimination” and “stepwise recovery”, were implemented on CWRU and Paderborn, respectively, and the data partitioning, training rounds, and optimizer configuration were kept completely consistent to ensure the principle of single variable. All results are the mean of 10 five-fold cross-validations.

Remove AFM

In the five types of working conditions scenarios, after the AFM is removed, the five clusters are significantly “flattened” in the two-dimensional visualization space: the X-axis span is compressed from plus-or-minus 40 by plus-or-minus 20, and the Y-axis is compressed from plus-or-minus 40 by plus-or-minus 10, resulting in an expansion of the dispersion within the clusters by approximately 30%, while the center distance between different clusters is reduced by 15% instead, and the category boundaries tend to be blurred, as shown in Figure 18a. In contrast, the five types of scenarios that retain AFM still maintain a distribution range of plus-or-minus 40 by plus-or-minus 40, with compact clusters, clear intervals between different types, and no aliasing in the high-frequency redundant noise regions. Quantitatively, AFM reduces the intra-cluster distance of the same type by 28% and increases the inter-cluster distance by 17%, which intuitively demonstrates its dual effects of compression and denoising, as shown in Figure 18b.

In the seven types of working condition scenarios, after removing AFM, the distribution domain of the t-SNE scatter plot (Figure 19a) sharply shrinks from the original plus-or-minus 40 by plus-or-minus 40 to plus-or-minus 20 by plus-or-minus 20. The seven category clusters overlap over a large area, forming a high-density aliasing zone in the Y-axis -20 to 0 region, and the category boundaries are completely blurred. In contrast, the t-SNE plot (Figure 19b) that retains the AFM still maintains a complete space of plus-or-minus 40 by plus-or-minus 40: the clusters are compact, the spacing between different classes is significant, and the background redundant noise area is clean and undisturbed. This visually verifies the robust compression and feature fidelity capabilities of the AFM under multi-class and strongly variable working conditions.

Remove Load Adaptive Scaling

Replace workload Adaptive Weight Rescaler in WAMSC with a fixed scaling factor 1.0. The peak CPU utilization at the edge of Jetson Xavier NX increased from 42% to 78%, and the power consumption rose from 7.8 W to 12.5 W, as shown in Figure 20a. The Paderborn10 dB noise subset Macro-F1 dropped from 90.3% to 85.6%. As shown in Figure 20b, further analysis of the heat map of the convolutional kernel weights reveals that during the CPU full-load stage, the 5 × 5 and 7 × 7 kernel weights exhibit a 1.6-times oversurge amplification, causing the background high-frequency band to be mistakenly regarded as a fault feature. This indicates that the dynamic adaptation of λ-Scale to real-time computing loads is indispensable.

Remove Spatial Attention SAM

After removing the SAM branch in WAMSC, as shown in Figure 21, the Grad-CAM++ heatmap shows large-scale activation in the background area of 50 to 300 Hz, and the signal-to-noise ratio drops from 19.4 dB to 14.8 dB. The noise subset Macro-F1 of CWRU is further reduced by 3.2%, and Paderborn is reduced by 2.9%. A statistical analysis of 100 out-of-circle crack samples revealed that 27% of the samples were misjudged as healthy when SAM was absent, demonstrating the crucial role of spatial attention in the localization of weak impacts.

Remove the Reconstruction Branch

Trained only with cross-entropy loss, the network degenerates into a pure classifier. Take the CWRU seven-class dataset with more complex classification as an example. In the 100:1 imbalance scenario, the F1 of the minority class (crack diameter 0.1778 mm) plummeted from 82.4% to 68.7%, and the Macro-F1 dropped to 70.1%. Meanwhile, the area under the PR curve (AUPRC) decreased from 0.876 to 0.722. As shown in Figure 22, the visualization indicates that in the absence of reconstruction constraints, severe category collapse occurs in the encoder’s latent space, with healthy samples occupying 68% of the volume, resulting in a significant shift in the decision boundary.

The Decisive Advantage of BST-WVI in Extremely Unbalanced Scenarios

After the removal of BST-WVI, the traditional majority voting led to an abnormally high weight of the health-related voting to 0.73 (originally 0.25), while the weight of the outer circle microcrack was suppressed to 0.08 (originally 0.35). The Macro-F1 of Paderborn plummeted from 70.13% to 61.2%, and the CWRU dropped from 69.12% to 59.7%. Among them, the recall rate for minor faults on the outer ring (OR7) was only 38%, and the recall rate for samples with a crack diameter of 0.007 mm even fell to 25%, directly exposing the systematic bias of traditional voting under extreme imbalance.

After the introduction of BST-WVI, the weights were dynamically recalibrated through F1-weighted voting (0.25 for the health category results in 0.35 for the outer ring crack). The Macro-F1 of the two datasets jumped by 7.7% (Paderborn) and 9.5% (CWRU), and the recall rates simultaneously recovered to 70% and 69%. It is proved that BST-WVI is the only module that can maintain fault sensitivity in scenarios with a significant sample size difference of 1:100, as shown in Figure 23.

Gradually Resume the Experiment: The “Final Push” Effect of BST-WVI

When restoring in the sequence of AFM followed by λ-Scale followed by SAM followed by Recon followed by BST-WVI, the first four steps only increased Macro-F1 from the baseline 61.2% on Paderborn and 59.7% on CWRU to 91.3% and 90.5%, respectively, still not breaking through the extreme-imbalance bottleneck. The single-step introduction of BST-WVI brings explosive gains of 7.6% on Paderborn and 8.1% on CWRU, ultimately locking in 98.9% and 98.6%. Although its contribution proportion is quantitatively shown as 10%, its contribution rate reaches 100% in eliminating the last 8% performance gap, and it increases the distance between t-SNE clusters by 3.2 times and significantly reduces the area of the noise activation zone. Intuitively, it is verified that BST-WVI is the ultimate key to breaking through the ceiling of unbalanced scenarios.

5. Conclusions and Prospects

5.1. Main Conclusions

This paper addresses the problem of fault identification for rotating machinery under the quadruple coupling extreme working conditions of “strong variable speed-variable load-strong noise-extremely unbalanced categories”, and proposes an adaptive feature module-Dynamic gated autoencoder network (AFM-CDGAE). The main conclusions are as follows:

Feature compression and denoising: The AFM module compresses the 512-dimensional spectrum into 32/48-dimensional “feature modules” through K-means, achieving over 90% dimension compression while retaining 98.4% of the fault energy, significantly reducing input redundancy and computational load;
Load-adaptive multi-scale convolution: WAMSC introduces spatial attention, λ-scale real-time load scaling, and channel–height–width triaxial collaborative attention. On Jetson Xavier NX, the CPU peak utilization rate is reduced from 78% to 42%, the power consumption is reduced from 12.5 W to 7.8 W, and background noise is effectively suppressed;
Lightweight and high-performance inference: The network has only 0.87 M parameters and 2.1 ms (RTX3070)/7.3 ms (JetsonXavierNXINT8) inference latency, meeting the requirements of real-time edge monitoring;
Extreme condition robustness: Under the four extreme conditions of only 5% training label, 10 dB noise, 100:1 class imbalance and plus or minus 20% speed/load drift, Macro-F1 only decreased by 1.5%/1.8% (Paderborn/CWRU), while the average decline of the five latest baselines was 6.7%;
Breakthrough of unbalanced bottleneck BST-WVI, through balanced subset training and F1-weighted voting, in the extremely unbalanced 1:100 scenario, the Macro-F1 of the two datasets jumped by 7.7% (Paderborn) and 9.5% (CWRU), respectively, and the recall rates simultaneously recovered to 99% and 98%. Experiments have confirmed that it is a key module for the “final kick”.

5.2. Summary of Innovation Points

For the first time, the “K-means feature module clustering-load adaptive multi-scale convolution-self-attention dynamic gated autoencoding” was embedded into a unified framework to achieve Pareto optimality in terms of accuracy and lightweight;
Workload-Adaptive Weight Rescaler is proposed, which enables the convolutional weights to be simultaneously modulated by spatial attention and real-time CPU load λ dynamically, taking into account both denoising and multi-scale fidelity;
Design reconstruction-classification joint loss + minority class F1-weighted voting, breaks through the traditional majority voting bias towards the majority class, significantly improving the recognition rate of weak faults in extremely unbalanced scenarios.

5.3. Method Limitations

5.3.1. Hardware Limitations

On low-power edge platforms (e.g., Raspberry Pi Zero W), the inference latency of the INT8-quantized model reaches 56 ms with a CPU utilization rate of 92%, which fails to meet real-time requirements for high-frequency monitoring. For mid-range edge devices (e.g., Jetson Nano), the model achieves 12.3 ms latency and 52% CPU utilization, indicating potential for practical deployment. Future work will focus on network pruning (e.g., channel pruning of WAMSC) to adapt to ultra-low-power sensor platforms.

5.3.2. Systematic Elaboration of Limitations:

Sample size dependence: The model requires a minimum of 40 to 50 samples for minority classes; performance degrades significantly with fewer samples.
Complex noise adaptability: Under non-Gaussian/non-stationary noise (e.g., 15 dB impulsive noise), the Macro-F1 decreases by more than 5%, which is worse than under Gaussian noise.
Low-power hardware adaptation: The model cannot meet real-time requirements on ultra-low-power platforms (e.g., Raspberry Pi Zero W) and requires further optimization.
Calibration overhead: Recalibration is needed for large-scale operating condition changes, increasing maintenance cost in long-term deployment.

5.4. Future Outlook

Multimodal expansion: Uniformly map multiple physical quantities such as vibration, current, acoustic emission, and temperature to a shared hidden space, further enhancing the cross-sensor generalization capability;
Adaptive clustering: Research online incremental K-means or contrastive clustering to enable AFM to continuously update feature modules in streaming data scenarios and avoid concept drift;
Causality can be explained: Introduce gradient causality graphs (Grad-CAM-GC) or Shapley values to quantify the contribution of each frequency band to diagnostic decisions, meeting the auditing requirements of safety-critical areas;
Federated deployment: By integrating federated learning and differential privacy, collaborative training of wind turbine clusters is achieved without data leaving the factory, addressing issues of data silos and privacy compliance;
Fault Prediction: By introducing a temporal Transformer prediction head on the basis of the existing diagnostic framework, an end-to-end joint optimization of “diagnosisremaining life prediction” is achieved, providing a more complete decision-making basis for predictive maintenance.

Author Contributions

Conceptualization, K.Z. (Kaiyi Zhang); methodology, K.Z. (Kaiyi Zhang) and X.L.; software, G.Y.; validation, K.Z. (Kun Zhai) and G.A.; formal analysis, K.Z. (Kaiyi Zhang) and G.A.; investigation, Y.Z. and C.P.; resources, K.Z. (Kaiyi Zhang); data curation, X.L.; writing—original draft preparation, K.Z. (Kaiyi Zhang); writing—review and editing, X.L.; visualization, K.Z. (Kaiyi Zhang); supervision, X.L.; project administration, G.Y.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by multiple funds: the National Natural Science Foundation of China (grant number 52005453), the Key Scientific and Technological Project of Henan Province (grant number 252102241012), the Science and Technology Project of the Henan Provincial Department of Transportation (grant number 2023-5-3), and the Zhengzhou City "Unveiling and Commanding" Project (grant number 2023JBGS009).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AFM	Adaptive Feature Module
CDGAE	Conditional Dynamic GRU Auto-Encoder
AFM-CDGAE	Adaptive Feature Module–Conditional Dynamic GRU Auto-Encoder
WAMSC	Workload-Adaptive Multi-Scale Convolution
DGAE	Dynamic GRU Auto-Encoder
DGRUAE	Dynamic GRU Auto-Encoder (also written as DGAE)
GRU	Gated Recurrent Unit
CNN	Convolutional Neural Network
FFT	Fast Fourier Transform
MSE	Mean Square Error
BST-WVI	Balanced Subset Training and Weighted Voting Integration
CWRU	Case Western Reserve University (bearing dataset)
t-SNE	t-distributed Stochastic Neighbor Embedding
Grad-CAM	Gradient-weighted Class Activation Mapping
Grad-CAM++	Improved Gradient-weighted Class Activation Mapping
AUC	Area Under the ROC Curve
AUPRC	Area Under the Precision–Recall Curve
INT8	8-bit Integer quantization
SMOTE	Synthetic Minority Over-sampling Technique
GAN	Generative Adversarial Network
VAE	Variational Auto-Encoder
DANN	Domain-Adversarial Neural Network
TrAdaBoost	Transfer AdaBoost
EEMD	Ensemble Empirical Mode Decomposition
CEEMDAN	Complete Ensemble Empirical Mode Decomposition with Adaptive Noise
SEHT	Signal-Enhanced Hilbert Transform
DITSVD	Doubly Improved Truncated Singular Value Decomposition
JR-TFViT	Lightweight Jamming Recognition–Time–Frequency Vision Transformer
ADAGCN	Adaptive Graph Convolutional Network
ResNet	Residual Network
EfficientNet	Efficient Convolutional Neural Network architecture
TS-TCC	Time-Series Temporal Contrastive Coding
CLR	Cyclical Learning Rate

References

Khan, S.; Yairi, T. A review on the application of deep learning in system health management. Mech. Syst. Signal Process. 2018, 107, 241–265. [Google Scholar] [CrossRef]
Guo, P.; Huang, W.; Ding, C.; Shi, J.; Zhu, Z. A novel sparse-aware contrastive learning network with adaptive gating neurons for extreme class imbalance diagnosis scenarios. Mech. Syst. Signal Process. 2025, 235, 112895. [Google Scholar] [CrossRef]
Li, X.; Wang, Y.; Zhao, S.; Yao, J.; Li, M. Adaptive Convergent Visibility Graph Network: An interpretable method for intelligent rolling bearing diagnosis. Mech. Syst. Signal Process. 2025, 222, 111761. [Google Scholar] [CrossRef]
Wen, W.; Bai, Y.; Hu, F.; Cheng, W. Intelligent fault diagnosis based on receptive field of DCNN for rotary machine under variable conditions. Procedia Manuf. 2020, 49, 119–125. [Google Scholar] [CrossRef]
Bin, L.; Jian, G. JR-TFViT: A lightweight efficient radar jamming recognition network based on global representation of the time–frequency domain. Electronics 2022, 11, 2794. [Google Scholar]
Mao, J.; Sun, L.; Chen, J.; Yu, S. A parallel image denoising network based on nonparametric attention and multiscale feature fusion. Sensors 2025, 25, 317. [Google Scholar] [CrossRef]
Zhang, Y.; Su, C.; He, X.; Tang, J.; Xie, M.; Liu, H. Progressive hybrid hypergraph attention network with channel information fusion for remaining useful life prediction of rolling bearings. Mech. Syst. Signal Process. 2025, 236, 112987. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Zhao, H.; Shi, X.; Xie, D.; Gao, Z. An early weak fault assessment method for rolling bearings based on adaptive frequency focusing and multi-level activation quantum-inspired neural network. Measurement 2025, 255, 118039. [Google Scholar] [CrossRef]
Sun, Q.; Tang, Y. Singularity analysis using continuous wavelet transform for bearing fault diagnosis. Mech. Syst. Signal Process. 2002, 16, 1025–1041. [Google Scholar] [CrossRef]
Zhang, J.; Zhao, Z.; Jiao, Y.; Zhao, R.; Hu, X.; Che, R. DPCCNN: A new lightweight fault diagnosis model for small samples and high noise problem. Neurocomputing 2025, 626, 129526. [Google Scholar] [CrossRef]
Hao, J.; Lv, Y.; Liu, J.; Liu, Y.-C. Dynamic weighted multimodal fusion for fault diagnosis of marine rotating machinery under noisy and low-sample conditions. Ocean Eng. 2025, 339, 122082. [Google Scholar] [CrossRef]
Liu, Z.; Li, M. An early fault characteristics analysis method of reactor canned motor pump based on signal enhancement Hilbert transform and complete ensemble empirical mode decomposition with optimized adaptive noise. Measurement 2025, 255, 118035. [Google Scholar] [CrossRef]
Lee, G.J.; Kim, S.K.; Lee, H.J. Sound-based unsupervised fault diagnosis of industrial equipment considering environmental noise. Sensors 2024, 24, 7319. [Google Scholar] [CrossRef]
Rajagopalan, S.; Purohit, A.; Singh, J. Genetically optimised SMOTE-based adversarial discriminative domain adaptation for rotor fault diagnosis at variable operating conditions. Meas. Sci. Technol. 2024, 35, 105027. [Google Scholar] [CrossRef]
Hu, C.; Deng, R.; Hu, X.; He, M.; Zhao, H.; Jiang, X. An automatic methodology for lithology identification in a tight sandstone reservoir using a bidirectional long short-term memory network combined with Borderline-SMOTE. Acta Geophys. 2024, 73, 1–17. [Google Scholar] [CrossRef]
Xiao, L.; Feng-Liang, Z. Classification of multi-type bearing fault features based on semi-supervised generative adversarial network (GAN). Meas. Sci. Technol. 2024, 35, 025014. [Google Scholar]
He, W.; Chen, J.; Zhou, Y.; Liu, X.; Chen, B.; Guo, B. An intelligent machinery fault diagnosis method based on GAN and transfer learning under variable working conditions. Sensors 2022, 22, 9175. [Google Scholar] [CrossRef]
Wang, Y.; Li, D.; Li, L.; Sun, R.; Wang, S. A novel deep learning framework for rolling bearing fault diagnosis enhancement using VAE-augmented CNN model. Heliyon 2024, 10, e35407. [Google Scholar] [CrossRef]
Khan, M.A.; Asad, B.; Vaimann, T.; Kallaste, A.; Pomarnacki, R.; Hyunh, V.K. Improved fault classification and localization in power transmission networks using VAE-generated synthetic data and machine learning algorithms. Machines 2023, 11, 963. [Google Scholar] [CrossRef]
Yang, M.; Yuncheng, J.; Jinfeng, H. Application of transfer learning in fault diagnosis of seawater hydraulic pumps. J. Vib. Shock 2020, 2020, 1–8. [Google Scholar]
Chen, J.; Ling, J.; Lei, N.; Li, L. BDSER-InceptionNet: A novel method for near-infrared spectroscopy model transfer based on deep learning and balanced distribution adaptation. Sensors 2025, 25, 4008. [Google Scholar] [CrossRef] [PubMed]
Xu, F.; Zhang, R. Explainable domain adaptation learning framework for credit scoring in internet finance through adversarial transfer learning and ensemble fusion model. Mathematics 2025, 13, 1045. [Google Scholar] [CrossRef]
Zhang, G.; Kong, X.; Ma, H.; Wang, Q.; Du, J.; Wang, J. Dual disentanglement domain generalization method for rotating machinery fault diagnosis. Mech. Syst. Signal Process. 2025, 228, 112460. [Google Scholar] [CrossRef]
Wei, J.; Wang, Q.; Zhang, G.; Ma, H.; Wang, Y. Domain knowledge guided pseudo-label generation framework for semi-supervised domain generalization fault diagnosis. Adv. Eng. Inform. 2025, 67, 103540. [Google Scholar] [CrossRef]
Kumar, S.; Sinha, B.B. Enhanced fault diagnosis of rolling bearings with noise filtering and neural networks. J. Vib. Eng. Technol. 2025, 13, 411–425. [Google Scholar] [CrossRef]
Jiang, L.; Dong, F.; Xu, S.; Yin, B. Fault diagnosis method for proton exchange membrane fuel cells based on the fusion of deep learning and ensemble learning. J. Energy Eng. 2025, 151, 04025018. [Google Scholar] [CrossRef]
Long, C.; Yu, T.; Feng, G.; Wang, T. Research on air compressor fault detection algorithm based on ensemble learning. Eng. Lett. 2025, 33, 1–6. [Google Scholar]
Li, X.; Gu, J.; Li, M.; Zhang, X.; Guo, L.; Wang, Y.; Lyu, W.; Wang, Y. Adaptive expert ensembles for fault diagnosis: A graph causal framework addressing distributional shifts. Mech. Syst. Signal Process. 2025, 234, 112762. [Google Scholar] [CrossRef]
Zhao, Z.; Jin, Z.; Xin, X.; Fu, Y.; Huang, X.; Li, L.; Qin, H.; Wei, C.; Li, Y.; Liu, Y. Cross-domain fault diagnosis of marine diesel engines based on stepwise diffusion and iterative bidirectional optimization. Eng. Appl. Artif. Intell. 2025, 155, 110994. [Google Scholar] [CrossRef]
Song, X.; Wu, C.; Song, S.; Stojanovic, V.; Tejado, I. Fuzzy wavelet neural adaptive finite-time self-triggered fault-tolerant control for a quadrotor unmanned aerial vehicle with scheduled performance. Eng. Appl. Artif. Intell. 2024, 131, 107832. [Google Scholar] [CrossRef]
Xia, P.; Huang, Y.; Wang, Y.; Liu, C.; Liu, J. Augmentation-based discriminative meta-learning for cross-machine few-shot fault diagnosis. Sci. China Technol. Sci. 2023, 66, 1698–1716. [Google Scholar] [CrossRef]
Rezazadeh, N.; De Oliveira, M.; Lamanna, G.; Perfetto, D.; De Luca, A. WaveCORAL-DCCA: A Scalable Solution for Rotor Fault Diagnosis Across Operational Variabilities. Electronics 2025, 14, 3146. [Google Scholar] [CrossRef]
Smith, A.W.; Randall, B.R. Rolling element bearing diagnostics using the Case Western Reserve University data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Lessmeier, C.; Kimotho, J.K.; Zimmer, D.; Sextro, W. Condition Monitoring of Bearing Damage in Electromechanical Drive Systems by Using Motor Current Signals of Electric Motors: A Benchmark Data Set for Data-Driven Classification. In Proceedings of the European Conference of the Prognostics and Health Management Society, Bilbao, Spain, 6–8 July 2016. [Google Scholar]
Rezazadeh, N.; De Luca, A.; Perfetto, D.; Salami, M.R.; Lamanna, G. Systematic Critical Review of Structural Health Monitoring Under Environmental and Operational Variability: Approaches for Baseline Compensation, Adaptation, and Reference-Free Techniques. Smart Mater. Struct. 2025, 34, 073001. [Google Scholar] [CrossRef]

Figure 1. Multi-scale convolutional model graph.

Figure 2. Structure diagram of the GRU model.

Figure 3. Structure diagram of the DGAE model.

Figure 4. Fault Diagnosis flowchart.

Figure 5. Rolling bearing test bench of Model 6203.

Figure 6. Global time-domain waveform diagrams of vibration signals for five types of bearing states under four working conditions: (a) Operating condition 1; (b) Working Condition 2; (c) Working Condition 3; (d) Working Condition 4.

Figure 7. Real scene picture of the test bench.

Figure 8. Time-domain vibration waveforms of seven rolling bearing conditions under 1 HP, 2 HP, and 3 HP loads: (a) 1 HP; (b) 2 HP; (c) 3 HP.

Figure 9. Diagnostic Accuracy and loss Rate of the Paderborn dataset: (a) Accuracy; (b) Loss rate.

Figure 10. Classification Visualization of the Paderborn dataset: (a) tsne visualization; (b) Confusion matrix.

Figure 11. Grayscale heat map of the normalized features of the Paderborn dataset varying with character length.

Figure 12. Performance comparison chart of the Paderborn dataset under extreme working conditions.

Figure 13. Diagnostic Accuracy and loss Rate of the CWRU dataset: (a) Accuracy; (b) Loss rate.

Figure 14. Grayscale heat map of the normalized features of the CWRU dataset varying with character length.

Figure 15. Classification Visualization of the CWRU dataset: (a) tsne visualization; (b) Confusion matrix.

Figure 16. Performance comparison chart of the CWRU dataset under extreme working conditions.

Figure 17. Three-dimensional index bar chart of “Accuracy-Parameter quantity-Robustness”.

Figure 18. Five-classification diagram of tsne visualization after AFM removal: (a) before removal; (b) after removal.

Figure 19. tsne visualization seven-category diagram after AFM removal: (a) before removal; (b) after removal.

Figure 20. Convolutional Kernel Weight Heatmap: (a) 5 × 5 kernel weight heatmap (Over-amplified); (b) 7 × 7 Kernel Weight Heatmap (Over-amplified).

Figure 21. The color in the heatmap represents the activation intensity of Grad - CAM++. Warmer colors (e.g., yellow, red) signify higher activation levels, reflecting a stronger attention or response to the target features; cooler colors (e.g., blue, light - blue) signify lower activation levels, corresponding to relatively weaker or background regions.

Figure 22. Visualization of health and fault sample classification: (a) Remove the reconstruction branch; (b) Do not remove the reconstruction branch.

Figure 23. Weight distribution and Macro-F1 recovery effect of BST-WVI in an extremely unbalanced 1:100 scene.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, K.; Liu, X.; Yang, G.; Zhai, K.; An, G.; Zhang, Y.; Peng, C. Lightweight Adaptive Feature Compression and Dynamic Network Fusion for Rotating Machinery Fault Diagnosis Under Extreme Conditions. Actuators 2025, 14, 458. https://doi.org/10.3390/act14090458

AMA Style

Zhang K, Liu X, Yang G, Zhai K, An G, Zhang Y, Peng C. Lightweight Adaptive Feature Compression and Dynamic Network Fusion for Rotating Machinery Fault Diagnosis Under Extreme Conditions. Actuators. 2025; 14(9):458. https://doi.org/10.3390/act14090458

Chicago/Turabian Style

Zhang, Kaiyi, Xuling Liu, Guohua Yang, Kun Zhai, Gaofei An, Yusong Zhang, and Chaofeng Peng. 2025. "Lightweight Adaptive Feature Compression and Dynamic Network Fusion for Rotating Machinery Fault Diagnosis Under Extreme Conditions" Actuators 14, no. 9: 458. https://doi.org/10.3390/act14090458

APA Style

Zhang, K., Liu, X., Yang, G., Zhai, K., An, G., Zhang, Y., & Peng, C. (2025). Lightweight Adaptive Feature Compression and Dynamic Network Fusion for Rotating Machinery Fault Diagnosis Under Extreme Conditions. Actuators, 14(9), 458. https://doi.org/10.3390/act14090458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Adaptive Feature Compression and Dynamic Network Fusion for Rotating Machinery Fault Diagnosis Under Extreme Conditions

Abstract

1. Introduction

2. Theoretical Foundations

2.1. Adaptive Feature Module Clustering (AFM)

2.2. Workload-Adaptive Multi-Scale Convolution (WAMSC)

2.3. Self-Attention Dynamic Gated Auto-Encoder (Dynamic GRU Autoencoder, DGAE)

2.3.1. Encoder

2.3.2. Self-Attention Mechanism

2.3.3. Decoder

2.3.4. Loss Function

2.4. Classification-Reconstruction Joint Loss

2.4.1. Reconstruction Loss

2.4.2. Classification Loss

2.4.3. Joint Loss Function

2.5. Balanced Subset Training and Weighted Voting Integration

2.5.1. Balanced Subset Training

2.5.2. Weighted Voting Integration

3. The Network Structure and Fault Diagnosis Process of AFM-CDGAE

4. Experimental Verification and Comparative Analysis

4.1. Introduction to the Dataset

4.1.1. The Paderborn Public Dataset from Germany

4.1.2. The Public Dataset of Rolling Bearings from Case Western Reserve University (CWRU) in the United States

4.2. Experimental Analysis

4.2.1. Main Experiment Results

4.2.2. Comparative Experiment

4.2.3. Ablation Experiment

Remove AFM

Remove Load Adaptive Scaling

Remove Spatial Attention SAM

Remove the Reconstruction Branch

The Decisive Advantage of BST-WVI in Extremely Unbalanced Scenarios

Gradually Resume the Experiment: The “Final Push” Effect of BST-WVI

5. Conclusions and Prospects

5.1. Main Conclusions

5.2. Summary of Innovation Points

5.3. Method Limitations

5.3.1. Hardware Limitations

5.3.2. Systematic Elaboration of Limitations:

5.4. Future Outlook

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI