1. Introduction
Acoustic-based fault diagnosis has increasingly leveraged frequency-domain representations to capture the spectral characteristics of industrial signals [
1,
2,
3,
4,
5]. However, unlike vibration signals, acoustic data is highly susceptible to non-stationary environmental noise, leading to significant domain shifts that degrade model generalization. While frequency-domain augmentation has gained momentum owing to its ability to manipulate spectral components [
6,
7,
8], a critical gap remains: most current techniques rely on simple linear interpolation [
9] or keep phase information fixed while only exchanging amplitude. Such approaches often fail to capture the intrinsic decoupling of amplitude (energy) and phase (structure) [
10], resulting in ’semantic drift’, where fault-specific features are inadvertently altered. To address these challenges, this paper proposes a novel Amplitude-Phase Collaborative Augmentation Network (AP-CANet). Motivated by the need for label-consistent diversification, AP-CANet collaboratively leverages both amplitude and phase features across multiple source domains to learn robust representations. Our method offers distinct benefits over existing DG frameworks by introducing a Frequency–Spatial Interaction Module (FSIM) to bridge global spectral cues with local patterns, alongside a manifold triplet loss that reinforces category boundaries in a non-Euclidean manifold space. This problem-driven design ensures that the augmented features remain category-invariant while suppressing domain-specific interference.
Despite the success of Fourier-based augmentation, its application to industrial acoustics is limited by two gaps. First, it ignores physical decoupling where environmental noise affects amplitude while fault transients reside in phase, leading to “semantic drift” during random mixing; second, it lacks an adaptive mechanism to align these frequency components across domains. To bridge these gaps, our AP-CANet employs problem-driven Amplitude-Phase Collaborative Augmentation (APCA). By performing label-consistent augmentation and frequency–spatial interaction, APCA actively exploits the complementary nature of amplitude and phase to suppress domain interference while preserving category-invariant features.
The rest of this paper is organized as follows:
Section 3 details the proposed AP-CANet architecture,
Section 4 presents the experimental results and ablation analysis, and
Section 5 concludes the study. The main contributions of this paper are summarized as follows:
We propose a novel framework named AP-CANet that leverages the complementary properties of amplitude and phase spectra to perform multi-source frequency-domain augmentation for robust acoustic fault diagnosis.
We design the FSIM module, which fuses global spectral and local spatial features in both amplitude and phase subspaces, enhancing feature discriminability and cross-domain transferability.
We introduce a manifold triplet loss with a non-linear distance metric to improve the model’s ability to mine hard samples and learn discriminative representations under domain-shift conditions.
We conduct comprehensive experiments on two public acoustic datasets (GPLA-12 and MIMII-DG), and the results demonstrate that AP-CANet outperforms state-of-the-art domain generalization methods in unseen target domains.
3. Method
3.1. Motivation and Background
In acoustic fault diagnosis, the data distribution is highly sensitive to variations in industrial factors such as the working load, vibration frequency, and background noise. These variations lead to domain shifts that significantly degrade diagnostic performance. While frequency-domain analysis has been shown to alleviate such shifts, prior research has seldom explored discriminative feature extraction from frequency-transformed acoustic signals. To bridge this gap, we propose AP-CANet, which reconstructs both amplitude and phase representations in the Fourier domain. This enables the model to learn more robust, generalizable features that capture the underlying structure of acoustic signals across diverse working conditions.
First, we revisit the definition and properties of the Fourier transform. The Fourier transformation (
) a signal (
x) with a shape of
is formulated as follows:
where
C,
H, and
W are the channel, height, and width values of the signal, respectively.
Let
denote the complex-valued Fourier transform of the acoustic signal (
). We define the amplitude spectrum (
) and phase spectrum (
) as follows:
where
and
denote the real and imaginary parts of
. After decoupling the amplitude and phase, we perform label-consistent augmentation. The augmented frequency-domain feature (
) is formulated as follows:
where
and
denote the complex frequency-domain representations of two different samples from the source domains.
and
are mixing coefficients sampled from a Dirichlet or Beta distribution that control the fusion ratio of the amplitude (style-related) and phase (structure-related) components. Finally, the augmented feature in the spatial domain is obtained by applying the Inverse Fast Fourier Transform (IFFT):
.
According to Fourier theory, the amplitude component () reflects the stylistic characteristics of the signal in the frequency domain, while the phase component () encodes structure-related information that is closely tied to the signal category. Empirical analysis reveals that amplitude varies significantly across domains, even within the same class, whereas the phase remains relatively consistent, indicating its robustness to domain shifts. Motivated by this observation, we propose the enhancement of intra-class diversity and cross-domain consistency by progressively extracting and reconstructing both amplitude and phase components. This strategy enables the model to learn more generalized representations, thereby reducing domain discrepancies and improving robustness under unseen conditions.
3.2. AP-CANet
Based on the above analysis, we introduce a simple yet effective AP-CANet framework, as shown in
Figure 1. The overall network consists of an amplitude sub-network and a phase sub-network, both of which are designed to generate augmented amplitude and phase representations. Each sub-network incorporates a frequency–spatial interaction module (FSIM) as a fundamental building block, which integrates global and local spatial information to enhance feature extraction and learning.
In this process, a subset () from domain A is selected as an input feature. AP-CANet adaptively aligns the amplitude and phase representations to the corresponding features () in domain B. Meanwhile, another sample subset () from domain B is used as the target, and AP-CANet aligns amplitude and phase components to approach subset in domain A. This bidirectional cross-domain matching enhances feature consistency and improves model generalization under diverse domain conditions. We take the alignment from to as an example to describe the processing flow of the two sub-networks in AP-CANet.
In the amplitude sub-network, a sample from
is selected as input (
) and is passed through five FSIM-based encoder–decoder blocks to generate the output feature (
). Meanwhile, the corresponding sample (
) from
under the same category is used as the target. In amplitude space, the three signals (
,
, and
) form a constraint. To guide the learning of amplitude components, we extract the amplitude spectrum (
) as a soft label and constrain the output (
). The amplitude loss (
) is defined as follows:
where
denotes the mean absolute error,
is the amplitude sub-network,
is the Fourier transform operator, and
represents amplitude extraction from the spectrum.
In the phase sub-network, we use four FSIM blocks specialized for phase representation learning. The first two blocks receive residual information from the amplitude sub-network to preserve cross-modal consistency. To ensure consistency of the input, we do not feed
directly into the phase sub-network. Instead, we use the reconstructed amplitude (
) and the original phase (
) as inputs:
To recover time-domain signals, we adopt the Inverse Fast Fourier Transform (IFFT) for signal reconstruction, treating the processed signals as complexly valued inputs. Considering the sensitivity of phase components to structural and environmental changes in real-world scenarios, we further model the residual phase differences between
and
to guide alignment. In implementation, the residual features are concatenated with
, followed by a
convolution for integration. The phase loss is defined as follows:
where
is the phase sub-network and
represents the phase extraction function. Finally, the total augmentation loss of AP-CANet is computed as follows:
where
and
are weighting factors, which, in this paper, we set to 0.2 and 0.4, respectively.
The selection of AP-CANet is driven by the need to decouple amplitude (style) and phase (structure), a capability standard spatial-domain CNNs lack but that is vital for acoustic domain generalization. By bridging global spectral cues and local patterns via FSIM, the model achieves superior robustness. While the dual-branch structure and FFT operations increase computational complexity, this overhead is justified by the significant gains in diagnostic reliability under varying industrial conditions, offering a more resilient solution than traditional single-stream architectures.
3.3. Frequency–Spatial Interaction Module
According to the theoretical analysis in [
20], frequency-domain operations allow models to capture global representations, while convolutional layers are mainly focused on extracting local spatial features. Inspired by this observation, we introduce the Frequency–Spatial Interaction Module (FSIM) as a fusion block in both the amplitude and phase sub-networks to enhance their joint representation learning capabilities.
As illustrated in
Figure 2, FSIM consists of two branches: a frequency branch and a spatial branch. First, the input source feature (
) is fed into FSIM. In the spatial branch, a series of
convolutional residual blocks are applied to extract spatial information, producing output
.
Simultaneously, the frequency branch processes
using a
convolution to obtain
, which is then transformed into the frequency domain via the Fourier transform, yielding spectrum representation
. Amplitude and phase components are constructed in the frequency domain. To further extract amplitude information, we introduce a convolutional encoder module. The final output of the frequency branch is reconstructed via inverse Fourier transform as follows:
where
denotes the
convolution operation.
Next, the two branch outputs (
and
) are passed through
convolution layers to encourage feature interaction between the two branches. The updated features are calculated as follows:
where
denotes the
convolution operation.
As shown in
Figure 2, the outputs (
and
) capture complementary features through interaction, enhancing model discriminability. This process is repeated in subsequent FSIM layers. The final output of FSIM is denoted as
. Similarly, in the phase branch, the amplitude function (
) is replaced by the phase function (
), while the remaining operations remain unchanged.
3.4. Manifold Triplet Loss
In traditional machine learning, it is typically assumed that the distance between samples reflects their similarity in Euclidean space, where data is linearly and uniformly distributed. However, in real-world applications such as rotating machinery fault diagnosis, the acquired signals are often affected by installation, environmental, and operating conditions, resulting in non-linear and complex data distributions that are hard to model using Euclidean assumptions.
Manifold space refers to a geometric space that appears Euclidean locally but may exhibit curved structure globally. As shown in
Figure 3, Euclidean distances fail to reflect true sample similarity paths, whereas manifold distances measured along the curved data structure preserve more intrinsic relationships. Motivated by this, we propose a manifold triplet loss that captures intra-class and inter-class relationships via semantic distances on the manifold.
To model this, we replace Euclidean distance with shortest path distance on a manifold graph. Specifically, a graph is constructed where each node represents a sample, and edges connect to its
K nearest neighbors. The edge weights are defined by pairwise local Euclidean distances. A sparse adjacency matrix (
D) is built such that
where
is the shortest path between nodes
i and
j and
k denotes any intermediate node. The Floyd–Warshall algorithm is used to compute shortest paths among all sample pairs.
To emphasize the importance of hard boundary samples, we define a nonlinear distance rescaling function:
where
r is the mean intra-class distance in the current mini-batch and
k is a scaling coefficient controlling the strength of expansion and contraction (set to 3). Finally, the manifold triplet loss is defined as follows:
where
and
denote the manifold distances of the hardest positive and negative pairs within a mini-batch and
is the margin. This approach enables the model to better capture intra-class cohesion and inter-class separation in non-Euclidean data structures.
3.5. Loss Function
During the classification stage, both the original input (
) and the augmented sample (
) are used to construct the training dataset to enhance the model’s ability to identify different fault types. The softmax function is used to calculate the predicted probability (
), which is defined as follows:
where
represents the logit score of the
i-th fault instance and
denotes the number of fault classes. The prediction with the highest probability is selected as the final classification result. The model is optimized by minimizing the cross-entropy loss between the predicted results and the ground truth:
By combining all loss terms, the total training objective of AP-CANet is expressed as
where
is a weighting coefficient used to balance the contribution of the manifold triplet loss.
4. Experiment
To verify the effectiveness and generalization capability of the proposed AP-CANet in acoustic fault diagnosis, comprehensive experiments are conducted on two publicly available acoustic fault diagnosis datasets: the Pipeline Leak Acoustic Dataset (GPLA-12) and the Electrical Sound Dataset (MIMII-DG). These datasets cover a variety of operating conditions and simulate domain shifts commonly encountered in industrial environments.
All models are trained using the same backbone network for fairness, and performance is measured using classification accuracy, confusion matrix analysis, and feature visualization (e.g., t-SNE). Furthermore, ablation studies are conducted to investigate the contributions of amplitude-phase augmentation and manifold-aware metric learning.
4.1. Dataset Description
The inclusion of the MIMII-DG dataset alongside GPLA-12 is intended to evaluate the model’s cross-scenario generalization. While GPLA-12 focuses on localized pipeline leakages, MIMII-DG provides acoustic data from rotating industrial fans under varying SNR conditions. By achieving consistent performance across these two fundamentally different acoustic environments, we demonstrate that APCANet is not overfit to a specific hardware setup but is robust to diverse industrial soundscapes.
GPLA-12 [
21]: GPLA-12 is an open-source acoustic dataset specifically designed for gas pipeline leak detection. The real gas pipeline system with detailed components shown in
Figure 4 is used to collect the GPLA-12 dataset. The dataset consists of 684 samples categorized into 12 classes, each representing different leakage intensities, pressure levels, and background noise conditions. To evaluate the generalization performance of the proposed method under domain-shift scenarios, we divide the dataset into domains based on sound pressure levels (SPLs), simulating realistic acoustic variations caused by changes in environmental energy. Specifically, the dataset is partitioned into three domains corresponding to pressure levels of 0.2 MPa, 0.4 MPa, and 0.5 MPa (expressed as
,
, and
), with each domain containing four distinct classes. Each sample is a one-dimensional acoustic signal of size
.
MIMII-DG [
22]: The MIMII-DG dataset is designed for acoustic monitoring of industrial equipment including pumps, fans, solenoid valves, and slide rails, containing machine operation sounds under varying background noise conditions.
Figure 5 depicts the recording setup for each machine’s orientation and distance relative to the microphone array. In this study, we adopt a domain-partitioning strategy based on noise levels, using varying sound pressure levels (SPLs) to simulate different noise environments: −6 dB, 0 dB, and 6 dB (expressed as
,
, and
). Each domain includes two classes: normal and anomalous, with anomalies covering faults such as leakage, contamination, and component damage. Experiments are conducted using the fan device data. The model is trained on two source domains and evaluated on an unseen target domain, without accessing its data during training. Audio was recorded using an 8-channel microphone array at 16 kHz. Each sample has a shape of
.
As shown in
Table 1, the GPLA-12 dataset consists of 12 classes recorded in three spatial domains representing different sensor-to-leak distances. The MIMII-DG dataset (fan subset) includes industrial fan sounds mixed with ambient factory noise at various signal-to-noise ratios (SNRs). To ensure reliability, we adopt a “leave-one-domain-out” strategy: the model is trained on two domains (with an 80/20 split for internal validation) and evaluated on the entirely unseen third domain to simulate real-world deployment.
The labeling procedure follows the physical configurations of the data collection process. For GPLA-12, each acoustic sample is assigned a label () corresponding to 12 distinct leakage scenarios (varying by hole size and location). For MIMII-DG, we adopt a binary labeling scheme where ‘0’ represents normal operational sound and ‘1’ indicates anomalous fan behavior. All labels are cross-verified with the ground-truth mechanical states recorded during the experiments to ensure high data fidelity before the training phase.
4.2. Experimental Setting
Our proposed method is implemented using the PyTorch 2.1.0 framework [
23] on Python 3.9 and a single NVIDIA 2080 Ti GPU. During the training process, the framework is trained for 50 epochs with a batch size of 128. The SGD optimizer [
24] is employed to update the overall framework with a momentum of 0.9. Due to the different sensitivity of the Amplitude-Phase Collaborative Augmentation Network, the initial learning rates are set to 0.001 and 0.01. To train the manifold triplet loss, each batch is constructed by a PK sampler [
25,
26], which comprises four types, each of which is composed of 32 instances in a batch. The constant threshold (r) is calculated by the mean distance in a batch, and the margin is 0.3 in the manifold triplet loss. The value of
is set to 0.01 in the training procedure. To ensure the reliability and effectiveness of the experimental results, we conducted experiments for each method five times for each task.
To quantitatively evaluate the performance of AP-CANet, we employ accuracy and standard deviation as the primary metrics. Accuracy is defined as the ratio of correctly predicted samples to the total number of samples:
where
,
,
, and
represent true positives, true negatives, false positives, and false negatives, respectively. Since domain generalization involves multiple trials across different target domains, we also report the standard deviation (Std) of the accuracy to demonstrate the model’s stability and reliability. A high average accuracy combined with a low standard deviation indicates that the proposed model is not only efficient but also consistently robust against diverse acoustic domain shifts.
4.3. Comparative Experiment
In this experiment, we compared our proposed method with two categories of benchmark models: a classical DA method, i.e., MMD [
27], and several widely adopted DG methods, including SNR [
28], Mixup [
29], and MixStyle [
30]. The average classification accuracies and standard deviations of all methods are reported in
Table 2.
Table 2 provides a comprehensive comparison of the domain generalization performance of various methods on two challenging acoustic datasets: GPLA-12 and MIMII-DG (fan device). The reported metrics are the average classification accuracy and standard deviation (mean ± std), which, together, reflect both predictive performance and stability across three cross-domain transfer tasks.
AP-CANet consistently outperforms all the compared methods on both datasets. On the GPLA-12 dataset, which involves complex inter-task generalization (e.g., ), AP-CANet achieves the best results in all three transfer tasks, with an average accuracy of 81.09% and a standard deviation of only 1.86, indicating both superior recognition capability and stable model behavior under domain shift. On the MIMII-DG dataset, where the domain gap arises mainly from different background noise levels, AP-CANet, again, demonstrates excellent robustness. It achieves accuracies of 85.76%, 84.68%, and 82.41% across the three tasks (, , and ), resulting in an average of 84.28% ± 1.74.
In contrast, the ResNet18 backbone exhibits the weakest generalization performance, with average accuracies of 68.07% ± 4.35 on GPLA-12 and 70.69% ± 2.61 on MIMII-DG, confirming its limited robustness under domain shift. The MMD method, which reduces domain discrepancies via distribution alignment, performs well on GPLA-12 (79.12% ± 1.89) but achieves only 80.90% ± 2.68 on MIMII-DG, indicating that its effectiveness may be limited in real-world noisy scenarios. The SNR-based method benefits from explicit modeling of the signal-to-noise ratio, which aligns with the domain definitions in MIMII-DG, yielding 79.77% ± 2.57. However, its performance drops to 76.73% ± 2.57 on GPLA-12, reflecting limited adaptability to more abstract domain shifts.
Feature-level augmentation methods such as Mixup and MixStyle show moderate performance. Mixup achieves 72.04% and 78.00% average accuracies on GPLA-12 and MIMII-DG respectively, while MixStyle reaches accuracies of 70.98% and 78.80%. Despite occasional improvements, their relatively large standard deviations (e.g., MixStyle’s 3.49 on GPLA-12) indicate inconsistency and potential sensitivity to varying domain compositions.
As shown in the confusion matrices in
Figure 6, AP-CANet achieves the best overall performance on the GPLA-12 dataset, with over 90% accuracy in three out of four classes and the lowest overall misclassification rate. This demonstrates its superior generalization ability and stability across tasks. In contrast, ResNet18 and Mixup suffer from severe misclassification between classes 2 and 3. MixStyle and SNR show some improvements but still exhibit class boundary ambiguity. MMD performs poorly in distinguishing class 3. Overall, AP-CANet exhibits stronger discriminative capability and robustness under complex acoustic conditions.
Overall, AP-CANet leverages amplitude-phase collaborative augmentation to effectively capture both amplitude-invariant and phase-sensitive representations in the frequency domain. This design enables robust and accurate fault detection under diverse domain shifts, demonstrating strong generalization capability across both synthetic and real-world acoustic scenarios.
4.4. Discussion and Implications
The experimental results across the GPLA-12 and MIMII-DG datasets yield several major findings. First, the consistent superiority of AP-CANet confirms that frequency-domain decoupling is more effective for acoustic signals than standard spatial-domain augmentations. As noted in [
10], phase information encodes the essential structural patterns of acoustic wavefronts. By preserving phase consistency while diversifying amplitude, our model successfully mitigates domain shift without losing fault-specific semantics.
Second, the performance gain in low-SNR scenarios (MIMII-DG) suggests that the FSIM module effectively filters environmental noise by bridging global spectral trends with local temporal transients. This aligns with the observations in [
5] regarding the importance of multi-scale feature interaction in complex soundscapes. The implication for industrial practice is significant: AP-CANet reduces the need for extensive manual data re-labeling when deploying diagnostic systems to new factories with different background noise levels.
Finally, the success of the manifold triplet loss indicates that mapping acoustic features to a non-Euclidean manifold space better captures the underlying geometry of fault distributions than traditional Euclidean metrics. This finding reinforces the theory that high-dimensional acoustic data often resides on low-dimensional manifolds, as discussed in recent DG literature. In summary, these results demonstrate that integrating acoustic physical priors (amplitude phase) with advanced metric learning provides a more reliable and interpretable pathway for robust machine health monitoring.
4.5. Ablation Experiments on Core Modules
To verify the effectiveness of the proposed components in AP-CANet, we conduct a detailed ablation study on the GPLA-12 dataset. The results are summarized in
Table 3, where we systematically evaluate the contribution of Amplitude-Phase Collaborative Augmentation (APCA), the two sub-branches (amplitude (amp) and phase (pha)) of the Frequency–Spatial Interaction Module (FSIM), and the Manifold Triplet Loss (MTL).
- (1)
Effectiveness of APCA: The baseline model (without any proposed modules) achieves an average accuracy of 75.82%. With the incorporation of APCA, the performance increases to 78.16%. This improvement confirms that frequency-guided augmentation effectively addresses the domain-shift problem by enriching the diversity of the training samples while preserving their semantic consistency.
- (2)
Contribution of FSIM Branches: We further investigate the impact of the dual-branch interaction within the FSIM. Adding only an amplitude branch (amp) or phase branch (pha) to the augmented baseline improves the accuracy to 78.68% and 79.06%, respectively. The best intermediate result (80.35%) is achieved when both branches are integrated, demonstrating that the collaborative interaction between global spectral information and local spatial details is essential for the extraction of discriminative features that are robust to environmental noise.
- (3)
Impact of MTL: The inclusion of MTL as the final constraint brings the overall performance to its peak of 81.09%. This incremental gain validates that our manifold-based metric learning effectively refines the feature space by narrowing intra-class distances on the data manifold, providing superior generalization to unseen target domains.
In conclusion, the ablation results demonstrate a clear performance ladder, confirming that each design choice—from data-level augmentation to architectural feature fusion and loss-level regularization—plays a distinct and vital role in the success of AP-CANet.
4.6. Ablation Study on the Hyperparameters of the Augmentation Network
To systematically evaluate the impact of the regularization weight factors (
and
) on model performance, this study includes an ablation experiment using a set of proportional parameter combinations:
(0.1/0.2, 0.2/0.4, 0.3/0.6, 0.4/0.8, 0.5/1.0, 0.6/1.2, 0.7/1.4). As shown in
Figure 7, when
, the model achieves the highest diagnostic accuracies of 83.91% on the GPLA-12 dataset and 86.76% on the MIMII-DG dataset across different sub-tasks. Compared to the second-best combination (
), this setting improves accuracy by 1.70 and 3.66 percentage points, respectively.
In the low-parameter range (), insufficient regularization leads to model overfitting, causing the GPLA-12 test accuracy to drop to 79.11%. Conversely, in the high-parameter range (), excessive regularization results in feature degradation, with the MIMII-DG accuracy dropping sharply by 6.76%. Notably, maintaining a 1:2 ratio between and enables a dynamic balance between the amplitude and phase sub-networks, effectively promoting multi-source domain-feature learning and augmentation and optimizing cross-domain feature disentanglement.
4.7. Industrial Applications
The proposed AP-CANet architecture offers significant practical value for modern industrial systems. Specifically, it can be deployed in smart manufacturing plants for the continuous health monitoring of rotating components (e.g., industrial fans and pumps), where its robustness to environmental noise ensures high diagnostic accuracy without frequent recalibration. Furthermore, in gas pipeline infrastructure, the model can be integrated into automated leak detection systems to provide real-time alerts, potentially preventing costly energy losses and environmental hazards. By leveraging the model’s domain generalization capability, enterprises can significantly reduce the costs associated with data acquisition and expert labeling when scaling diagnostic solutions across different geographic sites or equipment generations.
5. Conclusions
This paper addresses the challenge of domain generalization in acoustic fault diagnosis under complex working conditions and domain shifts, especially when target-domain data is unavailable. To tackle this, AP-CANet is proposed in this paper. The network adaptively aligns amplitude and phase features across multiple source domains and introduces a frequency–spatial interaction module, along with a manifold triplet loss to enhance feature discrimination while maintaining semantic consistency. Experimental evaluations on the GPLA-12 and MIMII-DG datasets demonstrate that the proposed method achieves superior performance in unseen domains and under noisy conditions. Despite its performance, this study has limitations. First, the dual-branch interaction increases computational complexity, potentially hindering real-time deployment on edge devices. Second, the reliance on amplitude-phase decoupling may be less effective for signals with extreme non-stationary noise. Future work will focus on developing lightweight architectures and adaptive frequency selection. Additionally, extending AP-CANet to open-set scenarios—where unseen fault categories emerge in the target domain—remains a promising direction for the enhancement of industrial reliability.
In the long term, the principles of amplitude-phase collaborative interaction established in this study can provide a foundation for the development of more resilient acoustic diagnostic frameworks. Future research will explore the extension of AP-CANet to multimodal fusion scenarios, where acoustic data is combined with vibration or thermal imagery for more comprehensive system health assessment. Additionally, we aim to investigate self-supervised pre-training techniques to further reduce the dependency on labeled source-domain data, making the model more adaptable to extreme environments where fault samples are scarce. Ultimately, these advancements will contribute to the realization of truly autonomous and self-evolving industrial maintenance platforms.