The experiments were implemented in PyTorch 1.10.0 and Python 3.8.5, running on an AMD Ryzen 9 7950X CPU @4.50 GHz (64 G RAM) and an RTX 4070Ti GPU platform. The meta-learning framework is trained for a total of 200 epochs, with 100 episodes randomly sampled per epoch. A meta-batch size of four episodes is used per gradient update. The initial learning rates are set to 0.01 for the feature extractor and 0.005 for the similarity measurer, and are decayed by a factor of 0.5 at epochs 50, 100, and 150. The multi-layer loss balancing parameter β is fixed at 0.1 unless stated otherwise. These settings are applied consistently across all subsequent experiments.
4.1. Experiment 1: Sensitivity Analysis of Training Sample Size
This experiment adopted the public standard dataset provided by the Bearing Data Center of Case Western Reserve University (CWRU) [
24] in the United States. This dataset has become an important benchmark for bearing fault diagnosis research due to its high reliability and wide recognition.
Data acquisition settings: Sampling frequency: 12 kHz. Sensor arrangement: The vibration signal is collected by an accelerometer, which is vertically installed on the drive end fan side bearing housing of the motor housing. Experimental setup: The schematic diagram of the experimental equipment is shown in
Figure 4. Fault simulation method: To simulate bearing faults, the electric discharge machining (EDM) technology is adopted. Three different sizes (diameters) are respectively introduced on the inner race, outer race and rolling element/ball of the bearing. There are single point faults of 0.007 inches, 0.014 inches, 0.021 inches.
Fault categories: As shown in
Table 2, the dataset contains a total of 10 explicit fault states, covering different combinations of fault locations (inner ring, outer ring, rolling elements) and different damage sizes, as well as the normal state. Data preprocessing and sample construction: The original collected signal is a time-domain vibration signal. Adopt the fixed-length segmentation strategy: The continuous vibration signal is divided into non-overlapping segments, and each segment contains 1024 consecutive data points, forming an independent data sample.
Dataset division: After implementing the above classification (10 categories) and segmentation strategy, the entire CWRU dataset is systematically divided into 10 fault categories. Each category contains 400 data samples of equal length (1024 points per sample). This dataset is clearly divided and has complete categories, providing a solid experimental basis for evaluating few-shot fault diagnosis methods.
Calculate the failure frequency according to Equations (3)–(6) in
Section 2.3, and serve as the theoretical basis for the “Fault Type” and “Label” classifications presented in
Table 2. The signal processing framework in this paper (
Section 3) is designed to extract and learn features that are discriminative of these underlying physical phenomena, even from very few samples. The severity of the fault (e.g., 0.007″, 0.014″, 0.021″) modulates the amplitude and bandwidth of these excitations.
This section focuses on the first key issue—the research on the impact of the number of training samples on the model performance. For this purpose, we designed a comparative experiment, setting the training sample sizes for each fault category at 4, 8 and 12, respectively. Under the meta-learning framework, the data of normal state, inner circle failure and rolling element failure are selected as the meta-training set, while the data of outer circle failure is used as the meta-testing set to evaluate the generalization ability of the model on unseen categories. The experiment evaluated two typical small-sample learning scenarios: three-way one-shot (the support set for each task contains three categories, and each category provides only one sample) and three-way five-shot (the support set for each task contains three categories, and each category provides five samples). The experimental parameter configuration is as follows: The model parameter β is set to 0.1; The learning rate of the feature extraction network is set at 0.01; the learning rate of the similarity measurement module is set at 0.005; the optimizer selected is Adam.
To comprehensively evaluate the performance of the proposed method, we designed a systematic comparative experiment. In the experiment, we adopted the small-sample bearing diagnosis framework based on the Siamese network proposed in the literature (SN) [
26] as the benchmark model. Meanwhile, the classic meta-learning method—Relation Module (RN) [
20], Model-Agnostic Meta-Learning (MAML) [
27], Time-Frequency Convolutional Feature Pyramid Network (TFCFPN) [
28] and Dual-Channel Feature Fusion Meta-Learning (DCFFML) [
29] was introduced as an important comparison method. Each method was tested using five different random seeds, and the results were presented as the mean ± standard deviation.
To ensure the statistical reliability of the results, for each training set size (i.e., 4, 8, or 12 samples per class) and each few-shot setting (one-shot and five-shot), we independently repeated the complete algorithm training and testing process five times. This approach effectively mitigates the influence of randomness in data sampling and model initialization. The mean diagnostic accuracies and their standard deviations (SDs) for each method under different settings are summarized in
Table 3, with an intuitive comparison presented in the bar chart of
Figure 5.
The analysis of the experimental results (
Table 3 and
Figure 5) leads to the following key observations. In the three-way few-shot learning scenario, the Siamese network benchmark performs the weakest, primarily due to its reliance on a fixed Euclidean distance metric, which lacks the adaptability to capture complex, multi-modal bearing fault features. The optimization-based MAML method shows better performance but suffers from significant computational overhead due to its bi-level optimization.
More importantly, by comparing with recent advanced methods, the superiority of our proposed method is further highlighted. RN achieves competitive results, yet its performance remains consistently lower than ours across all settings, indicating the advantage of our learnable similarity metric over traditional distance-based meta-learning. TFCFPN and DCFFML, which incorporate feature fusion or customized meta-learning strategies, show improved performance over Siamese and MAML. However, our method consistently outperforms all of them, achieving the highest mean accuracy and the lowest standard deviation in nearly all cases. This significant and stable advantage demonstrates that our proposed framework, combining multi-layer feature fusion and an adaptive similarity scheduler, not only achieves state-of-the-art diagnostic accuracy under extremely limited data conditions but also exhibits superior robustness against training randomness. The learnable parametric similarity scaler is crucial, as it dynamically optimizes the similarity criterion, effectively alleviating the constraints imposed by fixed-metric functions and enabling the feature extractor to mine more discriminative fault characteristics.
The measurement-based meta-learning method proposed in this paper shows significant advantages. This metric can dynamically learn and optimize the similarity judgment criteria among samples, thereby significantly alleviating the inherent constraints of fixed-metric functions (such as Euclidean distance) on the learning ability of the feature extractor, enabling it to more effectively mine and represent the discriminative features of complex fault data.
4.2. Experiment 2: Cross-Load Domain Generalization Capability Verification
In order to verify the general capability of fault diagnosis of the model proposed in this paper when the load changes, this paper first adopts CWRU with Fault Type-Inner Ring Fault (0.1778A) as the training set and validation set, and Fault Type-Inner Ring Fault (0.3556B/0.5334C) as the test set (denoted as A-B, A-C). The data used for model training and testing all comes from different loads. The method proposed in this paper was compared with RN, CNN, MobileNet-V1, InceptionResNet and DenseNet. The experimental results of the accuracy of bearing fault diagnosis under variable working conditions are shown in
Figure 6.
The convergence curve comparison as shown in
Figure 6 clearly indicates that the method proposed in this paper is significantly superior to the other four comparison models in terms of convergence speed. Specifically, this method rapidly converges to the highest accuracy in this experiment with only 50 rounds of iterations, demonstrating extremely high training efficiency.
In-depth analysis of the convergence performance of other models: MobileNet-V1 model: Due to the fact that its design core lies in model lightweighting (parameter simplification), when conducting fault diagnosis under cross-load conditions, its accuracy rate improves slowly, and ultimately fails to reach the highest accuracy level achieved by the model in this paper. This indicates that, while pursuing lightweighting, the feature extraction ability and convergence potential of the model in complex cross-domain scenarios may be sacrificed. DenseNet model: Thanks to the introduction of its dense connections mechanism, the convergence curve of this model is generally slightly higher than that of the basic CNN model. This result strongly confirms that dense connections can effectively promote feature reuse and information flow between network layers, thereby enhancing the ability to identify fault features. The InceptionResNet model: This model integrates the design concepts of multi-scale feature extraction (Inception module) and gradient propagation optimization (residual structure). Its convergence curve shows that the rate of increase in accuracy in the initial stage is higher than that of the DenseNet model. However, its convergence process shows considerable volatility and relatively poor stability, suggesting that there might be potential challenges in the model structure or optimization process.
Based on the above experimental results, it can be concluded that the method proposed in this paper not only has an overwhelming advantage in convergence speed and can quickly achieve extremely high diagnostic accuracy, but more importantly, it can stably and efficiently achieve high-precision fault diagnosis when facing bearing vibration data under different load conditions (i.e., cross-domain conditions). This fully demonstrates the excellent generalization performance of this method, which can effectively adapt to the complex and changeable working environment in actual industrial scenarios.
4.3. Experiment 3: Robustness Test of Anti-Noise Performance
The vibration signals of rolling bearings in industrial systems are complex and greatly disturbed by environmental noise. In order to accurately diagnose bearing faults, the signal-to-noise ratio (SNR) is an important criterion for evaluating the difference in signal and noise intensity. In the experimental dataset, Additive White Gaussian Noise (AWGN) with different signal-to-noise ratios (within the range of −4 dB to 12 dB) can be added to the signal to form a composite signal containing noise.
Figure 7 shows the process of a composite signal with a signal-to-noise ratio of −4 dB, in which the periodic impact of the original signal is significantly masked, which is unfavorable for subsequent fault diagnosis, as shown in
Figure 7.
Figure 8 clearly shows the trend of the diagnostic performance of different models varying with the signal-to-noise ratio (SNR) in a noisy interference environment. By analyzing this graph, the following key conclusions can be drawn: MobileNet performs poorly under strong noise: when SNR = −4 dB (the noise is extremely significant), the diagnostic accuracy of the MobileNet model is relatively low. This mainly stems from its lightweight design concept—in pursuit of the simplification of the number of parameters and computational load, the feature extraction ability of the model is limited, making it difficult to effectively learn the weak fault features submerged by high-intensity noise. The anti-noise ability of the CNN model is limited: the basic CNN model, as a relatively shallow network structure, does not integrate targeted noise suppression mechanisms. Therefore, after inputting the signal containing noise, the improvement process of its accuracy rate is slow, indicating that its learning and generalization efficiency in a noisy environment is insufficient. The model proposed in this paper demonstrates outstanding anti-noise performance: particularly notable is that, within the wide noise intensity range of SNR from −4 dB to 12 dB, the performance of the model proposed in this paper significantly and stably outperforms the other four comparison models. This result strongly proves that the model has a powerful ability to extract fault features and can effectively identify key fault information in a complex noise background.
The model in this paper has excellent robustness. An important advantage lies in that the model in this paper still demonstrates excellent anti-interference ability without any specialized denoising preprocessing. This fully demonstrates that its model architecture itself has good inherent robustness and is insensitive to noise disturbances in the input signal. Observing the overall trend in
Figure 8, it can be seen that the diagnostic accuracy of all five models gradually improves with the increase in SNR, which is consistent with expectations. It is notable that, when the SNR exceeds 6 dB, the accuracy changes in each model tend to stabilize. This indicates that, under the conditions of medium and above signal-to-noise ratios, the improvement space for diagnostic accuracy is relatively limited. At this time, the performance bottleneck of the model mainly depends on its own ability to extract and utilize effective fault features, rather than the noise suppression ability. The continuous leading advantage of the model in this paper at this stage further confirms the effectiveness of its core feature extraction module.
4.4. Experiment 4: Generalization Exploration Across Damage Mechanisms
To more comprehensively evaluate the generalization ability of the proposed method in actual industrial scenarios, this study further applies it to the Paderborn bearing dataset. This choice is crucial because, unlike the CWRU dataset which contains only artificially seeded defects (EDM, drilling), the Paderborn dataset provides samples of naturally evolved faults from accelerated life tests. These natural faults (e.g., fatigue pitting, plastic indentation) are the direct result of the degradation of contacting surfaces, a process intrinsically linked to the effectiveness and condition of the lubricant.
The Paderborn University (PU) bearing dataset is specifically designed for research on bearing condition monitoring and fault diagnosis. Its core contains 32 type 6203 bearings, with the specific composition as follows: artificially defective bearings: 12 (manufactured through specific process simulation), naturally faulty bearings: 14 (obtained through accelerated life test operation until failure), healthy (normal) bearings: 6. This dataset standardizes and classifies the severity of faults (based on damage length): Level 1: damage length < 2 mm, Level 2: damage length ≥ 2 mm.
For artificially defective bearings, three different processing modes were adopted to simulate different types of damage. Natural failure bearings, on the other hand, are generated during strictly controlled Run-to-Failure tests, which are closer to the failure process in actual service. All bearings are installed on a unified modular test bench for standardized testing. During the experiment, multi-source sensing signals were collected simultaneously, including: motor current signals and vibration signals.
In this case study, we focus on representative samples under specific working conditions (N09_M07_F10): one healthy bearing, eight artificially defective bearings, covering different processing modes and severities, and four naturally faulty bearings. A total of 14 bearings were used to construct the few-shot learning task. The detailed configuration information of these selected bearings is summarized in
Table 4. The schematic diagram of the experimental setup used for collecting the Paderborn dataset is shown in
Figure 9.
The dataset composition and selection criteria (
Table 4) are designed to explicitly test the hypothesis: Can a model trained on artificial faults (clean, geometrically precise damage) generalize to natural faults (rough, irregular, tribologically evolved damage)?
To further verify the generalization ability of the method proposed in this paper in the key scenario of crossing from “artificial simulation defects” to “real natural degradation” fault types, a rigorous comparative experiment was designed in this study. We selected several of the most advanced few-shot learning (FSL) and transfer learning algorithms as benchmarks for performance comparison, specifically including: Direct Training Network (DTN) [
30], Feature Transfer Network (FTN) [
30], and Model-Agnostic Meta-Learning (MAML). To ensure the fairness of the comparison and the reliability of the conclusion, this comparative experiment strictly follows the following unified settings: Dataset: all use the Paderborn University dataset. Data preprocessing: all adopt the Fast Fourier Transform (FFT) to convert the original vibration signal to the frequency domain as the model input. The composition of the meta-training set: it only contains the data of one healthy bearing and eight artificially damaged bearings. The core of this design lies in training the model using only artificial fault data and testing its performance on unseen natural faults. The composition of the meta-test set: evaluation is conducted using data from four naturally deteriorated bearings. Evaluation task: perform the four-way K-shot classification task on the meta-test set, where K = 1, 5, 10. This means that each task needs to identify four states simultaneously, and the support set provides K samples for each category.
The fault classification accuracy obtained by each method on the Paderborn dataset is recorded in detail in
Table 5, with an intuitive performance comparison shown in
Figure 10. Experimental reproducibility and statistical robustness: for the benchmark methods Direct Training Net and Feature Transfer Net, the results were obtained by running the open-source code provided by the authors. To effectively overcome the inherent randomness in data sampling and task construction, all accuracy values reported in
Table 5 and
Figure 10 are the mean averages of 10 independent tests, accompanied by their standard deviations (SDs).
As shown in
Table 5, our proposed method consistently achieves the highest mean accuracy across all few-shot settings (one-shot, five-shot, and ten-shot), while also yielding the smallest standard deviations in most cases. In the challenging one-shot scenario, our method attains 92.89% ± 1.32%, significantly outperforming MAML (89.16% ± 1.43%), Feature Transfer Net (87.16% ± 1.20%), and Direct Training Net (78.10% ± 1.33%). This statistical advantage becomes even more pronounced as the shot count increases. For the five-shot task, our method achieves 98.04% ± 0.63%, compared to 94.51% ± 0.83% for MAML and 92.25% ± 1.08% for Feature Transfer Net. In the 10-shot setting, our method reaches near-perfect accuracy (99.60% ± 0.13%), with a standard deviation of only 0.13%, demonstrating exceptional robustness and generalization stability when transferring from artificial to natural fault mechanisms. The consistently low standard deviations of our method across all tasks further confirm its reliability and insensitivity to random sampling variations.
To deeply explore the representation quality of different algorithms in the feature space, we adopt the t-SNE (t-distributed Stochastic Neighbor Embedding) [
31] technique. The deep features learned by four comparison algorithms in the five-shot learning scenario were analyzed for dimensionality reduction and visualization (the results are shown in
Figure 11). Through a detailed observation of the visualization results, the following important conclusions can be drawn:
Poor feature separability of the Direct Training Net: As shown in
Figure 11, the feature distribution of this method presents significant overlap and confusion. This intuitively reflects the serious insufficiency of its generalization ability—the model, trained only on artificially simulated fault data, lacks sufficient discriminability when tested on unseen naturally degraded fault data, resulting in blurred boundaries between different categories.
Limited improvement in the performance of the Feature Transfer Net: Compared with the directly trained network, the feature distribution of the Feature Transfer Net (
Figure 11) shows a certain degree of separation trend. However, the distances between feature clusters of different categories are still relatively close, with obvious local overlapping areas. This indicates that its feature transfer process fails to fully adapt to the characteristics of the target domain (natural faults), and the improvement of the model’s generalization performance is relatively limited.
The MAML algorithm demonstrates the advantages of meta-learning: Thanks to its meta-learning framework, the feature distribution learned by the MAML algorithm (
Figure 11) shows a better clustering structure. The specific manifestations are as follows: the compactness of intra-class features has improved, while the separation degree of inter-class features has been enhanced. This verifies the effectiveness of meta-learning in enhancing the model’s ability to quickly adapt to new tasks, such as identifying natural faults.
The method proposed in this paper achieves the optimal feature representation: the most prominent is the result of the method proposed in this paper (
Figure 11). Its visualization feature distribution presents a highly ideal pattern: highly compact within the class—sample points belonging to the same category are highly clustered, forming compact and clear clusters, and there is significant separation between classes—the feature clusters of different classes are distant from each other, with clear and distinguishable boundaries and almost no overlapping areas. This sharp contrast strongly proves that the method proposed in this paper can learn feature representations with extremely strong discriminability and excellent robustness. By enforcing multi-layer feature constraints and learning an adaptive metric, the model successfully ignores the superficial textural differences between EDM cuts and fatigue spalls, instead focusing on the underlying, physically meaningful vibration patterns (e.g., impulse periodicity corresponding to BPFO/BPFI) that are common to both failure generation mechanisms. This maximization of intra-class similarity and inter-class difference achieved in the feature space is the intrinsic reason for its outstanding diagnostic performance, particularly in this challenging scenario of generalizing from artificial to natural faults.
4.5. Sensitivity Analysis of the Balancing Parameter β
To thoroughly investigate the influence of the multi-level loss balancing parameter β (introduced in Equation (15)) on the diagnostic performance of the proposed method, a dedicated sensitivity analysis is conducted. The parameter β controls the contribution of the high-level semantic features (Layer L) relative to the shallower, detail-rich features (Layer L−1) in the total loss function. Understanding its impact is crucial for optimal model configuration and for demonstrating the robustness of the multi-layer feature fusion strategy.
This experiment utilizes the CWRU dataset under the three-way five-shot task with eight training samples per class. The value of β is varied across a representative range: {0.01, 0.05, 0.1,0.15,0.2, 0.3, 0.5, 0.8, 1.0}. The default learning rates and optimizer settings remain unchanged from
Section 4.1.
As shown in
Table 6, the diagnostic accuracy exhibits a clear trend with respect to β. When β is set to a very small value (β = 0.01), the contribution of the deep layer loss (Layer L) is heavily suppressed. In this case, the model relies predominantly on the shallower features (Layer L−1), which, while rich in local textures and edge details, lack the necessary semantic abstraction for robust fault classification. Consequently, the diagnostic accuracy is relatively low (87.43%).
As β increases from 0.01 to 0.1, the accuracy improves significantly, peaking at 96.33% when β = 0.1. This indicates that incorporating an appropriate contribution from high-level semantic features (Layer L) effectively complements the shallow details, leading to a more discriminative and robust feature representation. The optimal balance is achieved when the deep semantic features are given moderate weight, allowing the model to leverage both local and global information synergistically.
When β continues to increase beyond 0.1 (from 0.15 to 1.0), a gradual decline in accuracy is observed, dropping to 84.45% at β = 1.0. This suggests that over-emphasizing the deep layer loss can be detrimental. An excessively large β forces the model to prioritize high-level semantic abstraction while potentially neglecting the fine-grained, discriminative details present in the shallower layers.