Few-Shot Bearing Fault Diagnosis Based on Multi-Layer Feature Fusion and Similarity Measurement

Deng, Changyong; Dong, Dawei; Wang, Sipeng; Zhang, Hongsheng; Feng, Li

doi:10.3390/lubricants14040172

Open AccessArticle

Few-Shot Bearing Fault Diagnosis Based on Multi-Layer Feature Fusion and Similarity Measurement

by

Changyong Deng

^1,*

,

Dawei Dong

¹,

Sipeng Wang

²,

Hongsheng Zhang

³ and

Li Feng

⁴

¹

School of Mechanical Engineering, Southwest Jiaotong University, Chengdu 610031, China

²

School of Civil Engineering, Chongqing Jiaotong University, Chongqing 400074, China

³

School of Mechatronics and Vehicle Engineering, Chongqing Jiaotong University, Chongqing 400074, China

⁴

School of Artificial Intelligence and Electronics, Chongqing Telecom Vocational College, Chongqing 400900, China

^*

Author to whom correspondence should be addressed.

Lubricants 2026, 14(4), 172; https://doi.org/10.3390/lubricants14040172

Submission received: 27 February 2026 / Revised: 16 April 2026 / Accepted: 16 April 2026 / Published: 17 April 2026

(This article belongs to the Special Issue Advances in Wear Life Prediction of Bearings)

Download

Browse Figures

Versions Notes

Abstract

The running reliability of rolling bearings depends on the effective lubrication state, and poor lubrication will induce abnormal vibration. Therefore, vibration-based fault diagnosis is an important means to evaluate the health of bearings through vibration characteristics. However, the lack of fault samples in actual working conditions seriously restricts the generalization ability and accuracy of an intelligent diagnosis model. A novel few-shot diagnosis method integrating multi-layer feature fusion and adaptive similarity measurement is proposed. This method adopts a meta-learning framework to simulate sample scarcity through numerous N-way K-shot diagnostic tasks. An efficient feature extractor with a cross-task feature stitching mechanism is designed to fuse features from support and query sets. To overcome the limitation of fixed-distance metrics in existing meta-learners, a learnable similarity scheduler adaptively generates optimal pseudo-distance functions. In particular, a multi-layer feature fusion strategy is introduced to compute adaptive similarities at multiple network depths, which significantly enhances feature robustness against operational variations. Experimental results demonstrate the method achieves stable diagnostic accuracy above 90% under extremely few-shot conditions and maintains over 90% accuracy when transferring from laboratory-simulated faults to natural operational faults, validating its strong potential for practical industrial applications where annotated fault data is scarce.

Keywords:

rolling bearing; fault diagnosis; few-shot learning; meta-learning

1. Introduction

Rolling bearings play an indispensable role in rotating machinery, and their fault diagnosis [1,2,3,4,5,6] is vital for ensuring operational continuity and safety. Conditions such as inadequate lubrication, lubricant contamination, or degradation can precipitate or accelerate surface fatigue, pitting, and wear, directly altering the vibration signature. Therefore, vibration analysis, the most common diagnostic technique, directly reflects the dynamic forces within a bearing, which are profoundly influenced by the state of the lubricating film [7,8].

However, a fundamental challenge in practical applications is the scarcity of labeled fault data. Collecting sufficient vibration data for every conceivable fault state under varying lubrication regimes is costly, time-consuming, and often poses safety risks [9,10,11,12,13]. This data scarcity severely restricts the deployment of deep learning, which typically requires large, balanced datasets. Therefore, there is an urgent need for algorithms that can efficiently learn from limited data to produce a bearing fault classifier with strong generalization ability.

Existing strategies to mitigate data scarcity primarily fall into two categories. The first is data augmentation. For instance, Dai et al. [14] utilized autoencoders (AEs) within a generative adversarial framework for anomaly detection. Zhou et al. [15] designed an improved GAN to generate fault samples and alleviate data imbalance. However, while generative models can synthesize data, they suffer from inherent drawbacks such as training instability, mode collapse, and gradient vanishing [16]. More importantly, the generated samples often lack the physical fidelity of real degradation processes (e.g., fatigue spalling versus artificial indentation), and their introduction can mislead the classifier, limiting performance in authentic industrial scenarios.

The second strategy is meta-learning. Zhang et al. [17] applied a matching network (MatchNet) for motor bearing fault diagnosis, using pseudo-labels to reduce distribution differences. Feng et al. [18] developed a domain adversarial similarity network (DASMN) for cross-domain diagnosis, and later proposed a squeeze-excitation meta-learning network (SEMN) [19] to refine prototype features. However, most, including those based on optimization, suffer from complex training procedures, high computational overhead, and slow convergence. Furthermore, nearly all existing metric-based meta-learning methods for fault diagnosis rely on predefined, fixed-distance functions. This reliance is a fundamental limitation, as no single metric is universally optimal for the complex, non-linear feature spaces of bearing vibrations, especially when features are corrupted by noise or operational condition changes related to lubrication.

In summary, for few-shot bearing fault diagnosis, there are the following core limitations:

•: The physical fidelity of data enhancement methods is insufficient. Although the data enhancement method based on a generative countermeasure network (GAN) can generate new samples, the samples generated by a GAN deviate from the characteristics of real physical damage (such as fatigue spalling) and lack the fidelity of a physical process. This artificial noise can easily mislead the classifier and limit the performance of the model in the actual industrial scene.
•: The metric meta-learning method relies on a fixed similarity function. Most of the existing meta-learning methods based on metrics still rely on predefined and fixed-distance functions (such as Euclidean distance) when dealing with complex fault features. This preset measurement standard is lack of generality, and it is difficult to find the optimal similarity criterion in the complex bearing vibration feature space, especially in the presence of noise and working condition changes.
•: The single layer of feature utilization. When calculating the sample similarity, the existing metric learning methods usually only rely on the last layer of high-level semantic features of the feature extractor, and ignore the discriminative details contained in the shallow network (such as local texture and edge features), which is very important for distinguishing subtle fault modes.

To address the aforementioned limitations, this work introduces a novel few-shot diagnostic framework inspired by the relational reasoning paradigm pioneered by Relation Module (RN) [20]. While the foundational Relation Network learns a non-linear similarity metric via feature concatenation, its direct application to bearing vibration signals yields suboptimal performance due to its reliance on single-depth semantic features and its agnosticism to the physical characteristics of tribological excitation. To bridge this gap, we advance the state of the art through the following distinct contributions:

•: Multi-layer Relational Fusion for Vibration Physics: Unlike the standard Relation Network that computes relations solely on final high-level embeddings, we introduce a parallel multi-layer fusion mechanism. This is specifically designed to preserve the fine-grained local impulsive textures (shallow layers) while leveraging global fault semantics (deep layers), which is crucial for distinguishing subtle differences in defect sizes under strong lubrication film noise.
•: Task-Adaptive Metric Refinement: We enhance the relational module by integrating it within a meta-learning episodic framework and applying a weighted hierarchical loss. This forces the relational scorer to adapt its pseudo-distance function not just to the class, but to the specific depth of feature abstraction, thereby significantly improving robustness under cross-load and cross-damage mechanism transfers.
•: Domain-Specific Architecture Validation: We provide extensive experimental validation on the transition from artificial defects (EDM) to natural tribological failures (fatigue pitting), demonstrating that the proposed multi-layer relational architecture captures the underlying physical vibration patterns more effectively than the standard single-layer Relation Network or fixed-metric meta-learners.

Two bearing data sets, separately from Case Western Reserve University (CWRU) and University of Paderborn (PU), are utilized to verify the effectiveness of the proposed method. The sections of this paper are as follows.

Section 2 is mostly concerned about theoretical background. In Section 3, the proposed method is discussed in detail; Section 4 includes experiments and analyses. A conclusion and recommendation for limitation and further research are formed in Section 5.

2. Theoretical Backgrounds

2.1. Meta-Learning for Few-Shot Diagnosis

To address data scarcity, this work adopts the meta-learning paradigm, also known as “learning to learn.” Unlike conventional supervised learning, which optimizes a model for a fixed dataset, meta-learning optimizes the model’s ability to adapt rapidly to new tasks with limited samples. This is achieved through episodic training, where each training episode simulates a few-shot scenario.

Specifically, an episode comprises a support set S and a query set Q, sampled from a source domain dataset. In an N-way K-shot task, S contains N classes with K samples per class, while Q contains additional samples from the same N classes. The model is trained across numerous episodes to minimize the generalization error on Q after observing S. During meta-testing, the trained model is evaluated on entirely unseen classes from a target domain, thereby assessing its cross-task generalization capability. This episodic training strategy explicitly exposes the model to sample scarcity, fostering robust feature representations that transfer effectively to novel fault classes [21].

2.2. Metric-Based Meta-Learning and Relation Networks

The metric-based meta-learning method is highly favored and widely applied in few-shot classification tasks due to its advantages such as simple model structure, high computational efficiency and low memory occupation. Its core classification mechanism lies in the following: by calculating the distance or similarity between the embedding features of the query sample and the embedding features of each sample in the support set, it can infer the category attribution of the query samples based on the Nearest Neighbor principle, as shown in Figure 1.

The representative feature

P_{C}

for class C is obtained by averaging the embedded feature vectors from the support set, as shown below:

P_{C} = \frac{\sum_{y_{i} = C}^{(x_{i}, y_{i}) \in S} f_{M (x_{i})}}{n_{S}}

(1)

where the representative feature

f_{M}

is the abstract feature embedded by the meta-model.

C

is the class label, and

n_{S}

represents the total number of support samples.

The similarity between the support set features and query set features is then calculated based on a certain distance evaluation criterion. Finally, the predicted probability for the query sample can be expressed as follows:

p (y = k | x_{j} \in V) = \frac{\exp [- F_{M} (P_{k}, f_{M} (x_{j}))]}{\sum_{j = 1}^{C} \exp [- F_{M} (P_{k}, f_{M} (x_{j}))]}

(2)

where

F_{M}

represents the distance function, and k is the class label. Once the label prediction for the query sample is obtained, the loss function is defined as the cross-entropy loss function. By calculating the loss value between the true labels and the predicted labels of the query samples, the model parameters can be trained through backpropagation.

A significant advancement in metric-based meta-learning is the Relation Network [20], which replaces fixed-distance functions with a learnable deep neural network (Relation Module) that takes concatenated support-query features and outputs a relation score. This architecture established the foundational paradigm of using a secondary network to learn a non-linear similarity metric. However, the standard Relation Network is designed for image classification and operates exclusively on the final layer feature maps. For bearing fault diagnosis under variable lubrication states, this single-depth approach often fails to capture the multi-scale nature of vibration signatures, where high-frequency transient edges and low-frequency modulation sidebands coexist. Our proposed method builds upon this relational reasoning backbone but introduces multi-layer feature constraints to explicitly preserve the physical discriminability required for few-shot tribological fault detection.

2.3. Bearing Vibration Excitation Sources and Fault Feature Frequencies

Vibration in rolling element bearings arises from several distinct sources, even in a healthy state. These include varying compliance due to the cyclic passage of rolling elements, geometric imperfections, and, most significantly, localized defects. The dynamic response of the rotor-bearing system to these excitations is complex and non-linear. To provide a physically meaningful basis for the signal processing and diagnostic tasks in this study, it is essential to define the characteristic frequencies at which these defects excite the system.

The fundamental sources of bearing-induced vibration, including the effects of manufacturing errors and localized defects, have been rigorously formulated [22,23]. For a bearing with a fixed outer ring and a rotating inner ring, the key kinematic relationships are:

Fundamental Train Frequency (FTF): The rotational frequency of the cage.

F T F = \frac{f_{r}}{2} (1 - \frac{B_{d}}{P_{d}} \cos φ)

(3)

where

f_{r}

is the shaft rotational frequency,

B_{d}

is the ball diameter,

P_{d}

is the pitch diameter, and

φ

is the contact angle.

Ball Pass Frequency of Outer Race (BPFO): The frequency at which rolling elements pass a defect on the stationary outer race.

B P F O = n • F T F = \frac{n f_{r}}{2} (1 - \frac{B_{d}}{P_{d}} \cos φ)

(4)

where

n

is the number of rolling elements.

Ball Pass Frequency of Inner Race (BPFI): The frequency at which rolling elements pass a defect on the rotating inner race.

B P F I = n (f_{r} - F T F) = \frac{n f_{r}}{2} (1 + \frac{B_{d}}{P_{d}} \cos φ)

(5)

Ball (Rolling Element) Spin Frequency (BSF): The rotational frequency of a defect on a rolling element. A defect on the ball will contact both races, generating a characteristic frequency of

2 \times B S F

.

B S F = \frac{P_{d} f_{r}}{2 B_{d}} (1 - {(\frac{B_{d}}{P_{d}} \cos φ)}^{2})

(6)

3. Proposed Method

This section elaborates in detail the small-sample bearing fault diagnosis method based on multi-layer feature fusion and adaptive similarity measurement proposed in this paper. Figure 2 shows the overall architecture schematic diagram of the proposed model. The left half of Figure 2 depicts the overall structure of the model, whose core contains two key sub-networks: a shared feature extractor and two parallel similarity measurers. The right half of Figure 2 shows in detail the internal structure of the similarity meter in the left half of Figure 2. The specific structural configuration of each sub-network is detailed in Table 1.

3.1. Feature Extractor

To accelerate the convergence speed of gradient descent during the model training process and indirectly improve the final discrimination accuracy of the model, all input data needs to undergo standardization preprocessing before being fed into the neural network. This processing aims to eliminate the dimensional influence and keep the features within a similar numerical range. Its calculation process is as follows:

X_{j} = \frac{x_{i} - \tilde{x}}{S}

(7)

\tilde{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(8)

S = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - \tilde{x})}^{2}}

(9)

where

X_{j}

represents the

j

-th normalized sample. n is the number of data points in each sample.

x_{i}

denotes the value of the

i

-th data point.

\tilde{x}

is the mean value of each sample.

S

is the standard deviation of each sample.

The feature extractor mainly consists of convolutional layers, pooling layers, and activation functions. Its specific structure is shown in Table 1. The convolutional layer operation can be computed as follows:

Y_{j}^{l} = R E L U (X_{j}^{l}) = R E L U (X_{i}^{l - 1} \otimes w^{l} + b_{j}^{l})

(10)

R E L U (X_{j}^{l}) = \max {0, X_{j}^{l}}

(11)

where

Y_{j}^{l}

represents the output of the convolutional layer.

\otimes

denotes the convolution operation.

w^{l}

represents the weight matrix of the connection 1.

b_{j}^{l}

denotes the added bias for each output. RELU is the RELU activation function.

Additionally, the max pooling operation can be described as follows:

Y_{j}^{l'} = M a x (F (Y_{j}^{l}) + b_{j}^{l})

(12)

where

b_{i}^{l}

is the bias term, and

F ()

is the sampling function for finding the maximum value.

The feature extractor operates on raw or minimally processed time-domain vibration signals. Its convolutional filters learn to detect patterns that correspond to the impulsive responses excited at the characteristic frequencies detailed in Section 2.3. The hierarchical nature of the network allows shallow layers to capture local, high-frequency transient details, while deeper layers aggregate these into global, semantic representations of the fault mode and severity.

The design rationale of the feature extractor architecture (Table 1) is as follows. The first convolutional layer adopts a relatively large kernel size of 10 × 1 to capture the transient impulsive signatures characteristic of bearing localized defects (e.g., the initial impact event whose frequency content spans a wide range). Subsequent convolutional layers use smaller kernels (3 × 1) to progressively extract finer hierarchical patterns while reducing parameter count. Max pooling layers (stride = 2) are inserted after the first two convolutional blocks to down-sample the feature maps, increase the receptive field, and enhance translational invariance. Two consecutive 3 × 1 convolutional layers without intermediate pooling are stacked to deepen the network without overly compressing the temporal resolution, allowing the model to learn more abstract feature interactions. Finally, an average pooling layer with kernel size 25 aggregates the temporal dimension, producing a fixed-length feature vector that summarizes the entire input segment. This combination of large initial kernels, cascaded small kernels, and mixed pooling strategies is specifically tailored to vibration signals, where both high-frequency impact details and low-frequency modulation patterns are crucial for fault discrimination under few-shot conditions.

3.2. Similarity Measurer

In standard few-shot learning frameworks, the comparison between a support prototype and a query is governed by a static distance function. While computationally efficient, these functions impose a rigid geometric structure on the embedding space—specifically, they assume that features are best separated by isotropic spherical or angular margins. For bearing vibration signals under varying lubrication regimes, the feature manifold is often high-dimensional, non-isotropic, and heavily distorted by stochastic noise. In such cases, a fixed-distance metric acts as a bottleneck, restricting the expressiveness of the feature extractor.

To overcome this, we replace the hand-crafted function with a learnable deep neural network. Instead of calculating a generic distance, it ingests the cross-task concatenated features and projects them through a series of non-linear transformations to output a calibrated similarity score. This allows the network to learn that, for instance, certain frequency sidebands are more important than others when matching fault modes, a nuance that Euclidean distance cannot capture.

One of the core innovations of this paper is the abandonment of the traditional similarity measurement paradigm that relies on artificially preset distance functions (such as Euclidean distance, cosine similarity or Markov distance). Although these predefined functions may be effective in specific scenarios, their universality is limited—different distance functions perform significantly differently on the same dataset, and there is a lack of a universal criterion to pre-determine the optimal distance metric for a specific dataset. The suboptimization and uncertainty brought about by this kind of manual selection limit the generalization ability and diagnostic accuracy of the model on complex and diverse bearing fault data.

To address the above-mentioned key challenges, this paper innovatively designs a learnable similarity measurer. The core idea of this module is to utilize the powerful representation learning ability of deep neural networks to automatically learn from the data and generate an optimal pseudo-distance metric. Its operation mechanism is as follows: the similarity scheduler receives the features output by the feature extractor and calculates the high-order feature interaction relationship between the samples of the support set and the samples of the query set through a deep Convolutional Framework, and finally outputs a standardized similarity score. This score is directly used to determine the category attribution of the query sample (the higher the score, the greater the category similarity).

The network structure of the similarity scheduler is mainly composed of the stacking of convolutional layers, pooling layers and fully connected layers (for the specific architecture, please refer to Table 1). The mathematical models of the convolutional layer and the pooling layer have been defined in Equations (10)–(12). At the end of the network, the final fully connected layer outputs a scalar value, which undergoes a non-linear transformation through the Sigmoid activation function (σ(·)) to ensure that the similarity score is compressed and normalized within the interval [0, 1], thereby intuitively quantifying the similarity degree of sample pairs (1 indicates high similarity, 0 indicates high dissimilarity).

In order to explicitly establish the association between the features of the support set samples and the query set samples, and provide paired context information for the scaler, the input of the similarity scaler is not a single feature vector, but a concatenated feature vector of the two. The specific operation is defined as follows:

Z = (f_{ϕ} (X_{S}) Θ f_{ϕ} (X_{Q}))

(13)

where

Z

is referred to as the concatenated feature vector.

f_{φ} (X_{S})

represents the extracted features of the support set samples,

f_{φ} (X_{Q})

represents the extracted features of the query set samples, and

Θ

denotes the concatenation operation.

3.3. Multi-Layer Feature Fusion

In view of the common limitations of the existing measure-based methods—usually only relying on the final layer of the feature extractor (high-level semantic features) to calculate the similarity between samples—this paper proposes an innovative multi-level feature fusion strategy. This paradigm of single high-level feature dependence is prone to ignore the discriminative detailed information captured by shallow convolutional networks (such as local textures and edge features), and this information is crucial for accurately measuring the similarity between samples, especially for differentiating subtle failure modes under the condition of few samples.

To more fully explore and utilize the rich information contained in the limited samples, especially the fine-grained clues contained in the low-level convolutional features, and comprehensively improve the representation robustness and anti-interference ability of the model, this paper introduces the hierarchical feature joint constraint mechanism. The core idea of this mechanism lies in the following: not only utilizing high-level semantic features, but also explicitly integrating the complementary information contained in middle-level and even shallow features. Through the joint participation of multi-level features in similarity calculation and loss supervision, a more comprehensive and strict constraint on the network is formed. This cross-level feature collaboration can effectively utilize the intrinsic correlations among features at different levels (high-level semantic abstraction and low-level detail richness), significantly enhancing the robustness of the model to input changes and improving the generalization performance of the overall framework.

In terms of specific implementation, the model in this paper does not fuse all levels of the entire feature extractor, but focuses on the last two layers that are rich in information and have strong complementarity (such as Layer L−1 and Layer L). For the features extracted from these two layers:

First, similarity calculation and loss generation are carried out: at each level l (l ∈ {L−1, L}), the similarity score of the sample pairs of the support set–query set is independently calculated through its corresponding similarity scheduler, and the predicted loss of this level is calculated accordingly.

Secondly, the calculation of level loss: The loss value of each level l is calculated using the Mean Squared Error Loss (MSE) function:

L_{n} = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - D_{i})}^{2}

(14)

where

m

is the batch size,

D_{i}

represents the predicted similarity value of the i-th sample pair is the regression prediction result for the current level. The total loss value is the sum of the loss values from the last two layers.

Finally, multi-level loss fusion: The final supervision signal of the model is the weighted sum of the loss values of the above two levels:

L o s s = L_{n} + β L_{n - 1}

(15)

where

β

is an adjustable weighting coefficient, which is used to balance the relative contribution of the losses of the last two layers to the overall optimization objective (for example,

β

= 1.0 can be set based on experience or the performance of the validation set to indicate equal weighting emphasis).

During the training process, by minimizing the total loss defined by Equation (15), the backpropagation algorithm is adopted to simultaneously update all learnable parameters of the feature extractor (including shallow to deep layers). The learnable parameters of the two similarity measurers correspond to the L−1 layer and the L layer.

This hierarchical joint optimization mechanism forces the network to do the following: learn and retain the detailed features that are crucial for similarity discrimination at the shallow (L−1) level, and learn more semantically abstract discriminative features at the deep (L) level. Meanwhile, it must ensure that the feature representations learned at different levels remain consistent within the similarity measurement space. Ultimately, the collaboration improved the diagnostic accuracy and robustness of the model in a few-shot and complex noise environment.

The choice of the penultimate layer (L−1) and the final layer (L) for fusion is motivated by the hierarchical nature of feature abstraction in convolutional networks. Shallow layers typically capture low-level, localized patterns such as instantaneous impulses and high-frequency noise components. While these signals contain discriminative information, they are also highly sensitive to operating condition variations and measurement noise. Conversely, deep layers synthesize global, semantically rich representations that are more robust to domain shifts but may lack the fine-grained details necessary to differentiate subtle fault severities.

The penultimate layer represents a transition point where mid-level features—aggregated local patterns with partial semantic abstraction—are preserved. By fusing features from L−1 and L, the network simultaneously leverages the robust semantic invariance of deep features and the discriminative detail retention of intermediate features. A preliminary ablation study (detailed in Section 4.5) further validates that this two-layer configuration yields the optimal trade-off between accuracy and computational overhead, whereas including shallower layers (e.g., L−3) introduces noise sensitivity without commensurate performance gain.

3.4. Diagnosis Process

To effectively address the core challenges faced in bearing fault identification under the condition of few samples—extremely scarce training data and diverse fault modes—this paper designs and implements a few-shot fault diagnosis framework based on the meta-learning paradigm. The overall workflow of this framework is shown in Figure 3. Its core lies in simulating the few-shot scenarios of the target through episodic training, and combining the innovative adaptive similarity measurement and multi-layer feature fusion mechanism to achieve a fault diagnosis model with strong generalization ability from a limited number of samples. The specific implementation steps are as follows:

(1): Data preprocessing and dataset construction: Firstly, the collected original bearing vibration/acoustic signals are preprocessed in a standardized manner, and combined with feature enhancement methods such as time-frequency analysis, they are transformed into structured datasets suitable for deep network processing.
(2): Meta-task generation: Following the standard episodic training paradigm of meta-learning, the dataset is dynamically divided into a large number of meta-tasks. Each meta-task contains a support set (S) and a query set (Q). Among them, the support set contains N fault categories, and each category contains only K sample examples; the query set contains samples to be classified from the same N categories. This step generates the meta-training task set for model training and the meta-test task set for the final evaluation.
(3): Task batch sampling: During the meta-training phase, a batch of meta-training tasks is randomly sampled as the input for the current iteration. This batch processing improves the training efficiency and introduces task diversity, which is conducive to the model learning more universal generalization ability.
(4): Feature extraction and cross-task feature association: Input the batch tasks sampled in step 3 into the shared feature extractor. This extractor processes the support set samples and query set samples in all tasks in parallel and extracts their high-dimensional feature representations. Subsequently, the cross-task feature concatenation operation is innovatively performed: for each support set–query set sample in the task, the corresponding feature vectors are concatenated along the feature dimension to form a joint representation vector. This operation explicitly establishes the association between the supporting samples and the query samples, providing paired context information for subsequent similarity measurement.
(5): Adaptive similarity measurement and multi-layer feature fusion: Input the concatenated feature vector Z_ij generated in step 4 into the Learnable Similarity Measure (R) proposed in this paper. The core innovation of this framework lies in the introduction of the multi-level feature fusion mechanism in this step: instead of only utilizing the features of the final layer, the features output by the last two layers of the feature extractor (such as Layer L−1 and Layer L) are extracted synchronously.
(6): Loss calculation and parameter optimization: Based on the similarity scores of each layer obtained in step 5, hierarchical loss calculation is as follows: for each level l, the mean square error loss (MSE loss) is used to calculate its loss (Equation (14)), and this loss measures the difference between the predicted similarity score and the true similarity label (1 for similar classes and 0 for dissimilar classes). Multi-level loss fusion: The total loss of the model is defined as the weighted sum of the losses of the last two layers (Equation (15)).
(7): Model evaluation and optimal weight selection: During the meta-testing phase, an independent meta-testing task set is used to evaluate the model snapshots during the training process. Select the model weights that achieve the highest diagnostic accuracy rate on the test set as the final optimal model. This optimal weight demonstrates excellent generalization ability and can be directly applied to other small-sample bearing fault diagnosis scenarios with similar data distributions.

Figure 3. Network model diagnosis process.

Through the above-mentioned rigorous process design, this framework has achieved the goal of constructing a high-precision and strongly robust bearing fault diagnosis model under the condition of an extremely small number of fault samples.

4. Experimental Analysis

To validate the proposed method, a series of experiments were conducted focusing on four key aspects: (1) sensitivity to limited training samples, (2) cross-load domain generalization, (3) robustness to noise, and (4) generalization from artificial to natural damage mechanisms. The experimental setups utilize the CWRU [24] and Paderborn University (PU) bearing datasets [25].

The experiments were implemented in PyTorch 1.10.0 and Python 3.8.5, running on an AMD Ryzen 9 7950X CPU @4.50 GHz (64 G RAM) and an RTX 4070Ti GPU platform. The meta-learning framework is trained for a total of 200 epochs, with 100 episodes randomly sampled per epoch. A meta-batch size of four episodes is used per gradient update. The initial learning rates are set to 0.01 for the feature extractor and 0.005 for the similarity measurer, and are decayed by a factor of 0.5 at epochs 50, 100, and 150. The multi-layer loss balancing parameter β is fixed at 0.1 unless stated otherwise. These settings are applied consistently across all subsequent experiments.

4.1. Experiment 1: Sensitivity Analysis of Training Sample Size

This experiment adopted the public standard dataset provided by the Bearing Data Center of Case Western Reserve University (CWRU) [24] in the United States. This dataset has become an important benchmark for bearing fault diagnosis research due to its high reliability and wide recognition.

Data acquisition settings: Sampling frequency: 12 kHz. Sensor arrangement: The vibration signal is collected by an accelerometer, which is vertically installed on the drive end fan side bearing housing of the motor housing. Experimental setup: The schematic diagram of the experimental equipment is shown in Figure 4. Fault simulation method: To simulate bearing faults, the electric discharge machining (EDM) technology is adopted. Three different sizes (diameters) are respectively introduced on the inner race, outer race and rolling element/ball of the bearing. There are single point faults of 0.007 inches, 0.014 inches, 0.021 inches.

Fault categories: As shown in Table 2, the dataset contains a total of 10 explicit fault states, covering different combinations of fault locations (inner ring, outer ring, rolling elements) and different damage sizes, as well as the normal state. Data preprocessing and sample construction: The original collected signal is a time-domain vibration signal. Adopt the fixed-length segmentation strategy: The continuous vibration signal is divided into non-overlapping segments, and each segment contains 1024 consecutive data points, forming an independent data sample.

Dataset division: After implementing the above classification (10 categories) and segmentation strategy, the entire CWRU dataset is systematically divided into 10 fault categories. Each category contains 400 data samples of equal length (1024 points per sample). This dataset is clearly divided and has complete categories, providing a solid experimental basis for evaluating few-shot fault diagnosis methods.

Calculate the failure frequency according to Equations (3)–(6) in Section 2.3, and serve as the theoretical basis for the “Fault Type” and “Label” classifications presented in Table 2. The signal processing framework in this paper (Section 3) is designed to extract and learn features that are discriminative of these underlying physical phenomena, even from very few samples. The severity of the fault (e.g., 0.007″, 0.014″, 0.021″) modulates the amplitude and bandwidth of these excitations.

This section focuses on the first key issue—the research on the impact of the number of training samples on the model performance. For this purpose, we designed a comparative experiment, setting the training sample sizes for each fault category at 4, 8 and 12, respectively. Under the meta-learning framework, the data of normal state, inner circle failure and rolling element failure are selected as the meta-training set, while the data of outer circle failure is used as the meta-testing set to evaluate the generalization ability of the model on unseen categories. The experiment evaluated two typical small-sample learning scenarios: three-way one-shot (the support set for each task contains three categories, and each category provides only one sample) and three-way five-shot (the support set for each task contains three categories, and each category provides five samples). The experimental parameter configuration is as follows: The model parameter β is set to 0.1; The learning rate of the feature extraction network is set at 0.01; the learning rate of the similarity measurement module is set at 0.005; the optimizer selected is Adam.

To comprehensively evaluate the performance of the proposed method, we designed a systematic comparative experiment. In the experiment, we adopted the small-sample bearing diagnosis framework based on the Siamese network proposed in the literature (SN) [26] as the benchmark model. Meanwhile, the classic meta-learning method—Relation Module (RN) [20], Model-Agnostic Meta-Learning (MAML) [27], Time-Frequency Convolutional Feature Pyramid Network (TFCFPN) [28] and Dual-Channel Feature Fusion Meta-Learning (DCFFML) [29] was introduced as an important comparison method. Each method was tested using five different random seeds, and the results were presented as the mean ± standard deviation.

To ensure the statistical reliability of the results, for each training set size (i.e., 4, 8, or 12 samples per class) and each few-shot setting (one-shot and five-shot), we independently repeated the complete algorithm training and testing process five times. This approach effectively mitigates the influence of randomness in data sampling and model initialization. The mean diagnostic accuracies and their standard deviations (SDs) for each method under different settings are summarized in Table 3, with an intuitive comparison presented in the bar chart of Figure 5.

The analysis of the experimental results (Table 3 and Figure 5) leads to the following key observations. In the three-way few-shot learning scenario, the Siamese network benchmark performs the weakest, primarily due to its reliance on a fixed Euclidean distance metric, which lacks the adaptability to capture complex, multi-modal bearing fault features. The optimization-based MAML method shows better performance but suffers from significant computational overhead due to its bi-level optimization.

More importantly, by comparing with recent advanced methods, the superiority of our proposed method is further highlighted. RN achieves competitive results, yet its performance remains consistently lower than ours across all settings, indicating the advantage of our learnable similarity metric over traditional distance-based meta-learning. TFCFPN and DCFFML, which incorporate feature fusion or customized meta-learning strategies, show improved performance over Siamese and MAML. However, our method consistently outperforms all of them, achieving the highest mean accuracy and the lowest standard deviation in nearly all cases. This significant and stable advantage demonstrates that our proposed framework, combining multi-layer feature fusion and an adaptive similarity scheduler, not only achieves state-of-the-art diagnostic accuracy under extremely limited data conditions but also exhibits superior robustness against training randomness. The learnable parametric similarity scaler is crucial, as it dynamically optimizes the similarity criterion, effectively alleviating the constraints imposed by fixed-metric functions and enabling the feature extractor to mine more discriminative fault characteristics.

The measurement-based meta-learning method proposed in this paper shows significant advantages. This metric can dynamically learn and optimize the similarity judgment criteria among samples, thereby significantly alleviating the inherent constraints of fixed-metric functions (such as Euclidean distance) on the learning ability of the feature extractor, enabling it to more effectively mine and represent the discriminative features of complex fault data.

4.2. Experiment 2: Cross-Load Domain Generalization Capability Verification

In order to verify the general capability of fault diagnosis of the model proposed in this paper when the load changes, this paper first adopts CWRU with Fault Type-Inner Ring Fault (0.1778A) as the training set and validation set, and Fault Type-Inner Ring Fault (0.3556B/0.5334C) as the test set (denoted as A-B, A-C). The data used for model training and testing all comes from different loads. The method proposed in this paper was compared with RN, CNN, MobileNet-V1, InceptionResNet and DenseNet. The experimental results of the accuracy of bearing fault diagnosis under variable working conditions are shown in Figure 6.

The convergence curve comparison as shown in Figure 6 clearly indicates that the method proposed in this paper is significantly superior to the other four comparison models in terms of convergence speed. Specifically, this method rapidly converges to the highest accuracy in this experiment with only 50 rounds of iterations, demonstrating extremely high training efficiency.

In-depth analysis of the convergence performance of other models: MobileNet-V1 model: Due to the fact that its design core lies in model lightweighting (parameter simplification), when conducting fault diagnosis under cross-load conditions, its accuracy rate improves slowly, and ultimately fails to reach the highest accuracy level achieved by the model in this paper. This indicates that, while pursuing lightweighting, the feature extraction ability and convergence potential of the model in complex cross-domain scenarios may be sacrificed. DenseNet model: Thanks to the introduction of its dense connections mechanism, the convergence curve of this model is generally slightly higher than that of the basic CNN model. This result strongly confirms that dense connections can effectively promote feature reuse and information flow between network layers, thereby enhancing the ability to identify fault features. The InceptionResNet model: This model integrates the design concepts of multi-scale feature extraction (Inception module) and gradient propagation optimization (residual structure). Its convergence curve shows that the rate of increase in accuracy in the initial stage is higher than that of the DenseNet model. However, its convergence process shows considerable volatility and relatively poor stability, suggesting that there might be potential challenges in the model structure or optimization process.

Based on the above experimental results, it can be concluded that the method proposed in this paper not only has an overwhelming advantage in convergence speed and can quickly achieve extremely high diagnostic accuracy, but more importantly, it can stably and efficiently achieve high-precision fault diagnosis when facing bearing vibration data under different load conditions (i.e., cross-domain conditions). This fully demonstrates the excellent generalization performance of this method, which can effectively adapt to the complex and changeable working environment in actual industrial scenarios.

4.3. Experiment 3: Robustness Test of Anti-Noise Performance

The vibration signals of rolling bearings in industrial systems are complex and greatly disturbed by environmental noise. In order to accurately diagnose bearing faults, the signal-to-noise ratio (SNR) is an important criterion for evaluating the difference in signal and noise intensity. In the experimental dataset, Additive White Gaussian Noise (AWGN) with different signal-to-noise ratios (within the range of −4 dB to 12 dB) can be added to the signal to form a composite signal containing noise. Figure 7 shows the process of a composite signal with a signal-to-noise ratio of −4 dB, in which the periodic impact of the original signal is significantly masked, which is unfavorable for subsequent fault diagnosis, as shown in Figure 7.

S N R = 10 \lg (\frac{P_{signal}}{P_{noise}})

(16)

Figure 8 clearly shows the trend of the diagnostic performance of different models varying with the signal-to-noise ratio (SNR) in a noisy interference environment. By analyzing this graph, the following key conclusions can be drawn: MobileNet performs poorly under strong noise: when SNR = −4 dB (the noise is extremely significant), the diagnostic accuracy of the MobileNet model is relatively low. This mainly stems from its lightweight design concept—in pursuit of the simplification of the number of parameters and computational load, the feature extraction ability of the model is limited, making it difficult to effectively learn the weak fault features submerged by high-intensity noise. The anti-noise ability of the CNN model is limited: the basic CNN model, as a relatively shallow network structure, does not integrate targeted noise suppression mechanisms. Therefore, after inputting the signal containing noise, the improvement process of its accuracy rate is slow, indicating that its learning and generalization efficiency in a noisy environment is insufficient. The model proposed in this paper demonstrates outstanding anti-noise performance: particularly notable is that, within the wide noise intensity range of SNR from −4 dB to 12 dB, the performance of the model proposed in this paper significantly and stably outperforms the other four comparison models. This result strongly proves that the model has a powerful ability to extract fault features and can effectively identify key fault information in a complex noise background.

The model in this paper has excellent robustness. An important advantage lies in that the model in this paper still demonstrates excellent anti-interference ability without any specialized denoising preprocessing. This fully demonstrates that its model architecture itself has good inherent robustness and is insensitive to noise disturbances in the input signal. Observing the overall trend in Figure 8, it can be seen that the diagnostic accuracy of all five models gradually improves with the increase in SNR, which is consistent with expectations. It is notable that, when the SNR exceeds 6 dB, the accuracy changes in each model tend to stabilize. This indicates that, under the conditions of medium and above signal-to-noise ratios, the improvement space for diagnostic accuracy is relatively limited. At this time, the performance bottleneck of the model mainly depends on its own ability to extract and utilize effective fault features, rather than the noise suppression ability. The continuous leading advantage of the model in this paper at this stage further confirms the effectiveness of its core feature extraction module.

4.4. Experiment 4: Generalization Exploration Across Damage Mechanisms

To more comprehensively evaluate the generalization ability of the proposed method in actual industrial scenarios, this study further applies it to the Paderborn bearing dataset. This choice is crucial because, unlike the CWRU dataset which contains only artificially seeded defects (EDM, drilling), the Paderborn dataset provides samples of naturally evolved faults from accelerated life tests. These natural faults (e.g., fatigue pitting, plastic indentation) are the direct result of the degradation of contacting surfaces, a process intrinsically linked to the effectiveness and condition of the lubricant.

The Paderborn University (PU) bearing dataset is specifically designed for research on bearing condition monitoring and fault diagnosis. Its core contains 32 type 6203 bearings, with the specific composition as follows: artificially defective bearings: 12 (manufactured through specific process simulation), naturally faulty bearings: 14 (obtained through accelerated life test operation until failure), healthy (normal) bearings: 6. This dataset standardizes and classifies the severity of faults (based on damage length): Level 1: damage length < 2 mm, Level 2: damage length ≥ 2 mm.

For artificially defective bearings, three different processing modes were adopted to simulate different types of damage. Natural failure bearings, on the other hand, are generated during strictly controlled Run-to-Failure tests, which are closer to the failure process in actual service. All bearings are installed on a unified modular test bench for standardized testing. During the experiment, multi-source sensing signals were collected simultaneously, including: motor current signals and vibration signals.

In this case study, we focus on representative samples under specific working conditions (N09_M07_F10): one healthy bearing, eight artificially defective bearings, covering different processing modes and severities, and four naturally faulty bearings. A total of 14 bearings were used to construct the few-shot learning task. The detailed configuration information of these selected bearings is summarized in Table 4. The schematic diagram of the experimental setup used for collecting the Paderborn dataset is shown in Figure 9.

The dataset composition and selection criteria (Table 4) are designed to explicitly test the hypothesis: Can a model trained on artificial faults (clean, geometrically precise damage) generalize to natural faults (rough, irregular, tribologically evolved damage)?

To further verify the generalization ability of the method proposed in this paper in the key scenario of crossing from “artificial simulation defects” to “real natural degradation” fault types, a rigorous comparative experiment was designed in this study. We selected several of the most advanced few-shot learning (FSL) and transfer learning algorithms as benchmarks for performance comparison, specifically including: Direct Training Network (DTN) [30], Feature Transfer Network (FTN) [30], and Model-Agnostic Meta-Learning (MAML). To ensure the fairness of the comparison and the reliability of the conclusion, this comparative experiment strictly follows the following unified settings: Dataset: all use the Paderborn University dataset. Data preprocessing: all adopt the Fast Fourier Transform (FFT) to convert the original vibration signal to the frequency domain as the model input. The composition of the meta-training set: it only contains the data of one healthy bearing and eight artificially damaged bearings. The core of this design lies in training the model using only artificial fault data and testing its performance on unseen natural faults. The composition of the meta-test set: evaluation is conducted using data from four naturally deteriorated bearings. Evaluation task: perform the four-way K-shot classification task on the meta-test set, where K = 1, 5, 10. This means that each task needs to identify four states simultaneously, and the support set provides K samples for each category.

The fault classification accuracy obtained by each method on the Paderborn dataset is recorded in detail in Table 5, with an intuitive performance comparison shown in Figure 10. Experimental reproducibility and statistical robustness: for the benchmark methods Direct Training Net and Feature Transfer Net, the results were obtained by running the open-source code provided by the authors. To effectively overcome the inherent randomness in data sampling and task construction, all accuracy values reported in Table 5 and Figure 10 are the mean averages of 10 independent tests, accompanied by their standard deviations (SDs).

As shown in Table 5, our proposed method consistently achieves the highest mean accuracy across all few-shot settings (one-shot, five-shot, and ten-shot), while also yielding the smallest standard deviations in most cases. In the challenging one-shot scenario, our method attains 92.89% ± 1.32%, significantly outperforming MAML (89.16% ± 1.43%), Feature Transfer Net (87.16% ± 1.20%), and Direct Training Net (78.10% ± 1.33%). This statistical advantage becomes even more pronounced as the shot count increases. For the five-shot task, our method achieves 98.04% ± 0.63%, compared to 94.51% ± 0.83% for MAML and 92.25% ± 1.08% for Feature Transfer Net. In the 10-shot setting, our method reaches near-perfect accuracy (99.60% ± 0.13%), with a standard deviation of only 0.13%, demonstrating exceptional robustness and generalization stability when transferring from artificial to natural fault mechanisms. The consistently low standard deviations of our method across all tasks further confirm its reliability and insensitivity to random sampling variations.

To deeply explore the representation quality of different algorithms in the feature space, we adopt the t-SNE (t-distributed Stochastic Neighbor Embedding) [31] technique. The deep features learned by four comparison algorithms in the five-shot learning scenario were analyzed for dimensionality reduction and visualization (the results are shown in Figure 11). Through a detailed observation of the visualization results, the following important conclusions can be drawn:

Poor feature separability of the Direct Training Net: As shown in Figure 11, the feature distribution of this method presents significant overlap and confusion. This intuitively reflects the serious insufficiency of its generalization ability—the model, trained only on artificially simulated fault data, lacks sufficient discriminability when tested on unseen naturally degraded fault data, resulting in blurred boundaries between different categories.

Limited improvement in the performance of the Feature Transfer Net: Compared with the directly trained network, the feature distribution of the Feature Transfer Net (Figure 11) shows a certain degree of separation trend. However, the distances between feature clusters of different categories are still relatively close, with obvious local overlapping areas. This indicates that its feature transfer process fails to fully adapt to the characteristics of the target domain (natural faults), and the improvement of the model’s generalization performance is relatively limited.

The MAML algorithm demonstrates the advantages of meta-learning: Thanks to its meta-learning framework, the feature distribution learned by the MAML algorithm (Figure 11) shows a better clustering structure. The specific manifestations are as follows: the compactness of intra-class features has improved, while the separation degree of inter-class features has been enhanced. This verifies the effectiveness of meta-learning in enhancing the model’s ability to quickly adapt to new tasks, such as identifying natural faults.

The method proposed in this paper achieves the optimal feature representation: the most prominent is the result of the method proposed in this paper (Figure 11). Its visualization feature distribution presents a highly ideal pattern: highly compact within the class—sample points belonging to the same category are highly clustered, forming compact and clear clusters, and there is significant separation between classes—the feature clusters of different classes are distant from each other, with clear and distinguishable boundaries and almost no overlapping areas. This sharp contrast strongly proves that the method proposed in this paper can learn feature representations with extremely strong discriminability and excellent robustness. By enforcing multi-layer feature constraints and learning an adaptive metric, the model successfully ignores the superficial textural differences between EDM cuts and fatigue spalls, instead focusing on the underlying, physically meaningful vibration patterns (e.g., impulse periodicity corresponding to BPFO/BPFI) that are common to both failure generation mechanisms. This maximization of intra-class similarity and inter-class difference achieved in the feature space is the intrinsic reason for its outstanding diagnostic performance, particularly in this challenging scenario of generalizing from artificial to natural faults.

4.5. Sensitivity Analysis of the Balancing Parameter β

To thoroughly investigate the influence of the multi-level loss balancing parameter β (introduced in Equation (15)) on the diagnostic performance of the proposed method, a dedicated sensitivity analysis is conducted. The parameter β controls the contribution of the high-level semantic features (Layer L) relative to the shallower, detail-rich features (Layer L−1) in the total loss function. Understanding its impact is crucial for optimal model configuration and for demonstrating the robustness of the multi-layer feature fusion strategy.

This experiment utilizes the CWRU dataset under the three-way five-shot task with eight training samples per class. The value of β is varied across a representative range: {0.01, 0.05, 0.1,0.15,0.2, 0.3, 0.5, 0.8, 1.0}. The default learning rates and optimizer settings remain unchanged from Section 4.1.

As shown in Table 6, the diagnostic accuracy exhibits a clear trend with respect to β. When β is set to a very small value (β = 0.01), the contribution of the deep layer loss (Layer L) is heavily suppressed. In this case, the model relies predominantly on the shallower features (Layer L−1), which, while rich in local textures and edge details, lack the necessary semantic abstraction for robust fault classification. Consequently, the diagnostic accuracy is relatively low (87.43%).

As β increases from 0.01 to 0.1, the accuracy improves significantly, peaking at 96.33% when β = 0.1. This indicates that incorporating an appropriate contribution from high-level semantic features (Layer L) effectively complements the shallow details, leading to a more discriminative and robust feature representation. The optimal balance is achieved when the deep semantic features are given moderate weight, allowing the model to leverage both local and global information synergistically.

When β continues to increase beyond 0.1 (from 0.15 to 1.0), a gradual decline in accuracy is observed, dropping to 84.45% at β = 1.0. This suggests that over-emphasizing the deep layer loss can be detrimental. An excessively large β forces the model to prioritize high-level semantic abstraction while potentially neglecting the fine-grained, discriminative details present in the shallower layers.

4.6. Ablation Study on Multi-Layer Feature Fusion Configurations

To rigorously investigate the influence of the number and depth of fused feature layers on diagnostic performance, an ablation experiment was conducted.

All experiments were performed on the CWRU dataset under the three-way five-shot setting with eight training samples per class, and the loss weights for each fused layer were set equal for consistency. The results are summarized in Table 7.

The results demonstrate that the two-layer fusion achieves the highest accuracy. Extending fusion to three or four layers yields marginally lower accuracy, which can be attributed to the inclusion of shallow features that are more susceptible to noise and operating condition variations. Moreover, the computational cost increases with the number of fused layers due to the parallel similarity measurers. The two-layer configuration thus provides the optimal balance between diagnostic precision and computational efficiency, justifying its selection in the proposed framework.

5. Conclusions

This paper addressed the critical challenge of data scarcity in rolling bearing fault diagnosis, a problem that severely limits the practical deployment of deep learning models in industrial settings. The work is grounded in the fundamental principle that lubrication state directly influences vibration signatures, thereby motivating the development of a diagnostic framework capable of learning from extremely limited labeled samples while maintaining robustness to real-world operational variations.

To overcome the limitations of existing methods, a novel few-shot diagnosis framework was proposed. The approach employs a meta-learning paradigm with episodic training to explicitly simulate sample scarcity. Its core innovations include a learnable similarity measurer that adaptively generates optimal pseudo-distance metrics, replacing traditional fixed-distance functions. Furthermore, a multi-layer feature fusion mechanism was introduced to jointly constrain the network with both shallow discriminative details and deep semantic features, significantly enhancing feature robustness against noise and operational changes.

Systematic experimental validation on the CWRU and PU bearing datasets substantiates the effectiveness of the proposed method. The key findings are as follows: (1) The framework achieves stable diagnostic accuracy above 90% under extremely few-shot conditions (e.g., three-way one-shot and five-shot tasks). (2) It maintains high accuracy with minimal fluctuation under cross-load domain shifts, demonstrating strong load invariance. (3) The model exhibits excellent inherent noise immunity without requiring signal denoising preprocessing. (4) Most critically, the method successfully generalizes from laboratory-simulated artificial faults to naturally evolved fatigue failures, confirming its ability to learn underlying physical vibration patterns rather than superficial defect geometries.

Despite these promising results, limitations remain. The current framework relies solely on vibration data and does not explicitly incorporate prior physical knowledge of the bearing degradation process. Future research will focus on integrating tribological and dynamic physical information—such as surface fatigue evolution models, lubrication film thickness dynamics, or spall propagation rates—into the meta-learning architecture. This physics-informed machine learning fusion has the potential to further enhance the model’s extrapolation capability across different bearing geometries and failure stages, paving the way for more predictive and interpretable condition monitoring systems.

Author Contributions

Conceptualization, C.D. and S.W.; methodology, C.D.; software C.D.; validation, S.W., H.Z. and L.F.; resources, C.D.; data curation, C.D.; writing—original draft preparation, C.D.; writing—review and editing, D.D. and S.W.; project administration, C.D.; funding acquisition, C.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Technologies R&D Program of Liangping Science and Technology Bureau of Chongqing of China, grant number LpKj20240008 and The Natural Science Foundation of Sichuan Province of China, grant number 2022NSFSC0416.

Data Availability Statement

The data used in this article can be obtained from references [24,25].

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Nomenclature

AE	Autoencoder
ALT	Accelerated Life Testing
AWGN	Additive White Gaussian Noise
BFxx	Ball Fault with Fault Size
BPFI	Ball Pass Frequency of Inner Race
BPFO	Ball Pass Frequency of Outer Race
BSF	Ball Spin Frequency
CNN	Convolutional Neural Network
CWRU	Case Western Reserve University
DASMN	Domain Adversarial Similarity Meta-learning Network
DNN	Deep Neural Network
EDM	Electric Discharge Machining
FFT	Fast Fourier Transform
FSL	Few-Shot Learning
FTF	Fundamental Train Frequency
GAN	Generative Adversarial Network
IFxx	Inner Ring Fault with Fault Size
L	Layer
MAML	Model-Agnostic Meta-Learning
MatchNet	Matching Network
MSE	Mean Squared Error
OFxx	Outer Ring Fault with Fault Size
PU	Paderborn University
Q	Query Set
ReLU	Rectified Linear Unit
S	Support Set
SEMN	Squeeze-Excitation Meta-learning Network
SNR	Signal-to-Noise Ratio
t-SNE	t-Distributed Stochastic Neighbor Embedding

References

Su, X.; Han, J.; Chen, C.; Lu, J.; Ma, W.; Dai, X. Intelligent Workshop Bearing Fault Diagnosis Method Based on Improved Convolutional Neural Network. Lubricants 2025, 13, 521. [Google Scholar] [CrossRef]
Yang, J.; Bai, Y.; Ma, X.; Yang, J.; Chai, L.; Meng, X. A Complex Multi-Working-Condition Bearing Fault Diagnosis Model Based on Sparse Representation Classification. Lubricants 2026, 14, 27. [Google Scholar] [CrossRef]
Pan, C.; Shang, Z.; Li, W.; Liu, F.; Tang, L. Bearing fault diagnosis based on high-confidence pseudo-labels and dual-view multi-adversarial sparse joint attention network under variable working conditions. Eng. Appl. Artif. Intell. 2024, 133, 108625. [Google Scholar] [CrossRef]
Jia, W.; Dong, Z.; Shi, H.; Zhao, Y.; Wang, Z. A zero-shot bearing fault diagnosis framework utilizing spatial relationships among primary features under variable working conditions. Mech. Syst. Signal Process. 2025, 238, 113103. [Google Scholar] [CrossRef]
Tian, S.; Zhen, D.; Li, H.; Feng, G.; Zhang, H.; Gu, F. Adaptive resonance demodulation semantic-induced zero-shot compound fault diagnosis for railway bearings. Measurement 2024, 235, 115040. [Google Scholar] [CrossRef]
Zhu, L.; Wang, J.; Chen, M.; Liu, L. Fusion-driven fault diagnosis based on adaptive tuning feature mode decomposition and synergy graph enhanced transformer for bearings under noisy conditions. Expert Syst. Appl. 2025, 260, 125441. [Google Scholar] [CrossRef]
Chen, Y.; Pu, X.; Li, G.; Bai, Y.; Hao, L. Few-Shot Fault Diagnosis of Rolling Bearings Using Generative Adversarial Networks and Convolutional Block Attention Mechanisms. Lubricants 2025, 13, 515. [Google Scholar] [CrossRef]
Deng, D.; Li, W.; Liu, J.; Qin, Y. Node-Incremental-Based Multisource Domain Adaptation for Fault Diagnosis of Rolling Bearings with Limited Data. Machines 2026, 14, 71. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, Y.; Qin, J.; Duan, J.; Zhou, Y. Dynamic MAML with Efficient Multi-Scale Attention for Cross-Load Few-Shot Bearing Fault Diagnosis. Entropy 2025, 27, 1063. [Google Scholar] [CrossRef]
Xu, M.; Pan, H.; Wang, S.; Sun, S. Transformer-Embedded Task-Adaptive-Regularized Prototypical Network for Few-Shot Fault Diagnosis. Electronics 2025, 14, 3838. [Google Scholar] [CrossRef]
Zhu, P.; Deng, L.; Tang, B.; Yang, Q.; Li, Q. Digital twin-enabled entropy regularized wavelet attention domain adaptation network for gearboxes fault diagnosis without fault data. Adv. Eng. Inform. 2025, 64, 103055. [Google Scholar] [CrossRef]
Pei, X.; Li, X.; Li, J.; Gao, Y.; Gao, L. Threshold alignment indicator driven two-phase nonlinear degradation model for remaining useful life prediction of rolling bearing. Adv. Eng. Inform. 2025, 67, 103507. [Google Scholar] [CrossRef]
Mao, W.; Liu, Y.; Ding, L. Imbalanced fault diagnosis of rolling bearing based on generative adversarial network: A comparative study. IEEE Access 2019, 7, 9515–9530. [Google Scholar] [CrossRef]
Dai, J.; Wang, J. Anomaly detection of mechanical systems based on generative adversarial network and auto-encoder. Chin. J. Sci. Instrum. 2019, 40, 16–26. [Google Scholar]
Zhou, F.; Yang, S.; Fujita, H. Deep learning fault diagnosis method based on global optimization GAN for unbalanced data. Knowl.-Based Syst. 2020, 187, 104837. [Google Scholar] [CrossRef]
Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv 2017, arXiv:1701.04862. [Google Scholar] [CrossRef]
Zhang, K.; Chen, J.; Zhang, T. Intelligent fault diagnosis of mechanical equipment under varying working condition via iterative matching network augmented with selective Signal reuse strategy. J. Manuf. Syst. 2020, 57, 400–415. [Google Scholar] [CrossRef]
Feng, Y.; Chen, J.; Yang, Z. Similarity-based meta-learning network with adversarial domain adaptation for cross-domain fault identification. Knowl.-Based Syst. 2021, 217, 106829. [Google Scholar] [CrossRef]
Feng, Y.; Chen, J.; Zhang, T. Semi-supervised meta-learning networks with squeeze-and-excitation attention for few-shot fault diagnosis. ISA Trans. 2022, 120, 383–401. [Google Scholar] [CrossRef]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.; Hospedales, T.M. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1199–1208. [Google Scholar]
Li, F.; Liu, Y. A Survey on Recent Advances in Meta-Learning. Chin. J. Comput. 2021, 44, 422–446. [Google Scholar]
Lynagh, N.; Rahnejat, H.; Ebrahimi, M.; Aini, R. Bearing induced vibration in precision high speed routing spindles. Int. J. Mach. Tools Manuf. 2000, 40, 561–577. [Google Scholar] [CrossRef]
Vafaei, S.; Rahnejat, H.; Aini, R. Vibration monitoring of high speed spindles using spectral analysis techniques. Int. J. Mach. Tools Manuf. 2002, 42, 1223–1234. [Google Scholar] [CrossRef]
Case Western Reserve University. Case Western ReserveUniversity (CWRU) Bearing Data Center. Available online: https://engineering.case.edu/bearingdatacenter/download-data-file (accessed on 3 April 2026).
Zhu, Z.; Peng, G.; Chen, Y.; Gao, H. A convolutional neural network based on a capsule network with strong generalization for bearing fault diagnosis. Neurocomputing 2019, 323, 62–75. [Google Scholar] [CrossRef]
Zhang, A.; Li, S. Limited data rolling bearing fault diagnosis with few-shot learning. IEEE Access 2019, 7, 110895–110904. [Google Scholar] [CrossRef]
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Liu, D.; Deng, L.; Zhao, C.; Yang, D.; Zhang, Y.; Wang, G. A noise-robust and cross-domain few-shot fault diagnosis method of rolling bearings based on TFC-FPN. Meas. Sci. Technol. 2025, 36, 046127. [Google Scholar] [CrossRef]
Xie, Z.; Zhan, H.; Wang, Y.; Zhan, C.; Wang, Z.; Jia, N. Meta-learning-based fault diagnosis method for rolling bearings under cross-working conditions. Meas. Sci. Technol. 2025, 36, 016218. [Google Scholar] [CrossRef]
Wu, J.; Zhao, Z.; Sun, C. Few-shot transfer learning for intelligent fault diagnosis of machine. Measurement 2020, 166, 108202. [Google Scholar] [CrossRef]
Laurens, V.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Meta-learning process based on measurement.

Figure 2. Network model based on multi-layer feature fusion and similarity measurement.

Figure 4. CWRU testbed.

Figure 5. Diagnostic accuracy of 1-shot under each sample number.

Figure 6. Schematic diagram of the cross-load test results of CWRU.

Figure 7. The composite signal diagram of the original signal, the noise signal and the signal-to-noise ratio of −4 dB.

Figure 8. A comparison diagram of the CWRU noise verification experiment.

Figure 9. PU testbed.

Figure 10. Diagnostic accuracy of different methods.

Figure 11. Feature visualization results of the different methods.

Table 1. Network layer structure.

Structure	Operation	Kernel Size	Stride	Padding	Number of Kernels	Activation Function
Feature Extractor	Input	/	/	/	/	/
	Conv1d	10 * 1	1	Same	64	ReLU
	Max Pooling	2	2	Valid	/	/
	Conv1d	3 * 1	1	Same	64	ReLU
	Max Pooling	2	2	Valid	/	/
	Conv1d	3 * 1	1	Same	64	ReLU
	Conv1d	3 * 1	1	Same	64	ReLU
	Average Pooling	25	1	Valid	/	/
Similarity Measurer	Conv1d	3 * 1	1	Same	64	ReLU
	Max Pooling	2	2	Valid	/	/
	Conv1d	3 * 1	1	Same	64	ReLU
	Max Pooling	2	2	Valid	/	/
	Feature Concatenation	/	/	/	64 (32 support + 32 query)
	Connected	/	/	/	8	ReLU
	Connected	/	/	/	1	Sigmoid

Table 2. Description of experimental data of CWRU bearing.

Fault Type	Size/mm	Total Samples	Label	Remarks
Normal	-	400	0	NO
Inner Ring Fault	0.1778	400	1	IF07
	0.3556	400	2	IF14
	0.5334	400	3	IF21
Outer Ring Fault	0.1778	400	4	OF07
	0.3556	400	5	OF14
	0.5334	400	6	OF21
Rolling Element Fault	0.1778	400	7	BF07
	0.3556	400	8	BF14
	0.5334	400	9	BF21

Table 3. Diagnostic accuracy of Experiment 1.

Method	4		8		12
Method	1-Shot	5-Shot	1-Shot	5-Shot	1-Shot	5-Shot
SN	52.53% ± 1.29%	/	60.24% ± 1.30%	65.00% ± 1.35%	67.10% ± 1.12%	71.84% ± 1.17%
RN	77.20% ± 1.36%	/	82.64% ± 1.22%	87.54% ± 0.63%	90.56% ± 1.14%	95.54% ± 0.77%
MAML	77.73% ± 1.59%	/	85.03% ± 1.11%	90.25% ± 1.04%	93.44% ± 1.23%	97.94% ± 0.81%
TFCFPN	80.09% ± 1.51%	/	85.85% ± 1.34%	88.57% ± 0.92%	91.17% ± 1.25%	96.57% ± 0.79%
DCFFML	79.46% ± 1.12%	/	84.29% ± 1.29%	89.03% ± 1.12%	92.65% ± 1.30%	97.44% ± 0.75%
Ours	82.65% ± 0.82%	/	90.82% ± 1.30%	96.61% ± 0.88%	98.02% ± 0.84%	99.09% ± 0.52%

Table 4. Select different types of bearing faults from PU data.

Name	Fault Location	Processing Method	Fault Severity
KA01	Outer Ring Fault	Electrical Discharge Machining (EDM)	1
KA03	Outer Ring Fault	Electro-etching	2
KA05	Outer Ring Fault	Electro-etching	1
KA07	Outer Ring Fault	Drilling	1
KA08	Outer Ring Fault	Drilling	2
KI01	Inner Ring Fault	Electrical Discharge Machining (EDM)	1
KI03	Inner Ring Fault	Electro-etching	1
KI05	Inner Ring Fault	Electro-etching	2
K001	Healthy	/	/
KA04	Outer Ring Fault	Fatigue: Pitting	1
KB23	Outer Ring Fault + Inner Ring Fault	Fatigue: Pitting	2
KB27	Outer Ring Fault + Inner Ring Fault	Plastic Deformation: Indentation	1
KI04	Inner Ring Fault	Fatigue: Pitting	1

Table 5. Diagnostic accuracy of Experiment 4.

Method	1-Shot	5-Shot	10-Shot
DTN	78.10% ± 1.33%	84.01% ± 1.07%	89.98% ± 1.16%
FTN	87.16% ± 1.20%	92.25% ± 1.08%	95.46% ± 0.76%
MAML	89.16% ± 1.43%	94.51% ± 0.83%	96.48% ± 0.89%
Ours	92.89% ± 1.32%	98.04% ± 0.63%	99.60% ± 0.13%

Table 6. Diagnostic accuracy of different balancing parameter β.

β value	0.01	0.05	0.1	0.15	0.2	0.3	0.5	0.8	1
Accuracy	87.43%	91.97%	96.33%	94.45%	93.55%	91.48%	88.93%	86.21%	84.45%

Table 7. Diagnostic accuracy of different feature fusion configurations.

Method	Accuracy	Training Time per Epoch
Single-layer (L only)	89.41%	18.2 S
Two-layer (L−1 + L)	96.33%	22.5 S
Three-layer (L−2 + L−1 + L)	94.18%	28.7 S
Four-layer (L−3 + L−2 + L−1 + L)	92.87%	36.4 S

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, C.; Dong, D.; Wang, S.; Zhang, H.; Feng, L. Few-Shot Bearing Fault Diagnosis Based on Multi-Layer Feature Fusion and Similarity Measurement. Lubricants 2026, 14, 172. https://doi.org/10.3390/lubricants14040172

AMA Style

Deng C, Dong D, Wang S, Zhang H, Feng L. Few-Shot Bearing Fault Diagnosis Based on Multi-Layer Feature Fusion and Similarity Measurement. Lubricants. 2026; 14(4):172. https://doi.org/10.3390/lubricants14040172

Chicago/Turabian Style

Deng, Changyong, Dawei Dong, Sipeng Wang, Hongsheng Zhang, and Li Feng. 2026. "Few-Shot Bearing Fault Diagnosis Based on Multi-Layer Feature Fusion and Similarity Measurement" Lubricants 14, no. 4: 172. https://doi.org/10.3390/lubricants14040172

APA Style

Deng, C., Dong, D., Wang, S., Zhang, H., & Feng, L. (2026). Few-Shot Bearing Fault Diagnosis Based on Multi-Layer Feature Fusion and Similarity Measurement. Lubricants, 14(4), 172. https://doi.org/10.3390/lubricants14040172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Bearing Fault Diagnosis Based on Multi-Layer Feature Fusion and Similarity Measurement

Abstract

1. Introduction

2. Theoretical Backgrounds

2.1. Meta-Learning for Few-Shot Diagnosis

2.2. Metric-Based Meta-Learning and Relation Networks

2.3. Bearing Vibration Excitation Sources and Fault Feature Frequencies

3. Proposed Method

3.1. Feature Extractor

3.2. Similarity Measurer

3.3. Multi-Layer Feature Fusion

3.4. Diagnosis Process

4. Experimental Analysis

4.1. Experiment 1: Sensitivity Analysis of Training Sample Size

4.2. Experiment 2: Cross-Load Domain Generalization Capability Verification

4.3. Experiment 3: Robustness Test of Anti-Noise Performance

4.4. Experiment 4: Generalization Exploration Across Damage Mechanisms

4.5. Sensitivity Analysis of the Balancing Parameter β

4.6. Ablation Study on Multi-Layer Feature Fusion Configurations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI