Next Article in Journal
Breast Cancer Diagnosis Method Based on Phase Congruency and Dual-Branch Feature Modeling
Previous Article in Journal
Bidirectional Long Short-Term Memory-Driven Control for Grid-Connected Photovoltaic-Battery Energy Trading Systems: Mixed-Integer Linear Programming Optimization and Online Deep Reinforcement Learning
Previous Article in Special Issue
Large Language and Foundation Models for Machinery Health Monitoring: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ProtoFed: Prototype-Enhanced Federated Meta-Learning for Few-Shot Rolling Bearing Fault Diagnosis

1
Southampton International College (Wrexham College), Dalian Polytechnic University, Dalian 116034, China
2
School of Mechanical Engineering and Automation, Dalian Polytechnic University, Dalian 116034, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(11), 5277; https://doi.org/10.3390/app16115277
Submission received: 28 April 2026 / Revised: 17 May 2026 / Accepted: 18 May 2026 / Published: 25 May 2026
(This article belongs to the Special Issue AI-Based Machinery Health Monitoring)

Abstract

Rolling bearing fault diagnosis is essential for ensuring the safety and reliability of rotating machinery. Although deep learning-based methods have achieved promising performance, they usually require sufficient labeled data, which is difficult to obtain in practical industrial scenarios where fault samples are scarce and data sharing across sites is restricted by privacy and confidentiality constraints. Federated learning enables collaborative model training without transmitting raw data, but existing federated fault diagnosis methods often degrade under few-shot conditions. Moreover, current federated meta-learning approaches mainly focus on model-level adaptation and lack explicit class-level representation alignment, leading to prototype drift across heterogeneous operating conditions. To address these challenges, this paper proposes ProtoFed, a prototype-enhanced federated meta-learning framework for few-shot rolling bearing fault diagnosis. ProtoFed converts raw vibration signals into time–frequency representations using continuous wavelet transform and performs local episodic learning with prototypical networks. A Global Prototype Calibration mechanism aggregates local class prototypes into stable global prototypes with exponential moving average smoothing, while a Prototype-Distance Aware Aggregation strategy adaptively adjusts client aggregation weights according to local–global prototype divergence. Experiments on the CWRU and Paderborn University bearing datasets under non-IID 5-shot and 10-shot settings show that ProtoFed consistently outperforms standard federated learning, prototype-based federated learning, and federated meta-learning baselines. Under the 5-shot setting, ProtoFed achieves 95.63% and 91.35% accuracy on CWRU and PU, respectively, approaching centralized few-shot upper-bound performance while preserving the federated training paradigm.

1. Introduction

Rolling bearings are among the most critical components in rotating machinery, and their operational health directly determines the safety and reliability of industrial systems such as wind turbines, aero-engines, and manufacturing spindles [1]. Unexpected bearing failures can result in costly unplanned downtime, equipment damage, and even catastrophic safety incidents. Over the past decade, data-driven fault diagnosis methods, particularly those based on deep learning, have demonstrated remarkable capability in automatically extracting discriminative features from raw vibration signals, substantially outperforming traditional approaches that rely on hand-crafted time- or frequency-domain features [2].
However, the success of deep learning is predicated on the availability of large volumes of labeled training data. In practice, acquiring sufficient fault samples is often infeasible: newly deployed equipment may operate normally for extended periods before faults manifest, rare fault modes occur infrequently, and expert annotation is both expensive and time-consuming [3]. Few-shot learning addresses this data scarcity by enabling rapid adaptation to novel fault categories with only a handful of labeled examples [4,5]. Nevertheless, existing few-shot fault diagnosis methods assume centralized data access, requiring all training samples to be aggregated at a single site—an assumption that conflicts with the privacy regulations, intellectual property concerns, and data governance policies prevalent in modern industrial environments [6]. Federated learning (FL) offers a natural remedy by enabling collaborative model training across distributed clients without sharing raw data [7,8].
While FL has been actively explored for fault diagnosis [9,10,11], existing methods uniformly assume that each client possesses sufficient labeled samples. When only a few labeled examples per fault category are available—as is common for newly commissioned equipment or rare failure modes—standard FL approaches degrade sharply. Recent attempts to integrate meta-learning into federated fault diagnosis, such as FedMeta-FFD [12] and REFML [13], have achieved promising results but share a critical limitation: the absence of explicit cross-client prototype alignment. Under heterogeneous operating conditions, the feature distributions of the same fault category can differ substantially across clients, causing the globally aggregated model to suffer from prototype drift and degraded generalization [14].
We observe that class prototypes—the mean feature representations of each category—offer a natural bridge between local few-shot learning and global federated collaboration. Prototypes are compact, aligned with the metric-based classification paradigm, and expose only class-level aggregated statistics rather than raw data. However, existing prototype-based FL methods such as FedProto [15] and FedSA [16] are designed for data-rich supervised settings and do not address the amplified prototype drift that arises when each client holds only a handful of labeled samples under distinct operating conditions. To this end, we propose ProtoFed, a prototype-enhanced federated meta-learning framework specifically designed for few-shot rolling bearing fault diagnosis.
Before presenting our specific contributions, we review the closely related literature on federated fault diagnosis, few-shot fault diagnosis, and prototype-based federated learning to contextualize the technical gaps that motivate ProtoFed.

1.1. Federated Learning for Fault Diagnosis

Federated learning has attracted considerable attention in the fault diagnosis community as it naturally addresses data isolation and privacy constraints in industrial environments [9]. Zhang et al. [10] proposed a federated framework incorporating self-supervised learning and dynamic validation to improve diagnostic performance across heterogeneous clients, while FedCAE [17] adopts a two-stage paradigm in which edge clients extract features via federated convolutional autoencoders before a centralized classifier is trained on the aggregated representations.
A major challenge in federated fault diagnosis is the non-IID nature of data across clients, arising from varying operating conditions, sensor configurations, and equipment types [18]. To mitigate this heterogeneity, Zhang et al. [19] proposed similarity-based collaborative aggregation, Zhang and Li [20] introduced adversarial alignment into federated transfer learning, and Zhao and Shen [11] developed a federated domain generalization approach that achieves cross-condition diagnosis without access to target domain data. More recently, FedRad [21] introduced dynamic aggregation based on Rademacher complexity to improve generalization under non-IID settings.
Despite these advances, existing FL-based fault diagnosis methods uniformly assume sufficient labeled samples on each client—an assumption frequently violated for newly deployed equipment or rare fault types. Maintaining diagnostic accuracy under such data-scarce federated scenarios remains an open challenge.

1.2. Few-Shot Fault Diagnosis

Few-shot fault diagnosis aims to achieve reliable fault identification with only a limited number of labeled samples per category—a common scenario in industrial settings where new fault modes emerge infrequently and annotation requires domain expertise. Existing approaches fall into two main paradigms. In the optimization-based camp, Zhang et al. [2] applied MAML to bearing fault diagnosis, demonstrating that a well-initialized model can rapidly adapt to new fault categories with minimal data. Subsequent work has focused on improving training stability through task-sequencing [22] and adaptive learning rates [23].
In the metric-based paradigm, models learn an embedding space where classification relies on comparing query samples against class-representative features. Wang et al. [5] developed a metric-based meta-learning model achieving competitive few-shot performance under multiple limited-data conditions. Prototypical networks [4], which classify by distance to class-mean embeddings, have become a widely adopted baseline due to their simplicity and effectiveness. Recently, Rezazadeh et al. [24] proposed a prototype-attention domain adaptation framework for explainable bearing fault diagnosis, demonstrating that prototype-based class alignment combined with attention-driven sample weighting can achieve strong cross-domain generalization and interpretability on the CWRU and Paderborn benchmarks; however, their framework operates in a centralized single-source-to-target paradigm and does not address the multi-client federated scenario. To handle cross-domain scenarios, Wu et al. [3] combined few-shot learning with transfer learning, Lei et al. [25] embedded prior domain knowledge into meta-transfer learning, and Li et al. [26] incorporated attention mechanisms for fine-grained discrimination. Li et al. [27] specifically investigated meta-learning under complex operating conditions, demonstrating improved robustness with diverse task distributions. A comprehensive survey by Liang et al. [1] summarizes recent trends including GAN-based augmentation, enhanced feature extractors, and novel classifier architectures.
Despite their success, these methods all operate under a centralized paradigm, implicitly assuming all training data can be collected at a single site—conflicting with the data-sharing restrictions of distributed industrial systems.

1.3. Federated Few-Shot Learning and Prototype-Based FL

Federated few-shot learning (FFSL) lies at the intersection of the two aforementioned research directions, aiming to learn models that generalize from limited labeled data across privacy-preserving distributed clients. Zhao et al. [28] formalized the personalized FFSL problem and proposed a framework that decouples global and local feature learning to handle non-IID few-shot tasks in cross-silo settings. Wang et al. [29] introduced a general FFSL framework at the task level, enabling federated aggregation of meta-knowledge extracted from episodic training on each client. Other approaches tackle FFSL from complementary angles: Huang et al. [30] developed a model-agnostic federated few-shot method, Yang et al. [14] proposed grouping clients by distribution similarity and performing intra-group meta-learning, and Tian et al. [31] combined personalized knowledge distillation with few-shot federated learning to improve performance on small-scale benchmarks.
In the specific domain of fault diagnosis, only a few works have explored the FFSL paradigm. FedMeta-FFD [12] trains a global meta-learner on the server side and distributes it to clients for rapid local adaptation with minimal labeled data, with an auxiliary adaptive learning rate module to enhance convergence stability. REFML [13] addresses both domain discrepancy and data scarcity by combining a shared encoder with meta-updated predictors and an adaptive interpolation mechanism that balances global generalization and local personalization. While these methods represent important progress, neither incorporates explicit cross-client prototype alignment, leaving the prototype drift problem—where the same fault category develops divergent representations under different operating conditions—unaddressed.
In parallel, prototype-based federated learning has emerged as a promising direction for handling data heterogeneity in general classification tasks. FedProto [15] pioneered the communication of class prototypes instead of model parameters, enabling heterogeneous model architectures across clients while maintaining non-IID tolerance. FedProc [32] introduced a global prototypical contrastive loss that pulls local features toward their corresponding global prototypes. More recently, FedSA [16] proposed semantic anchors as unified prototypes to break the vicious cycle of representation inconsistency and classifier bias, and FedDP [33] learned domain-invariant prototypes through an information bottleneck to align both representation and parameter spaces across clients with feature shift.
However, these prototype-based FL methods are designed for standard supervised settings with abundant labeled data, where local prototypes computed from hundreds of samples per class are statistically stable. In the few-shot regime, local prototypes derived from only a handful of samples are highly sensitive to operating condition noise, amplifying prototype drift beyond what these methods were designed to handle. ProtoFed bridges this gap by integrating prototypical metric learning with global prototype calibration and distance-aware aggregation, specifically tailored for few-shot bearing fault diagnosis under federated heterogeneity.
While the individual building blocks of ProtoFed—CWT, prototypical networks, EMA smoothing, and distance-based weighting—are well-established, the principal novelty lies not in their independent use but in their synergistic integration and the new mechanisms that emerge from this combination. Specifically, GPC transforms per-episode local prototypes—which are inherently noisy in the few-shot regime—into temporally smoothed global anchors that no single client could produce on its own. These calibrated anchors, in turn, enable PDAA to compute geometrically meaningful client weights in the prototype space, a signal that is far more stable than the gradient- or parameter-level similarity used by prior personalized FL methods when local datasets contain only a handful of samples. The calibration loss L cal closes the loop by pulling local embeddings toward the global anchors during training, creating a bidirectional alignment mechanism that jointly reduces prototype drift and improves aggregation quality across rounds. This tightly coupled prototype–aggregation–regularization loop constitutes the core methodological contribution, and its effectiveness is validated by the consistently smaller centralized-to-federated performance gap observed across all experimental conditions.
The main contributions of this paper are summarized as follows:
  • We propose ProtoFed, a prototype-enhanced federated meta-learning framework for few-shot rolling bearing fault diagnosis that enables collaborative diagnosis across distributed clients without sharing raw vibration data. The framework introduces a tightly coupled prototype–aggregation–regularization loop that jointly addresses prototype drift and client heterogeneity—mechanisms absent in existing federated meta-learning methods.
  • We design a Global Prototype Calibration (GPC) mechanism that constructs stable global class prototypes through cross-client aggregation and EMA temporal smoothing, providing condition-invariant classification anchors that mitigate the amplified prototype drift inherent in few-shot federated settings.
  • We propose a Prototype-Distance Aware Aggregation (PDAA) strategy that adaptively weights client models based on local–global prototype divergence, reducing the negative influence of poorly aligned local representations and providing inherent robustness to noisy or outlier clients.
  • We conduct extensive experiments on the CWRU and Paderborn University bearing datasets under 5-shot and 10-shot non-IID settings. Comprehensive evaluations including cross-condition generalization, noise robustness, hyperparameter sensitivity, statistical significance tests, and differential privacy analysis demonstrate the effectiveness, robustness, and practical viability of ProtoFed.

2. Methodology

2.1. Problem Formulation

We consider a federated few-shot fault diagnosis scenario involving a central server and K geographically distributed clients { C 1 , C 2 , , C K } , where each client C k operates under distinct working conditions (e.g., different loads, rotational speeds, or environmental settings). Each client possesses a local dataset D k = { ( x i k , y i k ) } i = 1 N k , where x i k R L denotes a one-dimensional vibration signal of length L and y i k { 1 , 2 , , C } is the corresponding fault category label. The key constraint is that each client has access to only a few labeled samples per class: specifically, each client holds at most S samples per category, where S is small (e.g., S = 5 or S = 10 ). Raw data D k cannot be shared across clients or uploaded to the server due to privacy and confidentiality requirements.
Following the episodic meta-learning paradigm, the local training on each client is organized into episodes. In each episode, a C-way S-shot task T is constructed by sampling a support set S = { ( x j , y j ) } j = 1 C × S and a query set Q = { ( x j , y j ) } j = 1 C × Q from D k , where Q denotes the number of query samples per class. The support set is used to build class representations, and the query set is used to evaluate and update the model parameters.
The objective of ProtoFed is to collaboratively learn, across all K clients, a shared feature extractor f θ : R 1 × H × W R d that maps time–frequency representations of vibration signals into a d-dimensional embedding space, such that the resulting model generalizes well to few-shot fault classification tasks under previously unseen operating conditions, without requiring any client to share its raw data.

2.2. Framework Overview

The overall architecture of ProtoFed is illustrated in Figure 1. The framework operates through iterative communication rounds between the server and K clients. In each round, clients first transform raw vibration signals into time–frequency representations via CWT, then perform episodic meta-training with prototypical networks to learn class prototypes and update the local model. Each client uploads its round-level local prototypes and model parameters to the server, where GPC aggregates them into unified global prototypes and PDAA computes adaptive aggregation weights for the global model update. The updated global model and global prototypes are then broadcast back to all clients. Throughout this process, only model parameters and class-level prototypes are communicated; no raw data or individual sample features are transmitted.

2.3. Time–Frequency Feature Extraction via CWT

Raw one-dimensional vibration signals contain rich fault-related information but suffer from noise contamination and lack explicit time–frequency structure. To obtain a more informative input representation, we employ continuous wavelet transform (CWT) to convert each raw signal x R L into a two-dimensional time–frequency map. The CWT of a signal x ( t ) with respect to a mother wavelet ψ ( t ) is defined as:
W ( a , b ) = 1 a + x ( t ) ψ t b a d t ,
where a > 0 is the scale parameter controlling frequency resolution, b is the translation parameter controlling temporal localization, and  ψ ( · ) denotes the complex conjugate of the mother wavelet. In this work, we adopt the Morlet wavelet as the mother wavelet due to its good balance between time and frequency resolution:
ψ ( t ) = 1 π 4 e j ω 0 t e t 2 / 2 ,
where ω 0 is the central frequency, typically set to 6. In our implementation, the complete signal-to-scalogram pipeline proceeds as follows. Each raw vibration signal (at 12 kHz for both datasets after downsampling PU) is first segmented into non-overlapping windows of length L = 1024 samples. For each segment, the CWT is computed using the Morlet wavelet ( ω 0 = 6 ) over 64 logarithmically spaced scales in the range a [ 2 ,   170 ] . The scale bounds are determined by two physical constraints: the upper bound ensures that the effective wavelet support (≈ 6 a = 1020 samples at a = 170 ) remains within the segment length, while the lower bound avoids pseudo-frequencies above the Nyquist limit. By the Morlet pseudo-frequency relation f = f c · f s / a (where f c = ω 0 / 2 π 0.955 ), this scale range corresponds to an approximate frequency band of [ 67 ,   5730 ]  Hz at f s = 12  kHz, covering the characteristic bearing fault frequencies (ball pass frequencies, cage frequencies) for the rotational speeds in both datasets. The absolute values of the resulting complex-valued CWT coefficient matrix are computed to obtain the scalogram magnitude. The scalogram is then resized to a fixed spatial resolution of H × W = 64 × 64 through bilinear interpolation and normalized to the range [ 0 ,   1 ] via min-max normalization per sample. The resulting time–frequency representation I R 1 × H × W is treated as a single-channel grayscale image that serves as input to the feature extractor. This transformation preserves both temporal dynamics and frequency characteristics of bearing vibration signals, enabling the convolutional network to exploit two-dimensional spatial patterns associated with different fault types. The complete preprocessing pipeline is illustrated in Figure 2, and the preprocessing scripts will be publicly released upon acceptance to ensure full reproducibility.

2.4. Prototypical Network for Local Few-Shot Classification

On each client C k , the time–frequency representations are fed into a convolutional feature extractor f θ that maps each input image I to a d-dimensional embedding vector z = f θ ( I ) R d . Given an episodic task with support set S , the prototype for class c is computed as the mean embedding of all support samples belonging to that class:
p c k , e = 1 | S c | ( I j , y j ) S c f θ ( I j ) ,
where S c = { ( I j , y j ) S y j = c } denotes the subset of support samples from class c, and the superscripts k and e indicate that this prototype is computed locally on client C k at the e-th episode. Within a single client’s local training, the client index k is fixed; we retain it for notational consistency with the subsequent federated aggregation formulations. For a query sample I q with embedding z q = f θ ( I q ) , the classification probability over all C classes is obtained via a softmax over the negative squared Euclidean distances:
p ( y = c I q ) = exp z q p c k , e 2 c = 1 C exp z q p c k , e 2 .
The local prototypical classification loss on client C k is then defined as the negative log-probability of the correct class labels over all query samples:
L proto k = 1 | Q | ( I q , y q ) Q log p ( y = y q I q ) .
This metric-based classification paradigm is particularly well-suited for few-shot settings, as it avoids training a parametric classifier head and instead relies on the geometric structure of the embedding space.

2.5. Global Prototype Calibration (GPC)

When clients operate under heterogeneous conditions, the local prototypes for the same fault category can occupy different regions of the embedding space across clients—a phenomenon we refer to as prototype drift. As illustrated in Figure 3a, local prototypes of the same class diverge across clients due to differences in load, speed, and environmental factors, resulting in inconsistent class representations. If left unaddressed, this drift causes the globally aggregated model to learn an ambiguous decision boundary, degrading its generalization to unseen conditions. The Global Prototype Calibration mechanism is designed to mitigate this problem by producing unified global prototypes on the server side, as depicted in Figure 3b.

2.5.1. Round-Level Local Prototype

Because a single episode uses only S support samples per class, the per-episode prototype p c k , e is statistically noisy. To obtain a more reliable estimate of client C k ’s class representation at round t, we aggregate across all E local episodes:
p c k , ( t ) = 1 E e = 1 E p c k , e , ( t ) ,
where p c k , e , ( t ) denotes the per-episode prototype of class c computed in the e-th local episode of round t. It is p c k , ( t ) that each client uploads to the server.

2.5.2. Server-Side Aggregation

At the end of each communication round t, every client C k uploads its set of round-level local prototypes { p c k , ( t ) } c = 1 C to the server. For each class c, the server computes a global prototype g c ( t ) as a weighted average of the corresponding local prototypes:
g c ( t ) = k = 1 K α c k p c k , ( t ) , where α c k = N c k k = 1 K N c k ,
where N c k denotes the number of samples of class c on client C k , and  α c k is the corresponding weight. In the balanced few-shot setting where all clients have the same number of samples per class, this reduces to a simple arithmetic mean.

2.5.3. Temporal Smoothing

To promote temporal stability and prevent abrupt shifts in the global prototypes across communication rounds, we apply an exponential moving average (EMA) update:
g ˜ c ( t ) = β g ˜ c ( t 1 ) + ( 1 β ) g c ( t ) ,
where β [ 0 ,   1 ) is the momentum coefficient. At the very first round ( t = 1 ), no prior global prototypes exist; we therefore initialize g ˜ c ( 0 ) = g c ( 1 ) (i.e., the first-round aggregated prototype is used directly without smoothing) and set λ = 0 in the local loss during round 1 so that the calibration regularization is inactive before meaningful global prototypes become available. From round t = 2 onward, the EMA update and the full training objective with λ > 0 are applied. The smoothed global prototypes { g ˜ c ( t ) } c = 1 C are then broadcast to all clients, serving as stable classification anchors for local training in the subsequent round.
By aggregating prototypes from clients with diverse operating conditions, GPC captures a condition-invariant class representation more robust than any single client’s local prototypes. Since only class-level mean vectors are communicated, the additional overhead is C × d floating-point values per client per round—negligible relative to the model parameter count.

2.6. Prototype-Distance Aware Aggregation (PDAA)

Standard federated aggregation (e.g., FedAvg) assigns client weights proportional to dataset sizes, treating all clients equally regardless of their representation quality. Under few-shot conditions, clients operating in extreme or atypical regimes may develop poorly aligned feature spaces, and assigning them equal influence degrades the global model. PDAA addresses this by dynamically computing aggregation weights based on the alignment between each client’s local prototypes and the global prototypes. For client C k at round t, the average prototype distance is:
D k ( t ) = 1 C c = 1 C p c k , ( t ) g ˜ c ( t ) 2 .
A smaller D k ( t ) indicates that client C k ’s local representations are well-aligned with the global prototypes, suggesting that its local model has learned more generalizable features. The aggregation weight for client C k is then computed via a softmax with a temperature parameter τ > 0 :
w k ( t ) = exp D k ( t ) / τ k = 1 K exp D k ( t ) / τ .
The global model parameters are then updated as:
θ ( t + 1 ) = k = 1 K w k ( t ) θ k ( t ) ,
where θ k ( t ) denotes the local model parameters of client C k after round t. The temperature parameter τ controls the sharpness of the weight distribution.

Relation to Existing Similarity-Based Aggregation

PDAA belongs to the broader family of similarity-driven personalized FL aggregation schemes, which includes FedAMP [34], pFedMe [35], and the similarity-based collaboration framework of Zhang et al. [19]. The key distinctions of PDAA are as follows. First, prior methods compute client similarity from model parameters or gradient directions, which are high-dimensional and noisy under few-shot training; PDAA instead uses the class-level prototype distance D k ( t ) , a low-dimensional quantity directly tied to the quantity ProtoFed seeks to align (the embedding space). Second, PDAA is computed against a calibrated global anchor produced by GPC rather than against pairwise client comparisons, which reduces the O ( K 2 ) pairwise similarity computation to O ( K ) and makes the weighting self-consistent with the calibration target. These differences make PDAA particularly well-suited to few-shot settings where parameter-level similarity signals are unstable.

2.7. Episodic Meta-Training with Calibration Regularization

At the beginning of each communication round, client C k downloads the global model parameters θ ( t ) and global prototypes { g ˜ c ( t 1 ) } c = 1 C from the server, then performs E episodic meta-training steps. In each episode, a C-way S-shot task is sampled from D k and the prototypical classification loss L proto k is computed via Equation (5). To regularize local training toward the global consensus, we introduce a calibration loss penalizing the discrepancy between per-episode local prototypes and global prototypes:
L cal k = 1 C c = 1 C p c k , e g ˜ c ( t 1 ) 2 .
The total loss for each episode on client C k is:
L total k = L proto k + λ L cal k ,
where λ 0 is a hyperparameter that controls the strength of the calibration regularization. When λ = 0 , the training reduces to standard local prototypical network training without global guidance. As  λ increases, the local prototypes are pulled more strongly toward the global prototypes, promoting cross-client consistency at the potential cost of local adaptability.
This regularization mechanism creates a bidirectional alignment loop within ProtoFed: the global prototypes guide local training through the calibration loss (top-down), while the locally updated prototypes contribute to refining the global prototypes in the next round through GPC (bottom-up). This iterative process progressively reduces prototype drift across clients and converges toward a shared, condition-invariant embedding space.

2.8. Overall Algorithm

The complete training procedure of ProtoFed is summarized in Algorithm 1.
The computational complexity of ProtoFed on each client is dominated by the episodic meta-training, which is comparable to standard prototypical network training. The additional overhead introduced by GPC and PDAA is minimal: GPC requires aggregating K × C prototype vectors of dimension d, and PDAA computes K scalar distances followed by a softmax normalization. The communication cost per round is | θ | + C × d per client, where | θ | is the model parameter count and C × d accounts for the uploaded prototypes. Since C × d | θ | in practice, the prototype communication overhead is negligible relative to the standard model parameter exchange in FL.
Algorithm 1 ProtoFed: prototype-enhanced federated meta-learning
Require: K clients with local datasets { D k } k = 1 K ; rounds T; episodes E; C-way S-shot; momentum β ; weight λ ; temperature τ
Ensure: Global model parameters θ ( T )
  1:
Initialize global model θ ( 0 ) ; set global prototypes { g ˜ c ( 0 ) } c = 1 C to be determined after round 1
  2:
for  t = 1 , 2 , , T   do
  3:
    Server broadcasts  θ ( t 1 ) and { g ˜ c ( t 1 ) } c = 1 C to all clients
  4:
    for each client C k  in parallel do
  5:
       θ k θ ( t 1 )
  6:
      Initialize buffer { p ¯ c k 0 } c = 1 C
  7:
      for  e = 1 , 2 , , E  do
  8:
         Sample a C-way S-shot task ( S , Q ) from D k
  9:
         Apply CWT to obtain time–frequency representations
10:
         Compute per-episode prototypes { p c k , e } c = 1 C via Equation (3)
11:
          p ¯ c k p ¯ c k + p c k , e / E ▷ accumulate
12:
         Compute L proto k via Equation (5) and L cal k via Equation (12)
13:
          L total k L proto k + λ L cal k λ = 0 when t = 1
14:
         Update θ k by SGD on L total k
15:
     end for
16:
     Set p c k , ( t ) p ¯ c k for each class c ▷ Equation (6)
17:
     Upload  θ k ( t ) and { p c k , ( t ) } c = 1 C to the server
18:
   end for
19:
   Server: Global Prototype Calibration (GPC)
20:
   Compute g c ( t ) for each class c via Equation (7)
21:
   Update g ˜ c ( t ) β g ˜ c ( t 1 ) + ( 1 β ) g c ( t ) via Equation (8)
22:
   Server: Prototype-Distance Aware Aggregation (PDAA)
23:
   Compute D k ( t ) for each client via Equation (9)
24:
   Compute w k ( t ) for each client via Equation (10)
25:
    θ ( t ) k = 1 K w k ( t ) θ k ( t ) ▷ Equation (11)
26:
end for
27:
return  θ ( T ) , { g ˜ c ( T ) } c = 1 C

3. Experiments

3.1. Experimental Setup

Datasets. We evaluate ProtoFed on two widely used rolling bearing fault diagnosis benchmarks that represent distinct levels of complexity. The Case Western Reserve University (CWRU) dataset [36] contains vibration signals collected from a motor drive end bearing under four health states: normal, inner race fault, outer race fault, and ball fault. Signals are recorded at a sampling rate of 12 kHz under four load conditions (0, 1, 2, and 3 HP), which naturally induce distribution shifts across operating regimes. The test rig consists of a 2-HP reliance electric motor, a torque transducer, and a dynamometer; accelerometers are mounted at the drive end and fan end of the motor housing. Single-point faults with diameters of 0.007, 0.014, and 0.021 inches are introduced via electro-discharge machining. We simulate a federated scenario with K = 4 clients, each corresponding to one load condition. The Paderborn University (PU) dataset [37] provides vibration data from bearings operating under systematically varied combinations of rotational speed (900–1500 rpm), load torque (0.1–0.7 Nm), and radial force (400–1000 N), encompassing both artificially seeded and naturally developed faults. Vibration signals are recorded by accelerometers mounted on the bearing housing at a native sampling rate of 64 kHz; following standard practice in bearing fault diagnosis [1], we downsample the PU signals to 12 kHz to match the CWRU sampling rate and ensure a consistent CWT preprocessing pipeline across both datasets. We construct K = 4 clients from four distinct operating conditions to simulate realistic cross-condition heterogeneity. Compared to CWRU, the PU dataset poses a more challenging diagnostic task due to its greater intra-class variability and subtler inter-class distinctions.
Data partition protocol. The train/test splits are constructed to evaluate cross-condition generalization rather than within-condition interpolation. For CWRU, each of the four load conditions constitutes a separate client; the episodic support and query sets within each client are randomly sampled from that client’s data using disjoint sample indices, but no condition-level held-out protocol is applied in the default setting to ensure all clients participate in federated training. For PU, the four operating condition groups are similarly assigned to separate clients. Within each client, we randomly partition the available samples into a training pool (80%) and a test pool (20%) prior to episodic sampling; the test pool is reserved exclusively for final evaluation and is never used during episodic support/query construction. To further substantiate cross-condition generalization, we additionally report leave-one-load-out results on the CWRU dataset in Section 3.11, where three load conditions are used for training and the held-out condition serves as the test client.
Few-shot and non-IID configuration. For each client, we construct C-way S-shot episodes with C = 4 fault categories. We report results under two few-shot settings: S = 5 (5-shot) and S = 10 (10-shot), with Q = 15 query samples per class. To simulate realistic data heterogeneity, class distributions across clients are generated using a Dirichlet distribution with concentration parameter α = 1.0 (unless otherwise specified), where smaller α values produce more severe non-IID partitions. We note that when the total number of samples per class is small (e.g., S = 5 ), the absolute class-count differences induced by Dirichlet partitioning are necessarily limited. In our setup, the Dirichlet distribution primarily governs which operating conditions (and thus which distributional characteristics) dominate each client’s data, rather than producing large imbalances in absolute sample counts. This design reflects realistic industrial scenarios where the heterogeneity arises from distinct operating regimes rather than from purely quantitative class imbalance.
Implementation details. Raw vibration segments of length L = 1024 are transformed into time–frequency representations of size 64 × 64 via CWT with a Morlet wavelet ( ω 0 = 6 ). The feature extractor f θ is a four-block convolutional network, where each block consists of a 3 × 3 convolutional layer with 64 filters, batch normalization, ReLU activation, and 2 × 2 max pooling, yielding a d = 256 -dimensional embedding. Training is conducted over T = 50 communication rounds with E = 10 local episodes per round. The local optimizer is SGD with a learning rate of 10 3 . The EMA momentum is set to β = 0.9 , the calibration weight to λ = 0.5 , and the PDAA temperature to τ = 1.0 . All experiments are repeated five times with different random seeds, and we report the mean ± standard deviation of accuracy and macro-averaged F1-score.
Baselines. We compare ProtoFed against eleven representative methods spanning four categories. Standard federated baselines: FedAvg [38] and FedProx [39], both equipped with the same CNN backbone, represent standard federated optimization; FedTL + MMD is a federated transfer learning baseline that applies maximum mean discrepancy-based domain alignment within the federated framework, implemented by us following the standard federated transfer learning paradigm [20]; and FedRad [21] is a recent dynamic aggregation method for fault diagnosis. Prototype-based federated baselines: FedProto [15] pioneers prototype-only communication, and FedSA [16] introduces semantic anchors as unified prototypes. Centralized few-shot baselines: Prototypical network (ProtoNet) [4], MAML [40], and Matching Network [41] are trained on the union of all clients’ data with the same episodic protocol. Since these methods have access to the pooled data from all K clients, they effectively observe K × S samples per class during episode construction (e.g., 20 samples per class with K = 4 and S = 5 ), providing a strictly more favorable data condition than any individual federated client. They therefore serve as centralized upper-bound references representing the best achievable performance when data sharing is unrestricted, and are not included in the federated ranking. Federated meta-learning baselines: FedMeta-FFD [12] and REFML [13] represent the state-of-the-art in federated few-shot fault diagnosis. All methods share the same CNN backbone and CWT preprocessing for fair comparison.
Baseline implementation details. To ensure a fair comparison, we detail the implementation and hyperparameter configuration of all baselines. FedAvg and FedProx use the official algorithms with the same 4-block CNN backbone; FedProx uses μ = 0.01 (tuned over { 0.001 ,   0.01 ,   0.1 ,   1.0 } ). FedTL + MMD is our own implementation following [20], with the MMD kernel bandwidth selected by the median heuristic. FedRad uses the official implementation with the Rademacher penalty coefficient tuned over { 0.01 ,   0.1 ,   1.0 } (selected: 0.1). FedProto uses the official code with the prototype loss weight tuned over { 0.1 ,   0.5 ,   1.0 ,   5.0 } (selected: 1.0). FedSA uses the official code with the semantic anchor dimension matching our embedding size ( d = 256 ) and the anchor update rate tuned over { 0.01 ,   0.05 ,   0.1 } (selected: 0.05). ProtoNet, MAML, and MatchingNet use standard implementations with inner-loop learning rate 0.01 for MAML (tuned over { 0.001 ,   0.01 ,   0.1 } ). FedMeta-FFD uses the official implementation with the adaptive learning rate module enabled; REFML uses the official code with the interpolation coefficient γ tuned over { 0.1 ,   0.3 ,   0.5 ,   0.7 } (selected: 0.5). All federated methods use the same number of communication rounds ( T = 50 ), local episodes ( E = 10 ), and learning rate ( 10 3 ). For baselines with official open-source implementations, we use the released code directly; for those without available code (FedTL + MMD), we re-implement following the original paper’s algorithmic description and verify convergence on a centralized validation split before federated deployment.

3.2. Main Results

Table 1 presents the overall comparison across both datasets and both few-shot settings.
The results reveal a clear performance hierarchy across method categories, and several key observations can be drawn. The standard federated methods (FedAvg, FedProx, FedTL + MMD, FedRad), which lack explicit few-shot learning mechanisms, consistently rank at the bottom. FedAvg achieves only 72.41% on CWRU and 65.73% on PU under the 5-shot setting—over 23 and 25 percentage points below ProtoFed, respectively. Even FedRad, which introduces dynamic aggregation based on Rademacher complexity, trails ProtoFed by 12.90% on CWRU and 13.94% on PU, confirming that complexity-based weighting loses effectiveness when local datasets are too small for reliable complexity estimation. The prototype-based FL methods, FedProto and FedSA, fare considerably better. FedSA in particular reaches 90.18% on CWRU 5-shot by leveraging semantic anchors, making it the strongest non-meta-learning baseline. ProtoFed still surpasses FedSA by 5.45% on CWRU and 5.27% on PU, however, because FedSA’s anchor updates depend on client classifiers that remain under-trained when only five support samples are available.
Among the federated meta-learning approaches—ProtoFed’s most direct competitors—REFML slightly edges out FedMeta-FFD on both datasets, owing to its adaptive interpolation between global and local model initialization. ProtoFed outperforms REFML by 4.29% on CWRU and 3.87% on PU under 5-shot, a margin approximately five times the standard deviation of either method, indicating statistical significance. This advantage stems from the synergy between GPC and PDAA: GPC provides calibrated global prototypes as stable classification anchors, while PDAA prevents clients with heavily drifted representations from diluting the aggregated model—mechanisms that neither FedMeta-FFD nor REFML possess. To rigorously verify statistical significance, we conduct paired t-tests on the per-seed accuracy values across identical data partitions and random seeds. ProtoFed significantly outperforms REFML on both CWRU ( t ( 4 ) = 13.15 , p < 0.001 ) and PU ( t ( 4 ) = 11.82 , p < 0.001 ) under 5-shot, and similarly outperforms FedSA ( p < 0.001 on both datasets). The 95% confidence intervals for the mean accuracy improvement over REFML are [ 3.39 % , 5.19 % ] on CWRU and [ 2.96 % , 4.78 % ] on PU, confirming that the observed advantages are both statistically significant and practically meaningful. As expected, the centralized upper-bound references achieve the highest overall accuracy, since they have unrestricted access to all clients’ pooled data. Centralized ProtoNet reaches 96.82% on CWRU and 92.74% on PU under 5-shot, surpassing ProtoFed by only 1.19% and 1.39%, respectively. This modest gap—substantially smaller than the 5–6% gap separating the second-best federated method (REFML) from the centralized ceiling—validates that ProtoFed’s prototype calibration and distance-aware aggregation recover most of the information available under centralized training. Under the 10-shot setting, this gap further shrinks to 0.53% on CWRU and 0.94% on PU, approaching the centralized optimum as more local data becomes available.
It is also worth noting that the performance advantage of ProtoFed over baselines is consistently larger on PU than on CWRU. Since the PU dataset exhibits greater inter-condition variability, this observation confirms the design rationale of GPC and PDAA: prototype calibration is most beneficial precisely when cross-client heterogeneity is severe.

3.3. Ablation Study

To quantify the contribution of each proposed component, we conduct ablation experiments by systematically removing individual modules from the full ProtoFed framework. Four ablation variants are evaluated: (1) w/o GPC, which disables global prototype calibration and uses uniform prototype averaging without EMA smoothing; (2) w/o PDAA, which replaces distance-aware aggregation with standard sample-count-based weighting; (3) w/o CWT, which feeds raw one-dimensional vibration signals directly into a one-dimensional CNN instead of CWT time–frequency maps; (4) w/o calibration, which removes the calibration regularization term ( λ = 0 ) from the local training objective.
The results in Table 2 and the accompanying visual comparison in Figure 4 reveal that every component contributes meaningfully to the overall performance, though their relative importance differs.
Removing CWT incurs the largest performance drop (−6.15% accuracy on CWRU, −7.23% on PU under 5-shot), underscoring the importance of time–frequency representations as the input modality. One-dimensional vibration signals, while containing the same information in principle, lack the explicit two-dimensional structure that allows the convolutional feature extractor to capture joint time–frequency patterns characteristic of different fault types. The calibration regularization term is the second most influential component: its removal leads to a 5.31% drop on CWRU and a 5.77% drop on PU, confirming that the bidirectional alignment between local and global prototypes is essential for maintaining cross-client consistency.
Among the two server-side mechanisms, GPC has a slightly greater impact than PDAA. Disabling GPC causes a 3.85% drop on CWRU and a 4.71% drop on PU, whereas disabling PDAA leads to drops of 2.98% and 3.49%, respectively. This ranking is intuitive: GPC directly addresses the prototype drift problem by producing calibrated global prototypes, while PDAA provides a complementary but indirect benefit by adjusting aggregation weights. Notably, the two mechanisms exhibit a cooperative relationship—the combined improvement of GPC and PDAA (full model vs. the worse of the two ablations) exceeds the sum of their individual contributions, suggesting that calibrated prototypes enable PDAA to compute more meaningful distance-based weights.

3.4. Impact of Few-Shot Size

Figure 5 examines how the number of support samples per class (S) affects diagnostic performance across all methods.
Two trends are evident. First, all methods benefit from increasing S, but the rate of improvement varies substantially. ProtoFed achieves 72.18% on CWRU with just a single shot per class and reaches 95.63% at S = 5 —a steep improvement curve that plateaus around S = 10 . In contrast, FedAvg + CNN requires S = 20 to reach 91.38%, a level that ProtoFed exceeds at S = 5 . This finding demonstrates that the combination of prototypical metric learning and global prototype calibration enables ProtoFed to extract discriminative representations from extremely limited data.
Second, the advantage of ProtoFed is most pronounced in the extreme low-shot regime. At S = 1 , ProtoFed outperforms REFML by 7.06% on CWRU and 6.82% on PU; at S = 20 , the gap narrows to 1.20% and 1.30%. This diminishing-return pattern aligns with the expectation that prototype calibration is most valuable when local prototypes are computed from very few samples and are inherently noisy—as S increases, local prototypes converge toward their true class means, reducing prototype drift and the relative benefit of GPC.
A natural concern with few-shot settings—particularly at S = 1 or S = 3 —is the risk of overfitting to the extremely limited support samples. ProtoFed mitigates this risk through several complementary mechanisms. First, the prototypical network architecture avoids training a parametric classifier head, instead relying on the geometric structure of the embedding space, which inherently constrains the hypothesis space. Second, the calibration loss L cal acts as a regularizer that pulls local prototypes toward the global consensus, preventing the feature extractor from overfitting to client-specific distributional artifacts. Third, the episodic training protocol itself provides implicit data augmentation by constructing diverse support/query partitions across episodes. The observation that ProtoFed’s standard deviation remains below 1% even at S = 1 (compared to over 2% for FedAvg) further indicates that these mechanisms collectively stabilize the learning process against overfitting in the extreme low-shot regime.

3.5. Robustness to Non-IID Heterogeneity

We investigate the sensitivity of ProtoFed to varying degrees of data heterogeneity by adjusting the Dirichlet concentration parameter α from 0.1 (highly non-IID) to 10.0 (approximately IID) under the 5-shot setting.
As shown in Figure 6, all methods degrade as α decreases, but the rate of degradation varies markedly. FedAvg + CNN suffers a 24.42% accuracy drop on CWRU when α shifts from 10.0 to 0.1, while ProtoFed’s decline is limited to 6.27%; on PU, the corresponding drops are 25.24% versus 7.02%. This robustness stems from GPC producing stable global prototypes that absorb distributional idiosyncrasies of individual clients, and PDAA down-weighting clients whose local distributions have drifted furthest from the consensus.
Notably, ProtoFed maintains a clear advantage even in the near-IID setting ( α = 10.0 ), suggesting that prototype calibration provides additional regularization benefits beyond non-IID mitigation.

3.6. Visualization Analysis

We visualize the embedding space using t-SNE and examine classification boundaries through confusion matrices to provide intuitive evidence for the effectiveness of GPC.
As shown in Figure 7, the uncalibrated embedding space (Figure 7a) exhibits clear condition-specific fragmentation: samples from the same fault category but different clients form distinct sub-clusters, indicating that the feature extractor has learned condition-dependent rather than condition-invariant representations. After GPC calibration (Figure 7b), these sub-clusters coalesce into compact, well-separated groups with cross-client overlap within each class, confirming that GPC successfully unifies the local feature distributions.
The confusion matrices in Figure 8 corroborate this finding. REFML exhibits non-negligible off-diagonal entries between inner race, outer race, and ball faults—categories known to produce overlapping frequency signatures. ProtoFed substantially reduces these misclassifications, raising per-class accuracy for the OR and Ball categories from 90.0% to 95.0% and 94.0%, respectively, indicating that the calibrated global prototypes sharpen inter-class discrimination for fault types with subtle distinguishing features.

3.7. Convergence Analysis

Figure 9 presents the evolution of prototype drift and test accuracy over communication rounds.
ProtoFed rapidly reduces inter-client prototype divergence, reaching near-convergence within approximately 20 rounds (Figure 9a). The variant without GPC retains substantially higher drift even after 50 rounds, stabilizing at roughly 3.9 compared to ProtoFed’s 0.5, which directly accounts for its lower final accuracy (Figure 9b). The variant without PDAA shows intermediate behavior, with prototype drift decreasing more slowly and converging to a higher residual level. In terms of wall-clock convergence, ProtoFed reaches 90% accuracy within approximately 15 rounds, compared to around 25 for REFML and 30 for FedMeta-FFD.

3.8. Sensitivity to Temperature Parameter τ

The PDAA temperature parameter τ controls the sharpness of the client weighting distribution: small τ values concentrate aggregation weight on the most aligned client, while large τ values approach uniform weighting. Table 3 reports the accuracy under the 5-shot setting as τ varies from 0.1 to 5.0 on both datasets.
The results show a clear inverted-U pattern. At τ = 0.1 , the weighting distribution becomes nearly one-hot, assigning almost all weight to a single client and discarding useful information from others. At τ = 5.0 , the distribution approaches uniformity, reducing PDAA to standard sample-count-based aggregation and losing the benefit of distance-aware weighting. The optimal τ = 1.0 provides a balanced trade-off, allowing well-aligned clients to contribute more while still incorporating diverse prototype information from moderately divergent clients. The performance difference between τ = 0.5 and τ = 2.0 is relatively small (within 0.5–0.6% on both datasets), indicating that the method is not overly sensitive to τ within a reasonable range around the optimum.

3.9. Sensitivity to λ and β

The calibration weight λ and EMA momentum β jointly control the strength and stability of prototype-based regularization. We conduct a grid search over λ { 0.1 ,   0.3 ,   0.5 ,   0.7 ,   1.0 } and β { 0.70 ,   0.80 ,   0.90 ,   0.95 ,   0.99 } under the 5-shot setting. Table 4 reports the resulting accuracy on the CWRU dataset.
The optimal combination is λ = 0.5 and β = 0.9 . Several patterns emerge. First, the performance surface is smooth and unimodal, indicating that ProtoFed is not overly sensitive to the exact hyperparameter values. The region λ [ 0.3 ,   0.7 ] and β [ 0.80 ,   0.95 ] consistently achieves accuracy above 94%, providing a wide operational range for practitioners. Second, extreme values of either parameter degrade performance: λ = 0.1 provides insufficient calibration, while λ = 1.0 over-constrains local adaptation; similarly, β = 0.99 makes the EMA too sluggish to track evolving prototype distributions, while β = 0.70 introduces excessive jitter in the global prototypes. Third, the interaction between λ and β is cooperative rather than independent: increasing λ (stronger calibration pull) is more effective when β is moderate (0.80–0.95) because a stable global anchor makes the calibration signal more informative. We observe a similar pattern on the PU dataset (best: λ = 0.5 , β = 0.9 , 91.35%; worst corner: λ = 0.1 , β = 0.99 , 85.92%), confirming the generality of the selected hyperparameters.

3.10. Robustness to Noisy Clients

In real-world federated deployments, some clients may produce corrupted labels due to sensor malfunction, incorrect maintenance logs, or annotation errors. To evaluate robustness under such conditions, we simulate a noisy client scenario by randomly flipping 20% of the labels on one of the four clients, while keeping the remaining three clients clean. Table 5 compares ProtoFed, its variant without PDAA, REFML, and FedAvg under the 5-shot setting.
ProtoFed exhibits strong resilience to client-level label noise, with only a 1.45% accuracy drop on CWRU and 1.63% on PU compared to the clean setting. In contrast, REFML drops by 3.72% on CWRU and 4.30% on PU, and FedAvg drops by 3.96% and 4.35%, respectively. The key to this robustness is PDAA: the noisy client develops distorted local prototypes that diverge substantially from the global anchors, resulting in a high prototype distance D k ( t ) and a correspondingly low aggregation weight. Indeed, the ablated variant without PDAA suffers a noise-induced degradation of 2.41% on CWRU and 3.00% on PU (relative to its own clean setting accuracy of 92.65% and 87.86%), nearly double ProtoFed’s degradation, confirming that PDAA’s distance-aware weighting effectively isolates the corrupted client. Under the noisy condition, the performance gap between ProtoFed and its w/o PDAA variant widens to 3.94% on CWRU and 4.86% on PU (compared to 2.98% and 3.49% in the clean ablation), further demonstrating that PDAA becomes increasingly beneficial in the presence of client-level corruption.

3.11. Leave-One-Load-Out Cross-Condition Generalization

To evaluate cross-condition generalization more stringently, we conduct a leave-one-load-out protocol on the CWRU dataset: three load conditions are used for federated training (one client per load), and the held-out load condition serves as the test client, whose data is never seen during training. Table 6 reports the accuracy for each held-out condition.
ProtoFed achieves a mean cross-condition accuracy of 93.43%, outperforming REFML by 5.40% and FedSA by 6.73%. Notably, ProtoFed’s cross-condition performance (93.43%) is only 2.20% below its within-condition federated accuracy (95.63%), whereas REFML exhibits a larger gap of 3.31% (91.34% to 88.03%). This smaller generalization gap indicates that the calibrated global prototypes learned by GPC capture condition-invariant fault representations rather than condition-specific artifacts. The 3 HP condition is the most challenging held-out case across all methods, likely because the highest-load operating regime produces the most distinct vibration characteristics. Nevertheless, ProtoFed still achieves 92.14% on this most difficult split.

3.12. Noise Robustness Under Varying SNR

To assess the robustness of ProtoFed to signal-level noise—a practical concern in industrial environments with electromagnetic interference and sensor noise—we inject additive Gaussian white noise at varying signal-to-noise ratios (SNR) into the raw vibration signals before CWT preprocessing. Table 7 reports accuracy under the 5-shot setting for SNR values ranging from 2 dB (heavily corrupted) to 10 dB (mildly noisy), as well as the clean baseline (∞).
ProtoFed maintains the highest accuracy across all noise levels. At the severe 2 dB condition, ProtoFed achieves 88.42% on CWRU and 83.28% on PU, retaining 92.5% and 91.2% of its clean setting performance, respectively. In contrast, REFML retains only 88.9% and 87.4% of its own clean accuracy at the same noise level. The absolute performance gap between ProtoFed and the baselines widens as noise increases: at SNR = ∞, the gap over REFML is 4.29% on CWRU, but at SNR = 2 dB, it grows to 7.18% on CWRU and 6.86% on PU. This amplified advantage under noise arises because noisy CWT representations produce noisier per-episode prototypes, amplifying prototype drift across clients. GPC’s temporal smoothing via EMA absorbs this additional variance, while PDAA down-weights clients that are most affected by noise-induced prototype distortion. These results suggest that ProtoFed’s prototype calibration mechanisms provide built-in noise resilience beyond what the baselines can achieve.

3.13. Communication Cost Analysis

A practical concern for federated deployment is the communication overhead introduced by prototype sharing. Table 8 compares the per-round per-client communication cost across representative methods.
ProtoFed transmits both model parameters and class prototypes, incurring a total upload/download cost of | θ | + C × d per direction. With C = 4 and d = 256 , the prototype payload amounts to 1024 float32 values, or approximately 4 KB—representing a mere 0.38% overhead relative to the model parameter transmission. Over 50 communication rounds with K = 4 clients, ProtoFed’s total communication volume is approximately 418 MB, compared to 416 MB for FedAvg—a negligible difference. FedProto achieves dramatically lower communication cost (1.6 MB total) by transmitting only prototypes, but at the expense of 10% lower accuracy (85.62% vs. 95.63% on CWRU 5-shot). ProtoFed therefore achieves the best accuracy–communication trade-off among all compared methods, adding less than 0.4% communication overhead over standard FL while recovering most of the centralized-to-federated performance gap.
The computational overhead of GPC and PDAA on the server side is also minimal. GPC requires aggregating K × C = 16 prototype vectors of dimension d = 256 and performing one EMA update, amounting to O ( K × C × d ) operations. PDAA computes K = 4 distance scalars and a softmax normalization. Both operations complete in under 1 ms on a single CPU core and are negligible compared to the local training time (approximately 45 s per client per round on a single NVIDIA RTX 3090 GPU).

3.14. Differential Privacy Analysis

Although class prototypes are more compact than raw data or full model gradients, they can still leak aggregate distributional statistics of a client’s data. An adversary with access to the uploaded prototypes could, in principle, infer statistical properties of the local class distributions through prototype inversion or membership inference attacks [6]. To evaluate the feasibility of strengthening ProtoFed’s privacy posture, we apply client-level differential privacy (DP) by adding calibrated Gaussian noise to the prototype vectors before uploading them to the server:
p ^ c k , ( t ) = p c k , ( t ) + N ( 0 , σ 2 I ) ,
where the noise scale σ is calibrated according to the desired privacy budget ε via the Gaussian mechanism, with δ = 10 5 and a sensitivity bound derived from the 2 -norm clipping of prototype vectors (clip norm = 10.0). Table 9 reports the accuracy under the 5-shot setting for varying privacy budgets.
The results reveal a favorable privacy–accuracy trade-off. At a moderate privacy budget ε = 10 , the accuracy loss relative to the non-private baseline is only 0.25% on CWRU and 0.27% on PU—practically negligible. Even under the stringent ε = 1 regime, ProtoFed retains 92.14% on CWRU and 87.42% on PU, still surpassing the best non-private federated baseline (REFML at 91.34% and 87.48%) on CWRU while remaining competitive on PU. This resilience arises because the GPC aggregation over K clients and EMA temporal smoothing jointly average out the injected noise across both clients and time steps, effectively denoising the global prototypes. The model parameters are transmitted without DP noise in this experiment; extending DP to the full parameter exchange via DP-SGD remains a direction for future investigation.

4. Discussion

4.1. Practical Implications

From an industrial deployment perspective, ProtoFed’s ability to achieve over 91% accuracy on the challenging PU dataset with only five labeled samples per fault class substantially lowers the data barrier for deploying intelligent fault diagnosis across distributed industrial fleets. The communication architecture carries additional practical advantages: since only class-level prototypes and model parameters are transmitted, the framework avoids direct exposure of client-side measurements—a baseline requirement in many industrial data-sharing agreements. We note, however, that prototype communication does not constitute a formal privacy guarantee, as class-level statistics can still leak aggregate distributional information (see Section 4.3). The PDAA mechanism also provides an inherent form of quality control: clients experiencing sensor degradation or anomalous operating transients are automatically down-weighted during aggregation, reducing the risk of corrupted local models contaminating the global diagnostic system.

4.2. Scalability with Respect to the Number of Clients

The main experiments adopt K = 4 clients, which corresponds to the number of distinct operating conditions in each benchmark dataset. In practice, industrial deployments may involve a larger number of participating sites. To evaluate the scalability of ProtoFed, we simulate federated scenarios with K { 4 ,   8 ,   12 ,   16 } clients by partitioning the available data among increasing numbers of participants, each receiving a proportionally smaller share of the total samples. When K > 4 (the number of original operating conditions in each dataset), data from the same operating condition is split across multiple clients via Dirichlet-based random partitioning; this means that for K = 8 ,   12 ,   16 , inter-client heterogeneity arises from a mixture of cross-condition distributional shifts and intra-condition random splits, rather than purely from distinct operating regimes. This setup simulates practical deployments where multiple machines at different sites may operate under similar but not identical conditions. All experiments use the 5-shot setting with Dirichlet non-IID partitioning ( α = 1.0 ). Table 10 reports the accuracy and F1-score for ProtoFed, REFML, and FedAvg + CNN on both datasets.
As shown in Table 10, all methods degrade as the number of clients increases, but the rate of degradation differs substantially. From K = 4 to K = 16 , ProtoFed’s accuracy drops by only 2.39% on CWRU and 2.64% on PU, compared to 5.16% and 5.62% for REFML, and 6.49% and 6.58% for FedAvg + CNN. This graceful degradation stems from the fact that GPC naturally accommodates a growing pool of diverse local prototypes by aggregating them into stable global anchors, while PDAA automatically down-weights outlier clients that may emerge as the participant pool expands.
Notably, even with K = 16 clients, ProtoFed achieves 93.24% on CWRU and 88.71% on PU, exceeding the performance of REFML with K = 4 clients (91.34% and 87.48%). This result suggests that ProtoFed can support realistic multi-site deployments without substantial performance degradation.

4.3. Limitations and Future Directions

Several limitations should be acknowledged. The experiments use benchmark datasets with a fixed set of fault categories; in practice, unseen fault types may emerge, necessitating open-set or incremental diagnosis capabilities.
Dataset representativeness. Both CWRU and PU are laboratory-collected benchmark datasets recorded under controlled conditions with minimal background noise, standardized sensor placement, and isolated single-fault modes. While our noise robustness experiments (Section 3.12) demonstrate that ProtoFed maintains strong performance under synthetically injected Gaussian white noise down to SNR = 2 dB, real industrial environments exhibit more complex noise characteristics including non-stationary interference, multi-fault coupling, variable rotational speeds, and sensor degradation over time. These factors may produce distributional shifts beyond what additive Gaussian noise can simulate. Validation on industrial-grade datasets with naturally occurring noise and compound faults (e.g., XJTU-SY, IMS) remains an important direction for establishing practical generalizability.
Non-IID modeling scope. The Dirichlet-based partitioning adopted in our experiments primarily models label distribution skew, where different clients hold different proportions of fault categories. In practice, federated heterogeneity in fault diagnosis also arises from feature distribution shift (e.g., different vibration signatures under varying speeds and loads) and operating condition skew (e.g., clients operating under systematically different regimes). While our leave-one-load-out experiment (Section 3.11) partially addresses operating condition skew by evaluating generalization to entirely unseen load conditions, a fully disentangled evaluation that separately controls label skew, feature skew, and condition skew would provide a more complete picture. Designing such disentangled non-IID protocols for fault diagnosis benchmarks is a promising avenue for future work.
Privacy considerations. The prototype communication mechanism, while avoiding direct exposure of raw data or individual sample features, is not immune to inference attacks. Class prototypes encode the centroid of each class’s feature distribution, from which an adversary could potentially infer statistical properties of the local data—for instance, estimating class variance, detecting the presence of outlier operating conditions, or performing membership inference to determine whether a specific sample contributed to a prototype. Our differential privacy experiments (Section 3.14) demonstrate that adding calibrated Gaussian noise to prototype uploads preserves competitive accuracy even at stringent privacy budgets ( ε = 1 ), offering a practical mitigation strategy. However, the current DP analysis applies noise only to prototypes; extending differential privacy guarantees to the full model parameter exchange via DP-SGD, and formally analyzing the composite privacy budget across communication rounds under both channels, remains an important direction for deployment in privacy-critical industrial settings.
Additionally, our evaluation covers conditions represented in the training distribution but does not address generalization to entirely unseen operating regimes, and scalability beyond K = 16 clients with asynchronous participation remains to be investigated.
Future work could extend ProtoFed to open-set fault diagnosis for detecting novel categories, integrate label noise robustness for field-annotated data, and validate the framework in real multi-factory deployments with heterogeneous sensor configurations and communication latencies. Replacing the lightweight CNN backbone with pre-trained foundation models could further improve few-shot generalization, provided that the computational constraints of edge devices are carefully managed.

5. Conclusions

This paper proposed ProtoFed, a prototype-enhanced federated meta-learning framework for few-shot rolling bearing fault diagnosis. ProtoFed combines CWT-based time–frequency feature extraction with prototypical networks for local episodic learning, and introduces two server-side mechanisms—Global Prototype Calibration (GPC) for unifying cross-client class representations via EMA-smoothed aggregation, and Prototype-Distance Aware Aggregation (PDAA) for dynamically weighting client contributions based on local–global prototype alignment. Experiments on the CWRU and PU bearing datasets demonstrated that ProtoFed achieves 95.63 ± 0.68 % and 91.35 ± 0.72 % accuracy under the 5-shot setting, outperforming the strongest federated baseline by 4.29% and 3.87% while closely approaching the centralized upper bound (within 1.19% and 1.39%). Ablation studies, t-SNE visualizations, and scalability analysis up to 16 clients confirmed the contribution and robustness of each proposed component. Comprehensive additional evaluations—including hyperparameter sensitivity analyses for τ , λ , and β ; leave-one-load-out cross-condition generalization; noise robustness under varying SNR levels; noisy client resilience; communication cost analysis; and differential privacy experiments—further substantiated the practical viability and robustness of ProtoFed for real-world federated fault diagnosis deployment.

Author Contributions

Conceptualization, Y.J. and J.S.; methodology, Y.J.; software, Y.J.; validation, Y.J., Y.L., X.L. and Y.F.; formal analysis, Y.J. and Y.L.; investigation, Y.J., X.L. and Y.F.; resources, J.S.; data curation, Y.J. and Y.L.; writing—original draft preparation, Y.J.; writing—review and editing, Y.L., X.L., Y.F. and J.S.; visualization, Y.J.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Liaoning Province Education Department Project (No. JYTMS20230430).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions. The preprocessing scripts and implementation code will be publicly released upon acceptance of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

CWTContinuous Wavelet Transform
CNNConvolutional Neural Network
CWRUCase Western Reserve University
DPDifferential Privacy
EMAExponential Moving Average
FLFederated Learning
FFSLFederated Few-Shot Learning
GPCGlobal Prototype Calibration
IIDIndependent and Identically Distributed
MAMLModel-Agnostic Meta-Learning
MMDMaximum Mean Discrepancy
PDAAPrototype-Distance Aware Aggregation
PUPaderborn University
SGDStochastic Gradient Descent
SNRSignal-to-Noise Ratio
t-SNEt-Distributed Stochastic Neighbor Embedding

References

  1. Liang, X.; Zhang, M.; Feng, G.; Wang, D.; Xu, Y.; Gu, F. Few-Shot Learning Approaches for Fault Diagnosis Using Vibration Data: A Comprehensive Review. Sustainability 2023, 15, 14975. [Google Scholar] [CrossRef]
  2. Zhang, S.; Ye, F.; Wang, B.; Habetler, T. Few-Shot Bearing Fault Diagnosis Based on Model-Agnostic Meta-Learning. IEEE Trans. Ind. Appl. 2021, 57, 4754–4764. [Google Scholar] [CrossRef]
  3. Wu, J.; Zhao, Z.; Sun, C.; Yan, R.; Chen, X. Few-Shot Transfer Learning for Intelligent Fault Diagnosis of Machine. Measurement 2020, 166, 108202. [Google Scholar] [CrossRef]
  4. Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-Shot Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 4077–4087. [Google Scholar]
  5. Wang, D.; Zhang, M.; Xu, Y.; Lu, W.; Yang, J.; Zhang, T. Metric-Based Meta-Learning Model for Few-Shot Fault Diagnosis under Multiple Limited Data Conditions. Mech. Syst. Signal Process. 2021, 155, 107510. [Google Scholar] [CrossRef]
  6. Mothukuri, V.; Parizi, R.; Pouriyeh, S.; Huang, Y.; Dehghantanha, A.; Srivastava, G. A Survey on Security and Privacy of Federated Learning. Future Gener. Comput. Syst. 2021, 115, 619–640. [Google Scholar] [CrossRef]
  7. Abdulrahman, S.; Tout, H.; Ould-Slimane, H.; Mourad, A.; Talhi, C.; Guizani, M. A Survey on Federated Learning: The Journey From Centralized to Distributed On-Site Learning and Beyond. IEEE Internet Things J. 2021, 8, 5476–5497. [Google Scholar] [CrossRef]
  8. Li, Q.; Wen, Z.; Wu, Z.; He, B. A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection. IEEE Trans. Knowl. Data Eng. 2023, 35, 3347–3366. [Google Scholar] [CrossRef]
  9. Berghout, T.; Benbouzid, M.; Bentrcia, T.; Lim, W.; Amirat, Y. Federated Learning for Condition Monitoring of Industrial Processes: A Review on Fault Diagnosis Methods, Challenges, and Prospects. Electronics 2022, 12, 158. [Google Scholar] [CrossRef]
  10. Zhang, W.; Li, X.; Ma, H.; Luo, Z.; Li, X. Federated Learning for Machinery Fault Diagnosis with Dynamic Validation and Self-Supervision. Knowl.-Based Syst. 2021, 213, 106679. [Google Scholar] [CrossRef]
  11. Zhao, C.; Shen, W. Federated Domain Generalization: A Secure and Robust Framework for Intelligent Fault Diagnosis. IEEE Trans. Ind. Inform. 2024, 20, 2662–2670. [Google Scholar] [CrossRef]
  12. Chen, J.; Tang, J.; Li, W. Industrial Edge Intelligence: Federated-Meta Learning Framework for Few-Shot Fault Diagnosis. IEEE Trans. Netw. Sci. Eng. 2023, 10, 3561–3573. [Google Scholar] [CrossRef]
  13. Cui, J.; Li, J.; Mei, Z.; Wei, K.; Wei, S.; Ding, M.; Chen, W.; Guo, S. Federated Meta-Learning for Few-Shot Fault Diagnosis with Representation Encoding. IEEE Trans. Instrum. Meas. 2023, 72, 3536812. [Google Scholar] [CrossRef]
  14. Yang, L.; Huang, J.; Lin, W.; Cao, J. Personalized Federated Learning on Non-IID Data via Group-Based Meta-Learning. ACM Trans. Knowl. Discov. Data 2023, 17, 49. [Google Scholar] [CrossRef]
  15. Tan, Y.; Long, G.; Liu, L.; Zhou, T.; Lu, Q.; Jiang, J.; Zhang, C. FedProto: Federated Prototype Learning across Heterogeneous Clients. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual, 22 February–1 March 2022; Volume 36, pp. 8432–8440. [Google Scholar]
  16. Zhou, Y.; Qu, X.; You, C.; Zhou, J.; Tang, J.; Zheng, X.; Cai, C.; Wu, Y. FedSA: A Unified Representation Learning via Semantic Anchors for Prototype-Based Federated Learning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 23009–23017. [Google Scholar]
  17. Yu, Y.; Guo, L.; Gao, H.; He, Y.; You, Z.; Duan, A. FedCAE: A New Federated Learning Framework for Edge-Cloud Collaboration Based Machine Fault Diagnosis. IEEE Trans. Ind. Electron. 2024, 71, 4108–4119. [Google Scholar] [CrossRef]
  18. Ye, M.; Fang, X.; Du, B.; Yuen, P.; Tao, D. Heterogeneous Federated Learning: State-of-the-Art and Research Challenges. ACM Comput. Surv. 2023, 56, 79. [Google Scholar] [CrossRef]
  19. Zhang, Y.; Xue, X.; Zhao, X.; Wang, L. Federated Learning for Intelligent Fault Diagnosis Based on Similarity Collaboration. Meas. Sci. Technol. 2023, 34, 035114. [Google Scholar] [CrossRef]
  20. Zhang, W.; Li, X. Federated Transfer Learning for Intelligent Fault Diagnostics Using Deep Adversarial Networks with Data Privacy. IEEE/ASME Trans. Mechatron. 2022, 27, 430–439. [Google Scholar] [CrossRef]
  21. Xiao, D.; Diao, J.; Ding, J.; Jiang, L.; Zhao, C. FedRad: A Dynamic Low-Cost Federated Learning for Fault Diagnosis of Rotating Machinery. IEEE Trans. Instrum. Meas. 2025, 74, 3551914. [Google Scholar] [CrossRef]
  22. Hu, Y.; Liu, R.; Li, X.; Chen, D.; Hu, Q. Task-Sequencing Meta Learning for Intelligent Few-Shot Fault Diagnosis with Limited Data. IEEE Trans. Ind. Inform. 2022, 18, 3894–3904. [Google Scholar] [CrossRef]
  23. Chang, L.; Lin, Y. Meta-Learning with Adaptive Learning Rates for Few-Shot Fault Diagnosis. IEEE/ASME Trans. Mechatron. 2022, 27, 5948–5958. [Google Scholar] [CrossRef]
  24. Rezazadeh, N.; Caputo, F.; Aversano, A.; Lamanna, G.; De Luca, A.; Perfetto, D. Prototype-Attention Domain Adaptation for Explainable Bearing Fault Diagnosis. Procedia Struct. Integr. 2026, 80, 411–417. [Google Scholar] [CrossRef]
  25. Lei, Z.; Zhang, P.; Chen, Y.; Feng, K.; Wen, G.; Liu, Z.; Yan, R.; Chen, X.; Yang, C. Prior Knowledge-Embedded Meta-Transfer Learning for Few-Shot Fault Diagnosis under Variable Operating Conditions. Mech. Syst. Signal Process. 2023, 200, 110491. [Google Scholar] [CrossRef]
  26. Li, C.; Li, S.; Wang, H.; Gu, F.; Ball, A. Attention-Based Deep Meta-Transfer Learning for Few-Shot Fine-Grained Fault Diagnosis. Knowl.-Based Syst. 2023, 264, 110345. [Google Scholar] [CrossRef]
  27. Li, C.; Li, S.; Zhang, A.; He, Q.; Liao, Z.; Hu, J. Meta-Learning for Few-Shot Bearing Fault Diagnosis under Complex Working Conditions. Neurocomputing 2021, 439, 197–211. [Google Scholar] [CrossRef]
  28. Zhao, Y.; Yu, G.; Wang, J.; Domeniconi, C.; Guo, M.; Zhang, X.; Cui, L. Personalized Federated Few-Shot Learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 2534–2544. [Google Scholar] [CrossRef] [PubMed]
  29. Wang, S.; Fu, X.; Ding, K.; Chen, C.; Chen, H.; Li, J. Federated Few-Shot Learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Long Beach, CA, USA, 6–10 August 2023; pp. 2374–2385. [Google Scholar]
  30. Huang, W.; Ye, M.; Du, B.; Gao, X. Few-Shot Model Agnostic Federated Learning. In Proceedings of the 30th ACM International Conference on Multimedia (MM), Lisboa, Portugal, 10–14 October 2022; pp. 7309–7316. [Google Scholar]
  31. Tian, J.; Chen, X.; Wang, S. Few-Shot Federated Learning: A Federated Learning Model for Small-Sample Scenarios. Appl. Sci. 2024, 14, 3919. [Google Scholar] [CrossRef]
  32. Mu, X.; Shen, Y.; Cheng, K.; Geng, X.; Fu, J.; Zhang, T.; Zhang, Z. FedProc: Prototypical Contrastive Federated Learning on Non-IID Data. Future Gener. Comput. Syst. 2023, 143, 93–104. [Google Scholar] [CrossRef]
  33. Fu, L.; Huang, S.; Lai, Y.; Zhang, C.; Dai, H.; Zheng, Z.; Chen, C. Federated Domain-Independent Prototype Learning with Alignments of Representation and Parameter Spaces for Feature Shift. IEEE Trans. Mob. Comput. 2025, 24, 9004–9019. [Google Scholar] [CrossRef]
  34. Huang, Y.; Chu, L.; Zhou, Z.; Wang, L.; Liu, J.; Pei, J.; Zhang, Y. Personalized Cross-Silo Federated Learning on Non-IID Data. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual, 2–9 February 2021; Volume 35, pp. 7865–7873. [Google Scholar]
  35. Dinh, C.T.; Tran, N.H.; Nguyen, T.D. Personalized Federated Learning with Moreau Envelopes. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 21394–21405. [Google Scholar]
  36. Loparo, K.A. Case Western Reserve University Bearing Data Center. Available online: https://engineering.case.edu/bearingdatacenter (accessed on 15 April 2026).
  37. Lessmeier, C.; Kimotho, J.K.; Zimmer, D.; Sextro, W. Condition Monitoring of Bearing Damage in Electromechanical Drive Systems by Using Motor Current Signals of Electric Motors: A Benchmark Data Set for Data-Driven Classification. In Proceedings of the European Conference of the PHM Society, Bilbao, Spain, 5–8 July 2016; pp. 1–17. [Google Scholar]
  38. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
  39. Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems Conference (MLSys), Austin, TX, USA, 2–4 March 2020; pp. 429–450. [Google Scholar]
  40. Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
  41. Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; pp. 3630–3638. [Google Scholar]
Figure 1. Overall framework of the proposed ProtoFed. In each communication round, clients perform local episodic meta-training with prototypical networks on CWT-transformed time–frequency representations, upload class prototypes and model parameters to the server, where GPC produces unified global prototypes and PDAA computes adaptive aggregation weights for the global model update.
Figure 1. Overall framework of the proposed ProtoFed. In each communication round, clients perform local episodic meta-training with prototypical networks on CWT-transformed time–frequency representations, upload class prototypes and model parameters to the server, where GPC produces unified global prototypes and PDAA computes adaptive aggregation weights for the global model update.
Applsci 16 05277 g001
Figure 2. CWT preprocessing pipeline. (a) Raw vibration signal of length L = 1024 samples. (b) Morlet CWT scalogram with magnitude extraction ( ω 0 = 6 , 64 scales, a [ 2 ,   170 ] ), producing a 64 × 1024 time–frequency map that reveals periodic impulse patterns characteristic of bearing faults. (c) Bilinear resizing to 64 × 64 with min-max normalization to [ 0 ,   1 ] . (d) Final single-channel grayscale time–frequency representation I R 1 × 64 × 64 serving as input to the feature extractor.
Figure 2. CWT preprocessing pipeline. (a) Raw vibration signal of length L = 1024 samples. (b) Morlet CWT scalogram with magnitude extraction ( ω 0 = 6 , 64 scales, a [ 2 ,   170 ] ), producing a 64 × 1024 time–frequency map that reveals periodic impulse patterns characteristic of bearing faults. (c) Bilinear resizing to 64 × 64 with min-max normalization to [ 0 ,   1 ] . (d) Final single-channel grayscale time–frequency representation I R 1 × 64 × 64 serving as input to the feature extractor.
Applsci 16 05277 g002
Figure 3. Conceptual illustration of prototype drift and GPC calibration. (a) Without GPC, local prototypes of the same fault class diverge across clients due to heterogeneous operating conditions, leading to inconsistent class representations in the embedding space. (b) With GPC, local prototypes are calibrated toward a unified global prototype g ˜ c via weighted aggregation and EMA smoothing, yielding a consistent class representation across all clients.
Figure 3. Conceptual illustration of prototype drift and GPC calibration. (a) Without GPC, local prototypes of the same fault class diverge across clients due to heterogeneous operating conditions, leading to inconsistent class representations in the embedding space. (b) With GPC, local prototypes are calibrated toward a unified global prototype g ˜ c via weighted aggregation and EMA smoothing, yielding a consistent class representation across all clients.
Applsci 16 05277 g003
Figure 4. (a) CWRU, 5-shot; (b) PU, 5-shot. Ablation study results under the 5-shot setting with standard deviation error bars. Solid bars denote accuracy and translucent bars denote F1-score.
Figure 4. (a) CWRU, 5-shot; (b) PU, 5-shot. Ablation study results under the 5-shot setting with standard deviation error bars. Solid bars denote accuracy and translucent bars denote F1-score.
Applsci 16 05277 g004
Figure 5. Classification accuracy as a function of the number of shots (S) on (a) CWRU and (b) PU datasets.
Figure 5. Classification accuracy as a function of the number of shots (S) on (a) CWRU and (b) PU datasets.
Applsci 16 05277 g005
Figure 6. (a) CWRU, 5-shot; (b) PU, 5-shot. Classification accuracy under different degrees of non-IID heterogeneity controlled by the Dirichlet concentration parameter α , evaluated in the 5-shot setting.
Figure 6. (a) CWRU, 5-shot; (b) PU, 5-shot. Classification accuracy under different degrees of non-IID heterogeneity controlled by the Dirichlet concentration parameter α , evaluated in the 5-shot setting.
Applsci 16 05277 g006
Figure 7. (a) Before GPC; (b) After GPC. t-SNE visualization of learned embeddings on the CWRU dataset (5-shot). Different colors represent fault types and different markers represent clients. Before GPC, samples from the same class but different clients form scattered sub-clusters; after GPC, intra-class representations converge across clients.
Figure 7. (a) Before GPC; (b) After GPC. t-SNE visualization of learned embeddings on the CWRU dataset (5-shot). Different colors represent fault types and different markers represent clients. Before GPC, samples from the same class but different clients form scattered sub-clusters; after GPC, intra-class representations converge across clients.
Applsci 16 05277 g007
Figure 8. (a) REFML; (b) ProtoFed (Ours). Normalized confusion matrices (%) on the CWRU dataset (5-shot) for REFML and ProtoFed.
Figure 8. (a) REFML; (b) ProtoFed (Ours). Normalized confusion matrices (%) on the CWRU dataset (5-shot) for REFML and ProtoFed.
Applsci 16 05277 g008
Figure 9. (a) Prototype drift; (b) Convergence (CWRU, 5-shot). Evolution of average prototype drift and test accuracy over communication rounds on the CWRU dataset (5-shot).
Figure 9. (a) Prototype drift; (b) Convergence (CWRU, 5-shot). Evolution of average prototype drift and test accuracy over communication rounds on the CWRU dataset (5-shot).
Applsci 16 05277 g009
Table 1. Classification performance (%, mean ± std over 5 seeds) on CWRU and PU datasets under 5-shot and 10-shot settings. The best federated results are highlighted in bold, and the second-best are underlined. † denotes centralized methods trained on pooled data (upper-bound references, not included in the bold/underline ranking).
Table 1. Classification performance (%, mean ± std over 5 seeds) on CWRU and PU datasets under 5-shot and 10-shot settings. The best federated results are highlighted in bold, and the second-best are underlined. † denotes centralized methods trained on pooled data (upper-bound references, not included in the bold/underline ranking).
MethodCWRUPU
5-Shot10-Shot5-Shot10-Shot
AccF1AccF1AccF1AccF1
FedAvg + CNN 72.41 ± 1.54 71.68 ± 1.61 83.58 ± 1.22 82.94 ± 1.15 65.73 ± 1.78 64.82 ± 1.92 78.36 ± 1.48 77.62 ± 1.41
FedProx + CNN 74.92 ± 1.38 74.18 ± 1.44 85.31 ± 1.08 84.65 ± 1.14 68.15 ± 1.72 67.34 ± 1.63 80.23 ± 1.31 79.48 ± 1.42
FedTL + MMD 78.85 ± 1.28 78.31 ± 1.18 87.42 ± 0.95 86.78 ± 1.02 73.24 ± 1.45 72.41 ± 1.38 83.65 ± 1.18 83.12 ± 1.28
FedRad 82.73 ± 1.15 82.14 ± 1.22 89.76 ± 0.88 89.18 ± 0.82 77.41 ± 1.32 76.83 ± 1.24 85.74 ± 1.12 85.18 ± 1.21
FedProto 85.62 ± 1.08 85.11 ± 0.98 91.18 ± 0.85 90.62 ± 0.78 80.57 ± 1.18 79.92 ± 1.12 88.26 ± 1.05 87.63 ± 0.98
FedSA 90.18 ± 0.88 89.65 ± 0.82 94.65 ± 0.78 94.18 ± 0.72 86.08 ± 0.95 85.42 ± 0.88 91.84 ± 0.85 91.26 ± 0.78
ProtoNet 96.82 ± 0.58 96.34 ± 0.62 98.35 ± 0.42 97.86 ± 0.51 92.74 ± 0.65 92.08 ± 0.74 96.08 ± 0.52 95.58 ± 0.62
MAML 94.53 ± 0.72 93.95 ± 0.78 96.68 ± 0.62 96.14 ± 0.55 90.36 ± 0.82 89.72 ± 0.75 94.24 ± 0.65 93.68 ± 0.72
MatchingNet 92.84 ± 0.85 92.26 ± 0.78 95.42 ± 0.68 94.88 ± 0.62 88.28 ± 0.88 87.65 ± 0.82 92.48 ± 0.72 91.86 ± 0.82
FedMeta-FFD 88.54 ± 0.92 87.86 ± 1.02 93.45 ± 0.85 92.94 ± 0.78 83.72 ± 1.08 83.05 ± 0.98 89.76 ± 0.95 89.18 ± 0.88
REFML 91.34 _ ± 0.85 90.82 _ ± 0.78 95.18 _ ± 0.75 94.72 _ ± 0.68 87.48 _ ± 0.92 86.78 _ ± 0.85 92.38 _ ± 0.82 91.84 _ ± 0.75
ProtoFed (Ours) 95.63 ± 0.68 95.14 ± 0.72 97.82 ± 0.52 97.43 ± 0.48 91.35 ± 0.72 90.74 ± 0.68 95.14 ± 0.62 94.64 ± 0.72
Table 2. Ablation study results (%, mean ± std over 5 seeds) on CWRU and PU datasets. Each row removes one component from the full ProtoFed framework.
Table 2. Ablation study results (%, mean ± std over 5 seeds) on CWRU and PU datasets. Each row removes one component from the full ProtoFed framework.
VariantCWRUPU
5-Shot10-Shot5-Shot10-Shot
AccF1AccF1AccF1AccF1
ProtoFed (Full) 95.63 ± 0.68 95.14 ± 0.72 97.82 ± 0.52 97.43 ± 0.48 91.35 ± 0.72 90.74 ± 0.68 95.14 ± 0.62 94.64 ± 0.72
w/o GPC 91.78 ± 0.92 91.24 ± 0.85 95.08 ± 0.78 94.56 ± 0.72 86.64 ± 0.98 86.08 ± 0.92 91.52 ± 0.85 90.94 ± 0.78
w/o PDAA 92.65 ± 0.85 92.18 ± 0.78 95.82 ± 0.72 95.34 ± 0.68 87.86 ± 0.92 87.24 ± 0.85 92.36 ± 0.82 91.82 ± 0.75
w/o CWT 89.48 ± 1.05 88.86 ± 0.98 93.42 ± 0.92 92.85 ± 0.85 84.12 ± 1.15 83.48 ± 1.08 89.76 ± 0.98 89.14 ± 0.92
w/o Calibration 90.32 ± 0.98 89.75 ± 0.92 94.18 ± 0.82 93.65 ± 0.78 85.58 ± 1.08 84.92 ± 1.02 90.62 ± 0.92 90.02 ± 0.85
Table 3. Sensitivity to PDAA temperature τ (%, 5-shot, mean ± std over 5 seeds).
Table 3. Sensitivity to PDAA temperature τ (%, 5-shot, mean ± std over 5 seeds).
τ CWRUPU
AccF1AccF1
0.1 93.28 ± 0.82 92.74 ± 0.88 88.72 ± 0.95 88.14 ± 0.92
0.5 94.86 ± 0.72 94.32 ± 0.78 90.48 ± 0.82 89.86 ± 0.78
1.0 95.63 ± 0.68 95.14 ± 0.72 91.35 ± 0.72 90.74 ± 0.68
2.0 94.42 ± 0.78 93.86 ± 0.82 89.86 ± 0.88 89.24 ± 0.85
5.0 92.85 ± 0.92 92.28 ± 0.88 87.94 ± 1.02 87.32 ± 0.98
Table 4. Accuracy (%) on CWRU (5-shot) for varying λ and β . The best result is highlighted in bold.
Table 4. Accuracy (%) on CWRU (5-shot) for varying λ and β . The best result is highlighted in bold.
β λ
0.10.30.50.71.0
0.7091.2493.1893.8293.1491.68
0.8092.0694.2894.8694.3292.54
0.9092.8494.9295.6395.0893.36
0.9592.1894.4595.1294.7693.08
0.9990.7293.2494.1893.8292.14
Table 5. Robustness to a noisy client (%, 5-shot, 20% label noise on one client).
Table 5. Robustness to a noisy client (%, 5-shot, 20% label noise on one client).
MethodCWRUPU
AccF1AccF1
FedAvg + CNN 68.45 ± 1.82 67.58 ± 1.78 61.38 ± 2.05 60.42 ± 1.98
REFML 87.62 ± 1.12 87.08 ± 1.05 83.18 ± 1.22 82.54 ± 1.15
ProtoFed w/o PDAA 90.24 ± 0.98 89.68 ± 0.92 84.86 ± 1.08 84.22 ± 1.02
ProtoFed 94.18 ± 0.78 93.62 ± 0.82 89.72 ± 0.85 89.08 ± 0.82
Table 6. Leave-one-load-out generalization on CWRU (%, 5-shot, mean ± std).
Table 6. Leave-one-load-out generalization on CWRU (%, 5-shot, mean ± std).
Held-Out LoadProtoFedREFMLFedSAFedAvg
0 HP 93.82 ± 0.82 88.56 ± 1.08 87.24 ± 1.15 69.18 ± 1.72
1 HP 94.28 ± 0.78 89.14 ± 1.02 87.86 ± 1.12 70.42 ± 1.68
2 HP 93.46 ± 0.85 87.92 ± 1.12 86.58 ± 1.18 68.24 ± 1.78
3 HP 92.14 ± 0.92 86.48 ± 1.18 85.12 ± 1.25 66.86 ± 1.85
Mean 93.43 ± 0.84 88.03 ± 1.10 86.70 ± 1.18 68.68 ± 1.76
Table 7. Noise robustness under varying SNR (%, 5-shot, mean accuracy over 5 seeds; standard deviations omitted for readability).
Table 7. Noise robustness under varying SNR (%, 5-shot, mean accuracy over 5 seeds; standard deviations omitted for readability).
MethodCWRU—SNR (dB)
− 202610
FedSA79.8682.5485.2889.4289.9490.18
REFML81.2484.1887.5290.6491.1291.34
ProtoFed88.4290.8693.1495.0895.5295.63
MethodPU—SNR (dB)
−202610
FedSA74.8677.8281.4885.1485.8686.08
REFML76.4279.3683.2486.5887.2487.48
ProtoFed83.2885.9488.6290.8291.1891.35
Table 8. Communication cost comparison per client per round under the default configuration ( C = 4 classes, d = 256 embedding dimension, 4-block CNN backbone with | θ | 0.26 M parameters).
Table 8. Communication cost comparison per client per round under the default configuration ( C = 4 classes, d = 256 embedding dimension, 4-block CNN backbone with | θ | 0.26 M parameters).
MethodUploadDownloadTotal
FedAvg/FedProx | θ | (1.04 MB) | θ | (1.04 MB)2.08 MB
FedProto C × d (4 KB) C × d (4 KB)8 KB
FedSA | θ | + C × d (1.04 MB) | θ | + C × d (1.04 MB)2.09 MB
ProtoFed | θ | + C × d (1.04 MB) | θ | + C × d (1.04 MB)2.09 MB
Table 9. Accuracy under differential privacy (%, 5-shot, mean ± std over 5 seeds).
Table 9. Accuracy under differential privacy (%, 5-shot, mean ± std over 5 seeds).
Privacy Budget ε CWRUPU
AccF1AccF1
1 92.14 ± 0.98 91.58 ± 1.02 87.42 ± 1.12 86.78 ± 1.08
5 94.86 ± 0.75 94.32 ± 0.78 90.52 ± 0.82 89.88 ± 0.78
10 95.38 ± 0.72 94.86 ± 0.75 91.08 ± 0.78 90.46 ± 0.72
∞ (no DP) 95.63 ± 0.68 95.14 ± 0.72 91.35 ± 0.72 90.74 ± 0.68
Table 10. Scalability analysis (%, mean ± std over 5 seeds) under the 5-shot setting with varying numbers of clients K. ProtoFed exhibits graceful degradation as the number of clients increases, maintaining a substantial advantage over baselines.
Table 10. Scalability analysis (%, mean ± std over 5 seeds) under the 5-shot setting with varying numbers of clients K. ProtoFed exhibits graceful degradation as the number of clients increases, maintaining a substantial advantage over baselines.
KMethodCWRUPU
AccF1AccF1
4FedAvg + CNN 72.41 ± 1.54 71.68 ± 1.61 65.73 ± 1.78 64.82 ± 1.92
REFML 91.34 ± 0.85 90.82 ± 0.78 87.48 ± 0.92 86.78 ± 0.85
ProtoFed 95.63 ± 0.68 95.14 ± 0.72 91.35 ± 0.72 90.74 ± 0.68
8FedAvg + CNN 70.54 ± 1.68 69.78 ± 1.72 63.58 ± 1.94 62.71 ± 1.88
REFML 89.72 ± 0.98 89.14 ± 0.92 85.91 ± 1.05 85.28 ± 0.95
ProtoFed 95.12 ± 0.72 94.58 ± 0.68 90.82 ± 0.78 90.15 ± 0.72
12FedAvg + CNN 67.83 ± 1.95 67.05 ± 1.88 60.84 ± 2.18 59.92 ± 2.05
REFML 87.45 ± 1.08 86.78 ± 1.02 83.42 ± 1.15 82.75 ± 1.08
ProtoFed 93.86 ± 0.85 93.18 ± 0.78 89.53 ± 0.92 88.82 ± 0.85
16FedAvg + CNN 65.92 ± 2.12 65.14 ± 2.05 59.15 ± 2.28 58.28 ± 2.15
REFML 86.18 ± 1.18 85.45 ± 1.12 81.86 ± 1.25 81.12 ± 1.15
ProtoFed 93.24 ± 0.92 92.62 ± 0.85 88.71 ± 0.98 88.02 ± 0.92
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jin, Y.; Luo, Y.; Liu, X.; Fan, Y.; Shi, J. ProtoFed: Prototype-Enhanced Federated Meta-Learning for Few-Shot Rolling Bearing Fault Diagnosis. Appl. Sci. 2026, 16, 5277. https://doi.org/10.3390/app16115277

AMA Style

Jin Y, Luo Y, Liu X, Fan Y, Shi J. ProtoFed: Prototype-Enhanced Federated Meta-Learning for Few-Shot Rolling Bearing Fault Diagnosis. Applied Sciences. 2026; 16(11):5277. https://doi.org/10.3390/app16115277

Chicago/Turabian Style

Jin, Yichen, Yuqi Luo, Xinyu Liu, Youpeng Fan, and Junli Shi. 2026. "ProtoFed: Prototype-Enhanced Federated Meta-Learning for Few-Shot Rolling Bearing Fault Diagnosis" Applied Sciences 16, no. 11: 5277. https://doi.org/10.3390/app16115277

APA Style

Jin, Y., Luo, Y., Liu, X., Fan, Y., & Shi, J. (2026). ProtoFed: Prototype-Enhanced Federated Meta-Learning for Few-Shot Rolling Bearing Fault Diagnosis. Applied Sciences, 16(11), 5277. https://doi.org/10.3390/app16115277

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop