1. Introduction
Rolling bearings are among the most critical components in rotating machinery, and their operational health directly determines the safety and reliability of industrial systems such as wind turbines, aero-engines, and manufacturing spindles [
1]. Unexpected bearing failures can result in costly unplanned downtime, equipment damage, and even catastrophic safety incidents. Over the past decade, data-driven fault diagnosis methods, particularly those based on deep learning, have demonstrated remarkable capability in automatically extracting discriminative features from raw vibration signals, substantially outperforming traditional approaches that rely on hand-crafted time- or frequency-domain features [
2].
However, the success of deep learning is predicated on the availability of large volumes of labeled training data. In practice, acquiring sufficient fault samples is often infeasible: newly deployed equipment may operate normally for extended periods before faults manifest, rare fault modes occur infrequently, and expert annotation is both expensive and time-consuming [
3]. Few-shot learning addresses this data scarcity by enabling rapid adaptation to novel fault categories with only a handful of labeled examples [
4,
5]. Nevertheless, existing few-shot fault diagnosis methods assume centralized data access, requiring all training samples to be aggregated at a single site—an assumption that conflicts with the privacy regulations, intellectual property concerns, and data governance policies prevalent in modern industrial environments [
6]. Federated learning (FL) offers a natural remedy by enabling collaborative model training across distributed clients without sharing raw data [
7,
8].
While FL has been actively explored for fault diagnosis [
9,
10,
11], existing methods uniformly assume that each client possesses sufficient labeled samples. When only a few labeled examples per fault category are available—as is common for newly commissioned equipment or rare failure modes—standard FL approaches degrade sharply. Recent attempts to integrate meta-learning into federated fault diagnosis, such as FedMeta-FFD [
12] and REFML [
13], have achieved promising results but share a critical limitation: the absence of explicit cross-client prototype alignment. Under heterogeneous operating conditions, the feature distributions of the same fault category can differ substantially across clients, causing the globally aggregated model to suffer from
prototype drift and degraded generalization [
14].
We observe that class prototypes—the mean feature representations of each category—offer a natural bridge between local few-shot learning and global federated collaboration. Prototypes are compact, aligned with the metric-based classification paradigm, and expose only class-level aggregated statistics rather than raw data. However, existing prototype-based FL methods such as FedProto [
15] and FedSA [
16] are designed for data-rich supervised settings and do not address the amplified prototype drift that arises when each client holds only a handful of labeled samples under distinct operating conditions. To this end, we propose
ProtoFed, a prototype-enhanced federated meta-learning framework specifically designed for few-shot rolling bearing fault diagnosis.
Before presenting our specific contributions, we review the closely related literature on federated fault diagnosis, few-shot fault diagnosis, and prototype-based federated learning to contextualize the technical gaps that motivate ProtoFed.
1.1. Federated Learning for Fault Diagnosis
Federated learning has attracted considerable attention in the fault diagnosis community as it naturally addresses data isolation and privacy constraints in industrial environments [
9]. Zhang et al. [
10] proposed a federated framework incorporating self-supervised learning and dynamic validation to improve diagnostic performance across heterogeneous clients, while FedCAE [
17] adopts a two-stage paradigm in which edge clients extract features via federated convolutional autoencoders before a centralized classifier is trained on the aggregated representations.
A major challenge in federated fault diagnosis is the non-IID nature of data across clients, arising from varying operating conditions, sensor configurations, and equipment types [
18]. To mitigate this heterogeneity, Zhang et al. [
19] proposed similarity-based collaborative aggregation, Zhang and Li [
20] introduced adversarial alignment into federated transfer learning, and Zhao and Shen [
11] developed a federated domain generalization approach that achieves cross-condition diagnosis without access to target domain data. More recently, FedRad [
21] introduced dynamic aggregation based on Rademacher complexity to improve generalization under non-IID settings.
Despite these advances, existing FL-based fault diagnosis methods uniformly assume sufficient labeled samples on each client—an assumption frequently violated for newly deployed equipment or rare fault types. Maintaining diagnostic accuracy under such data-scarce federated scenarios remains an open challenge.
1.2. Few-Shot Fault Diagnosis
Few-shot fault diagnosis aims to achieve reliable fault identification with only a limited number of labeled samples per category—a common scenario in industrial settings where new fault modes emerge infrequently and annotation requires domain expertise. Existing approaches fall into two main paradigms. In the optimization-based camp, Zhang et al. [
2] applied MAML to bearing fault diagnosis, demonstrating that a well-initialized model can rapidly adapt to new fault categories with minimal data. Subsequent work has focused on improving training stability through task-sequencing [
22] and adaptive learning rates [
23].
In the metric-based paradigm, models learn an embedding space where classification relies on comparing query samples against class-representative features. Wang et al. [
5] developed a metric-based meta-learning model achieving competitive few-shot performance under multiple limited-data conditions. Prototypical networks [
4], which classify by distance to class-mean embeddings, have become a widely adopted baseline due to their simplicity and effectiveness. Recently, Rezazadeh et al. [
24] proposed a prototype-attention domain adaptation framework for explainable bearing fault diagnosis, demonstrating that prototype-based class alignment combined with attention-driven sample weighting can achieve strong cross-domain generalization and interpretability on the CWRU and Paderborn benchmarks; however, their framework operates in a centralized single-source-to-target paradigm and does not address the multi-client federated scenario. To handle cross-domain scenarios, Wu et al. [
3] combined few-shot learning with transfer learning, Lei et al. [
25] embedded prior domain knowledge into meta-transfer learning, and Li et al. [
26] incorporated attention mechanisms for fine-grained discrimination. Li et al. [
27] specifically investigated meta-learning under complex operating conditions, demonstrating improved robustness with diverse task distributions. A comprehensive survey by Liang et al. [
1] summarizes recent trends including GAN-based augmentation, enhanced feature extractors, and novel classifier architectures.
Despite their success, these methods all operate under a centralized paradigm, implicitly assuming all training data can be collected at a single site—conflicting with the data-sharing restrictions of distributed industrial systems.
1.3. Federated Few-Shot Learning and Prototype-Based FL
Federated few-shot learning (FFSL) lies at the intersection of the two aforementioned research directions, aiming to learn models that generalize from limited labeled data across privacy-preserving distributed clients. Zhao et al. [
28] formalized the personalized FFSL problem and proposed a framework that decouples global and local feature learning to handle non-IID few-shot tasks in cross-silo settings. Wang et al. [
29] introduced a general FFSL framework at the task level, enabling federated aggregation of meta-knowledge extracted from episodic training on each client. Other approaches tackle FFSL from complementary angles: Huang et al. [
30] developed a model-agnostic federated few-shot method, Yang et al. [
14] proposed grouping clients by distribution similarity and performing intra-group meta-learning, and Tian et al. [
31] combined personalized knowledge distillation with few-shot federated learning to improve performance on small-scale benchmarks.
In the specific domain of fault diagnosis, only a few works have explored the FFSL paradigm. FedMeta-FFD [
12] trains a global meta-learner on the server side and distributes it to clients for rapid local adaptation with minimal labeled data, with an auxiliary adaptive learning rate module to enhance convergence stability. REFML [
13] addresses both domain discrepancy and data scarcity by combining a shared encoder with meta-updated predictors and an adaptive interpolation mechanism that balances global generalization and local personalization. While these methods represent important progress, neither incorporates explicit cross-client prototype alignment, leaving the prototype drift problem—where the same fault category develops divergent representations under different operating conditions—unaddressed.
In parallel, prototype-based federated learning has emerged as a promising direction for handling data heterogeneity in general classification tasks. FedProto [
15] pioneered the communication of class prototypes instead of model parameters, enabling heterogeneous model architectures across clients while maintaining non-IID tolerance. FedProc [
32] introduced a global prototypical contrastive loss that pulls local features toward their corresponding global prototypes. More recently, FedSA [
16] proposed semantic anchors as unified prototypes to break the vicious cycle of representation inconsistency and classifier bias, and FedDP [
33] learned domain-invariant prototypes through an information bottleneck to align both representation and parameter spaces across clients with feature shift.
However, these prototype-based FL methods are designed for standard supervised settings with abundant labeled data, where local prototypes computed from hundreds of samples per class are statistically stable. In the few-shot regime, local prototypes derived from only a handful of samples are highly sensitive to operating condition noise, amplifying prototype drift beyond what these methods were designed to handle. ProtoFed bridges this gap by integrating prototypical metric learning with global prototype calibration and distance-aware aggregation, specifically tailored for few-shot bearing fault diagnosis under federated heterogeneity.
While the individual building blocks of ProtoFed—CWT, prototypical networks, EMA smoothing, and distance-based weighting—are well-established, the principal novelty lies not in their independent use but in their synergistic integration and the new mechanisms that emerge from this combination. Specifically, GPC transforms per-episode local prototypes—which are inherently noisy in the few-shot regime—into temporally smoothed global anchors that no single client could produce on its own. These calibrated anchors, in turn, enable PDAA to compute geometrically meaningful client weights in the prototype space, a signal that is far more stable than the gradient- or parameter-level similarity used by prior personalized FL methods when local datasets contain only a handful of samples. The calibration loss closes the loop by pulling local embeddings toward the global anchors during training, creating a bidirectional alignment mechanism that jointly reduces prototype drift and improves aggregation quality across rounds. This tightly coupled prototype–aggregation–regularization loop constitutes the core methodological contribution, and its effectiveness is validated by the consistently smaller centralized-to-federated performance gap observed across all experimental conditions.
The main contributions of this paper are summarized as follows:
We propose ProtoFed, a prototype-enhanced federated meta-learning framework for few-shot rolling bearing fault diagnosis that enables collaborative diagnosis across distributed clients without sharing raw vibration data. The framework introduces a tightly coupled prototype–aggregation–regularization loop that jointly addresses prototype drift and client heterogeneity—mechanisms absent in existing federated meta-learning methods.
We design a Global Prototype Calibration (GPC) mechanism that constructs stable global class prototypes through cross-client aggregation and EMA temporal smoothing, providing condition-invariant classification anchors that mitigate the amplified prototype drift inherent in few-shot federated settings.
We propose a Prototype-Distance Aware Aggregation (PDAA) strategy that adaptively weights client models based on local–global prototype divergence, reducing the negative influence of poorly aligned local representations and providing inherent robustness to noisy or outlier clients.
We conduct extensive experiments on the CWRU and Paderborn University bearing datasets under 5-shot and 10-shot non-IID settings. Comprehensive evaluations including cross-condition generalization, noise robustness, hyperparameter sensitivity, statistical significance tests, and differential privacy analysis demonstrate the effectiveness, robustness, and practical viability of ProtoFed.
2. Methodology
2.1. Problem Formulation
We consider a federated few-shot fault diagnosis scenario involving a central server and K geographically distributed clients , where each client operates under distinct working conditions (e.g., different loads, rotational speeds, or environmental settings). Each client possesses a local dataset , where denotes a one-dimensional vibration signal of length L and is the corresponding fault category label. The key constraint is that each client has access to only a few labeled samples per class: specifically, each client holds at most S samples per category, where S is small (e.g., or ). Raw data cannot be shared across clients or uploaded to the server due to privacy and confidentiality requirements.
Following the episodic meta-learning paradigm, the local training on each client is organized into episodes. In each episode, a C-way S-shot task is constructed by sampling a support set and a query set from , where Q denotes the number of query samples per class. The support set is used to build class representations, and the query set is used to evaluate and update the model parameters.
The objective of ProtoFed is to collaboratively learn, across all K clients, a shared feature extractor that maps time–frequency representations of vibration signals into a d-dimensional embedding space, such that the resulting model generalizes well to few-shot fault classification tasks under previously unseen operating conditions, without requiring any client to share its raw data.
2.2. Framework Overview
The overall architecture of ProtoFed is illustrated in
Figure 1. The framework operates through iterative communication rounds between the server and
K clients. In each round, clients first transform raw vibration signals into time–frequency representations via CWT, then perform episodic meta-training with prototypical networks to learn class prototypes and update the local model. Each client uploads its round-level local prototypes and model parameters to the server, where GPC aggregates them into unified global prototypes and PDAA computes adaptive aggregation weights for the global model update. The updated global model and global prototypes are then broadcast back to all clients. Throughout this process, only model parameters and class-level prototypes are communicated; no raw data or individual sample features are transmitted.
2.3. Time–Frequency Feature Extraction via CWT
Raw one-dimensional vibration signals contain rich fault-related information but suffer from noise contamination and lack explicit time–frequency structure. To obtain a more informative input representation, we employ continuous wavelet transform (CWT) to convert each raw signal
into a two-dimensional time–frequency map. The CWT of a signal
with respect to a mother wavelet
is defined as:
where
is the scale parameter controlling frequency resolution,
b is the translation parameter controlling temporal localization, and
denotes the complex conjugate of the mother wavelet. In this work, we adopt the Morlet wavelet as the mother wavelet due to its good balance between time and frequency resolution:
where
is the central frequency, typically set to 6. In our implementation, the complete signal-to-scalogram pipeline proceeds as follows. Each raw vibration signal (at 12 kHz for both datasets after downsampling PU) is first segmented into non-overlapping windows of length
samples. For each segment, the CWT is computed using the Morlet wavelet (
) over 64 logarithmically spaced scales in the range
. The scale bounds are determined by two physical constraints: the upper bound ensures that the effective wavelet support (≈
samples at
) remains within the segment length, while the lower bound avoids pseudo-frequencies above the Nyquist limit. By the Morlet pseudo-frequency relation
(where
), this scale range corresponds to an approximate frequency band of
Hz at
kHz, covering the characteristic bearing fault frequencies (ball pass frequencies, cage frequencies) for the rotational speeds in both datasets. The absolute values of the resulting complex-valued CWT coefficient matrix are computed to obtain the scalogram magnitude. The scalogram is then resized to a fixed spatial resolution of
through bilinear interpolation and normalized to the range
via min-max normalization per sample. The resulting time–frequency representation
is treated as a single-channel grayscale image that serves as input to the feature extractor. This transformation preserves both temporal dynamics and frequency characteristics of bearing vibration signals, enabling the convolutional network to exploit two-dimensional spatial patterns associated with different fault types. The complete preprocessing pipeline is illustrated in
Figure 2, and the preprocessing scripts will be publicly released upon acceptance to ensure full reproducibility.
2.4. Prototypical Network for Local Few-Shot Classification
On each client
, the time–frequency representations are fed into a convolutional feature extractor
that maps each input image
to a
d-dimensional embedding vector
. Given an episodic task with support set
, the prototype for class
c is computed as the mean embedding of all support samples belonging to that class:
where
denotes the subset of support samples from class
c, and the superscripts
k and
e indicate that this prototype is computed locally on client
at the
e-th episode. Within a single client’s local training, the client index
k is fixed; we retain it for notational consistency with the subsequent federated aggregation formulations. For a query sample
with embedding
, the classification probability over all
C classes is obtained via a softmax over the negative squared Euclidean distances:
The local prototypical classification loss on client
is then defined as the negative log-probability of the correct class labels over all query samples:
This metric-based classification paradigm is particularly well-suited for few-shot settings, as it avoids training a parametric classifier head and instead relies on the geometric structure of the embedding space.
2.5. Global Prototype Calibration (GPC)
When clients operate under heterogeneous conditions, the local prototypes for the same fault category can occupy different regions of the embedding space across clients—a phenomenon we refer to as prototype drift. As illustrated in
Figure 3a, local prototypes of the same class diverge across clients due to differences in load, speed, and environmental factors, resulting in inconsistent class representations. If left unaddressed, this drift causes the globally aggregated model to learn an ambiguous decision boundary, degrading its generalization to unseen conditions. The Global Prototype Calibration mechanism is designed to mitigate this problem by producing unified global prototypes on the server side, as depicted in
Figure 3b.
2.5.1. Round-Level Local Prototype
Because a single episode uses only
S support samples per class, the per-episode prototype
is statistically noisy. To obtain a more reliable estimate of client
’s class representation at round
t, we aggregate across all
E local episodes:
where
denotes the per-episode prototype of class
c computed in the
e-th local episode of round
t. It is
that each client uploads to the server.
2.5.2. Server-Side Aggregation
At the end of each communication round
t, every client
uploads its set of round-level local prototypes
to the server. For each class
c, the server computes a global prototype
as a weighted average of the corresponding local prototypes:
where
denotes the number of samples of class
c on client
, and
is the corresponding weight. In the balanced few-shot setting where all clients have the same number of samples per class, this reduces to a simple arithmetic mean.
2.5.3. Temporal Smoothing
To promote temporal stability and prevent abrupt shifts in the global prototypes across communication rounds, we apply an exponential moving average (EMA) update:
where
is the momentum coefficient. At the very first round (
), no prior global prototypes exist; we therefore initialize
(i.e., the first-round aggregated prototype is used directly without smoothing) and set
in the local loss during round 1 so that the calibration regularization is inactive before meaningful global prototypes become available. From round
onward, the EMA update and the full training objective with
are applied. The smoothed global prototypes
are then broadcast to all clients, serving as stable classification anchors for local training in the subsequent round.
By aggregating prototypes from clients with diverse operating conditions, GPC captures a condition-invariant class representation more robust than any single client’s local prototypes. Since only class-level mean vectors are communicated, the additional overhead is floating-point values per client per round—negligible relative to the model parameter count.
2.6. Prototype-Distance Aware Aggregation (PDAA)
Standard federated aggregation (e.g., FedAvg) assigns client weights proportional to dataset sizes, treating all clients equally regardless of their representation quality. Under few-shot conditions, clients operating in extreme or atypical regimes may develop poorly aligned feature spaces, and assigning them equal influence degrades the global model. PDAA addresses this by dynamically computing aggregation weights based on the alignment between each client’s local prototypes and the global prototypes. For client
at round
t, the average prototype distance is:
A smaller
indicates that client
’s local representations are well-aligned with the global prototypes, suggesting that its local model has learned more generalizable features. The aggregation weight for client
is then computed via a softmax with a temperature parameter
:
The global model parameters are then updated as:
where
denotes the local model parameters of client
after round
t. The temperature parameter
controls the sharpness of the weight distribution.
Relation to Existing Similarity-Based Aggregation
PDAA belongs to the broader family of similarity-driven personalized FL aggregation schemes, which includes FedAMP [
34], pFedMe [
35], and the similarity-based collaboration framework of Zhang et al. [
19]. The key distinctions of PDAA are as follows. First, prior methods compute client similarity from
model parameters or
gradient directions, which are high-dimensional and noisy under few-shot training; PDAA instead uses the class-level prototype distance
, a low-dimensional quantity directly tied to the quantity ProtoFed seeks to align (the embedding space). Second, PDAA is computed against a
calibrated global anchor produced by GPC rather than against pairwise client comparisons, which reduces the
pairwise similarity computation to
and makes the weighting self-consistent with the calibration target. These differences make PDAA particularly well-suited to few-shot settings where parameter-level similarity signals are unstable.
2.7. Episodic Meta-Training with Calibration Regularization
At the beginning of each communication round, client
downloads the global model parameters
and global prototypes
from the server, then performs
E episodic meta-training steps. In each episode, a
C-way
S-shot task is sampled from
and the prototypical classification loss
is computed via Equation (
5). To regularize local training toward the global consensus, we introduce a calibration loss penalizing the discrepancy between per-episode local prototypes and global prototypes:
The total loss for each episode on client
is:
where
is a hyperparameter that controls the strength of the calibration regularization. When
, the training reduces to standard local prototypical network training without global guidance. As
increases, the local prototypes are pulled more strongly toward the global prototypes, promoting cross-client consistency at the potential cost of local adaptability.
This regularization mechanism creates a bidirectional alignment loop within ProtoFed: the global prototypes guide local training through the calibration loss (top-down), while the locally updated prototypes contribute to refining the global prototypes in the next round through GPC (bottom-up). This iterative process progressively reduces prototype drift across clients and converges toward a shared, condition-invariant embedding space.
2.8. Overall Algorithm
The complete training procedure of ProtoFed is summarized in Algorithm 1.
The computational complexity of ProtoFed on each client is dominated by the episodic meta-training, which is comparable to standard prototypical network training. The additional overhead introduced by GPC and PDAA is minimal: GPC requires aggregating
prototype vectors of dimension
d, and PDAA computes
K scalar distances followed by a softmax normalization. The communication cost per round is
per client, where
is the model parameter count and
accounts for the uploaded prototypes. Since
in practice, the prototype communication overhead is negligible relative to the standard model parameter exchange in FL.
| Algorithm 1 ProtoFed: prototype-enhanced federated meta-learning |
| Require: K clients with local datasets ; rounds T; episodes E; C-way S-shot; momentum ; weight ; temperature |
| Ensure: Global model parameters |
- 1:
Initialize global model ; set global prototypes to be determined after round 1 - 2:
for
do - 3:
Server broadcasts and to all clients - 4:
for each client in parallel do - 5:
- 6:
Initialize buffer - 7:
for do - 8:
Sample a C-way S-shot task from - 9:
Apply CWT to obtain time–frequency representations - 10:
Compute per-episode prototypes via Equation ( 3) - 11:
▷ accumulate - 12:
Compute via Equation ( 5) and via Equation ( 12) - 13:
▷ when - 14:
Update by SGD on - 15:
end for - 16:
Set for each class ▷ Equation ( 6) - 17:
Upload and to the server - 18:
end for - 19:
Server: Global Prototype Calibration (GPC) - 20:
Compute for each class c via Equation ( 7) - 21:
Update via Equation ( 8) - 22:
Server: Prototype-Distance Aware Aggregation (PDAA) - 23:
Compute for each client via Equation ( 9) - 24:
Compute for each client via Equation ( 10) - 25:
▷ Equation ( 11) - 26:
end for - 27:
return ,
|
3. Experiments
3.1. Experimental Setup
Datasets. We evaluate ProtoFed on two widely used rolling bearing fault diagnosis benchmarks that represent distinct levels of complexity. The
Case Western Reserve University (CWRU) dataset [
36] contains vibration signals collected from a motor drive end bearing under four health states: normal, inner race fault, outer race fault, and ball fault. Signals are recorded at a sampling rate of 12 kHz under four load conditions (0, 1, 2, and 3 HP), which naturally induce distribution shifts across operating regimes. The test rig consists of a 2-HP reliance electric motor, a torque transducer, and a dynamometer; accelerometers are mounted at the drive end and fan end of the motor housing. Single-point faults with diameters of 0.007, 0.014, and 0.021 inches are introduced via electro-discharge machining. We simulate a federated scenario with
clients, each corresponding to one load condition. The
Paderborn University (PU) dataset [
37] provides vibration data from bearings operating under systematically varied combinations of rotational speed (900–1500 rpm), load torque (0.1–0.7 Nm), and radial force (400–1000 N), encompassing both artificially seeded and naturally developed faults. Vibration signals are recorded by accelerometers mounted on the bearing housing at a native sampling rate of 64 kHz; following standard practice in bearing fault diagnosis [
1], we downsample the PU signals to 12 kHz to match the CWRU sampling rate and ensure a consistent CWT preprocessing pipeline across both datasets. We construct
clients from four distinct operating conditions to simulate realistic cross-condition heterogeneity. Compared to CWRU, the PU dataset poses a more challenging diagnostic task due to its greater intra-class variability and subtler inter-class distinctions.
Data partition protocol. The train/test splits are constructed to evaluate cross-condition generalization rather than within-condition interpolation. For CWRU, each of the four load conditions constitutes a separate client; the episodic support and query sets within each client are randomly sampled from that client’s data using disjoint sample indices, but no condition-level held-out protocol is applied in the default setting to ensure all clients participate in federated training. For PU, the four operating condition groups are similarly assigned to separate clients. Within each client, we randomly partition the available samples into a training pool (80%) and a test pool (20%) prior to episodic sampling; the test pool is reserved exclusively for final evaluation and is never used during episodic support/query construction. To further substantiate cross-condition generalization, we additionally report leave-one-load-out results on the CWRU dataset in
Section 3.11, where three load conditions are used for training and the held-out condition serves as the test client.
Few-shot and non-IID configuration. For each client, we construct C-way S-shot episodes with fault categories. We report results under two few-shot settings: (5-shot) and (10-shot), with query samples per class. To simulate realistic data heterogeneity, class distributions across clients are generated using a Dirichlet distribution with concentration parameter (unless otherwise specified), where smaller values produce more severe non-IID partitions. We note that when the total number of samples per class is small (e.g., ), the absolute class-count differences induced by Dirichlet partitioning are necessarily limited. In our setup, the Dirichlet distribution primarily governs which operating conditions (and thus which distributional characteristics) dominate each client’s data, rather than producing large imbalances in absolute sample counts. This design reflects realistic industrial scenarios where the heterogeneity arises from distinct operating regimes rather than from purely quantitative class imbalance.
Implementation details. Raw vibration segments of length are transformed into time–frequency representations of size via CWT with a Morlet wavelet (). The feature extractor is a four-block convolutional network, where each block consists of a convolutional layer with 64 filters, batch normalization, ReLU activation, and max pooling, yielding a -dimensional embedding. Training is conducted over communication rounds with local episodes per round. The local optimizer is SGD with a learning rate of . The EMA momentum is set to , the calibration weight to , and the PDAA temperature to . All experiments are repeated five times with different random seeds, and we report the mean ± standard deviation of accuracy and macro-averaged F1-score.
Baselines. We compare ProtoFed against eleven representative methods spanning four categories.
Standard federated baselines: FedAvg [
38] and FedProx [
39], both equipped with the same CNN backbone, represent standard federated optimization; FedTL + MMD is a federated transfer learning baseline that applies maximum mean discrepancy-based domain alignment within the federated framework, implemented by us following the standard federated transfer learning paradigm [
20]; and FedRad [
21] is a recent dynamic aggregation method for fault diagnosis.
Prototype-based federated baselines: FedProto [
15] pioneers prototype-only communication, and FedSA [
16] introduces semantic anchors as unified prototypes.
Centralized few-shot baselines: Prototypical network (ProtoNet) [
4], MAML [
40], and Matching Network [
41] are trained on the union of all clients’ data with the same episodic protocol. Since these methods have access to the pooled data from all
K clients, they effectively observe
samples per class during episode construction (e.g., 20 samples per class with
and
), providing a strictly more favorable data condition than any individual federated client. They therefore serve as centralized upper-bound references representing the best achievable performance when data sharing is unrestricted, and are not included in the federated ranking.
Federated meta-learning baselines: FedMeta-FFD [
12] and REFML [
13] represent the state-of-the-art in federated few-shot fault diagnosis. All methods share the same CNN backbone and CWT preprocessing for fair comparison.
Baseline implementation details. To ensure a fair comparison, we detail the implementation and hyperparameter configuration of all baselines. FedAvg and FedProx use the official algorithms with the same 4-block CNN backbone; FedProx uses
(tuned over
). FedTL + MMD is our own implementation following [
20], with the MMD kernel bandwidth selected by the median heuristic. FedRad uses the official implementation with the Rademacher penalty coefficient tuned over
(selected: 0.1). FedProto uses the official code with the prototype loss weight tuned over
(selected: 1.0). FedSA uses the official code with the semantic anchor dimension matching our embedding size (
) and the anchor update rate tuned over
(selected: 0.05). ProtoNet, MAML, and MatchingNet use standard implementations with inner-loop learning rate 0.01 for MAML (tuned over
). FedMeta-FFD uses the official implementation with the adaptive learning rate module enabled; REFML uses the official code with the interpolation coefficient
tuned over
(selected: 0.5). All federated methods use the same number of communication rounds (
), local episodes (
), and learning rate (
). For baselines with official open-source implementations, we use the released code directly; for those without available code (FedTL + MMD), we re-implement following the original paper’s algorithmic description and verify convergence on a centralized validation split before federated deployment.
3.2. Main Results
Table 1 presents the overall comparison across both datasets and both few-shot settings.
The results reveal a clear performance hierarchy across method categories, and several key observations can be drawn. The standard federated methods (FedAvg, FedProx, FedTL + MMD, FedRad), which lack explicit few-shot learning mechanisms, consistently rank at the bottom. FedAvg achieves only 72.41% on CWRU and 65.73% on PU under the 5-shot setting—over 23 and 25 percentage points below ProtoFed, respectively. Even FedRad, which introduces dynamic aggregation based on Rademacher complexity, trails ProtoFed by 12.90% on CWRU and 13.94% on PU, confirming that complexity-based weighting loses effectiveness when local datasets are too small for reliable complexity estimation. The prototype-based FL methods, FedProto and FedSA, fare considerably better. FedSA in particular reaches 90.18% on CWRU 5-shot by leveraging semantic anchors, making it the strongest non-meta-learning baseline. ProtoFed still surpasses FedSA by 5.45% on CWRU and 5.27% on PU, however, because FedSA’s anchor updates depend on client classifiers that remain under-trained when only five support samples are available.
Among the federated meta-learning approaches—ProtoFed’s most direct competitors—REFML slightly edges out FedMeta-FFD on both datasets, owing to its adaptive interpolation between global and local model initialization. ProtoFed outperforms REFML by 4.29% on CWRU and 3.87% on PU under 5-shot, a margin approximately five times the standard deviation of either method, indicating statistical significance. This advantage stems from the synergy between GPC and PDAA: GPC provides calibrated global prototypes as stable classification anchors, while PDAA prevents clients with heavily drifted representations from diluting the aggregated model—mechanisms that neither FedMeta-FFD nor REFML possess. To rigorously verify statistical significance, we conduct paired t-tests on the per-seed accuracy values across identical data partitions and random seeds. ProtoFed significantly outperforms REFML on both CWRU (, ) and PU (, ) under 5-shot, and similarly outperforms FedSA ( on both datasets). The 95% confidence intervals for the mean accuracy improvement over REFML are on CWRU and on PU, confirming that the observed advantages are both statistically significant and practically meaningful. As expected, the centralized upper-bound references achieve the highest overall accuracy, since they have unrestricted access to all clients’ pooled data. Centralized ProtoNet reaches 96.82% on CWRU and 92.74% on PU under 5-shot, surpassing ProtoFed by only 1.19% and 1.39%, respectively. This modest gap—substantially smaller than the 5–6% gap separating the second-best federated method (REFML) from the centralized ceiling—validates that ProtoFed’s prototype calibration and distance-aware aggregation recover most of the information available under centralized training. Under the 10-shot setting, this gap further shrinks to 0.53% on CWRU and 0.94% on PU, approaching the centralized optimum as more local data becomes available.
It is also worth noting that the performance advantage of ProtoFed over baselines is consistently larger on PU than on CWRU. Since the PU dataset exhibits greater inter-condition variability, this observation confirms the design rationale of GPC and PDAA: prototype calibration is most beneficial precisely when cross-client heterogeneity is severe.
3.3. Ablation Study
To quantify the contribution of each proposed component, we conduct ablation experiments by systematically removing individual modules from the full ProtoFed framework. Four ablation variants are evaluated: (1) w/o GPC, which disables global prototype calibration and uses uniform prototype averaging without EMA smoothing; (2) w/o PDAA, which replaces distance-aware aggregation with standard sample-count-based weighting; (3) w/o CWT, which feeds raw one-dimensional vibration signals directly into a one-dimensional CNN instead of CWT time–frequency maps; (4) w/o calibration, which removes the calibration regularization term () from the local training objective.
The results in
Table 2 and the accompanying visual comparison in
Figure 4 reveal that every component contributes meaningfully to the overall performance, though their relative importance differs.
Removing CWT incurs the largest performance drop (−6.15% accuracy on CWRU, −7.23% on PU under 5-shot), underscoring the importance of time–frequency representations as the input modality. One-dimensional vibration signals, while containing the same information in principle, lack the explicit two-dimensional structure that allows the convolutional feature extractor to capture joint time–frequency patterns characteristic of different fault types. The calibration regularization term is the second most influential component: its removal leads to a 5.31% drop on CWRU and a 5.77% drop on PU, confirming that the bidirectional alignment between local and global prototypes is essential for maintaining cross-client consistency.
Among the two server-side mechanisms, GPC has a slightly greater impact than PDAA. Disabling GPC causes a 3.85% drop on CWRU and a 4.71% drop on PU, whereas disabling PDAA leads to drops of 2.98% and 3.49%, respectively. This ranking is intuitive: GPC directly addresses the prototype drift problem by producing calibrated global prototypes, while PDAA provides a complementary but indirect benefit by adjusting aggregation weights. Notably, the two mechanisms exhibit a cooperative relationship—the combined improvement of GPC and PDAA (full model vs. the worse of the two ablations) exceeds the sum of their individual contributions, suggesting that calibrated prototypes enable PDAA to compute more meaningful distance-based weights.
3.4. Impact of Few-Shot Size
Figure 5 examines how the number of support samples per class (
S) affects diagnostic performance across all methods.
Two trends are evident. First, all methods benefit from increasing S, but the rate of improvement varies substantially. ProtoFed achieves 72.18% on CWRU with just a single shot per class and reaches 95.63% at —a steep improvement curve that plateaus around . In contrast, FedAvg + CNN requires to reach 91.38%, a level that ProtoFed exceeds at . This finding demonstrates that the combination of prototypical metric learning and global prototype calibration enables ProtoFed to extract discriminative representations from extremely limited data.
Second, the advantage of ProtoFed is most pronounced in the extreme low-shot regime. At , ProtoFed outperforms REFML by 7.06% on CWRU and 6.82% on PU; at , the gap narrows to 1.20% and 1.30%. This diminishing-return pattern aligns with the expectation that prototype calibration is most valuable when local prototypes are computed from very few samples and are inherently noisy—as S increases, local prototypes converge toward their true class means, reducing prototype drift and the relative benefit of GPC.
A natural concern with few-shot settings—particularly at or —is the risk of overfitting to the extremely limited support samples. ProtoFed mitigates this risk through several complementary mechanisms. First, the prototypical network architecture avoids training a parametric classifier head, instead relying on the geometric structure of the embedding space, which inherently constrains the hypothesis space. Second, the calibration loss acts as a regularizer that pulls local prototypes toward the global consensus, preventing the feature extractor from overfitting to client-specific distributional artifacts. Third, the episodic training protocol itself provides implicit data augmentation by constructing diverse support/query partitions across episodes. The observation that ProtoFed’s standard deviation remains below 1% even at (compared to over 2% for FedAvg) further indicates that these mechanisms collectively stabilize the learning process against overfitting in the extreme low-shot regime.
3.5. Robustness to Non-IID Heterogeneity
We investigate the sensitivity of ProtoFed to varying degrees of data heterogeneity by adjusting the Dirichlet concentration parameter from 0.1 (highly non-IID) to 10.0 (approximately IID) under the 5-shot setting.
As shown in
Figure 6, all methods degrade as
decreases, but the rate of degradation varies markedly. FedAvg + CNN suffers a 24.42% accuracy drop on CWRU when
shifts from 10.0 to 0.1, while ProtoFed’s decline is limited to 6.27%; on PU, the corresponding drops are 25.24% versus 7.02%. This robustness stems from GPC producing stable global prototypes that absorb distributional idiosyncrasies of individual clients, and PDAA down-weighting clients whose local distributions have drifted furthest from the consensus.
Notably, ProtoFed maintains a clear advantage even in the near-IID setting (), suggesting that prototype calibration provides additional regularization benefits beyond non-IID mitigation.
3.6. Visualization Analysis
We visualize the embedding space using t-SNE and examine classification boundaries through confusion matrices to provide intuitive evidence for the effectiveness of GPC.
As shown in
Figure 7, the uncalibrated embedding space (
Figure 7a) exhibits clear condition-specific fragmentation: samples from the same fault category but different clients form distinct sub-clusters, indicating that the feature extractor has learned condition-dependent rather than condition-invariant representations. After GPC calibration (
Figure 7b), these sub-clusters coalesce into compact, well-separated groups with cross-client overlap within each class, confirming that GPC successfully unifies the local feature distributions.
The confusion matrices in
Figure 8 corroborate this finding. REFML exhibits non-negligible off-diagonal entries between inner race, outer race, and ball faults—categories known to produce overlapping frequency signatures. ProtoFed substantially reduces these misclassifications, raising per-class accuracy for the OR and Ball categories from 90.0% to 95.0% and 94.0%, respectively, indicating that the calibrated global prototypes sharpen inter-class discrimination for fault types with subtle distinguishing features.
3.7. Convergence Analysis
Figure 9 presents the evolution of prototype drift and test accuracy over communication rounds.
ProtoFed rapidly reduces inter-client prototype divergence, reaching near-convergence within approximately 20 rounds (
Figure 9a). The variant without GPC retains substantially higher drift even after 50 rounds, stabilizing at roughly 3.9 compared to ProtoFed’s 0.5, which directly accounts for its lower final accuracy (
Figure 9b). The variant without PDAA shows intermediate behavior, with prototype drift decreasing more slowly and converging to a higher residual level. In terms of wall-clock convergence, ProtoFed reaches 90% accuracy within approximately 15 rounds, compared to around 25 for REFML and 30 for FedMeta-FFD.
3.8. Sensitivity to Temperature Parameter
The PDAA temperature parameter
controls the sharpness of the client weighting distribution: small
values concentrate aggregation weight on the most aligned client, while large
values approach uniform weighting.
Table 3 reports the accuracy under the 5-shot setting as
varies from 0.1 to 5.0 on both datasets.
The results show a clear inverted-U pattern. At , the weighting distribution becomes nearly one-hot, assigning almost all weight to a single client and discarding useful information from others. At , the distribution approaches uniformity, reducing PDAA to standard sample-count-based aggregation and losing the benefit of distance-aware weighting. The optimal provides a balanced trade-off, allowing well-aligned clients to contribute more while still incorporating diverse prototype information from moderately divergent clients. The performance difference between and is relatively small (within 0.5–0.6% on both datasets), indicating that the method is not overly sensitive to within a reasonable range around the optimum.
3.9. Sensitivity to and
The calibration weight
and EMA momentum
jointly control the strength and stability of prototype-based regularization. We conduct a grid search over
and
under the 5-shot setting.
Table 4 reports the resulting accuracy on the CWRU dataset.
The optimal combination is and . Several patterns emerge. First, the performance surface is smooth and unimodal, indicating that ProtoFed is not overly sensitive to the exact hyperparameter values. The region and consistently achieves accuracy above 94%, providing a wide operational range for practitioners. Second, extreme values of either parameter degrade performance: provides insufficient calibration, while over-constrains local adaptation; similarly, makes the EMA too sluggish to track evolving prototype distributions, while introduces excessive jitter in the global prototypes. Third, the interaction between and is cooperative rather than independent: increasing (stronger calibration pull) is more effective when is moderate (0.80–0.95) because a stable global anchor makes the calibration signal more informative. We observe a similar pattern on the PU dataset (best: , , 91.35%; worst corner: , , 85.92%), confirming the generality of the selected hyperparameters.
3.10. Robustness to Noisy Clients
In real-world federated deployments, some clients may produce corrupted labels due to sensor malfunction, incorrect maintenance logs, or annotation errors. To evaluate robustness under such conditions, we simulate a noisy client scenario by randomly flipping 20% of the labels on one of the four clients, while keeping the remaining three clients clean.
Table 5 compares ProtoFed, its variant without PDAA, REFML, and FedAvg under the 5-shot setting.
ProtoFed exhibits strong resilience to client-level label noise, with only a 1.45% accuracy drop on CWRU and 1.63% on PU compared to the clean setting. In contrast, REFML drops by 3.72% on CWRU and 4.30% on PU, and FedAvg drops by 3.96% and 4.35%, respectively. The key to this robustness is PDAA: the noisy client develops distorted local prototypes that diverge substantially from the global anchors, resulting in a high prototype distance and a correspondingly low aggregation weight. Indeed, the ablated variant without PDAA suffers a noise-induced degradation of 2.41% on CWRU and 3.00% on PU (relative to its own clean setting accuracy of 92.65% and 87.86%), nearly double ProtoFed’s degradation, confirming that PDAA’s distance-aware weighting effectively isolates the corrupted client. Under the noisy condition, the performance gap between ProtoFed and its w/o PDAA variant widens to 3.94% on CWRU and 4.86% on PU (compared to 2.98% and 3.49% in the clean ablation), further demonstrating that PDAA becomes increasingly beneficial in the presence of client-level corruption.
3.11. Leave-One-Load-Out Cross-Condition Generalization
To evaluate cross-condition generalization more stringently, we conduct a leave-one-load-out protocol on the CWRU dataset: three load conditions are used for federated training (one client per load), and the held-out load condition serves as the test client, whose data is never seen during training.
Table 6 reports the accuracy for each held-out condition.
ProtoFed achieves a mean cross-condition accuracy of 93.43%, outperforming REFML by 5.40% and FedSA by 6.73%. Notably, ProtoFed’s cross-condition performance (93.43%) is only 2.20% below its within-condition federated accuracy (95.63%), whereas REFML exhibits a larger gap of 3.31% (91.34% to 88.03%). This smaller generalization gap indicates that the calibrated global prototypes learned by GPC capture condition-invariant fault representations rather than condition-specific artifacts. The 3 HP condition is the most challenging held-out case across all methods, likely because the highest-load operating regime produces the most distinct vibration characteristics. Nevertheless, ProtoFed still achieves 92.14% on this most difficult split.
3.12. Noise Robustness Under Varying SNR
To assess the robustness of ProtoFed to signal-level noise—a practical concern in industrial environments with electromagnetic interference and sensor noise—we inject additive Gaussian white noise at varying signal-to-noise ratios (SNR) into the raw vibration signals before CWT preprocessing.
Table 7 reports accuracy under the 5-shot setting for SNR values ranging from
dB (heavily corrupted) to 10 dB (mildly noisy), as well as the clean baseline (∞).
ProtoFed maintains the highest accuracy across all noise levels. At the severe dB condition, ProtoFed achieves 88.42% on CWRU and 83.28% on PU, retaining 92.5% and 91.2% of its clean setting performance, respectively. In contrast, REFML retains only 88.9% and 87.4% of its own clean accuracy at the same noise level. The absolute performance gap between ProtoFed and the baselines widens as noise increases: at SNR = ∞, the gap over REFML is 4.29% on CWRU, but at SNR = dB, it grows to 7.18% on CWRU and 6.86% on PU. This amplified advantage under noise arises because noisy CWT representations produce noisier per-episode prototypes, amplifying prototype drift across clients. GPC’s temporal smoothing via EMA absorbs this additional variance, while PDAA down-weights clients that are most affected by noise-induced prototype distortion. These results suggest that ProtoFed’s prototype calibration mechanisms provide built-in noise resilience beyond what the baselines can achieve.
3.13. Communication Cost Analysis
A practical concern for federated deployment is the communication overhead introduced by prototype sharing.
Table 8 compares the per-round per-client communication cost across representative methods.
ProtoFed transmits both model parameters and class prototypes, incurring a total upload/download cost of per direction. With and , the prototype payload amounts to 1024 float32 values, or approximately 4 KB—representing a mere 0.38% overhead relative to the model parameter transmission. Over 50 communication rounds with clients, ProtoFed’s total communication volume is approximately 418 MB, compared to 416 MB for FedAvg—a negligible difference. FedProto achieves dramatically lower communication cost (1.6 MB total) by transmitting only prototypes, but at the expense of 10% lower accuracy (85.62% vs. 95.63% on CWRU 5-shot). ProtoFed therefore achieves the best accuracy–communication trade-off among all compared methods, adding less than 0.4% communication overhead over standard FL while recovering most of the centralized-to-federated performance gap.
The computational overhead of GPC and PDAA on the server side is also minimal. GPC requires aggregating prototype vectors of dimension and performing one EMA update, amounting to operations. PDAA computes distance scalars and a softmax normalization. Both operations complete in under 1 ms on a single CPU core and are negligible compared to the local training time (approximately 45 s per client per round on a single NVIDIA RTX 3090 GPU).
3.14. Differential Privacy Analysis
Although class prototypes are more compact than raw data or full model gradients, they can still leak aggregate distributional statistics of a client’s data. An adversary with access to the uploaded prototypes could, in principle, infer statistical properties of the local class distributions through prototype inversion or membership inference attacks [
6]. To evaluate the feasibility of strengthening ProtoFed’s privacy posture, we apply client-level differential privacy (DP) by adding calibrated Gaussian noise to the prototype vectors before uploading them to the server:
where the noise scale
is calibrated according to the desired privacy budget
via the Gaussian mechanism, with
and a sensitivity bound derived from the
-norm clipping of prototype vectors (clip norm = 10.0).
Table 9 reports the accuracy under the 5-shot setting for varying privacy budgets.
The results reveal a favorable privacy–accuracy trade-off. At a moderate privacy budget , the accuracy loss relative to the non-private baseline is only 0.25% on CWRU and 0.27% on PU—practically negligible. Even under the stringent regime, ProtoFed retains 92.14% on CWRU and 87.42% on PU, still surpassing the best non-private federated baseline (REFML at 91.34% and 87.48%) on CWRU while remaining competitive on PU. This resilience arises because the GPC aggregation over K clients and EMA temporal smoothing jointly average out the injected noise across both clients and time steps, effectively denoising the global prototypes. The model parameters are transmitted without DP noise in this experiment; extending DP to the full parameter exchange via DP-SGD remains a direction for future investigation.