1. Introduction
Motors are key actuating components in industrial transmission equipment and intelligent manufacturing systems, and their operating conditions directly affect production safety, operational continuity, and maintenance cost [
1,
2]. During long-term service, motors are subject to load fluctuations, start–stop impacts, installation deviations, component aging, and environmental disturbances, which may lead to various faults, such as bearing damage, rotor unbalance, misalignment, broken rotor bars, winding short circuits, and voltage imbalance [
3]. However, motor fault diagnosis in engineering applications is not a simple classification problem but is jointly affected by data decentralization, class scarcity, distribution heterogeneity, privacy constraints, and limited resources for edge deployment.
In multi-client collaborative diagnosis scenarios, differences in motor types, operating conditions, sensor arrangements, and maintenance states lead to significant inconsistencies among local datasets. Some clients may operate mainly under normal or slightly degraded conditions for long periods, resulting in insufficient fault samples. Incipient and rare faults are usually characterized by weak impacts, strong noise interference, ambiguous class boundaries, and limited sample coverage [
4]. Existing studies still have three main limitations in balancing rare-fault preservation and lightweight deployment. First, conventional federated aggregation and federated distillation tend to emphasize global accuracy or knowledge consistency, but rare-fault knowledge from small-sample clients can still be weakened during server-side fusion. Second, most distillation methods rely mainly on output responses or feature representations, which are insufficient to jointly describe soft decision boundaries, inter-class fault structures, and weak-fault sensitivity. Third, existing federated compression or slimming methods mainly reduce parameters or channels, but they do not explicitly prevent the removal of weak yet critical representations related to rare-fault boundaries. As a result, rare-fault features can be easily overwhelmed by majority classes and highly repetitive samples, making it difficult for diagnostic models to preserve sensitivity to minority faults under imbalanced data conditions [
5]. Centralized deep learning methods can train diagnostic models by aggregating data from different devices [
6], but direct data sharing is often constrained by enterprise privacy policies, equipment security, communication costs, and data management regulations. Federated learning provides a feasible raw-data-free solution for multi-client collaborative modeling by keeping raw monitoring data locally and exchanging model parameters or knowledge descriptors [
7]. It should be noted that raw-data-free collaboration does not indicate a formal privacy guarantee because exchanged parameters or descriptors may still contain statistical information about local data. In addition, due to non-IID data distributions, conventional federated aggregation tends to be dominated by clients with larger sample sizes, more complete fault categories, and higher data quality. Consequently, rare-fault knowledge from minority clients may be weakened in the global model [
8]. Meanwhile, the global model may also absorb repetitive, low-quality, and weakly task-related knowledge from multiple clients. Such knowledge usually makes a limited contribution to rare-fault discrimination but may increase model complexity and weaken minority-class boundaries. It may arise from repetitive feature responses of large-sample clients, majority-class-dominated decision information, or low-contribution channels that are insensitive to rare and incipient faults. As a result, the global model may contain redundant decision boundaries, leading to higher deployment cost and reduced edge-side applicability.
Pruning, quantization, knowledge distillation, and federated distillation have been widely used to reduce model complexity, storage cost, communication burden, and inference latency in mechanical fault diagnosis [
9,
10,
11,
12]. However, most existing lightweighting methods are designed for centralized or single-client scenarios, and their objectives mainly focus on parameter compression or teacher-output approximation [
13]. Under multi-client imbalanced conditions, minority-fault and incipient-fault responses are usually weak and can be removed during pruning or slimming. Meanwhile, existing federated distillation methods mainly transfer soft labels, feature representations, class prototypes, or generated knowledge to improve model fusion, cross-domain generalization, or communication efficiency, with insufficient attention to rare-fault knowledge preservation during compression [
14]. Relying only on output probabilities or conventional feature distillation cannot fully represent inter-class fault structures, input-sensitive weak responses, and complex rare-fault boundaries. Therefore, lightweight federated diagnosis should not only reduce model size but also preserve effective diagnostic knowledge and suppress redundant, low-quality, and majority-class-dominated knowledge.
More specifically, existing related techniques address only part of this requirement. Federated prototype learning aligns feature structures through class prototypes, but it cannot describe soft decision boundaries or input-sensitive weak responses. Federated knowledge distillation transfers output-level knowledge, while output responses alone are insufficient to preserve structural relations among similar faults. Rare-class reweighting increases the contribution of minority samples during local training, but it cannot evaluate the reliability of rare-fault knowledge during federated fusion. Model slimming reduces structural redundancy, but weak channels associated with rare-fault boundaries may still be removed without explicit knowledge-preservation constraints. Therefore, rare-fault-preserving lightweight federated diagnosis requires a unified framework that jointly considers decision-level discrimination, class-level structural relations, input-level sensitivity responses, rare-fault-aware knowledge fusion, and redundancy-suppressed model slimming.
To address these issues, this paper proposes a rare-fault-preserving federated dynamic model slimming method based on relational knowledge. The core novelty lies in using multi-granularity relational knowledge as a unified constraint for both rare-fault-aware federated fusion and gated student slimming, rather than treating relational distillation, federated weighting, and model slimming as independent modules. Specifically, output-discriminative knowledge, class-prototype relations, and input-sensitive relations are extracted at local clients to describe diagnostic knowledge from the decision, structure, and sensitivity levels. At the federated center, rare-fault-aware weighting is introduced to enhance reliable minority-fault knowledge from clients with scarce samples and heterogeneous data quality. Then, a relation-constrained gated teacher-to-student slimming strategy is constructed to reduce deployment complexity while preserving rare-fault recognition capability. The main contributions of this paper are summarized as follows:
- (1)
A rare-fault-preserving federated dynamic slimming framework is proposed for multi-client imbalanced motor fault diagnosis. Different from conventional federated distillation or federated compression methods, the proposed framework jointly considers federated knowledge fusion, rare-fault boundary preservation, and redundant channel suppression.
- (2)
A multi-granularity relational knowledge representation mechanism is developed by integrating output-discriminative knowledge, class-prototype relations, and input-sensitive relations. This design preserves soft decision boundaries, fault-class structural relationships, and weak-fault response sensitivity simultaneously.
- (3)
A rare-fault-aware knowledge weighting strategy and a relation-constrained gated slimming strategy are designed. The former reduces the dominance of majority clients during federated fusion, while the latter prevents the lightweight student model from removing key channels associated with rare-fault recognition.
3. Federated Dynamic Slimming Algorithms with Relation Knowledge for Rare-Fault Preservation
The main notation is unified as follows. denotes the number of clients, denotes the number of fault classes, denotes the local dataset of client , and denotes the labeled calibration set at the server. , , and denote the local model, global teacher model, and student model, respectively. , , and denote the output knowledge, class-prototype relation, and input-sensitive relation. denotes the class-level fusion weight, and denotes the channel-retention threshold.
3.1. Overall Framework with Output Discrimination and Relation Knowledge Collaboration
To address the weakening of rare-fault knowledge, the accumulation of redundant knowledge in the global model, and the high deployment complexity at the edge side in multi-cliental motor fault diagnosis, this study constructs a relation knowledge-based federated dynamic slimming framework, as shown in
Figure 1. The framework consists of four stages: local knowledge mining, federated knowledge fusion, global teacher model optimization, and gated dynamic slimming of the student model. Different from methods that only upload model parameters or distill output soft labels, the proposed method represents local diagnostic knowledge using output-discriminative knowledge, class-prototype relations, and input-sensitive relations, and further performs rare-fault-aware fusion at the federated center.
Assume that the federated system consists of
local clients. The local fault dataset of the
-th client is denoted as
, where
is the number of fault classes. The local diagnostic model is composed of a feature extractor
and a classifier
, and the predicted probability is formulated as
where
denotes the classification logits of the
-th sample,
is the corresponding predicted probability, and
represents the local model parameters.
Unlike methods that directly upload model parameters or raw data, each client only uploads diagnostic knowledge descriptors extracted from its local model. The federated center then uses these descriptors to update the global teacher model and subsequently performs dynamic slimming through the gated student model . The key distinction of the proposed framework is that relational knowledge is used throughout the whole process, including local knowledge description, server-side rare-fault-aware fusion, teacher model optimization, and gated student slimming. Therefore, rare-fault boundaries are preserved during both knowledge aggregation and lightweight compression.
Remark 1. The proposed method keeps raw monitoring signals at local clients and only uploads compact knowledge descriptors, including softened outputs, class prototypes, relation matrices, and class-level statistics. This design reduces the exposure of raw vibration, acoustic, or current signals. Therefore, the proposed method should be understood as a raw-data-free and privacy-aware diagnostic framework rather than a method with a formal privacy guarantee. The upload cost of one client in each communication round is mainly determined by the descriptor size: , where is the number of bytes of one floating-point value, is the number of calibration samples, is the number of classes, and is the prototype dimension. The four terms correspond to output responses, class prototypes, two relation matrices, and class-level statistics, respectively. In comparison, parameter-based federated learning uploads model parameters with the cost . Since , , and are usually much smaller than the number of local signal samples or model parameters, the proposed descriptor-based transmission is communication-efficient.
3.2. Local Model Training and Multi-Granularity Diagnostic Knowledge Mining
In each federated iteration, the federated center first distributes the current global teacher model parameters to the local clients. The
-th client then updates its model using its local dataset. Considering that rare-fault classes may be insufficiently represented in local data, a class-balancing factor is introduced to regularize the local loss function:
where
denotes the number of samples belonging to the
-th fault class in the
-th client,
is a small constant used to avoid division by zero, and
controls the compensation strength for rare classes. This design enables the local training stage to pay more attention to minority-fault classes, thereby preventing the subsequent federated fusion process from being dominated by majority-class samples. After model updating, the local clients do not upload raw monitoring signals. Instead, three types of diagnostic knowledge are extracted from the output layer, feature layer, and sensitivity-response layer, as shown in
Figure 2.
First, output-discriminative knowledge is used to describe the soft decision boundaries of the model among different fault classes. The output distribution is softened using the temperature factor
, which is defined as
Compared with hard labels,
preserves the similarity information between different fault classes and provides finer-grained discriminative constraints for the subsequent student model. Second, class-prototype relation knowledge is used to characterize the structural relationships among different fault classes in the feature space. Let
denote the deep feature extracted by the local model. The local class prototype of the
-th fault class is then defined as
The class-relation matrix is constructed based on the class prototypes as follows:
This matrix reflects the relative distances and similarity relationships among different fault classes, preventing the student model from learning only sample-level outputs while ignoring the underlying fault-class structure during the slimming process. Finally, input-sensitive relation knowledge is introduced to describe the model response to weak fault features and boundary samples. For the
-th fault class, its sensitivity prototype is defined as
The input-sensitive relation matrix is further constructed as
Each component in the proposed pipeline is designed for a specific limitation in rare-fault-preserving federated slimming. Output knowledge preserves soft decision boundaries and global classification responses. Class-prototype relations preserve the structural distribution among fault classes, which is important for distinguishing rare faults from adjacent majority faults. Input-sensitive relations describe the response of the model to weak fault features and boundary samples, which helps retain rare-fault-sensitive channels. Rare-fault-aware weighting is used at the server to prevent reliable minority-fault knowledge from being suppressed by large-sample clients. Gated slimming further removes low-contribution channels while preserving relation-constrained diagnostic knowledge. Therefore, these components are complementary rather than simply stacked modules.
3.3. Rare-Fault-Aware Federated Relation Knowledge Fusion and Teacher Model Optimization
The knowledge uploaded by local clients may have different levels of reliability. Clients with large sample sizes may contribute substantial redundant knowledge and low-quality clients may produce low-confidence knowledge, whereas clients containing rare faults, despite having fewer samples, may carry critical class-boundary information. Therefore, the federated center constructs class-level knowledge weights according to sample scarcity, output confidence, and intra-class dispersion, as shown in
Figure 3.
For the
-th fault class in the
-th client, its knowledge reliability is defined as
where
denotes the information entropy. A larger
indicates a more stable output distribution for the corresponding class. The intra-class dispersion is further defined as
Based on this, the class-level federated knowledge weight is constructed as
where
denotes the set of clients containing knowledge of the
-th fault class;
represents the class scarcity degree; and
,
, and
control the effects of scarcity, reliability, and dispersion, respectively. The coefficients
,
, and
control the effects of class scarcity, output reliability, and intra-class dispersion, respectively. In the validation stage,
and
were selected from
and
was selected from
. The final setting was
. This setting increases the contribution of reliable rare-fault knowledge while avoiding excessive amplification of unstable minority samples or excessive suppression of dispersed rare-fault features.
Based on
, the federated center fuses the class prototypes and relation knowledge as follows:
where
denotes the set of clients that contain relation knowledge between class
and class
, and
is obtained by normalizing
and
. The federated center then optimizes the global teacher model using the fused knowledge, with the objective function defined as
where
denotes the supervised loss computed on the public anchor set or a small calibration set at the federated center. If labeled anchor data are unavailable, this term can be removed, and the teacher model can be updated only through distillation and relation constraints.
represents the weighted fusion result of output-discriminative knowledge from multiple clients.
and
denote the class-prototype relations and input-sensitive relations obtained by the teacher model on the anchor samples, respectively. This process enables the global teacher model not only to inherit the classification capability of local clients, but also to preserve the fault-class structure and the sensitivity responses to weak fault features.
3.4. Redundancy-Gated Dynamic Slimming of the Student Model
The global teacher model integrates diagnostic knowledge from multiple clients, but it may also contain redundant knowledge. Redundant knowledge mainly refers to majority-class-biased responses, repetitive feature representations from similar clients, and low-contribution channels that have weak effects on rare-fault discrimination. If such information is directly transferred to the student model, it may increase model complexity and weaken rare-fault boundaries. Therefore, this study introduces learnable gates to suppress low-contribution channels, while relation constraints are used to retain channels associated with output discrimination, class-prototype structure, and input-sensitive rare-fault responses, as shown in
Figure 4.
Let
denote the intermediate feature of the
-th layer in the student model. A learnable gating vector
is introduced to control the retention degree of channels or neurons in this layer:
where
denotes the gating parameter,
is the sigmoid function, and
represents element-wise multiplication. The training objective of the student model consists of output distillation, class-relation preservation, sensitivity relation preservation, and gate sparsity regularization. A smaller gate value indicates that the corresponding channel contributes less to the relation-preserving objective and is more likely to carry redundant or weakly task-related information.
where the first four terms enable the student model to inherit the output-discriminative capability, class structure relations, and input-sensitive responses of the teacher model. The term
is used to compress redundant channels, while
encourages the gating values to converge toward either 0 or 1, thereby avoiding ambiguous structure selection during training. For the student objective in Equation (14), the output distillation weight is used as the reference term and is set to
. The prototype relation and sensitivity relation weights are selected from
, and both are set to
. This moderate setting preserves fault-class structures and weak-fault responses without over-constraining the student model. The sparsity and binary regularization weights are set to
and
, respectively, to encourage compact and clear channel selection without causing excessive pruning or composite imbalance conditions.
After training, the retained structure is determined according to the mean gating value:
where
is the structure-retention threshold,
denotes the retained structural set of the student model after gate-based dynamic slimming. To prevent excessive compression from degrading rare-fault recognition performance, a performance-preservation criterion is introduced:
where
represents the final slimmed student model constructed according to
,
denotes the recall of rare-fault classes and
and
are the allowable degradation ranges of the overall accuracy and rare-fault recall, respectively. It should be noted that this criterion is used only during the offline slimming and model selection stage. In this stage, a small labeled calibration set, historical fault records, or laboratory-collected rare-fault samples can be used to prevent excessive compression from damaging rare-fault recognition. During online deployment, the slimmed model performs inference directly and does not require rare-fault labels.
3.5. Algorithm Implementation
The overall training and slimming procedure are summarized in Algorithm 1.
| Algorithm 1. Relation knowledge-guided federated dynamic slimming |
Input: Local datasets , calibration set , global teacher model , student model , maximum communication rounds . Output: Slimmed student model .
Initialize the global teacher model , local client models , student model , and gate parameters.
For each communication round :
The server broadcasts the current global teacher model to all clients.
For each client :
Update the local model on using the class-balanced loss.
Extract output knowledge , class-prototype relation , and input-sensitive relation .
Calculate the class sample number , output reliability , and intra-class dispersion .
Upload the knowledge descriptor
to the server.
End for.
The server calculates the rare-fault-aware fusion weight .
Fuse the multi-client output knowledge, class-prototype relations, and input-sensitive relations.
Update the global teacher model on the calibration set .
Stop teacher training if the stopping criterion is satisfied.
End for.
Train the gated student model under the guidance of .
Update the student model parameters and gate parameters by back-propagation. Calculate the mean gate value of each channel. Retain the channels whose mean gate values are not smaller than the retention threshold.
Construct the slimmed student model .
If satisfies the accuracy and rare-fault recall preservation criterion, output . Otherwise, relax the retention threshold and fine-tune the student model. |
In Algorithm 1, the calibration set is only used for server-side teacher optimization and is not included in local client training or final testing.
4. Experimental Results and Analysis
The effectiveness of the proposed method is evaluated on two public datasets. The CWRU bearing fault dataset is used to assess class absence, low-quality small-sample conditions, quality–scale imbalance, and sample–scale imbalance under single-modal vibration signals. The HUST motor multi-modal motor fault dataset contains both vibration and acoustic signals and is used to evaluate multi-modal fragmentation, cross-condition generalization, and modality missing at the deployment stage. These two datasets are complementary in terms of signal modality, fault pattern, and experimental setting.
4.1. Experimental Setup and Comparison Methods
The CWRU dataset is provided by the Case Western Reserve University Bearing Data Center [
33], and its experimental platform is shown in
Figure 5. In this study, the data collected under a 1 HP load, a rotational speed of 1797 r/min, and a sampling frequency of 48 kHz are selected to construct the diagnosis task. Signals of each class are segmented into samples using a fixed-length window and then normalized by zero-mean standardization. To simulate multi-client federated diagnosis, the dataset is divided among three clients, and three scenarios are designed, namely, composite imbalance, quality–scale imbalance, and sample–scale imbalance, and the experimental setup is shown in
Table 1.
The HUST motor dataset is a public multi-modal motor fault dataset, and its experimental platform is shown in
Figure 6 [
34]. The platform was built based on the Spectra Quest Mechanical Fault Simulator, and the monitored object is a motor system under different health conditions. The dataset contains six operating states, namely, healthy condition, bearing fault, rotor bow, broken rotor bar, rotor misalignment, and voltage unbalance. The operating frequencies are 5, 10, 20, and 30 Hz, and the sampling frequency is 25.6 kHz. Each file contains 163,840 data points. In this study, the dataset is used to construct three experimental scenarios: modality-fragmented rare-fault diagnosis, cross-condition generalization, and modality missing at the deployment stage, and the experimental setup is shown in
Table 2.
The details of the network structure and parameter settings under different datasets during the experiment are shown in
Table 3.
The CWRU experiments are designed to highlight single-modal class imbalance and the influence of model compression. Therefore, in addition to general federated baselines, quantization, pruning, and knowledge distillation methods related to model compression and communication efficiency are introduced. The HUST motor experiments focus on multi-modal and cross-condition heterogeneous scenarios. Accordingly, FedBN, FedDF, and FedSlim are further included on the basis of general federated baselines, since these methods are able to address distribution shift, server-side ensemble distillation, and model-structure differences, as shown in
Table 4.
To ensure fairness, all compared methods used the same preprocessing procedure, input length, training/testing split, and 1D-CNN backbone whenever applicable. Compression-based methods were compared under comparable parameter budgets. Hyperparameters were selected only on the validation set, with the same search budget used for all methods. Each experiment was repeated five times with different random seeds and regenerated client partitions. Statistical significance between the proposed method and the strongest baseline was evaluated using a paired t-test, with p < 0.05 considered significant.
The evaluation metrics include average accuracy, Macro-F1, rare-class recall, the number of parameters, FLOPs, inference time, and GPU memory usage. In the CWRU dataset, OR14 is defined as the primary rare-fault class, while in the HUST motor dataset, the broken rotor bar fault, BRB, is defined as the primary rare-fault class. Macro-F1 is used to prevent the performance of minority classes from being masked by majority classes, and rare-class recall is used to directly evaluate the core objective of the proposed method.
4.2. Results Under the CWRU Composite Imbalance Scenario and Rare-Fault Recognition Analysis
The composite imbalance scenario is the most challenging setting in the CWRU experiments. Since OR14 is provided only by Client 1, which contains a limited number of samples and is contaminated by noise, conventional federated aggregation tends to regard its information as weak local knowledge or low-confidence knowledge, causing it to be overwhelmed by the information from the majority of clients.
Table 5 and
Figure 7 present the quantitative results under this scenario.
As shown in
Table 5, conventional single-client models and standard compression methods exhibit poor performance on Client 1, indicating that a small sample size and low signal-to-noise ratio weaken the feature learning capability of local models. FedAvg improves the average accuracy to 64.0%, but its OR14 recall remains only 34.0%, suggesting that parameter averaging is easily dominated by clients with larger sample sizes and more complete class distributions. Fed-Prox alleviates client drift through a proximal constraint and increases the average accuracy to 68.8%; however, its optimization objective does not explicitly emphasize rare-class preservation. Fed-Proto mitigates partial inter-class structural shift by using class prototypes, increasing the OR14 recall to 56.0%, but it lacks input-sensitive relations and gated redundancy suppression. Fed-KD further enhances global consistency by exploiting soft output knowledge, yet its OR14 recall remains 58.0%. In contrast, the proposed method achieves an average accuracy, Macro-F1, and OR14 recall of 84.9%, 84.0%, and 80.0%, respectively, demonstrating that multi-granularity relation knowledge and rare-fault-aware weighting can significantly improve minority-class boundary preservation.
Combining
Table 5 and the confusion matrices, the differences between the methods are reflected not only in the average accuracy results but more importantly in the misclassification of the rare OR14 class. DNN, quantization, and pruning can recognize several majority fault classes, but they show severe misclassification for OR14, which is mainly confused with the adjacent outer-race faults OR07 and OR21. This indicates that single-client training and conventional compression methods struggle to preserve rare-fault boundaries under small-sample and low-SNR conditions. FedAvg strengthens the overall diagonal pattern and improves the average accuracy to 64.0%, but its OR14 recall remains only 34.0%, suggesting that conventional parameter aggregation is easily dominated by majority clients and majority classes. Fed-Prox and Fed-Proto improve the results through proximal regularization and class-prototype alignment, respectively, but evident confusion still exists between OR07, OR14, and OR21. Fed-KD further improves global consistency using soft output knowledge, yet its OR14 recall is only 58.0%, showing that output distillation alone is insufficient to preserve fine-grained boundaries between similar faults.
In contrast, the proposed method achieves an average accuracy of 84.9%, a Macro-F1 of 84.0%, and an OR14 recall of 80.0%, while substantially reducing the misclassification between OR14 and its adjacent classes OR07 and OR21. This demonstrates that class-prototype relations preserve the structural distribution of similar faults, input-sensitive relations enhance weak-response boundaries, rare-fault-aware weighting increases the contribution of reliable OR14 knowledge from the small-sample client, and the gated student model retains key diagnostic channels after compression. These results confirm that the proposed method can effectively preserve rare-fault recognition capability under composite imbalance conditions.
To further visualize the effect of input-sensitive relation knowledge, feature-response heatmaps were generated for rare OR14 test samples. The response intensity was calculated from the channel-wise feature activation weighted by the sensitivity of the rare-fault output to the corresponding feature channel and then normalized to
. Fed-KD, the proposed method without input-sensitive relation, and the complete proposed method were compared, as shown in
Figure 8.
In
Figure 8, Fed-KD produces scattered feature responses, indicating that output-level distillation alone cannot clearly focus on weak rare-fault-related channels. After removing input-sensitive relation knowledge, the response region becomes clearer than Fed-KD, but the activation is still not sufficiently concentrated. In contrast, the complete proposed method produces stronger and more stable responses on a group of rare-fault-sensitive channels. This indicates that input-sensitive relation knowledge helps the student model retain weak-response patterns associated with OR14 and reduces the confusion between OR14 and adjacent outer-race faults.
4.3. Robustness and Deployment Efficiency Analysis Under CWRU Scenarios
To further verify the stability of the proposed method under different imbalance intensities,
Figure 9 presents radar chart comparisons under three scenarios: composite imbalance, quality–scale imbalance, and sample–scale imbalance. Each subplot includes four metrics: Client 1 accuracy, average accuracy, Macro-F1, and OR14 recall. To ensure figure readability, only six representative methods are shown, namely, DNN, FedAvg, Fed-Prox, Fed-Proto, Fed-KD, and the proposed method.
As shown in
Figure 8, the overall performance of all methods improves as the degree of imbalance decreases, while the proposed method maintains the largest radar coverage area across all three scenarios. In the compound imbalance scenario, the advantage is most evident in terms of OR14 recall, indicating that the proposed method can effectively address the rare-fault suppression problem caused by the coexistence of missing fault categories and low-quality few-shot samples. In the quality–scale imbalance scenario, the proposed method still achieves superior performance in Client 1 accuracy and Macro-F1, demonstrating that the knowledge reliability evaluation can select effective knowledge from noisy few-shot clients. In the scale imbalance scenario, the proposed method continues to outperform Fed-Proto and Fed-KD, suggesting that the gated student model is not only used for parameter compression but also capable of suppressing the propagation of repetitive knowledge from large-sample clients.
To make the deployment efficiency comparison reproducible, all models used the same input length, preprocessing procedure, and batch size. The inference latency was measured with a batch size of 1 to simulate online diagnosis. Each model was first executed for 100 runs and then evaluated over 1000 repeated inference runs. The reported latency is the average single-sample inference time. FLOPs and parameter numbers were calculated using the same profiling tool; the specific settings are shown in
Table 6.
Under the above deployment setup,
Figure 10 shows that FedAvg and Fed-Prox require a higher computational cost and GPU memory because they retain the complete global model structure.
Fed-Proto and Fed-KD outperform FedAvg in knowledge transfer, but their compression capability remains limited. In contrast, the proposed method selectively preserves key channels and parameters through the gated student network, thereby reducing redundant computation while retaining relation knowledge. Consequently, it achieves the highest overall deployment efficiency. The CWRU results demonstrate that the proposed method can simultaneously improve rare-fault recall and deployment-oriented efficiency under the tested hardware and measurement protocol.
4.4. Multi-Modal Federated Diagnosis Results on HUST Motor
To verify the applicability of the proposed method under multi-modal and cross-condition settings, further experiments are conducted on the HUST motor dataset. The main challenge of this dataset is not simple sample–scale imbalance but modality fragmentation, operating-condition variation, and modality missing at the deployment stage. Therefore, FedBN and FedDF are additionally included as baseline methods in this section, and BRB is defined as the primary rare-fault class.
Table 7 presents the quantitative results under the modality-fragmented rare-fault scenario.
In
Table 7, Local-DNN achieves an average accuracy of only 62.8% and a BRB recall of 35.0%, indicating that a single-client model struggles to learn multi-modal information and rare-fault features simultaneously. FedAvg improves the overall performance, but its BRB recall remains limited, suggesting that simple parameter aggregation is easily dominated by majority modalities and majority classes. FedBN alleviates cross-condition and cross-modal feature shifts by retaining local normalization statistics, while FedProto improves structural consistency through class prototypes; both methods outperform FedAvg. FedDF and Fed-KD further enhance global consistency using distillation knowledge, but they mainly rely on output responses or ensemble outputs, making it difficult to fully characterize inter-modal class structures and input-sensitive responses. The proposed method achieves 88.4% average accuracy, 87.8% Macro-F1, and 84.0% BRB recall, outperforming Fed-KD by 7.2, 7.8, and 18.0 percentage points, respectively. This indicates that the proposed method can preserve rare-fault knowledge under modality-fragmented and client-heterogeneous conditions.
Figure 11 shows that Local-DNN and FedAvg can identify majority classes such as H and BF to some extent, but they suffer from severe misclassification for the BRB class, which is mainly confused with BF and UNBAL. This phenomenon is consistent with the mechanism of broken rotor bar faults: a broken rotor bar induces electromagnetic torque fluctuations, whose responses are similar to those caused by voltage unbalance, and it may also trigger mechanical vibration variations. FedBN and FedProto reduce part of the cross-modal confusion, but evident overlap between BRB and UNBAL remains. Fed-KD improves the overall diagonal structure, yet the rare-class boundary is still not sufficiently clear. In contrast, the proposed method presents the clearest diagonal structure, with a substantially increased number of correctly identified BRB samples and significantly reduced misclassification among adjacent classes. This indicates that relation knowledge fusion improves boundary separability among similar fault categories.
4.5. Cross-Condition Generalization Results on the HUST Motor Dataset
The 5, 10, and 20 Hz operating conditions are used as local training conditions, while the 30 Hz condition is treated as an unseen testing condition. This setting is designed to examine whether federated knowledge fusion can learn fault-discriminative relations that remain stable across different operating frequencies.
Table 8 presents the cross-condition generalization results.
Cross-condition testing significantly degrades the performance of all methods, indicating that operating-frequency variation leads to feature distribution shifts. FedBN outperforms FedProx in this scenario, suggesting that retaining local normalization statistics helps alleviate feature shifts across different operating conditions. FedDF and Fed-KD achieve average accuracies of 75.6% and 76.8%, respectively, showing that distillation mechanisms can improve global consistency under unseen conditions. The proposed method achieves 84.3% average accuracy, 83.5% Macro-F1, and 78.0% BRB recall, while Fed-KD achieves 76.8% average accuracy, 75.1% Macro-F1, and 61.0% BRB recall. These results demonstrate that the proposed method remains effective under cross-condition and cross-device-like federated heterogeneity.
Figure 12 further illustrates the feature distributions on the unseen-condition test set. FedAvg shows evident overlap among different fault classes, especially with blurred boundaries among BF, BRB, and UNBAL. FedProto improves the structure of class centers, but the intra-class dispersion remains relatively large. Fed-KD enhances class compactness through output knowledge distillation, yet overlap between BRB and UNBAL still exists. In contrast, the proposed method forms more compact intra-class distributions and clearer inter-class separations, indicating that multi-granularity relation knowledge improves the structural consistency of the feature space and enables the model to maintain strong fault discriminability under unseen operating frequencies.
In practical deployment, the edge side may only have access to a single-modal signal due to sensor failure, communication constraints, or installation limitations. Therefore, the diagnostic capability of the model is further evaluated under vibration-only, acoustic-only, and random modality-missing conditions. This scenario is designed to verify whether the gated dynamic slimming mechanism of the student model can preserve effective diagnostic knowledge during lightweight deployment, as shown in
Table 9.
Figure 13 show that acoustic single-modality diagnosis generally performs worse than vibration single-modality diagnosis, indicating that vibration signals contain more direct discriminative information for most mechanical faults, whereas acoustic signals are more susceptible to propagation paths and environmental noise.
After model size reduction, FedSlim achieves a slightly lower accuracy than Fed-KD, suggesting that conventional model slimming may lead to the loss of cross-modal knowledge. The proposed method achieves the best results under all three input conditions, with an accuracy of 89.2% for vibration-only input, 84.6% for acoustic-only input, and 87.4% under random modality-missing conditions. These results indicate that input-sensitive relations help the student model learn the different contributions of each modality to fault discrimination, while the gating mechanism preserves critical channels and suppresses redundant ones, thereby improving robustness at the deployment end. This setting also provides a cross-device-like validation because different operating frequencies are assigned to different clients and the 30 Hz condition is completely unseen during training.
To further clarify the rare-fault recognition capability of the proposed method, the rare-fault results under different challenging scenarios are summarized in
Table 10. These scenarios include composite imbalance in the CWRU dataset, modality-fragmented rare-fault diagnosis in the HUST motor dataset, and unseen operating-condition testing on the HUST motor dataset. Fed-KD is selected as the main reference method because it is the strongest distillation-based baseline in most rare-fault settings.
As shown in
Table 10, the proposed method consistently outperforms Fed-KD in rare-fault recall under different challenging settings. In the CWRU composite imbalance scenario, the OR14 recall is improved from 58.0% to 80.0%. In the HUST modality-fragmented rare-fault scenario, the BRB recall is improved from 66.0% to 84.0%. Under the unseen 30 Hz operating condition, the BRB recall is improved from 61.0% to 78.0%. These results indicate that the proposed method improves not only average accuracy, but also minority-fault recognition under sample imbalance, modality fragmentation, and unseen-condition generalization.
To verify that the reported improvements are not caused by a specific random initialization or client partition, the proposed method and the strongest baseline were compared over five repeated runs. The results are summarized in
Table 11.
In
Table 11, the proposed method consistently outperforms Fed-KD over repeated runs. The improvements in rare-fault recall and Macro-F1 are statistically significant under the main challenging scenarios, indicating that the gains are not caused by random initialization or a single client partition.
4.6. Lightweight and Deployment Performance Analysis
In addition to diagnostic performance, the model complexity and deployment cost of the main federated methods are further compared. Since the HUST motor dataset contains dual-modal inputs, its deployment complexity is higher than that of the single-modal CWRU experiment. Therefore, this experiment better reflects the practical significance of the dynamic slimming mechanism. The experimental results are shown in
Table 12.
Table 12 further shows that FedDF has the largest number of parameters and the highest GPU memory consumption because server-side ensemble distillation requires maintaining relatively complex models. Fed-KD reduces model complexity to some extent, but its student model still needs to preserve relatively complete feature channels to fit the teacher outputs. FedSlim achieves lower complexity; however, due to the lack of a relational knowledge preservation mechanism, its accuracy decreases under modality-missing conditions. The proposed method outperforms Fed-KD in terms of parameter size, FLOPs, inference time, and GPU memory consumption, and it is slightly better than FedSlim, while maintaining the highest diagnostic performance. These results indicate that the gated student model does not simply remove parameters but selectively preserves critical diagnostic knowledge under the constraint of relational distillation, thereby achieving a balance between performance and efficiency.
4.7. Ablation and Sensitivity Analysis
To isolate the contribution of each module, ablation studies were conducted by removing one component from the complete model at a time. The evaluated components included output-discriminative knowledge, class-prototype relations, input-sensitive relations, rare-fault-aware weighting, and gated slimming. The CWRU composite imbalance scenario was used as the main ablation setting because it contains class absence, low-quality rare samples, and majority-class dominance simultaneously. The results are shown in
Table 13.
As shown in
Table 13, removing any component leads to a decrease in diagnostic performance or deployment efficiency. Removing output knowledge reduces the overall accuracy and Macro-F1, indicating that soft decision responses are important for federated knowledge transfer. Removing prototype relations decreases OR14 recall from 80.0% to 72.0%, showing that class structure preservation is necessary for distinguishing adjacent fault classes. Removing input-sensitive relations also reduces OR14 recall, indicating that sensitivity responses help preserve weak rare-fault features. The largest drop in rare-fault recall is observed when rare-fault-aware weighting is removed, confirming that minority-fault knowledge can be suppressed during conventional federated fusion. When gated slimming is removed, the model achieves slightly higher accuracy but requires a full parameter structure. Therefore, the gated slimming module improves deployment efficiency while maintaining rare-fault recognition performance.
To further verify the weighting design in Equation (10), a compact sensitivity analysis was conducted on the CWRU composite imbalance scenario, as shown in
Table 14.
As shown in
Table 14, removing the scarcity or reliability term clearly reduces OR14 recall, indicating that both minority-class enhancement and confidence-based reliability are important for rare-fault preservation. A weak dispersion penalty may allow unstable local knowledge to enter the global model, while a strong dispersion penalty may suppress useful rare-fault knowledge from small-sample clients. The selected setting achieves the best balance among average accuracy, Macro-F1, and rare-fault recall.
5. Conclusions and Future Works
This paper proposes a rare-fault-preserving relation knowledge-based federated dynamic slimming method to address data heterogeneity, weakened minority-class knowledge, model redundancy, and high edge-deployment costs in multi-cliental motor fault diagnosis. The method constructs multi-granularity diagnostic knowledge using output-discriminative knowledge, class-prototype relations, and input-sensitive relations and introduces rare-fault-aware weighting to enhance the contribution of reliable minority-class knowledge during federated fusion. For model lightweighting, a redundancy-gated student model slimming mechanism is designed to retain key diagnostic channels under relation distillation constraints, while a rare-fault recall preservation criterion prevents performance degradation caused by excessive compression. Experiments on the CWRU and HUST motor datasets show that the proposed method outperforms compared methods under composite imbalance, cross-condition generalization, and modality-missing deployment scenarios. Confusion matrices and t-SNE results further verify its effectiveness in reducing rare-fault misclassification and improving feature separability, while complexity analysis demonstrates reduced parameter count, inference time, and GPU memory usage, indicating good potential for edge deployment.
The current validation is limited to a small number of clients and controlled non-IID settings. In ultra-large-scale or extreme non-IID scenarios, relation knowledge fusion may face higher communication costs and reduced reliability due to incomplete fault categories and biased operating conditions. In online federated learning, new faults and data drift may further affect the stability of learned relations. Future work will focus on scalable aggregation, uncertainty-aware weighting, drift-aware updating, and validation on in-house or industrial multi-client motor datasets.