1. Introduction
In recent years, mobile devices have emerged as the primary computational infrastructure utilized by billions of global users, and the scale of mobile devices is expected to exceed tens of billions in the coming years. In this context, smartphones and wearable devices continue to generate massive amounts of data [
1,
2], The continuous expansion and dynamic variability of data landscapes empower AI systems to achieve rapid performance gains. Traditional machine learning transfers user data to a centralized server to train models centrally. Still, much of the data are inherently privacy-sensitive, leading to a serious risk of privacy leakage in this process [
3]. The sharing of privacy-sensitive data [
4,
5] across different clients/platforms is increasingly restricted by data protection regulations, including the GDPR, in response to mounting privacy concerns [
6]. Achieving a balance between open data ecosystems and individual privacy rights requires innovative approaches, The concept of federated learning (FL) was first put forward by McMahan et al. [
7] and has become an effective solution to resolve this conflict and has been widely used in various fields. Among them, FedAvg [
8] is the most representative federated averaging algorithm, whose model performance in Google keyboard prediction represents a mere 1.5% deviation from the centralized training baseline. The successful deployment of FedAvg in production applications proves the feasibility of FL in striking a trade-off between privacy preservation and model efficacy. Although FL demonstrates strong capabilities in safeguarding data privacy, it still faces many problems.
The challenge of data heterogeneity, where data are not independently and identically distributed (IID), continues to hinder the performance of FL systems, particularly in real-world applications with diverse data sources. Reference [
9] refers to heterogeneity in data distributions across client devices, which usually stems from the fact that the data on each device correspond to a specific user, geographic location, or time window. For example, FedAvg, a simple weighted average approach, needs to be deployed in environments where the data are independently homogeneous and the model is isomorphic. Still, in practice, data are not independently homogeneous across clients. Karimireddy et al. [
10] demonstrated that client-side drift occurs when the data are heterogeneous, leading to poor performance, and introduced an algorithmic solution to address client-side drift in local updates through the utilization of control variables to adjust for client-specific variations during the update procedure, which increases the communication cost between devices. Liao et al. [
11] proposed a federated uniform representation enhancement framework, which aims to achieve uniform representation generation for federated unsupervised learning on non-independent and/or non-identically distributed (non-IID) data. The client-side flexible uniform regularizer avoids representation collapse by uniformly dispersing samples, and the server-side efficient uniform aggregator promotes global representation consistency by constraining the uniformity of client-side model updates. The above methods mitigate the drift of the model between global and local levels to some extent; nevertheless, this approach is insufficient to completely eradicate the adverse effects of data heterogeneity on convergence behavior and model accuracy. Hence, addressing data heterogeneity is imperative for improving the effectiveness of FL systems.
The presence of model heterogeneity in federated learning leads to barriers to knowledge transfer between participants. In data-heterogeneous federated learning, client data often have different features and distributions [
12], different hardware capabilities [
13], or different tasks [
14], and the suitable model structure and parameter settings will also be different. However, the traditional federated learning algorithm, the FedAvg global model, is aggregated from the average weights of local models, which cannot meet the demand for customized models in various scenarios and tasks, so each client likes to design its local model independently. For example, Makhija et al. [
15] proposed Federated Heterogeneous Neural Networks, which allow each client to build personalized models. In real-world scenarios, heterogeneous models [
16] do not match in the parameter space during aggregation, and the global gradient drifts, leading to difficulties in transferring knowledge between clients. Li et al. [
17] proposed a decentralized federated learning framework leveraging knowledge distillation to tackle the challenge of knowledge transfer on the client side, allowing the application of federated learning across independently designed models; however, this approach relies on a public dataset and lacks a continuous global model update mechanism, restricting its scalability for new participants whose data characteristics may deviate from existing models and potentially degrade overall performance. To address the issue of model heterogeneity, Diao et al. [
18] introduced a parameter subset selection strategy based on device parameters to mitigate parameter space mismatch, although this method may cause weight imbalance due to insufficient training of unshared model components in limited data subsets. These limitations highlight the critical need to overcome the constraints of traditional aggregation paradigms and knowledge transfer mechanisms to develop more efficient and robust heterogeneous federated learning frameworks that can accommodate diverse model architectures and data distributions while maintaining computational efficiency across resource-constrained devices.
The substantial communication burden in federated learning frameworks remains a critical issue, leads to reduced efficiency in model training, and becomes a major bottleneck for system scaling. Since federated learning requires frequent exchanges of model parameters [
19], when deploying large-scale pre-trained models, the high communication overhead poses a huge pressure on bandwidth-constrained clients, which directly restricts the practical application of large-scale models in federated systems [
20,
21]. To reduce the communication overhead, current research focuses on two main directions: gradient compression and co-distillation. Gradient compression [
22] reduces the amount of transmitted data, but is prone to significant performance degradation at high compression ratios and may weaken the model’s ability to handle data heterogeneity [
23]. The co-distillation paradigm [
24] reduces communication overhead by sharing local model predictions rather than transmitting model parameters, and the approach is adopted when the local model’s architecture exceeds the representational capacity of the public dataset. However, highly privacy-sensitive data [
25] cannot be shared and exchanged in real-world scenarios. Itahara et al. [
26] proposed a semi-supervised distillation that reduces communication overhead but still lacks scalability in data-heterogeneous scenarios. The FedKD framework proposed by Wu et al. [
27] removes reliance on public data through bidirectional knowledge migration but its performance is limited by the quality of teachers’ models. Therefore, balancing communication overhead and model performance remains a key challenge for federated learning.
In summarize, there are still three major problems with federated learning, although it protects data privacy: (1) data heterogeneity leads to significant model performance degradation; (2) model heterogeneity leads to knowledge transfer obstacles; (3) high communication overhead reduces training efficiency. To address these problems, this paper proposes communication-efficient federated learning via knowledge distillation and ternary compression, named FedDT. This method applies the dynamic combination of knowledge distillation and ternary quantization to federated learning, and the client adopts multiple rounds of local update strategy to compress the model simultaneously in two layers. Firstly, an adaptive knowledge distillation mechanism is employed at the client side to transfer knowledge from the teacher model to a lightweight student model. Secondly, the distilled student model undergoes per-layer quantization and is further trained into a ternary-weight network, effectively reducing the parameter count through structured sparsity induction. Finally, the compressed ternary model is aggregated on the server after multiple rounds of local refinement. This approach not only preserves model performance through iterative optimization but also achieves substantial parameter reduction via progressive compression, demonstrating particular advantages in large-scale distributed learning environments where data heterogeneity poses significant challenges.
This study makes the following key contributions:
(1) Personalized Federated Distillation Framework: We introduce a novel algorithm that enables clients to train customized teacher models tailored to their local data distributions. This personalized approach effectively alleviates the adverse effects of data heterogeneity, which is a critical challenge in federated learning, thereby enhancing both model performance and generalization capability.
(2) This paper addresses the challenge of knowledge transfer across heterogeneous local models by employing a student model as a unified intermediary. Specifically, the global model, initialized by the server, functions as a unified student model shared among all participating clients. During local updates, each client performs knowledge distillation from its personalized teacher model into the homogeneous student model, thereby eliminating cross-client knowledge transfer barriers.
(3) This paper proposes a two-level compression strategy that combines knowledge distillation with ternary quantization to reduce model parameters. In federated distillation, communication costs are fundamentally governed by the magnitude of the student model being transmitted. By leveraging ternary quantization, the continuous weight parameters of the student model are mapped to a discrete space, significantly reducing communication costs while maintaining model performance.
(4) The experimental evaluation indicates that FedDT surpasses three baseline methods by 7.85% in model accuracy under both IID and non-IID conditions on the MNIST and Cifar10 datasets, while simultaneously reducing communication overhead by an average of 78%.
This paper adopts a structured approach to present its contributions as follows: The next section conducts a comprehensive review of prevailing research in the federated learning domain, providing critical context for our work.
Section 3 introduces the preliminary definitions and the system architecture of FedDT.
Section 4 presents the detailed implementation of the two modules within the FedDT framework, along with the overall implementation of the proposed method.
Section 5 conducts comparative experiments between FedDT and three baseline methods, analyzing both model performance and communication overhead. Complementary ablation analyses are further performed to verify the individual contributions of each module. Finally,
Section 6 summarizes the key findings and outlines promising avenues for future research.
4. Method
In this section, we present the implementation of FedDT in the knowledge distillation module and the ternary quantization module on the local side and the overall implementation of the FedDT framework.
4.1. Knowledge Distillation Module
To address the challenges posed by data and model heterogeneity in FL, this study introduces a novel personalized FL approach. Specifically, each client trains a customized heterogeneous teacher model tailored to its unique data distribution, effectively mitigating the adverse effects of data heterogeneity on model optimization. The server maintains a global model that serves as a unified student model across all clients. Through the knowledge distillation process, clients transfer knowledge from their personalized teacher models to this shared student model, thereby obfuscating local model heterogeneity while preserving global consistency. This mechanism leverages the inherent properties of knowledge distillation, where the student model acts as an intermediary to harmonize diverse local models. The detailed local update mechanism is illustrated in
Figure 2.
The local knowledge distillation loss function is composed of three distinct components: task loss function, distillation loss function, and adaptive hidden loss.
(1) Task loss function: The discrepancy between the model’s predictions and actual labels is measured by the cross-entropy loss, serving to refine the model’s classification accuracy. When the input sample pair
is input, the soft probabilities of the teacher model and the student model after the prediction of the sample
are denoted as
and
, respectively. The task loss is denoted as follows.
(2) Distillation loss function: The training objective combines teacher model soft labels (from its softmax output) with student model hard labels (from its predictions). The distillation loss is computed as the KL divergence between these soft labels, enabling the student to mimic the teacher’s intermediate representations and approximate its output distribution.
In federated knowledge distillation, the teacher–student training dynamics directly influence the loss balance. When models converge effectively, the distillation loss takes precedence, effectively curbing overfitting while possibly impairing the student model’s ability to predict real labels accurately. In contrast, during early training stages or when data noise is pronounced, inadequate prediction reliability causes the task loss to prevail, obstructing efficient knowledge transfer. To overcome these issues, this paper introduces an adaptive distillation mechanism predicated on soft label quality perception as follows.
Adaptive intensity control by dynamically adjusting loss weights based on the predicted correctness of the teacher and student models. In the scenario of a high correct rate, the distillation loss weight is reduced and task learning is focused on. In scenarios with low correct rates, distillation loss weights are increased to enhance knowledge transfer. Distillation loss is balanced with task loss by using temperature and weighting methods to facilitate student model training.
(3) Due to the hidden state of the teacher model and the fact that the attention heat map contains key features of the data with contextual dependencies, additional adaptive hidden loss functions are added on top of traditional knowledge distillation techniques. The student model learns a more robust feature extraction capability by matching these representations. The loss formulas
and
are as follows.
The mean squared error (MSE) is utilized as the key objective function for optimization in this work. Let
,
,
, and
represent the hidden state and the heat of attention map in the
ith local teacher and student, respectively. Additionally, let
be a parameterized linear transformation matrix. We design an adaptive scaling method for the hidden loss function, dynamically adjusted by the teacher–student prediction accuracy. In summary, the unified loss function for local updates of the teacher and student models for each client is formulated as follows:
The comprehensive loss function is constructed by aggregating the distillation loss, task loss, and adaptive hidden loss. By implementing the backpropagation algorithm, the cumulative loss is reduced to enhance the model’s learning efficiency, and the gradient of the student model on the ith client can be derived from through , where represents the parameter set of the student model. The local teacher model for each client is updated by the local gradient obtained from the loss function . FL, using the knowledge distillation model compression method, has a communication load that depends on the size of the output student model. By capitalizing on the properties of knowledge distillation, the method allows the student model to conceal the heterogeneity of the local model, which helps to mitigate the device heterogeneity and statistical heterogeneity of the data.
4.2. Ternary Quantization Module
To reduce the communication overhead, a two-layer compression strategy is used, and this section is the ternary quantization phase of the local model update, and the local ternary quantization process is shown in
Figure 3. For the local client, the distilled student model is trained twice using labeled data, and the student model weights are mapped layer by layer to three discrete values (usually −1, 0, and +1) during the training process to simplify computation and storage. By quantizing the model into a ternary weight network during the training process, this method significantly reduces the model complexity while preserving its performance as much as possible, which is particularly suitable for deep learning.
First, the weights of the student model are subjected to a normalization operation.
denotes the full-precision weight matrix of the student model,
represents the normalized weight matrix, and g signifies the normalization function, which normalizes a vector to a certain random range to make the weight distributions of different layers closer, making the subsequent quantization more stable. Based on normalization, the continuous weights of the student model are discretized into three values (−1, 0, +1) by threshold division, which significantly reduces the storage and computation overhead. Thresholds are determined by generating uniformly distributed random numbers based on the sparsity of the weights used.
where
denotes the parameter set of client k,
signifies the adaptive optimum value, d indicates the number of layers, and
is set to 0.7.
Subsequently, realizing layer-by-layer weight quantization is adopted, in which the quantization accuracy is balanced by adaptive thresholds based on the number of layers and global scaling factors, breaking through the limitations of traditional fixed thresholds.
In this context,
denotes a step function, the Hadamard product
applies a thresholding operation that sets elements to 1 when their absolute values surpass a certain threshold,
is a layer-specific quantization factor that undergoes layer-wise training in conjunction with the local model’s weights, and
represents the quantized ternary weights. Therefore,
can be expressed as the concatenation of the positive indicator matrix
and the negative indicator matrix
.
Upon completing the quantization of the entire network, the loss function is calculated, and the error is transmitted backward via backpropagation.
and the gradient of the potential full-precision model is computed as follows.
Reducing upstream and downstream communication overhead by quantizing distilled student models also enhances privacy preservation in FL due to the lower weights making it harder to reverse inference sensitive data.
The local model ternary quantization Algorithm 1 is as follows:
Algorithm 1: Local model ternary quantization . |
![Electronics 14 02183 i001]() |
4.3. FedDT
The FedDT method uses a multi-round local update strategy, combining two modules, knowledge distillation and ternary quantization. The primary aim is to boost the global model’s accuracy via successive training iterations, while performing model compression during training minimizes storage consumption. The client-side local update process of the FedDT method is illustrated in
Figure 4.
The FedDT-specific training process is as follows:
In the context of FL, the server first initializes a global model, , with random weights before the training process begins. Clients subsequently download this model from the server and utilize it as their local student model, . Acknowledging the potential association between client data distributions and model parameters, this study introduces a personalized teacher model, , which is pre-trained for each client using labeled private data to align with the specific patterns of its local data distribution.
(1) Client-Side Local Model Distillation Training. This study employs three distinct loss functions—task loss (Equation (
4)), distillation loss (Equation (
8)), and hidden loss (Equation (
10))—to facilitate reciprocal knowledge transfer between the teacher and student models during training. Dynamic weight allocation is performed throughout the knowledge distillation process, with weights adjusted based on the intensity of either the task loss or distillation loss to enhance student model training. Furthermore, an adaptive hidden loss function is introduced to enable the student model to acquire knowledge from the teacher model’s hidden states (Ht) and attention heatmaps (At).
(2) Local Model Quantization on the Client Side. To further decrease communication overhead and improve model accuracy, the locally labeled data are employed for retraining the student model. Normalization techniques are applied to the student model’s weights during training. On top of normalization, the weights are quantized from full-precision floating-point numbers to a ternary representation (−1, 0, 1) with reduced bit-width, compressing the student model into an adaptive ternary quantized weight network.
(3) Small Model Uploading by the Client. The local quantization model for the clients participating in this training is uploaded to the server.
(4) The central server conducts aggregation of miniature models. During each communication cycle, the server converts the uploaded quantized local models into continuous counterparts and fuses them into the global model.
(5) The server quantizes the global model with the threshold
and two quantization factors,
and
, as defined in Equations (
11) and (
12). After completing the aggregation process, the server propagates the quantized global model to the client.
(6) Clients obtain the global model. Each participating client in the training phase downloads the quantized global model to perform local model updates. This cycle repeats until the student model and the global model converge.
Algorithm 2 shows the detailed process of knowledge distillation and ternary quantization.
Algorithm 2: FedDT. |
![Electronics 14 02183 i002]() |
7. Conclusions
This paper proposes a communication-efficient FL framework, named FedDT, which synergizes knowledge distillation and ternary compression techniques. Specifically, we design client-specific heterogeneous teacher models to reduce the negative impact of non-IID data, thereby improving both model performance and generalization capability. To facilitate knowledge transfer across diverse local models, we introduce a shared student model initialized by the server, enabling consistent knowledge aggregation. During local updates, each client distills personalized knowledge from its teacher model into the homogeneous student model, effectively eliminating cross-client knowledge transfer barriers. Furthermore, we propose a two-level compression strategy that combines adaptive knowledge distillation with ternary quantization to minimize communication overhead. By replacing continuous weight parameters with discrete ternary values, the student model’s size is significantly reduced, thereby lowering communication costs without sacrificing performance. Experimental results demonstrate that, compared to baseline methods on the MNIST and Cifar10 datasets, FedDT achieved significant performance gains in two critical metrics: model accuracy and communication efficiency. Notably, under high-data-heterogeneity scenarios, the multi-round local update strategy substantially improves overall system performance. The proposed approach effectively reduces communication costs while providing a lightweight solution for FL in distributed large-scale data environments.
Although the FedDT method significantly cuts communication overhead while safeguarding model accuracy, it introduces additional computational demands on client devices. Systems applying FL need to consider the overall performance, and our future work will dynamically adjust the model training strategy according to the client’s data distribution, computational power, and network state. Furthermore, we aim to explore more efficient compression algorithms to enhance the model generalization capability in FL, especially under complex scenarios.