Heterogeneous Federated Learning via Knowledge Transfer Guided by Global Pseudo Proxy Data

Sun, Wenhao; Guo, Xiaoxuan; Liu, Wenjun; Sun, Fang

doi:10.3390/fi18010036

Open AccessArticle

Heterogeneous Federated Learning via Knowledge Transfer Guided by Global Pseudo Proxy Data

¹

Information Center, Dalian Party Institute of Communist Party of China, No. 75 Binhai West Road, Xigang District, Dalian 116016, China

²

School of Computer Science and Artificial Intelligence, Liaoning Normal University, No. 1 Liushu South Street, Ganjingzi District, Dalian 116081, China

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(1), 36; https://doi.org/10.3390/fi18010036

Submission received: 12 December 2025 / Revised: 6 January 2026 / Accepted: 7 January 2026 / Published: 8 January 2026

(This article belongs to the Special Issue Information and Future Internet Security, Trust and Privacy—4th Edition)

Download

Browse Figures

Versions Notes

Abstract

Federated learning with data free knowledge distillation enables effective and privacy-preserving knowledge aggregation by employing generators to produce local pseudo samples during client-side model migration. However, in practical applications, data distributions across different institutions are often non-independent and identically distributed (Non-IID), which introduces bias in local models and consequently impedes the effective transfer of knowledge to the global model. In addition, insufficient local training can further exacerbate model bias, undermining overall performance. To address these challenges, we propose a heterogeneous federated learning framework that enhances knowledge transfer through guidance from global proxy data. Specifically, a noise filter is incorporated into the training of local generators to mitigate the negative impact of low-quality pseudo proxy samples on local knowledge distillation. Furthermore, a global generator is introduced to produce global pseudo proxy samples, which, together with local pseudo proxy data, are used to construct a cross attention matrix. This design effectively alleviates overfitting and underfitting issues in local models caused by data heterogeneity. Extensive experiments on publicly available datasets with heterogeneous data distributions demonstrate the superiority of the proposed framework. Results show that when the Dirichlet distribution coefficient is 0.05, our method achieves an average accuracy improvement of 5.77% over popular baselines; when the coefficient is 0.1, the improvement reaches 6.54%. Even under uniformly distributed sample classes, our model still achieves an average accuracy improvement of 7.07% compared to other methods.

Keywords:

data-free knowledge distillation; federated learning; generator; heterogeneous federated learning; non-independent and identically distributed (Non-IID)

1. Introduction

With the rapid advancements in the internet, mobile devices, and high-performance computing, artificial intelligence (AI) has become a powerful tool across various domains. However, concerns about data ownership, privacy, and cybersecurity pose significant barriers to sharing intelligent models, particularly in sensitive areas. To address these issues, Google introduced Federated Learning (FL) in 2016 [1], enabling distributed model training without directly accessing raw data. This approach allows decentralized clients to collaboratively train a global model and has been widely applied in healthcare, biometrics, and natural language processing [2,3]. FL can be categorized into horizontal and vertical types based on the consistency of local model structures. In horizontal FL, client models with the same architecture are trained on local datasets and aggregated using weighted averaging to form a global model. This practical and scalable approach has gained traction in industry [4]. Nevertheless, FL still faces challenges arising from disparities in data collection, preprocessing, and model construction across clients. A critical issue, particularly in industrial settings, is the non-IID nature of user data, which leads to locally biased models and degrades global model performance and generalization. Local data heterogeneity remains a major obstacle to achieving high-accuracy federated models [5,6].

Knowledge distillation techniques have therefore been proposed to transfer knowledge by having the student model mimic the teacher model’s output distribution, reducing the impact of data heterogeneity without altering model parameters [7]. In conventional FL with knowledge distillation, clients act as teacher models, transmitting soft predictions to the server, which aggregates them to update the global model and distills the knowledge back to clients. However, this approach relies on high-quality proxy datasets, which conflicts with the FL principles of data privacy and local data isolation. Furthermore, the aggregated global knowledge may not effectively guide client-side training. A major challenge is achieving knowledge aggregation without accessible proxy datasets. To overcome this, recent work has introduced data-free knowledge distillation for FL, enabling knowledge transfer between clients and the server without requiring actual data, thus enhancing privacy [8]. The first architecture for data-free distillation, DAFL, was proposed by Chen et al. in 2019 [8], using a generative adversarial network (GAN) to synthesize data. Following this, Pan et al. developed Meta-KD in 2020 [9], integrating meta-learning to optimize the data generation distribution. These efforts sparked a surge in research on replacing real datasets with generated proxy data in FL. Zhu et al. proposed FedGen [10], combining data-free distillation with FL by training a generator at the server based on client predictions, which is then broadcast to clients to assist local model updates. While FedGen advances data-free distillation, it remains vulnerable to noise in generated pseudo data, which can hinder knowledge transfer and degrade global model performance. In addition, in scenarios with significant local data heterogeneity, reliance on local labels can introduce model bias, limiting global to local knowledge transfer.

To address these limitations, we propose a novel federated learning framework for heterogeneous data, FedKDG (Federated Knowledge Distillation with Global Pseudo Data Guidance). This framework leverages the global model’s knowledge to guide local model training, reducing the impact of data heterogeneity on local model bias. First, the global server uses real local labels as auxiliary data to train a generator that integrates knowledge from multiple clients. Clients then receive the global model and generator to train local generators, incorporating filters to suppress noisy generated samples. A cross attention matrix is created between global and local synthetic samples to enhance the diversity and effectiveness of pseudo data in global to local distillation. Finally, a global distillation loss function is introduced, utilizing local real sample labels to further facilitate knowledge transfer from the global model.

The main contributions of this paper are summarized as follows:

(1): Global to local knowledge transfer to mitigate heterogeneity data induced bias: We propose a global knowledge-guided local model optimization module that effectively transfers knowledge from the global model to local models using global pseudo-data, thereby addressing the classification bias caused by data heterogeneity.
(2): Noise filtered generation for robust pseudo data construction: We design an optimization and filtering mechanism for pseudo data generation, which mitigates the negative impact of noisy samples and ensures the fidelity of transferred knowledge.
(3): Extensive empirical validation under heterogeneous settings: We validate the proposed approach on widely used benchmark datasets and demonstrate superior performance in terms of federated classification accuracy compared to state-of-the-art models under non-IID data distributions.

The remaining sections of this paper are organized as follows: Section 2 provides a detailed introduction and analysis of classic and widely adopted related works. Section 3 provides the three parts of the process framework of the FedKDG (our method). Section 4 presents the comparative experimental results and the analysis of the FedKDG algorithm parameters. Section 5 presents the conclusion.

2. Related Work

Federated Learning (FL) trains local models on client devices and aggregates their outputs to form a global model. Classical models, like Federated Averaging (FedAvg) [2], focus on weight-level aggregation but face weight divergence issues. PFNM [11] addresses this by aligning client parameters via weight matching and Bayesian nonparametrics for improved global integration. FedMA [12] enhances aggregation through CNN-LSTM combinations and alignment mechanisms. Recent approaches explore feature-level and hybrid aggregation. FedBE [13] integrates Bayesian ensembles with FL to aggregate high-quality models, while Fed2 [14] adapts feature structures and employs model fusion to improve convergence by capturing data distributions.

2.1. Heterogeneous Federated Learning

While existing methods can effectively aggregate local models into a global model, data heterogeneity arising from diverse edge devices poses significant challenges to knowledge transfer. Heterogeneous federated learning (FL) models, which can be categorized into data, model, and system heterogeneity, have emerged, with data heterogeneity being the most extensively addressed. Early solutions combined multi-task learning with Bayesian frameworks. For instance, MOCHA [3] applied alternating optimization for multi-task FL but is limited to convex models. To handle non-convex settings, VIRTUAL [15] introduced a hierarchical Bayesian network-based approach. Jiang et al. [16] integrated meta-learning to optimize local learning rates and server-side optimizers for personalized models. Astraea [17] improved model discriminability through data augmentation but relies on access to client data distributions, increasing vulnerability to security risks such as backdoor attacks. With the growing prominence of network attacks, achieving stable and secure knowledge aggregation has become an emerging and critical challenge in federated learning system design [18,19].

To balance global collaboration and local personalization, regularization and contrastive learning approaches have been explored. Acar et al. proposed Personalized Federated Learning (PFL) [20], which personalizes local training via gradient correction but is prone to catastrophic forgetting. Shoham et al. introduced FedCurv [21], leveraging Elastic Weight Consolidation (EWC) to preserve learned knowledge. Li et al. proposed FedProx [22], which adds a proximal term to FedAvg to improve convergence. Dinh et al. presented pFedMe [23], using the Moreau envelope to decouple personalized and global model optimization. Li et al. further introduced MOON [24], employing model contrastive learning to align local training via representation similarity. Building upon this, Mu et al. proposed FedProc [25], applying prototype-based contrastive learning to further mitigate heterogeneity. However, transmitting unprocessed outputs from client models poses privacy risks. To address this, Yoon et al. proposed FedMix [26], employing Mixup-based data augmentation to enhance privacy and data diversity. Xu et al. [27] used K-means clustering to filter malicious client data based on label trustworthiness.

2.2. Federated Learning with Knowledge Distillation

Previous approaches typically adjust model parameters to mitigate heterogeneity, but this can lead to knowledge loss or noise. Knowledge distillation (KD) has gained attention as an effective solution for knowledge transfer [7]. Sattler et al. proposed CFD [28] to reduce communication overhead using KD. He et al. introduced FedGKT [29], which transfers knowledge from lightweight CNNs on edge devices to larger global CNNs, easing computational load. Fang et al. proposed RHFL [30], using KD on unrelated public data, though it may not fully leverage local predictions. Huang et al. [31] introduced a latent embedding module with adversarial learning to distinguish private and public domains for better domain adaptation. To address data heterogeneity, Li et al. developed FedMD [32], combining KD with transfer learning to derive global knowledge from local predictions. Chang et al. proposed Cronus [33] to improve FL accuracy on public datasets, while Ozkara et al. [34] integrated KD to enhance training efficiency across heterogeneous data sources.

Despite progress, traditional KD methods rely heavily on the availability of high-quality proxy datasets, which may be impractical in real-world scenarios. To address this, Lin et al. proposed FedDF [35], an ensemble distillation framework that accelerates convergence via entropy minimization. Zhu et al. introduced FedGen [9], eliminating the need for real data by training a generator on client predictions. Zhang et al. developed FedFTG [36], a conditional generator-based approach to minimize global-local loss for effective distillation. DENSE [37], a two-stage data-free FL method, enables server-side training without data exchange. To tackle heterogeneity, Heinbaugh et al. proposed FedCVAE-Ens and FedCVAE-KD [38], using conditional variational autoencoders (CVAE) to reconstruct local tasks. Wu et al. introduced FedDKC [39], reducing inter-client knowledge variance through refinement strategies. However, many methods face catastrophic forgetting during iterative knowledge transfer. Zhang et al. proposed TARGET [40], mitigating forgetting by simulating global data transfer from past tasks. Zhu et al. introduced edTAD [41], a topology-aware data-free KD framework ensuring consistency in knowledge transfer, while Wang et al. presented DFRD [42], using an exponential moving average of generators to counter forgetting and preserve client knowledge.

Although promising, these methods still rely on generator-based pseudo-data, which can cause instability and hinder collaboration. Zhao et al. proposed FedF2DG [43], a generator-independent pseudo-data strategy that adaptively adjusts label distributions to reduce bias. However, it faces challenges with noisy pseudo-data and semantic mismatches. Additionally, most methods treat the server as the sole student model, overlooking model heterogeneity across clients.

3. Methodology

Existing data-free knowledge distillation methods in federated learning are constrained by the quality of the generator, which limits effective knowledge transfer. In addition, pseudo proxy datasets mainly focus on aggregating knowledge from local models into a global model, while the potential for the global model to support local updates is largely overlooked. This limitation reduces the ability to mitigate issues arising from Non-IID and heterogeneous data distributions. To address these challenges, we propose a heterogeneous federated learning framework that leverages globally generated pseudo proxy data to guide knowledge transfer. By training a global generator, the server produces pseudo-proxy data to facilitate global to local knowledge migration, enabling the server to play an active role that goes beyond simple parameter aggregation.

Our approach enables collaborative optimization between global and local models, eliminates reliance on real proxy data, and enhances knowledge transfer. It also reduces local model bias and mitigates global performance degradation caused by data heterogeneity. The model architecture is shown in Figure 1.

3.1. Problem Definition

The proposed federated learning framework consists of K clients, denoted as

k = 1, 2, \dots, K

. Each client k holds a local dataset

D_{k} = {\{(x_{i}, y_{i})\}}_{i = 1}^{n_{k}}

, where

x_{i} \in X \subseteq R^{d}

,

y_{i} \in Y = \{0, 1, \dots, C\}

represents the class label for a classification task, and

n_{k}

is the number of samples held by client k. Given the heterogeneous nature of local data distributions in this setting, each client’s data follows a non-IID distribution

\hat{p} (y_{k}) \sim D i r (α)

, where

α

controls the sparsity of the Dirichlet distribution. The global data distribution is denoted by

p_{k} (x, y)

, and the heterogeneity across clients can be formally expressed as

\forall k \neq j, \hat{p} (y_{k}) \neq \hat{p} (y_{j})

, and

\hat{p} (y_{k}) \neq \hat{p} (y_{g})

.

In each communication round, the central server receives local model parameters

θ_{k}

and true labels Y from client data. A global generator

G_{g}

is then trained to synthesize pseudo samples, which are used to facilitate knowledge transfer across clients. This results in an updated global server state and a globally aggregated model that captures rich shared information. Subsequently, the server distributes the updated global model parameters

θ_{g}

and the generator

G_{g}

to all clients. These are used to refine local model parameters

θ_{k}

, which are biased due to training on heterogeneous data. The federated learning process then proceeds iteratively, gradually optimizing toward a global model with improved generalization.

3.2. Localized Personalized Knowledge Transfer

Due to data privacy constraints, the global server cannot access the local data distributions used for training client models. To enable effective knowledge transfer from the global model to local models, we adopt a data-free knowledge distillation approach based on the DAFL framework. Specifically, a global generator

G_{g}

corresponding to the global model is constructed and optimized iteratively on the server to produce pseudo-proxy samples with refined feature distributions, denoted as

D_{g} = (d_{g_{1}}, d_{g_{2}}, \dots, d_{g_{n}})

. These pseudo samples serve as the basis for knowledge distillation between the global model

M_{g}

and each client model

M_{k}

, facilitating the extraction and transfer of global knowledge. To further enhance sample quality and reduce redundancy, a noise filter

F_{g}

is introduced to eliminate irrelevant or noisy information during the generation process. To ensure the effectiveness of the generator

G_{g}

, the global server is allowed to observe the label distributions of training samples used by each client. Importantly, only the label Y is shared, not the raw data

D_{k}

, thereby preserving privacy while enabling the server to condition the generator on diverse local label distributions. The global knowledge transfer process is illustrated in Figure 2.

First, a random noise vector

z_{g} \sim N (0, I)

is introduced as the input to the generator, and the distribution of the true labels

p (Y)

from users’ local data is used as supervision to train the global generator

G_{g}

. This phase adopts the DAFL training paradigm to produce a pseudo sample set

D_{g} = G_{g} (z_{g}, p (Y))

. To ensure diversity in the generated pseudo samples

D_{g}

, a diversity loss function

L_{d i v g}

is introduced. By minimizing the distributional discrepancy among pseudo samples generated from different noise inputs, this loss promotes sample diversity while preventing excessive discriminative bias. The formulation is given as follows:

L_{d i v g} = E_{(z_{g_{1}}, z_{g_{2}})} [{∥G_{g} (z_{g_{1}}, p (Y)) - G_{g} (z_{g_{2}}, p (Y))∥}_{2}^{2}]

(1)

where

z_{g 1}, z_{g 2}

denotes different noise inputs. For each client k, the pseudo samples

D_{g}

generated by the global generator

G_{g}

are fed into the local model for training. A cross-entropy loss is then constructed between the local model’s predictions

p_{θ_{k}}

and the client’s true label

y_{i}

, denoted as

L_{c e}

. In practice, each client’s cross-entropy loss is further weighted by the proportion of its true label distribution to account for data imbalance. The overall cross-entropy loss

L_{c e}

is defined as the weighted sum of individual client losses, as formulated below:

L_{c e g} = - \sum_{k = 1}^{K} \frac{n_{k}}{N} E_{(D_{g}, y_{i})} [log p_{θ_{k}} (y_{i} | D_{g})]

(2)

The global generator

G_{g}

is trained using a combined loss function that incorporates both the diversity loss

L_{d i v g}

and the cross-entropy loss

L_{c e g}

. The total loss function

L_{G_{g}}

is defined as follows:

L_{G_{g}} = L_{d i v g} + L_{c e g}

(3)

Second, since the pseudo samples

D_{g}

generated by the global generator

G_{g}

may contain substantial noise, a pseudo sample filter

F_{g}

is introduced to ensure the effectiveness of the generated data. This selector filters out noisy or irrelevant information, making the retained pseudo samples

{D_{g}}^{'} = ({d_{g_{1}}}^{'}, {d_{g_{2}}}^{'}, \dots, {d_{g_{n}}}^{'})

more representative of the clients’ real data

D_{k}

. The selection process is formulated as follows:

F_{g} = \frac{\nabla f (D_{g}) - min (\nabla f (D_{g}))}{max (\nabla f (D_{g})) - min (\nabla f (D_{g}))}

(4)

where

\nabla f (\cdot)

is gradient function of the model, and

\nabla f (D_{g}) = [\frac{\partial f}{\partial d_{g_{1}}}, \frac{\partial f}{\partial d_{g_{2}}}, \dots, \frac{\partial f}{\partial d_{g_{n}}}]

. Simultaneously, the global generator

G_{g}

is trained as follows:

{L_{c e g}}^{'} = - \sum_{k = 1}^{K} \frac{n_{k}}{N} E_{({D_{g}}^{'}, y_{i})} [log {p_{θ_{k}}}^{'} (y_{i} | {D_{g}}^{'})]

(5)

where

{D_{g}}^{'} = F_{g}^{T} \cdot D_{g}^{T}

. At this stage, the generator training loss can be expressed as:

{L_{G_{g}}}^{'} = L_{d i v g} + {L_{c e g}}^{'}

(6)

Finally, in the current communication round, the server broadcasts the trained global generator

G_{g}

, global model parameters

θ_{g}

, and the generated proxy dataset

{D_{g}}^{'}

to each local client k. The global model parameters

θ_{g}

are directly assigned to initialize the local models, serving as the starting point for local training:

θ_{k}^{t} = θ_{g}^{t}

.

3.3. Local Model Optimization Guided by Global Knowledge

Each client k receives the global parameters

θ_{g}

along with the trained global generator

G_{g}

. To ensure effective extraction and transfer of global knowledge, a local pseudo-random sample set

D_{l} = (d_{l_{1}}, d_{l_{2}}, \dots, d_{l_{n}})

is generated by training a local generator

G_{l}

using client-specific data. This local pseudo dataset, together with the global pseudo samples generated by

G_{g}

, is used to construct a cross-attention matrix that optimizes the distribution of global pseudo samples and facilitates global knowledge transfer. Simultaneously, a noise filter

F_{l}

is incorporated during the local generator training to remove redundant or noisy information, further refining the distribution of the locally generated pseudo dataset. The detailed process is illustrated in Figure 3 and described as follows:

(1) Global Knowledge Transfer to Mitigate Local Data Heterogeneity.

First, a local generator

G_{l}

is constructed for each client. Random noise vectors

z_{l} \sim N (0, I)

and the distribution of the client’s local true labels

p (Y)

are fed into the local generator for training, resulting in a local pseudo sample set

D_{l} = G_{l} (z_{l}, p (Y))

. To ensure diversity in the generated samples

D_{l}

, a diversity loss function

L_{d i v l}

is introduced, defined as follows:

L_{d i v l} = E_{(z_{l_{1}}, z_{l_{2}})} [{∥G_{l} (z_{l_{1}}, p (Y)) - G_{l} (z_{l_{2}}, p (Y))∥}_{2}^{2}]

(7)

where

z_{l 1}, z_{l 2}

represents different initial noise vectors. The pseudo samples

D_{l}

generated by the local generator

G_{l}

are fed into the local model for training, and a cross-entropy loss

L_{c e l}

is constructed between the local model’s predictions

{p_{θ_{k}}}^{″}

and the user’s true labels

y_{k}

, as formulated below:

L_{c e l} = - E_{(D_{l}, y_{k})} [log {p_{θ_{k}}}^{″} (y_{k} | D_{l})]

(8)

The local generator

G_{l}

is trained using a combined loss function that integrates the diversity loss

L_{d i v l}

and the cross-entropy loss

L_{c e l}

. The total loss function

L_{G_{l}}

is defined as follows:

L_{G_{l}} = L_{d i v l} + L_{c e l}

(9)

Since the pseudo samples

D_{l}

generated by the local generator

G_{l}

contain substantial noise, a pseudo sample filter

F_{l}

is introduced to ensure the authenticity of the generated samples. This selector filters out noisy information, making the pseudo samples

{D_{l}}^{'} = ({d_{l_{1}}}^{'}, {d_{l_{2}}}^{'}, \dots, {d_{l_{n}}}^{'})

generated by

G_{l}

more representative of the user’s real data, and

{D_{l}}^{'} = F_{l}^{T} \cdot D_{l}^{T}

. The local generator

G_{l}

is retrained as follows:

{L_{c e l}}^{'} = - E_{({D_{l}}^{'}, y_{k})} [log {p^{‴}}_{θ_{k}} (y_{k} | {D_{l}}^{'})]

(10)

At this stage, the generator training loss can be expressed as:

{L_{G_{l}}}^{'} = L_{d i v l} + {L_{c e l}}^{'}

(11)

After training, the local generator

G_{l}

produces a local pseudo sample set

{D_{l}}^{'}

, which is used to facilitate knowledge transfer from the global model to the local model. Second, to enable global knowledge transfer and mitigate bias caused by local data heterogeneity during local model training, the local model inputs a batch of the user’s true label distribution

p (y_{b})

into the global generator

G_{g}

to obtain pseudo samples

P_{i} = G_{g} (z_{i}, p (y_{b}))

. These pseudo samples

P_{i}

are then fed into the trained local model. A KL divergence loss is constructed between the local model’s predictions

{\ddot{p}}_{θ_{k}}

on the real batch and on the generated pseudo samples

p_{θ_{g}}

. This loss

L_{l a t}

serves as a latent constraint for training the local model, guiding it with knowledge extracted from the global model on the server side via the generated pseudo samples. The formulation is as follows:

L_{l a t} = \frac{1}{B} \sum_{i = 1}^{B} D_{K L} ({\ddot{p}}_{θ_{k}} (\cdot | P_{i}) | | p_{θ_{g}} (\cdot | P_{i}))

(12)

Simultaneously, to adapt the global model to the local data distribution, each user samples from their true label set

y_{i}

and inputs the resulting label distribution

p (y_{i})

into the global generator

G_{g}

to generate pseudo samples

P_{j} = G_{g} (z_{j}, p (y_{i}))

. These pseudo samples

P_{j}

are then used to train the local model by constructing a cross-entropy loss

L_{t e a}

, which measures the discrepancy between the local model’s predictions

{\overset{⃛}{p}}_{θ_{k}}

and the user’s true labels

y_{i}

. The formulation is as follows:

L_{t e a} = - \sum_{c = 1}^{C} p (y_{i} = c) log {\overset{⃛}{p}}_{θ_{k}} (y_{i} = c | P_{j})

(13)

We enhance the personalized representation of global pseudo samples at the local level by leveraging cross-attention between globally and locally generated pseudo samples, thereby facilitating global knowledge transfer. Specifically, the true label distribution

p (y_{b})

of each user batch is projected onto a uniform distribution

p (y_{a})

. A batch of labels sampled

y_{a b}

from this uniform distribution is then input separately into both the global generator

G_{g}

and the local generator

G_{l}

, producing global pseudo sample outputs

P_{g} = G_{g} (y_{a b})

and local pseudo sample outputs

P_{l} = G_{l} (y_{a b})

, respectively. A global-local cross-attention matrix

A = {(W_{l} \cdot P_{l})}^{T} (W_{g} \cdot P_{g})

, where

W_{l}

and

W_{g}

are linear projection matrices used to align the dimensions of the output matrices. Meanwhile, obtain the normalized matrix N:

N o r = s o f t m a x (\frac{A}{\sqrt{d}})

(14)

where the Softmax function is defined as

s o f t m a x (x_{i}) = \frac{e^{x_{i}}}{\sum_{i = 1}^{N} e^{x_{i}}}

. The normalized matrix

N o r

is used as the final attention matrix, which is multiplied by the output of the global generator

P_{g}

to construct a residual matrix. The resulting output

P_{A} = N o r P_{g} + P_{g}

is then fed into the prediction model. Therefore, inspired by the attention mechanism, this study assigns attention weights to the outputs of the generative models, enabling dynamic focus on critical global information while disregarding less relevant parts. This approach helps the results closely approximate the global generator’s outputs, thereby enhancing the accuracy and efficiency of the detection model.

Finally, the samples

P_{A}

are input into the global model and local model, respectively, for training. The server is the teacher model, while the client k is the student model. A KL divergence loss is constructed between the local model’s predictions

{\tilde{p}}_{θ_{k}}

and the global model’s predictions

{p_{θ_{g}}}^{'}

, serving as the distillation loss for training the local model. This allows the powerful global model to guide the training of the local model. The formulation is as follows:

L_{d i s g} = D_{K L} ({\tilde{p}}_{θ_{k}} (\cdot | P_{A}) | | {p_{θ_{g}}}^{'} (\cdot | P_{A}))

(15)

(2) Construction of Personalized Loss for Local Model. Each client inputs a batch of real local data

x_{b}

and corresponding true labels

y_{i}

into the local model k for training. The negative log-likelihood (NLL) loss is employed to minimize the prediction output of the local model

{\dot{p}}_{θ_{k}}

, serving as the prediction loss

L_{p r e}

for training. This enhances both the training speed and stability of the model. The loss is defined as follows:

L_{p r e} = - \frac{1}{B} \sum_{i = 1}^{B} log {\dot{p}}_{θ_{k}} (y_{i} | x_{b})

(16)

where B denotes the batch size.

Meanwhile, a batch of the user’s real data

x_{b}

is simultaneously input into both the global model and the local model for training. In this process, the server functions as the teacher model, and the client serves as the student model. A KL divergence loss is constructed between the prediction outputs of the local

{\overset{⌢}{p}}_{θ_{k}}

and global models

{p_{θ_{g}}}^{″}

, serving as the distillation loss for training the local model. This facilitates knowledge transfer from the server to the client using each batch of real user data for auxiliary training. The formulation is as follows:

L_{d i s} = \frac{1}{B} \sum_{j = 1}^{B} D_{K L} ({\overset{⌢}{p}}_{θ_{k}} (\cdot | x_{b}) | | {p_{θ_{g}}}^{″} (\cdot | x_{b}))

(17)

(3) Unified Training of the Local Model. To obtain a personalized local model that mitigates the effects of data heterogeneity, we employ a combination of the prediction loss

L_{p r e}

, latent construct loss

L_{l a t}

, cross-entropy loss

L_{t e a}

, generative distillation loss

L_{d i s g}

, and distillation loss

L_{d i s}

. These components are integrated into a final total loss function

L_{k}

used to train the local model k, as formulated below:

L_{k} = L_{p r e} + L_{l a t} + L_{t e a} + L_{d i s g} + L_{d i s}

(18)

The overall loss function in this work is composed of multiple components, as shown in the above equation. During practical implementation of the algorithm, it is necessary to ensure that all loss components are maintained at a comparable scale, so as to prevent the training process from being dominated by any single loss term.

3.4. Global Aggregation of Local Models

Each local user shares their true label distribution

y_{i}

and the optimized local model parameters

θ_{k}^{t + 1}

with the server S. The global model parameters

θ_{g}

on the server are updated by performing a weighted average of all local models, where the weights are determined by the proportion of samples held by each user k. The update rule is defined as follows:

θ_{g}^{t + 1} = \sum_{k = 1}^{K} \frac{n_{k}}{N} θ_{k}^{t + 1}

(19)

where

N = \sum_{k = 1}^{K} n_{k}

.

Meanwhile, if any local model is updated, knowledge is transferred from the local model to the global model using the data-free distillation method described in Section 3.2. This enables the iterative process illustrated in Figure 1.

4. Experiments

4.1. Dataset Description

(1): Benchmark Datasets

To evaluate the effectiveness of the proposed model, four benchmark datasets widely used in the field of federated learning are selected: MNIST (Modified National Institute of Standards and Technology), EMNIST (Extended MNIST), CIFAR-10 (Canadian Institute for Advanced Research), and CIFAR-100. These datasets are representative in terms of data complexity, application scenarios, and experimental design. All experimental datasets employed in this paper are classic public datasets.

a. MNIST. MNIST is a benchmark dataset for handwritten digit recognition, containing 60,000 images. Each 28 × 28 grayscale image represents one of 10 digit classes (0–9). At about 170 MB, MNIST is ideal for quickly validating federated learning algorithms. It is often used to simulate non-IID scenarios through random partitioning or label-specific allocation.

b. EMNIST. EMNIST extends MNIST by adding 62 classes of handwritten characters, including digits and both uppercase and lowercase letters. It contains 814,255 samples, with the same 28 × 28 grayscale image format as MNIST. The visual similarity between certain letter classes, such as uppercase ‘C’ and lowercase ‘c’, increases classification difficulty, offering a more challenging task for evaluating model robustness.

c. CIFAR-10. CIFAR-10 contains 60,000 32 × 32 RGB images across 10 object categories, with a 5:1 train-test split. It introduces greater variability in pose, background, and lighting, increasing challenges for feature extraction. As a result, complex models like Convolutional Neural Networks (CNNs) are often necessary for effective learning.

d. CIFAR-100. CIFAR-100, released by the Canadian Institute for Advanced Research in 2009, contains 100 fine-grained object categories, including animals, plants, and vehicles. It comprises 60,000 RGB images (32 × 32 pixels), with 50,000 for training (500 per class) and 10,000 for testing (100 per class). The 100 classes are grouped into 20 superclasses (e.g., mammals, flowers), enabling multi-granularity classification tasks. Table 1 summarizes the roles of these datasets in federated learning research.

(2): Dirichlet-based Non-IID Datasets

In federated learning, data distribution modeling and partitioning are crucial for model performance. The Dirichlet distribution, a multivariate probability distribution, is widely used to simulate non-IID scenarios due to its ability to capture uncertainty in probability vectors. The concentration parameter

α

controls distribution skewness: a small

α

(e.g.,

α \to 0

) results in high heterogeneity, while a larger

α

(e.g.,

α \to 1

) yields a more uniform distribution, approximating IID.

In this study, we evaluate the impact of three different Dirichlet concentration parameters

α

on the partitioning of MNIST, EMNIST, CIFAR-10, and CIFAR-100 datasets among 20 users. The comparative experimental results under these settings are illustrated in Figure 4, Figure 5 and Figure 6, which visualize the label distributions across users for varying

α

values. In each figure, subfigure (a) corresponds to a small

α = 0.05

, subfigure (b) to a moderate

α = 0.1

, and subfigure (c) to a large

α = 1

, showing the progressive reduction in data heterogeneity. The horizontal axis represents user IDs (1–20), and the vertical axis denotes label IDs (0–9 for MNIST and CIFAR-10, 0–25 for EMNIST, where a 26-class subset is used). The size of each dot indicates the quantity of samples a user holds for a particular label—the larger the dot, the more samples assigned. Overall, when

α = 0.05

is small (e.g.,

α \to 0

), the data distribution is extremely non-IID: some users may hold hundreds of times more samples for a given label than others, while some labels may be entirely absent for certain users. This high degree of heterogeneity poses a significant challenge for federated learning. When

α

is moderate (e.g.,

α = 1

), the heterogeneity is mitigated to some extent, though the distribution still deviates from a fully IID scenario.

In all experiments, 65% of each dataset was allocated as the training set, and 35% as the testing set, to be partitioned among users under the different Dirichlet settings.

4.2. Experimental Setting and Evaluation Metrics

The experimental environment and parameter configuration in this study strictly adhere to the latest research standards in the field of federated learning, ensuring reproducibility and scientific rigor. The hardware platform is equipped with an NVIDIA GeForce RTX 4070 GPU, which provides substantial acceleration for parallel computation in deep learning. The operating system is Windows 10, and PyTorch 1.10 is adopted as the deep learning framework, offering flexible support for distributed training in federated learning. GPU acceleration is enabled through CUDA and cuDNN. The source code for the proposed model has been made publicly available at: https://github.com/LNNU-computer-research-526/FedKDG, accessed on 11 December 2025.

In this study, the ResNet-18 architecture serves as the base model for network construction. A total of 20 clients participate in the federated training process. Detailed training parameters are presented in Table 1.

To fairly evaluate the effectiveness of the proposed model, classification accuracy is employed as the primary evaluation metric for performance assessment and comparative analysis. Classification accuracy is a commonly used metric that measures a model’s ability to correctly classify samples. It is defined as the ratio of the number of correctly predicted samples to the total number of samples in the test dataset. This can be expressed by the following formula:

a c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N}

(20)

where

T P

is the true rate, which is predicted to be a positive example and is actually a positive example.

F P

is a false positive example, which is predicted to be a positive example but is actually a negative example.

F N

is a false negative example, which is predicted to be a negative example but is actually a positive example.

T N

is a true negative example.

4.3. Comparative Analysis of Experimental Results

To evaluate the effectiveness of the proposed federated learning method FedKDG, five representative and widely adopted federated learning approaches FedAvg, FedProx, FedDistill, FedEnsemble, and FedGen are selected for comparative analysis. Table 2 summarizes the classification accuracy results of FedKDG and the five baseline methods under three different data heterogeneity scenarios on the MNIST dataset. As shown in the comparison, the proposed FedKDG method effectively mitigates local model heterogeneity by incorporating globally balanced distribution information. Even under severe data heterogeneity, FedKDG achieves a classification accuracy of 97.08%, demonstrating a substantial improvement over the widely used FedAvg approach. Moreover, under scenarios with moderate and low levels of local heterogeneity, FedKDG maintains high accuracy rates of 98.83% and 99.27%, respectively. These results highlight the model’s strong ability to overcome the challenges posed by data heterogeneity, outperforming other popular federated learning methods.

This study also reports comparative experimental results on the EMNIST dataset, which features more similar sample characteristics, as shown in Table 3. On this extended version of the MNIST dataset, the proposed method continues to demonstrate strong performance. Under a high degree of data heterogeneity, FedKDG achieves an accuracy of 88.47%. The EMNIST dataset contains a larger number of hard samples with similar feature representations across different classes, posing a greater challenge for model training. As a result, the classification accuracy of the five baseline methods drops significantly. In contrast, the proposed method, which incorporates global knowledge from the global model, effectively distinguishes between different classes.

Under moderate heterogeneity, FedKDG reaches an accuracy of 93.00%, outperforming previous methods by nearly 20 percentage points. Even in scenarios with balanced data distribution, where all models show improved performance, FedKDG still achieves the highest accuracy of 94.61%. These results demonstrate that the proposed method not only overcomes the challenges of severe data heterogeneity but also performs competitively on datasets with a larger number of classes.

To thoroughly evaluate and analyze the effectiveness of the proposed model, this study compares the accuracy of federated learning models under three different data heterogeneity scenarios on the color image dataset CIFAR-10, as shown in Table 4. As observed in the table, FedKDG achieves accuracy rates of 45.85% and 51.21% under two heterogeneous settings, both outperforming other popular methods. Furthermore, under the third scenario, FedKDG attains an accuracy of 64.48%, demonstrating superior capability in federated knowledge transfer.

In addition, we validate and compare the effectiveness of the proposed model on the more challenging CIFAR-100 dataset, which includes a larger number of categories, as shown in Table 5. The results demonstrate that under a heterogeneity level of 0.05, FedKDG achieves a classification accuracy of 19.76%, outperforming other models in terms of recognition accuracy. When the heterogeneity level increases to 0.1, FedKDG attains an accuracy of 22.42%, and under a highly heterogeneous setting with a level of 1, the accuracy significantly improves to 28.04%, representing an average improvement of approximately 7% over baseline methods.

These findings indicate that FedKDG exhibits strong robustness in multi-channel color image classification tasks. Even in the presence of highly non-IID data, the model maintains stable knowledge transfer capability and generalization performance.

Based on the overall statistical results, FedKDG demonstrates outstanding performance across all datasets. Specifically, it achieves an average accuracy of 98.34% under three levels of data heterogeneity on the MNIST dataset, 91.98% on the EMNIST dataset, 53.74% on the CIFAR-10 dataset, and 23.41% on the CIFAR-100 dataset. FedKDG consistently delivers satisfactory results across all evaluated scenarios.

To facilitate intuitive observation and analysis, convergence curves under various heterogeneity settings across different datasets are plotted. Figure 7 illustrates the performance of FedKDG and the reproduced results of the five baseline federated learning methods under varying degrees of heterogeneity for each dataset. As the training rounds progress, the figures clearly show the trend in prediction accuracy for each method. Notably, FedKDG quickly surpasses the accuracy of all other methods under every heterogeneity setting on the MNIST dataset. Moreover, all models converge stably to their respective optimal values.

For the EMNIST dataset, the proposed model still converges rapidly to its optimal performance, as illustrated in Figure 8. Moreover, compared to other methods, FedKDG demonstrates a clear advantage in test accuracy under highly heterogeneous conditions.

CIFAR-10 is a more challenging color image dataset compared to the others. Under conditions of high heterogeneity, the proposed model’s performance is initially limited by the quality of the generator and does not rapidly reach the optimal results. However, as the number of training rounds increases, the model effectively mitigates noise through the pseudo-data information filtering mechanism and addresses data heterogeneity via global knowledge transfer. As shown in Figure 9, the model ultimately achieves the best experimental results.

To further analyze the federated learning performance of the proposed model on heterogeneous datasets, we examined the number of classes per client in the EMNIST and CIFAR-10 datasets, as illustrated in Figure 10 and Figure 11. As shown in Figure 10, despite EMNIST containing a total of 26 classes, most clients have training data encompassing fewer than 10 classes. Even under such highly heterogeneous training conditions, the proposed model achieves higher accuracy compared to the popular FedGen model, as demonstrated in Figure 10. This result highlights the superior generalization ability and robustness of our approach. Similarly, although CIFAR-10 contains only 10 classes, its significant heterogeneity results in most clients having local training data comprising only 3 to 4 classes, as depicted in Figure 11. Under these circumstances, our model still attains better federated learning accuracy, achieving up to 74% accuracy even for clients with training data from only 2 classes, as shown in Figure 11.

Therefore, the proposed model demonstrates satisfactory accuracy, generalization, and robustness in heterogeneous data environments compared to existing popular federated learning methods.

Based on the above, in the domain of federated learning with heterogeneous data, the proposed FedKDG method demonstrates significant improvements over other federated learning approaches in terms of both accuracy and convergence performance. This indicates that the involvement of the global model in the training process plays a substantive role, reflecting the strength of the global model and its effective aggregation of beneficial knowledge for client models. The use of dual generators combined with a global model in a data-free knowledge distillation federated learning framework clearly offers greater advantages compared to traditional federated learning methods. This suggests that deep convolutional neural networks can effectively capture hidden, highly nonlinear relationships, and when paired with a knowledge-rich global model, both accuracy and convergence can be further enhanced.

4.4. Analysis of Module Effectiveness

Based on the experimental and comparative results presented in the previous section, the proposed model achieves superior recognition accuracy. This improvement fundamentally stems from the local model training process, where the cross-attention module integrates global information while preserving local personalized features, thereby enhancing local recognition performance. To better illustrate this, we visualize the feature distributions of samples generated by the local and global generators, as shown in Figure 12, Figure 13 and Figure 14. In each figure, panel (a) depicts the feature visualization of samples generated by the global generator, showing diverse distributions that represent multiple classes with relatively balanced class distributions. Panel (b) shows the feature visualization of samples generated by the local generator, where the feature distribution is highly imbalanced. Panel (c) presents the feature distribution of samples after global feature correction. It is evident that these samples not only retain the personalized representations of the local model but also exhibit a more balanced class distribution and reduced heterogeneity. Using such samples for knowledge distillation helps to mitigate local model bias and improve recognition accuracy.

Furthermore, from Figure 12, Figure 13 and Figure 14, it can be observed that under different levels of data heterogeneity, the proposed method effectively alleviates local information heterogeneity bias. Even when local heterogeneity is pronounced, the model optimizes local information through the global model, generating pseudo-samples containing rich class information. These pseudo-samples are then used in subsequent knowledge distillation, enhancing the local model’s accuracy and generalization ability.

4.5. Ablation Study

To further validate the effectiveness of FedKDG in heterogeneous federated learning, this section conducts ablation experiments under the heterogeneity setting of the CIFAR-10 dataset. The goal is to verify the indispensability of each component in the proposed method. The experimental results are summarized in Table 6.

(1): Train the local model solely based on the prediction loss between the local model’s output and the true labels from a single batch of user data.
(2): Building on experiment (1), a global generator is trained on the server side. The final total loss for training the local model includes both the cross-entropy loss between the local model’s predictions and all true user labels, and a latent loss measuring the discrepancy between the local model’s predictions on real user data and those on pseudo-samples generated by the global generator. No pseudo-sample filter is applied here; the global generator attempts to produce pseudo-samples that approximate the user’s real data as closely as possible.
(3): Based on experiment (2), a pseudo-sample filter is added to assist the training of the global generator. This filter removes noisy information from the generated samples, resulting in more realistic pseudo-samples.
(4): Building on experiment (3), knowledge distillation from the global model to the local model is incorporated during local training. Specifically, the KL divergence between the local model’s and global model’s predictions is added as a distillation loss with a weighting factor of 0.01, enabling knowledge transfer between the server and clients.
(5): Extending experiment (4), a local generator is randomly assigned to one user for training, analogous to the global generator. A pseudo-sample filter is also applied to the local generator to remove noise. During local model training, an attention mechanism is introduced to focus the generator’s output on key global information. Additionally, a KL divergence loss between the local model’s and global model’s predictions is constructed as a generative distillation loss with a weight of 0.01. This enables the powerful global model to guide the training of the local model, with knowledge distillation from experiment (4) serving as auxiliary supervision.

5. Conclusions

To address the challenges of local model training bias and ineffective knowledge transfer in federated learning under non-independent and identically distributed (Non-IID) data environments, this paper proposes a novel federated learning framework based on data-free knowledge distillation. During the transfer of local models to the global model, a noise sample filtering mechanism is introduced to mitigate the negative impact of low-quality pseudo proxy samples generated by local generators. Meanwhile, during the global to local model transfer phase, a global pseudo proxy supervision module is constructed to alleviate overfitting and underfitting issues caused by data heterogeneity across clients, thereby enhancing model robustness and generalization.

Extensive experiments under varying degrees of data heterogeneity demonstrate that the proposed method outperforms mainstream federated learning baselines in task accuracy. However, the current framework has the following limitations: (1) it is designed specifically for knowledge aggregation and federated learning of local models in image classification tasks and has not been extended to local models in natural language processing or time series prediction; (2) it primarily addresses federated learning with non-balanced local data distributions, without fully considering heterogeneity in local model architectures; (3) its application assumes a secure network environment without attacks and does not adequately account for model performance and robustness under network attacks. Based on these limitations, future research will focus on addressing the above challenges. In particular, for the third point, we plan to leverage low-quality samples in the generated pseudo-proxy dataset, constructing them as attack datasets to train the global model, thereby enhancing its stability and robustness.

Author Contributions

W.S. conducted the model architecture design and implemented the key techniques; X.G. carried out the core implementation and model validation; W.L. performed model validation and drafted the initial version of the manuscript; F.S. revised the manuscript and provided overall supervision and technical support. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China Youth Fund (No. 61902165), Liaoning Provincial Department of Education Fund (No. 203070132279, No. JYTMS2023104).

Data Availability Statement

The source code for the proposed model has been made publicly available at: https://github.com/LNNU-computer-research-526/FedKDG (accessed on 11 December 2025).

Acknowledgments

This work is in part supported by National Natural Science Foundation of China Youth Fund (No. 61902165), Liaoning Provincial Department of Education Fund (No. 203070132279, No. JYTMS20231040).

Conflicts of Interest

Author Wenhao Sun was employed by Dalian Party Institute of CPC. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Shrivastava, A. Privacy-Centric AI: Navigating the Landscape with Federated Learning. Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET) 2024, 12, 357–363. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR. pp. 1273–1282. [Google Scholar]
Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A.S. Federated multi-task learning. Adv. Neural Inf. Process. Syst. 30 (NIPS) 2017, 30. [Google Scholar]
Lim, W.Y.B.; Luong, N.C.; Hoang, D.T.; Jiao, Y.; Liang, Y.C.; Yang, Q.; Niyato, D.; Miao, C. Federated learning in mobile edge networks: A comprehensive survey. IEEE Commun. Surv. Tutor. 2020, 22, 2031–2063. [Google Scholar] [CrossRef]
Wang, L. Heterogeneous data and big data analytics. Autom. Control Inf. Sci. 2017, 3, 8–15. [Google Scholar] [CrossRef]
Li, H.; Reynolds, J.F. On definition and quantification of heterogeneity. Oikos 1995, 73, 280–284. [Google Scholar] [CrossRef]
Huang, W.; Ye, M.; Du, B. Learn from others and be yourself in heterogeneous federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10143–10153. [Google Scholar]
Chen, H.; Wang, Y.; Xu, C.; Yang, Z.; Liu, C.; Shi, B.; Xu, C.; Xu, C.; Tian, Q. Data-free learning of student networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3514–3522. [Google Scholar]
Pan, H.; Wang, C.; Qiu, M.; Zhang, Y.; Li, Y.; Huang, J. Meta-KD: A meta knowledge distillation framework for language model compression across domains. arXiv 2020, arXiv:2012.01266. [Google Scholar]
Zhu, Z.; Hong, J.; Zhou, J. Data-free knowledge distillation for heterogeneous federated learning. In Proceedings of the 38 th International Conference on Machine Learning, Virtual, 18–24 July 2020; PMLR. pp. 12878–12889. [Google Scholar]
Yurochkin, M.; Agarwal, M.; Ghosh, S.; Greenewald, K.; Hoang, N.; Khazaeni, Y. Bayesian nonparametric federated learning of neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR. pp. 7252–7261. [Google Scholar]
Wang, H.; Yurochkin, M.; Sun, Y.; Papailiopoulos, D.; Khazaeni, Y. Federated learning with matched averaging. arXiv 2020, arXiv:2002.06440. [Google Scholar] [CrossRef]
Chen, H.Y.; Chao, W.L. Fedbe: Making bayesian model ensemble applicable to federated learning. arXiv 2020, arXiv:2009.01974. [Google Scholar]
Yu, F.; Zhang, W.; Qin, Z.; Xu, Z.; Wang, D.; Liu, C.; Tian, Z.; Chen, X. Fed2: Feature-aligned federated learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 2066–2074. [Google Scholar]
Corinzia, L.; Beuret, A.; Buhmann, J.M. Variational federated multi-task learning. arXiv 2019, arXiv:1906.06268. [Google Scholar]
Jiang, Y.; Konečný, J.; Rush, K.; Kannan, S. Improving federated learning personalization via model agnostic meta learning. arXiv 2019, arXiv:1909.12488. [Google Scholar]
Duan, M.; Liu, D.; Chen, X.; Tan, Y.; Ren, J.; Qiao, L.; Liang, L. Astraea: Self-balancing federated learning for improving classification accuracy of mobile deep learning applications. In Proceedings of the 2019 IEEE 37th International Conference on Computer Design (ICCD), Abu Dhabi, United Arab Emirates, 17–20 November 2019; pp. 246–254. [Google Scholar]
Alansary, S.A.; Ayyad, S.M.; Talaat, F.M.; Saafan, M.M. Emerging AI threats in cybercrime: A review of zero-day attacks via machine, deep, and federated learning. Knowl. Inf. Syst. 2025, 67, 10951–10987. [Google Scholar] [CrossRef]
Yang, L.; Miao, Y.; Liu, Z.; Liu, Z.; Li, X.; Kuang, D.; Li, H.; Deng, R.H. Enhanced model poisoning attack and multi-strategy defense in federated learning. IEEE Trans. Inf. Forensics Secur. 2025, 20, 3877–3892. [Google Scholar] [CrossRef]
Acar, D.A.E.; Zhao, Y.; Zhu, R.; Matas, R.; Mattina, M.; Whatmough, P.; Saligrama, V. Debiasing model updates for improving personalized federated training. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR. pp. 21–31. [Google Scholar]
Shoham, N.; Avidor, T.; Keren, A.; Israel, N.; Benditkis, D.; Mor-Yosef, L.; Zeitak, I. Overcoming forgetting in federated learning on non-iid data. arXiv 2019, arXiv:1910.07796. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
T Dinh, C.; Tran, N.; Nguyen, J. Personalized federated learning with moreau envelopes. Adv. Neural Inf. Process. Syst. 2020, 33, 21394–21405. [Google Scholar]
Li, Q.; He, B.; Song, D. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10713–10722. [Google Scholar]
Mu, X.; Shen, Y.; Cheng, K.; Geng, X.; Fu, J.; Zhang, T.; Zhang, Z. Fedproc: Prototypical contrastive federated learning on non-iid data. Future Gener. Comput. Syst. 2023, 143, 93–104. [Google Scholar] [CrossRef]
Yoon, T.; Shin, S.; Hwang, S.J.; Yang, E. Fedmix: Approximation of mixup under mean augmented federated learning. arXiv 2021, arXiv:2107.00233. [Google Scholar] [CrossRef]
Xu, X.; Li, H.; Li, Z.; Zhou, X. Safe: Synergic data filtering for federated learning in cloud-edge computing. IEEE Trans. Ind. Inform. 2022, 19, 1655–1665. [Google Scholar] [CrossRef]
Liu, L.; Zhang, J.; Song, S.H.; Letaief, K.B. Communication-efficient federated distillation with active data sampling. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 201–206. [Google Scholar]
He, C.; Annavaram, M.; Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Adv. Neural Inf. Process. Syst. 2020, 33, 14068–14080. [Google Scholar]
Fang, X.; Ye, M. Robust federated learning with noisy and heterogeneous clients. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10072–10081. [Google Scholar]
Huang, W.; Ye, M.; Du, B.; Gao, X. Few-shot model agnostic federated learning. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7309–7316. [Google Scholar]
Li, D.; Wang, J. Fedmd: Heterogenous federated learning via model distillation. arXiv 2019, arXiv:1910.03581. [Google Scholar] [CrossRef]
Chang, H.; Shejwalkar, V.; Shokri, R.; Houmansadr, A. Cronus: Robust and heterogeneous collaborative learning with black-box knowledge transfer. arXiv 2019, arXiv:1912.11279. [Google Scholar] [CrossRef]
Ozkara, K.; Singh, N.; Data, D.; Diggavi, S. Quped: Quantized personalization via distillation with applications to federated learning. Adv. Neural Inf. Process. Syst. 2021, 34, 3622–3634. [Google Scholar]
Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble distillation for robust model fusion in federated learning. Adv. Neural Inf. Process. Syst. 2020, 33, 2351–2363. [Google Scholar]
Zhang, L.; Shen, L.; Ding, L.; Tao, D.; Duan, L.Y. Fine-tuning global model via data-free knowledge distillation for non-iid federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10174–10183. [Google Scholar]
Zhang, J.; Chen, C.; Li, B.; Lyu, L.; Wu, S.; Ding, S.; Shen, C.; Wu, C. Dense: Data-free one-shot federated learning. Adv. Neural Inf. Process. Syst. 2022, 35, 21414–21428. [Google Scholar]
Heinbaugh, C.E.; Luz-Ricca, E.; Shao, H. Data-free one-shot federated learning under very high statistical heterogeneity. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wu, Z.; Sun, S.; Wang, Y.; Liu, M.; Pan, Q.; Zhang, J.; Li, Z.; Liu, Q. Exploring the distributed knowledge congruence in proxy-data-free federated distillation. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–34. [Google Scholar] [CrossRef]
Zhang, J.; Chen, C.; Zhuang, W.; Lyu, L. Target: Federated class-continual learning via exemplar-free distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4782–4793. [Google Scholar]
Zhu, Y.; Li, X.; Wu, Z.; Wu, D.; Hu, M.; Li, R.H. FedTAD: Topology-aware Data-free Knowledge Distillation for Subgraph Federated Learning. arXiv 2024, arXiv:2404.14061. [Google Scholar]
Wang, S.; Fu, Y.; Li, X.; Lan, Y.; Gao, M. DFRD: Data-Free Robustness Distillation for Heterogeneous Federated Learning. Adv. Neural Inf. Process. Syst. 2023, 36, 17854–17866. [Google Scholar]
Zhao, S.; Liao, T.; Fu, L.; Chen, C.; Bian, J.; Zheng, Z. Data-free knowledge distillation via generator-free data generation for Non-IID federated learning. Neural Netw. 2024, 179, 106627. [Google Scholar] [CrossRef]

Figure 1. Proposed model framework.

Figure 2. Personalized transfer of local knowledge to the global process.

Figure 3. Global knowledge to local transfer process under global pseudo sample supervision.

Figure 4. MNIST dataset training labeled Dirichlet distribution visualization.

Figure 5. EMNIST dataset training labeled Dirichlet distribution visualization.

Figure 6. CIFAR-10 dataset training labeled Dirichlet distribution visualization.

Figure 7. The convergence comparison curves for different levels of heterogeneity on the MNIST dataset.

Figure 8. The convergence comparison curves for different levels of heterogeneity on the EMNIST dataset.

Figure 9. The convergence comparison curves for different levels of heterogeneity on the CIFRA-10 dataset.

Figure 10. Comparison of client category distributions and accuracy on the EMNIST dataset.

Figure 11. Comparison of client category distributions and accuracy on the CIFAR-10 dataset.