pFedZKD: A One-Shot Personalized Federated Learning Framework via Evolutionary Architecture Search and Data-Free Distillation

Yan, Jiaqi; Yang, Xuan; Wang, Desheng; Xu, Yonggang; Hua, Gang

doi:10.3390/app16083878

Open AccessArticle

pFedZKD: A One-Shot Personalized Federated Learning Framework via Evolutionary Architecture Search and Data-Free Distillation

by

Jiaqi Yan

¹

,

Xuan Yang

²

,

Desheng Wang

³

,

Yonggang Xu

¹

and

Gang Hua

^1,*

¹

School of Information and Control Engineering, China University of Mining, No. 1 Daxue Road, Xuzhou 221116, China

²

School of Internet of Things Engineering, Wuxi University, No. 333 Xishan Avenue, Wuxi 214105, China

³

School of Electronic Information Engineering, Huaiyin Institute of Technology, Faculty of Electronic and Information Engineering, No. 1 Meicheng East Road, Huai’an 223003, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3878; https://doi.org/10.3390/app16083878

Submission received: 28 February 2026 / Revised: 7 April 2026 / Accepted: 13 April 2026 / Published: 16 April 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Personalized federated learning (PFL) faces significant challenges in resource-constrained edge environments, where strict communication budgets and severe system heterogeneity must be jointly addressed. Although one-shot federated learning reduces communication overhead, existing methods typically impose unified model architectures or rely on coarse manual selection strategies, limiting their adaptability to highly heterogeneous data distributions and restricting personalized representation capability. To overcome these limitations, we propose Personalized Federated Zero-shot Knowledge Distillation (pFedZKD), a data-free one-shot federated learning framework designed for structurally heterogeneous scenarios. The framework follows a decouple-and-reconstruct collaborative paradigm. On the client side (decoupling stage), we introduce Particle Swarm Optimization-based Federated Neural Architecture Search (PSO-FedNAS), a gradient-free neural architecture search method that enables each client to autonomously discover a customized convolutional architecture aligned with its local data distribution, eliminating the need for architectural consistency across clients. On the server side (reconstruction stage), to address parameter-space incompatibility caused by structural heterogeneity, we develop an architecture-agnostic multi-teacher zero-shot knowledge distillation mechanism (Multi-ZSKD). This method synthesizes pseudo-samples in latent space to extract semantic consensus from heterogeneous client models and transfers the aggregated knowledge to a unified global student model without accessing real data. The entire collaborative process is completed within a single communication round, substantially reducing communication cost while enhancing privacy preservation. Extensive experiments on MNIST, FashionMNIST, SVHN, and CIFAR-10 under heterogeneous data settings demonstrate that pFedZKD consistently achieves superior personalization accuracy, global generalization performance, and communication efficiency compared with state-of-the-art PFL methods.

Keywords:

personalized federated learning; neural architecture search; swarm intelligence algorithm; data-free knowledge distillation; model heterogeneity

1. Introduction

The rapid advancement of big data and artificial intelligence technologies has significantly accelerated the deployment of intelligent systems in a wide range of application domains, including healthcare, smart cities, and the Internet of Things (IoT). However, the success of these technologies has long relied on the centralized collection of massive datasets and sustained access to high-performance computing resources. With the increasing enforcement of data privacy regulations and the explosive growth of edge devices, the centralized training paradigm is facing fundamental challenges in terms of privacy protection, regulatory compliance, and system scalability [1].

Federated learning (FL) [2] has emerged as a privacy-preserving distributed learning paradigm that enables collaborative model training without sharing raw data, providing a promising solution to the aforementioned challenges. Nevertheless, in practical edge intelligence systems, the deployment of FL is often constrained by two critical bottlenecks: limited communication bandwidth and severe system heterogeneity. On the one hand, communication frequency is tightly restricted due to constraints on energy consumption, latency, and privacy risks. On the other hand, client devices exhibit substantial heterogeneity in terms of computational capability, storage capacity, and network conditions. For instance, in intelligent healthcare applications, different hospitals are often required to deploy models with diverse architectures due to heterogeneous clinical demands and data governance policies [3]. Similarly, in mobile health monitoring scenarios [4,5], devices range from high-performance smartphones to ultra-low-power wearable sensors, with computational capabilities differing by several orders of magnitude. Forcing such heterogeneous devices to participate in conventional synchronous FL training frequently leads to pronounced straggler effects, where the overall system efficiency is dominated by the slowest participants.

Against this backdrop, Personalized Federated Learning (PFL) has been proposed as an emerging paradigm to deliver client-specific models while maintaining collaborative training across clients [6]. A growing body of research has demonstrated that PFL can effectively mitigate performance degradation caused by non-independent and identically distributed (non-IID) by tailoring models to local data distributions [7,8,9]. Despite these advances, most existing PFL methods remain constrained by a form of structural rigidity—that is, they assume a shared backbone architecture across all clients and achieve personalization only through fine-tuning a limited set of client-specific layers. Such partially personalized strategies exhibit limited adaptability in the presence of extreme edge heterogeneity. A unified backbone network is often unable to simultaneously satisfy the resource constraints of both high-end servers and low-power IoT devices, leading to underutilization of powerful devices while forcing resource-constrained clients to drop out due to memory overflow (OOM) or computational infeasibility. This imbalance severely restricts the breadth and depth of global collaboration.

To overcome this structural rigidity and enable truly personalized modeling, clients should be granted architecture autonomy, allowing each device to construct and deploy model architectures intrinsically aligned with its local data characteristics and resource constraints. In this way, personalization is no longer limited to lightweight local adaptation on top of a shared backbone, but can be achieved at the architectural level.

In recent years, Neural Architecture Search (NAS) has demonstrated remarkable capability in automating model design within centralized machine learning frameworks. In particular, evolutionary NAS methods have attracted increasing attention due to their natural suitability for non-convex and non-differentiable search spaces [10,11]. However, directly deploying mainstream gradient-based NAS approaches on edge devices, such as DARTS and its variants [12,13,14], typically incurs substantial computational and memory overhead, which renders them impractical for resource-constrained IoT environments. In contrast, gradient-free evolutionary search methods exhibit greater flexibility and lower computational burden when operating over discrete and non-differentiable architectural spaces [15,16,17]. Among these approaches, Particle Swarm Optimization (PSO) [18] stands out as a classical evolutionary algorithm due to its fast convergence properties and its ability to explore heterogeneous architectural topologies via variable-length encoding strategies [19,20,21].

While NAS techniques have matured considerably in centralized settings, their application in federated learning remains largely underexplored. Existing attempts, such as FedNAS and FedorAS [22,23], incorporate architecture search into federated learning; however, they typically focus on discovering a single global architecture or performing localized search within pre-clustered client groups. Such approaches prioritize architectural commonality and therefore fail to fundamentally overcome structural rigidity, limiting their ability to accommodate the extreme resource heterogeneity inherent in edge environments. Motivated by this gap, we adopt Particle Swarm Optimization (PSO) as a client-side architecture search mechanism and reformulate it as a fully local neural architecture search kernel, enabling each client to autonomously evolve a personalized network architecture without relying on cloud-side search resources.

However, granting clients full architecture autonomy inevitably introduces pronounced structural heterogeneity. Since locally searched models may differ substantially in depth, width, and layer composition, conventional parameter-space aggregation becomes mathematically inapplicable, as direct weighted averaging is ill-defined when parameter dimensions are inconsistent. Therefore, once architecture autonomy is introduced, effective collaboration across clients can no longer rely on parameter alignment, but must instead be achieved at a higher semantic level.

To address this challenge, we introduce a structure-agnostic knowledge aggregation mechanism, namely Multi-Teacher Zero-Shot Knowledge Distillation (Multi-ZSKD). By synthesizing pseudo-samples and shifting the aggregation process from parameter space to semantic representation space, the proposed mechanism enables effective knowledge fusion across heterogeneous client models without accessing any real training data. This design makes cross-architecture collaboration feasible while preserving the privacy advantages of federated learning.

In addition, communication bottlenecks constitute another critical factor that constrains the deployment of federated learning in edge environments. Conventional FL frameworks rely on frequent communication rounds to synchronize model parameters with the server, which not only incurs substantial latency and energy consumption, but also exacerbates network congestion and increases potential privacy leakage risks. To alleviate these issues, the concept of one-shot federated learning (One-Shot FL) has been proposed. However, existing one-shot approaches still exhibit notable limitations in heterogeneous settings. For instance, data distillation-based methods, such as DOSFL [24], require clients to generate and upload compact distilled datasets, which inevitably introduces additional communication overhead and expands the attack surface for privacy leakage. In contrast, data-free approaches, including DENSE [25] and FedCVAE [26], avoid explicit data transmission, but their performance largely depends on the quality and diversity of client models. When teacher models are weak or highly homogeneous, the amount of transferable semantic information during distillation becomes severely limited.

Motivated by the above challenges and analysis, this work proposes Personalized Federated Zero-shot Knowledge Distillation (pFedZKD), a data-free one-shot federated learning framework tailored for structurally heterogeneous scenarios. The framework follows a decouple-and-reconstruct paradigm. On the client side (decoupling stage), we design Particle Swarm Optimization-based Federated Neural Architecture Search (PSO-FedNAS), which enables each client to automatically search for and construct a personalized CNN architecture aligned with its local data distribution and resource constraints. On the server side (reconstruction stage), we develop a structure-agnostic Multi-ZSKD module, which transfers heterogeneous knowledge into a unified global model through pseudo-sample generation and distillation without accessing any real data. The entire process requires only a single communication round, thereby reducing communication cost while improving global generalization under strict privacy constraints.

The main contributions of this work can be summarized as follows:

We propose pFedZKD, a data-free one-shot federated learning framework tailored for structurally heterogeneous scenarios. The proposed framework breaks away from the prevailing reliance on predefined homogeneous model architectures by introducing a decouple-and-reconstruct paradigm. Without requiring structural alignment across clients, pFedZKD enables efficient personalized collaboration while preserving data privacy.
We design PSO-FedNAS, an adaptive federated neural architecture search algorithm based on particle swarm optimization. PSO-FedNAS empowers each client to autonomously evolve a customized convolutional neural network architecture according to its local data distribution. This design enables architectural heterogeneity and personalization without relying on cloud-side computational resources.
We develop a structure-agnostic multi-teacher zero-shot knowledge distillation (Multi-ZSKD) mechanism for global knowledge aggregation. To address parameter incompatibility induced by model heterogeneity, the proposed mechanism elevates the aggregation process from the parameter space to the semantic space through generative pseudo-samples. By exploiting the ensemble knowledge of heterogeneous teachers, Multi-ZSKD enables robust data-free knowledge transfer under a single communication round.
We conduct extensive empirical evaluations on multiple non-IID benchmark datasets. Experimental results demonstrate that, under simultaneous data and model heterogeneity, pFedZKD consistently outperforms state-of-the-art methods in terms of personalized accuracy, communication efficiency, and global generalization capability.

The remainder of this paper is organized as follows. Section 2 reviews the related work on personalized federated learning and federated distillation, and introduces the preliminaries relevant to the proposed method. Section 3 presents the proposed pFedZKD framework in detail. Section 4 reports the experimental setup, baseline comparisons, result analysis, and ablation studies. Finally, Section 5 concludes the paper.

2. Related Work and Preliminaries

This section first reviews recent advances in personalized federated learning (PFL) and federated distillation (FD), two closely related research directions that address the limitations of conventional federated learning under heterogeneous settings. PFL mainly focuses on improving client-specific adaptability and personalization under non-IID data distributions, whereas FD leverages knowledge distillation to reduce communication overhead and relax the model homogeneity constraint. To position the proposed method within the existing literature and provide the necessary technical foundation, we first summarize representative studies from these two research streams and then introduce the preliminaries most relevant to the proposed pFedZKD framework.

2.1. Personalized Federated Learning

Personalized Federated Learning (PFL) extends conventional Federated Learning (FL) by aiming to provide client-specific models while preserving data privacy. It is primarily motivated by the presence of heterogeneous data distributions and diverse system capabilities across clients. To address these challenges, a wide range of PFL methods have been proposed.

Early PFL approaches focus on enhancing personalization under non-IID data through optimization-level or training-strategy modifications. Per-FedAvg [9] introduces meta-learning into FL to improve generalization across heterogeneous clients. FedBN [27] mitigates feature distribution shifts by keeping batch normalization statistics local while aggregating only the remaining model parameters at the server. Ditto [28] maintains a personalized model for each client and introduces a regularization term to control its deviation from the global model, thereby balancing personalization and global consistency. Unlike FedProx [29], which primarily targets system heterogeneity by stabilizing local updates, Ditto explicitly emphasizes fairness and personalization under non-IID data distributions. Despite their effectiveness, these methods generally assume a shared global model architecture and rely on local fine-tuning, which limits structural flexibility.

Another line of work achieves personalization by decoupling shared and client-specific parameters within a unified architecture. LG-FedAvg [30] allows clients to preserve local feature representations while only uploading globally shared parameters for aggregation. FedPer [31] further decomposes the model into shared and personalized layers, where only the shared layers participate in global aggregation. Compared with LG-FedAvg, FedPer provides stronger personalization capability through explicit structural modularization, while still assuming a fixed backbone architecture.

Multi-Task Learning (MTL) has also been widely explored to address statistical heterogeneity by modeling each client as an individual task. MOCHA [32] formulates federated learning as a multi-task optimization problem and jointly learns client models along with a task relationship matrix, enabling effective knowledge sharing under non-IID data. Moreover, MOCHA supports asynchronous optimization, making it suitable for settings with both statistical and system heterogeneity.

Client clustering represents another important direction for personalization. Clustered Federated Learning (CFL) [33] groups clients based on gradient similarity and trains a separate model for each cluster. IFCA [34] adopts an alternating clustering-and-optimization scheme that avoids centralized clustering and supports asynchronous client participation. FedAMP [35] further enhances collaboration among similar clients through attention-based message passing, improving both personalization and global performance in heterogeneous environments.

Despite notable progress, most existing PFL methods still rely on a unified model architecture and lack mechanisms to jointly adapt to data heterogeneity and system-level heterogeneity, such as variations in computational resources and model capacity requirements across clients.

2.2. Federated Distillation

Federated Distillation (FD) has emerged as an effective paradigm to reduce communication overhead and relax the model homogeneity assumption in conventional federated learning. Unlike parameter-based aggregation methods such as FedAvg, which require clients to upload full model parameters or gradients, FD aggregates knowledge at the output level (e.g., logits or soft labels). This design substantially reduces communication costs and naturally supports collaboration among heterogeneous client models.

FedMD [36] is one of the earliest representative FD approaches. It enables heterogeneous clients to collaboratively train their models by exchanging logits on a shared public dataset and performing knowledge distillation, thereby addressing both structural heterogeneity and privacy concerns. This framework laid the foundation for subsequent FD research. FedDF [37] further extends this idea by performing server-side ensemble distillation, aggregating client predictions on unlabeled auxiliary data. FedMKD [38] incorporates multi-teacher adaptive distillation and global anchor alignment to alleviate representation bias caused by model heterogeneity and class imbalance in federated self-supervised learning. Despite their effectiveness, these methods still rely on small public datasets, which may introduce additional privacy risks and limit practical applicability.

To eliminate the dependence on public data, several data-free FD methods have been proposed. FedGen [39] introduces a generator network to capture the consensus knowledge of client models, enabling data-free knowledge distillation while reducing communication costs and privacy leakage. FedMMD [40] enhances knowledge transfer under non-IID data and heterogeneous model settings by integrating multi-teacher distillation with intermediate feature alignment. DENSE [25] advances data-free FD by training a generator to synthesize pseudo-samples, which are then distilled using ensemble client outputs, achieving model heterogeneity support with significantly reduced communication rounds. FedMHO [41] further explores one-shot heterogeneous federated learning, where resource-rich clients upload classification models and resource-constrained clients upload lightweight generative models. The server synthesizes pseudo-data from these uploaded decoders and performs centralized distillation based on the generated samples. In recent years, FedOM [42] and FedLPA [43] have further advanced research on one-shot heterogeneous federated learning. FedOM focuses on enabling knowledge collaboration among heterogeneous clients under a single round of communication, whereas FedLPA places greater emphasis on knowledge alignment and aggregation among heterogeneous local models under one-shot communication constraints. These studies further demonstrate the strong potential of one-shot knowledge transfer for reducing communication costs and supporting model heterogeneity.

Despite the substantial progress made by existing FD methods in reducing communication overhead and supporting heterogeneous models—often without relying on public datasets—most of them still adopt manually predefined and fixed model architectures. In practice, client models are typically selected from standard backbones such as ResNet-18, which, although effective, are not tailored to the diverse data distributions and resource constraints of individual clients. This static design overlooks intrinsic data heterogeneity and requires considerable human effort for architecture selection and tuning, thereby limiting scalability and adaptability in personalized federated learning scenarios.

In contrast to existing PFL and FD methods, pFedZKD jointly addresses structural heterogeneity, data heterogeneity, and communication efficiency by integrating personalized architecture search with data-free multi-teacher distillation in a one-shot setting.

2.3. Preliminaries

2.3.1. Conventional Federated Learning

Federated learning was first introduced by McMahan et al. [2], with FedAvg serving as a representative paradigm for iterative model aggregation. FedAvg updates the global model by performing a weighted average of locally trained models uploaded by clients, thereby enabling collaborative learning while preserving data privacy.

During the federated training process, there are K clients, each holding a private and non-IID local training dataset

D_{k}^{t r}

. The objective of federated learning is to minimize the weighted sum of local empirical risks across all clients, where the weight of each client is proportional to its local data size. This global optimization problem can be formulated as Equation (1).

min_{ω} F (ω) : = \sum_{k = 1}^{K} \frac{| D_{k}^{t r} |}{\sum_{j = 1}^{K} | D_{j}^{t r} |} L_{k} (ω), L_{k} (ω) = \frac{1}{| D_{k}^{t r} |} \sum_{(x, y) \in D_{k}^{t r}} l (ω; x, y) .

(1)

Here,

F (ω)

denotes the global empirical loss function over all clients, and

L_{k} (ω)

represents the empirical risk of the k-th client computed on its local training dataset

D_{k}^{t r}

. The variable

ω

denotes the parameters of the global model, and

l (\cdot)

is the sample-wise loss function. Each local dataset

D_{k}^{t r}

consists of input–label pairs

(x, y)

, where x denotes the input features and y denotes the corresponding ground-truth labels.

To further illustrate the conventional iterative optimization paradigm in federated learning, the standard FedAvg procedure is summarized in Algorithm 1. In general, Conventional federated learning typically relies on repeated local training and server-side parameter aggregation across multiple communication rounds. Such a paradigm usually assumes architecture compatibility among participating clients, which differs fundamentally from the one-shot, architecture-heterogeneous setting considered in this work.

Algorithm 1 Federated Averaging (FedAvg)

1: Input: Private training datasets

{D_{k}^{t r}}_{k = 1}^{K}

; initial global model

ω_{0}

; number of clients K; communication rounds R; local epochs E; batch size B; client fraction C; learning rate

η

2: Output: Final global model

ω_{R}

  3:
  4: Server-Side Execution:
  5: for

t = 0

to

R - 1

do
6: Sample a client subset

S_{t}

with

| S_{t} | = max (⌊ C K ⌋, 1)

7: for all clients

k \in S_{t}

in parallel do
8:

ω_{t + 1}^{k} \leftarrow ClientUpdate (k, ω_{t})

9: end for
10:

ω_{t + 1} \leftarrow \sum_{k \in S_{t}} \frac{| D_{k}^{t r} |}{\sum_{j \in S_{t}} | D_{j}^{t r} |} ω_{t + 1}^{k}

11: end for
12:
13: Function ClientUpdate

(k, ω_{t})

:
14:

ω \leftarrow ω_{t}

15: for

e = 1

to E do
16: for all mini-batches

b \subset D_{k}^{t r}

with

| b | = B

do
17:

ω \leftarrow ω - η \nabla L_{k} (ω; b)

18: end for
19: end for
20: return

ω

2.3.2. Knowledge Distillation

Knowledge distillation (KD) is a widely used paradigm for model compression and knowledge transfer, aiming to convey the predictive behavior of a high-capacity teacher model to a compact student model [44]. Unlike conventional supervised learning that relies solely on hard labels, KD allows the student to learn from the softened output distribution of the teacher, which encodes informative inter-class relationships. To facilitate this process, a temperature parameter

τ

is introduced into the softmax function to control the smoothness of the output distribution. A larger value of

τ

produces softer probability assignments, enabling the student to better capture fine-grained class similarities and improving the effectiveness of knowledge transfer.

The distillation loss is typically formulated as the Kullback–Leibler (KL) divergence between the softened output distributions of the teacher and the student:

\begin{matrix} L_{KD} = \frac{1}{N} \sum_{i = 1}^{N} D_{KL} (softmax (\frac{T e (x_{i}; θ_{T})}{τ}) ∥ softmax (\frac{S t u (x_{i}; θ_{S t u})}{τ})), \end{matrix}

(2)

where

T e (x_{i}; θ_{T})

and

S t u (x_{i}; θ_{S t u})

denote the logits produced by the teacher and student models for input

x_{i}

, respectively.

In standard supervised KD settings, the distillation loss can also be combined with cross-entropy loss on labeled data. More generally, this distillation mechanism provides the basis for advanced variants such as multi-teacher distillation and zero-shot knowledge transfer, which are closely related to the server-side multi-teacher zero-shot knowledge aggregation strategy adopted in this work.

2.3.3. Particle Swarm Optimization

Particle Swarm Optimization (PSO) is a population-based heuristic optimization method in which a group of particles collaboratively searches for high-quality solutions in the target space [18]. Each particle represents a candidate solution and updates its search trajectory according to both its own historical best position and the globally best position discovered by the swarm.

Specifically, the update mechanism is guided by two key components: the personal best position (pBest) achieved by the particle itself and the global best position (gBest) found by the entire swarm. The standard velocity update rule is given by

\begin{matrix} ν_{i, j} (t + 1) & = ω ν_{i, j} (t) + c_{p} r_{p} (p B e s t_{i, j} - x_{i, j} (t)) \\ + c_{g} r_{g} (g B e s t_{j} - x_{i, j} (t)), \end{matrix}

(3)

where

ν_{i, j} (t)

denotes the velocity of particle i in the j-th dimension at iteration t, and

x_{i, j} (t)

represents its current position. The parameter

ω

is the inertia weight, which controls the influence of the particle’s previous velocity. The coefficients

c_{p}

and

c_{g}

are acceleration factors that regulate the attraction toward the personal best and global best positions, respectively. The random variables

r_{p}

and

r_{g}

are uniformly sampled from the interval

[0, 1)

, introducing stochasticity to enhance population diversity.

After updating the velocity, the particle position is updated as

x_{i, j} (t + 1) = x_{i, j} (t) + ν_{i, j} (t + 1) .

(4)

Through iterative velocity and position updates, PSO gradually balances exploration and exploitation in the search space. In the proposed framework, this mechanism serves as the basis for the client-side PSO-FedNAS module, where each particle encodes a candidate CNN architecture and evolves according to its local validation performance.

3. Proposed Method: pFedZKD Framework

In this section, we propose pFedZKD, a personalized federated learning framework tailored for scenarios with model topological heterogeneity. The core idea of pFedZKD is to break the reliance of conventional federated learning on a unified model architecture and parameter-space alignability, by introducing a decouple-and-reconstruct collaborative paradigm that explicitly separates personalized model construction from global knowledge aggregation. Figure 1 illustrates the overall workflow of the proposed pFedZKD framework. The remainder of this section first introduces the general concept of the decouple-and-reconstruct paradigm, and then elaborates the proposed method from two perspectives: the client-side personalized architecture search mechanism and the server-side structure-agnostic knowledge aggregation process.

3.1. Overview of the Decouple-and-Reconstruct Paradigm

To address the dual challenges of structural rigidity and limited communication bandwidth that commonly arise in edge computing environments, this paper proposes a novel one-shot federated learning framework termed pFedZKD. As illustrated in Figure 1, the proposed framework establishes a new Decouple-and-Reconstruct paradigm, whose core idea is to explicitly decouple the client-side personalized model construction process from the server-side global knowledge aggregation process, and to accomplish cross-model knowledge reconstruction within a structure-agnostic semantic space.

In the decoupling stage, clients are no longer constrained to a unified or predefined model architecture, but are instead granted full architectural autonomy. Specifically, each client is regarded as an independent evolutionary agent that leverages the proposed PSO-FedNAS algorithm to autonomously optimize its model architecture and parameter configuration according to its local data distribution, without considering cross-client architectural alignment. In this manner, the personalization process is fully localized, and the system naturally yields a set of client models that are highly diverse in both structure and representational capacity, thereby effectively avoiding the performance bottlenecks induced by structural rigidity.

In the reconstruction stage, the server no longer attempts parameter-level aggregation, but instead adopts a structure-agnostic knowledge collaboration mechanism, namely Multi-ZSKD. Specifically, the server treats the uploaded heterogeneous client models as an ensemble of teachers and, under a data-free setting, employs model inversion techniques to generate a pseudo-sample dataset capable of activating heterogeneous representations, which serves as a shared carrier of distributed knowledge. Subsequently, through an all-to-all cross-inference process, the semantic consensus among heterogeneous teachers is distilled into a global student model with a unified architecture. This process relies solely on pseudo-samples and teacher output responses, without involving model parameters, thereby endowing server-side knowledge aggregation with inherent structure agnosticism and enabling effective reconstruction and transfer of heterogeneous knowledge.

Furthermore, pFedZKD strictly adheres to a one-shot communication protocol. The entire collaborative process involves only a single uplink transmission, in which clients upload their heterogeneous models, followed by a single server-side computation for global distillation. This design completely eliminates the substantial bandwidth overhead caused by multi-round parameter synchronization in conventional federated learning. Moreover, since only model parameters are transmitted without any raw data, and pseudo-sample generation relies exclusively on internal model statistics, the proposed framework achieves efficient collaboration while providing strong privacy protection guarantees.

3.2. Client-Side Decoupling: Personalized Architecture Search via PSO-FedNAS

3.2.1. Search Space Definition and Particle Encoding

To enable highly flexible architectural autonomy on edge devices, we formulate personalized network construction as a combinatorial optimization problem over a discrete topological space. A modular encoding strategy is adopted to construct a search space

S

that supports elastic depth scaling. Specifically, the building blocks of candidate networks are abstracted as a predefined set of operation primitives

O_{c o r e} = {C, P}

, where

C

denotes a convolutional module responsible for feature extraction, and

P

denotes a pooling module that performs dimensionality reduction. In addition, a fully connected module

(F)

is fixedly introduced as the classification head to map high-dimensional feature representations into the label space.

Within the particle swarm optimization framework, each particle

P_{i}

is modeled as an evolutionary agent that conducts a search process in the discrete search space

S

. To mathematically characterize the instantaneous state of such an agent, we define the position vector of particle

P_{i}

in the search space—namely, the specific neural network architecture it currently represents:

X_{i} = [x_{i, 1}, x_{i, 2}, \dots, x_{i, L_{i}}, F],

(5)

where

L_{i}

denotes the dynamic depth of the current architecture, and

x_{i, j} \in O_{c o r e}

indicates the specific module type selected at the j-th layer. This variable-length sequence encoding endows each particle with strong topological expressiveness, enabling it to cover a wide spectrum of network topologies ranging from shallow feature extractors to deep and complex architectures. For example, different particles may be encoded as

[C - P - F]

and

[C - P - C - P - F - F]

, corresponding to network structures with different computational depths and levels of feature abstraction. Under a predefined maximum length constraint

L_{max}

, this encoding scheme implicitly explores network depth and structure through the selection and arrangement of modules, thereby generating structurally diverse candidate models.

To ensure that randomly generated topologies are trainable, we further impose topological validity constraints on the search space. Specifically, the first layer of each network is constrained to be a convolutional module

(x_{i, 1} \equiv C)

to guarantee effective initial feature extraction from raw inputs, while the network must terminate with at least one fully connected module

(F)

as the classification head to map high-dimensional features into the label space. Given the fixed head and tail, intermediate layers

(1 < j \leq L_{i})

are allowed to be arbitrary combinations of convolutional modules

C

and pooling modules

P

. Formally, the valid search space is defined as

S = {X ∣ Satisfies (X, C_{t o p o})}

, where

C_{t o p o}

denotes the set of aforementioned topological constraints. This constraint mechanism effectively avoids invalid architectures while preserving sufficient structural diversity, thereby providing a rich and feasible candidate solution space for subsequent evolutionary search.

3.2.2. Fitness Evaluation and Local Optimization Strategy

PSO-FedNAS performs local search for optimal model architectures on each client through iterative particle swarm evolution. Given the strict data privacy constraints inherent in edge computing environments, the entire fitness evaluation process is designed as a fully local and closed-loop procedure executed on the client side. We adopt an evaluation strategy based on limited-round training. Specifically, for each candidate architecture represented by a particle, the client instantiates the corresponding CNN model and conducts a small number of parameter fine-tuning epochs using its local private dataset, in order to obtain an initial assessment of its discriminative capability under the current data distribution.

To enhance the robustness of the evaluation process and mitigate the risk of early-stage overfitting in deep networks, an architecture-level regularization strategy is uniformly applied during architecture instantiation. Concretely, Batch Normalization (BN) layers are inserted after all convolutional layers, and Dropout is introduced in fully connected layers with a dropout rate of

ρ = 0.3

. After training stage, the cross-entropy loss of the model evaluated on the local validation set

D_{k}^{v a l}

is computed and defined as the fitness value of the corresponding particle, denoted by

F i t n e s s (X_{i})

:

F i t n e s s (X_{i}) = L_{C E} (X_{i}; D_{k}^{v a l}) .

(6)

This metric provides an objective measure of the feature extraction efficiency and discriminative capability of the current architecture under a specific data distribution, and guides the evolutionary search toward low-loss regions.

Due to the discrete and non-convex nature of the search space, especially the dimensional mismatch induced by variable-length particle representations, the velocity-position update mechanism used in conventional continuous PSO cannot be directly applied. To address this challenge, we propose a probability-controlled discrete structural update strategy, which aims to balance the global exploration capability of the population with the local exploitation ability of individual particles. The core idea is to probabilistically determine whether a particle follows its own historical experience (individual best

p B e s t_{i}

) or the collective knowledge of the swarm (global best

g B e s t

). Specifically, we introduce a control factor

C_{g} \in [0, 1]

. At each iteration, for the j-th structural layer of particle

X_{i}

, a random variable

r \sim U (0, 1)

is sampled to construct a target reference layer

X_{t a r g e t}^{(j)}

, according to the following decision rule:

X_{t a r g e t}^{(j)} = \{\begin{matrix} g B e s t^{(j)}, & if r \leq C_{g} \\ p B e s t_{i}^{(j)}, & if r > C_{g} \end{matrix} .

(7)

Here,

g B e s t^{(j)}

and

p B e s t_{i}^{(j)}

denote the module types at the corresponding layer index of the global best particle and the personal best particle, respectively.

After determining the target reference structure

X_{t a r g e t}

, the actual architecture update is performed using a predefined topological transformation operator

T

. This operator first resolves the alignment issue caused by variable-length encodings: if the depths of the current particle and the target particle differ, the current particle sequence is aligned to the target by insertion or deletion operations until their lengths match. After alignment, the operator computes layer-wise type discrepancies between the current structure and the target structure. For each gene position, if the module types differ (e.g., a convolutional module

C

versus a pooling module

P

), a replacement operation is applied; otherwise, the layer is retained. Notably, the topological transformation operator

T : S \times S \to S

is designed to satisfy the space-closure property, ensuring that any updated position vector

X_{i}^{(t + 1)}

strictly complies with the constraints defined in Section 4.2.1, i.e.,

X_{i}^{(t + 1)} \in S

. This property effectively prevents invalid architectures and guarantees the stability of the evolutionary process. Through the above discrete structural update mechanism, PSO-FedNAS enables efficient and stable exploration of complex architectural spaces while preserving topological validity. The complete client-side architecture search procedure of PSO-FedNAS is summarized in Algorithm 2.

Algorithm 2 PSO-FedNAS: Client-Side Architecture Autonomy

1:: Input: Local datasets $(D_{k}^{t r}, D_{k}^{v a l})$ ; search space $S$ with constraints $C_{t o p o}$ ; population size N; PSO iterations $T_{PSO}$ ; control factor $C_{g}$ ; local training epochs E
2:: Output: Personalized optimal architecture $M_{k}$
3:: Phase 1: Initialization
4:: Initialize architecture population ${X_{i}}_{i = 1}^{N}$ from $S$ subject to $C_{t o p o}$
5:: for $i = 1$ to N do
6:: Train $X_{i}$ on $D_{k}^{t r}$ for E epochs
7:: Evaluate on $D_{k}^{v a l}$
8:: $F_{i} \leftarrow L_{C E} (X_{i}; D_{k}^{v a l})$
9:: $p B e s t_{i} \leftarrow X_{i}$
10:: $F_{p B e s t}^{i} \leftarrow F_{i}$
11:: end for
12:: $g B e s t \leftarrow arg {min}_{i \in {1, \dots, N}} F_{p B e s t}^{i}$
13:: $F_{g B e s t} \leftarrow {min}_{i \in {1, \dots, N}} F_{p B e s t}^{i}$
14:: Phase 2: Evolutionary Search Loop
15:: for $t = 1$ to $T_{PSO}$ do
16:: for $i = 1$ to N do
17:: $X_{target} \leftarrow \emptyset$
18:: $L_{ref} \leftarrow max (len (p B e s t_{i}), len (g B e s t))$
19:: for $j = 1$ to $L_{ref}$ do
20:: Sample $r \sim U (0, 1)$
21:: if $r \leq C_{g}$ then
22:: Append layer $g B e s t^{(j)}$ to $X_{target}$
23:: else
24:: Append layer $p B e s t_{i}^{(j)}$ to $X_{target}$
25:: end if
26:: end for
27:: $X_{i} \leftarrow T (X_{i}, X_{target})$
28:: Train $X_{i}$ on $D_{k}^{t r}$ for E epochs
29:: $F_{new} \leftarrow L_{C E} (X_{i}; D_{k}^{v a l})$
30:: if $F_{new} < F_{p B e s t}^{i}$ then
31:: $p B e s t_{i} \leftarrow X_{i}$
32:: $F_{p B e s t}^{i} \leftarrow F_{new}$
33:: if $F_{new} < F_{g B e s t}$ then
34:: $g B e s t \leftarrow X_{i}$
35:: $F_{g B e s t} \leftarrow F_{new}$
36:: end if
37:: end if
38:: end for
39:: end for
40:: Phase 3: Optimal Architecture Selection
41:: $M_{k} \leftarrow g B e s t$
42:: return $M_{k}$

3.2.3. Handling Model Heterogeneity via Architecture Autonomy

At the end of the local evolutionary search phase, each client k selects the particle with the optimal fitness over its entire evolutionary history as the final personalized model architecture. This selection process is formalized as:

M_{k} = arg min_{X} F i t n e s s (X) |_{h i s t o r y},

(8)

After identifying the optimal architecture, each client further performs full local training of the model parameters based on this architecture using its private dataset, thereby obtaining the final personalized model

M_{k}

.

Since the above optimization and training processes are entirely driven by the unique Non-IID data characteristics of each client, and the search procedures across different clients are mutually independent, the resulting model set

{M_{1}, \dots, M_{K}}

naturally exhibits pronounced topological heterogeneity. This implies that client models no longer share a unified architectural backbone, but instead differ substantially in network depth, the number of convolutional layers, and hierarchical layer arrangements. Such heterogeneity, induced by architectural autonomy, fundamentally breaks the structural rigidity imposed by conventional federated learning with a unified model architecture, enabling each client to construct a personalized model that is highly aligned with its local data distribution. Moreover, this structural diversity translates into diversified decision boundaries, providing a highly heterogeneous ensemble of teacher models for the subsequent server-side reconstruction stage, and thereby establishing a critical foundation for effective multi-teacher zero-shot knowledge distillation.

3.3. Server-Side Reconstruction: Structure-Agnostic Knowledge Aggregation

3.3.1. Model-Specific Data Inversion Without Real Data

In the previous subsection, each client autonomously evolves a personalized optimal neural network model based on its local data distribution using the PSO-FedNAS algorithm. As a result, the server receives a collection of client models that are highly heterogeneous in terms of network depth, architectural structure, and feature dimensionality, denoted as:

M = M_{1}, \dots, M_{K} .

(9)

Since these heterogeneous models participate in global collaboration at the server as sources of knowledge, we uniformly regard them as a set of teacher models and denote them as:

M = T e_{1}, \dots, T e_{K} .

(10)

Due to the pronounced structural discrepancies among these personalized models, conventional federated aggregation strategies that rely on parameter alignment or layer-wise weighted averaging are no longer applicable in this setting. Moreover, for privacy preservation reasons, the central server is prohibited from accessing any private training data of the clients. To address the dual challenges of data unavailability and severe model heterogeneity, we introduce a model-specific pseudo-data inversion mechanism at the server side, which enables structure-agnostic knowledge explicitization and reconstruction in federated learning scenarios.

This mechanism is inspired by the zero-shot knowledge distillation paradigm proposed by Nayak et al. [45]. However, unlike prior approaches that focus on compressing a single teacher model, pFedZKD performs pseudo-sample inversion in parallel for each heterogeneous client model at the server. Specifically, we regard each personalized model as providing a local statistical perspective of the underlying global data distribution. By conducting model-specific data inversion, the server extracts the implicit statistical priors encoded within each model. Performing inversion across multiple heterogeneous models allows the reconstructed knowledge to achieve greater semantic diversity and broader coverage at the global level.

As illustrated in Algorithm 3, for the k-th teacher model

T e_{k}

, the server first extracts the weight matrix of its classification layer:

W = {[ω_{1}, \dots, ω_{L}]}^{⊤} \in R^{L \times d},

(11)

where L denotes the number of classes and d represents the feature dimension. Based on this matrix, we compute the cosine similarity matrix

C \in R^{L \times L}

between class-wise weight vectors, where each element

C_{i, j}

is defined as:

C_{i, j} = \frac{ω_{i} \cdot ω_{j}}{∥ ω_{i} ∥ \cdot ∥ ω_{j} ∥}, \forall i, j \in 1, \dots, L .

(12)

This similarity matrix characterizes the semantic correlations and potential confusions among different classes as perceived by the model.

Algorithm 3 Server-Side Model-Specific Pseudo-Image Inversion

1:: Input: Client teacher models ${T e_{k}}_{k = 1}^{K}$ ; number of classes L; batch size B; optimization steps $e p$ ; temperature $τ$ ; total pseudo-samples per teacher $N_{syn}$ ; diversity coefficient set $β_{set} = {0.1, 1.0}$
2:: Output: Global pseudo-image set $X$
3:: Initialize $X \leftarrow \emptyset$
4:: for $k = 1$ to K do
5:: Initialize $X_{k} \leftarrow \emptyset$
6:: Extract classifier weights $W^{(k)} \in R^{L \times d}$ from $T e_{k}$
7:: Compute class similarity matrix $C^{(k)}$ using Equation (12)
8:: for $c = 1$ to L do
9:: for all $β \in β_{set}$ do
10:: $T_{b} \leftarrow ⌈\frac{N_{syn}}{L \cdot | β_{set} | \cdot B}⌉$
11:: for $t = 1$ to $T_{b}$ do
12:: Sample soft targets $Y_{batch} \sim Dir (β \cdot C_{c}^{(k)})$
13:: Initialize noise images $X_{batch}$
14:: for $s = 1$ to $e p$ do
15:: $X_{batch} \leftarrow arg min L_{C E} (Y_{batch}, σ (T e_{k} (X_{batch}) / τ))$
16:: end for
17:: Add optimized $X_{batch}$ to $X_{k}$
18:: end for
19:: end for
20:: end for
21:: $X \leftarrow X \cup X_{k}$
22:: end for
23:: return $X$

To enhance the diversity of generated samples and encourage coverage of decision boundaries, we further design a Dirichlet-based soft-label sampling strategy driven by the class similarity matrix C. Specifically, for class i, we construct a sampling distribution

D i r (β \cdot C_{i})

, where

C_{i}

denotes the i-th row of C and

β \in 0.1, 1.0

is a diversity control coefficient. A smaller

β

leads to soft labels closer to one-hot vectors, facilitating the generation of more discriminative pseudo-images, while a larger

β

produces smoother label distributions that encourage exploration of inter-class decision boundaries, thereby improving generalization.

Given a sampled soft target

Y \sim D i r (β \cdot C_{i})

, the server initializes a random noise image X and optimizes it via gradient descent such that the prediction distribution of the teacher model matches the target soft label:

\bar{X} = arg min_{X} L_{C E} (Y, σ (\frac{T e_{k} (X)}{τ})),

(13)

where

τ

is a temperature scaling parameter and

σ (\cdot)

denotes the softmax function.

By repeating the above inversion process for different soft-label targets, the server constructs a pseudo-sample set for each teacher model

T e_{k}

:

X_{k} = {\bar{X}}^{(1)}, \dots, {\bar{X}}^{(N)} .

(14)

Finally, the server aggregates the pseudo-samples generated by all K heterogeneous teacher models to form a global pseudo-dataset:

X = ⋃_{k = 1}^{K} X_{k} .

(15)

Without exposing any real training samples, this global pseudo-dataset serves as a unified data carrier for structure-agnostic knowledge fusion across heterogeneous clients and provides the foundation for subsequent multi-teacher zero-shot knowledge distillation.

3.3.2. Soft Label Construction from Heterogeneous Teachers

Although the images in the global pseudo-sample set

X

are generated via inversion with respect to individual teacher models, relying on the predictions of a single model as supervision during the knowledge aggregation stage may introduce local errors caused by model-specific biases. Owing to the structural heterogeneity induced by PSO-FedNAS, the client models exhibit diverse network topologies and feature extraction pathways. We therefore organize these structurally diverse client models into a naturally high-diversity teacher ensemble. Since models with different architectures tend to capture complementary feature subspaces, their aggregated predictions generally provide more robust decision boundaries than those derived from any single model.

Specifically, for any pseudo-image

\bar{x} \in X

, we adopt an all-teacher cross-inference strategy to construct supervisory signals. The pseudo-image

\bar{x}

is first fed into all K teacher models to obtain the corresponding logits via forward propagation. The prediction distributions of all teachers are then averaged to form a consensus soft label

\bar{y} (\bar{x})

:

\bar{y} (\bar{x}) = \frac{1}{K} \sum_{k = 1}^{K} σ (\frac{T e_{k} (\bar{x})}{τ}) .

(16)

By repeating this procedure for all pseudo-images in

X

, we obtain the corresponding consensus label set

Y = \bar{y} (\bar{x}) ∣ \bar{x} \in X

, thereby constructing the complete synthetic distillation dataset

D_{s y n} = (X, Y)

. This ensemble-based label construction strategy effectively mitigates the impact of inversion errors from individual models and, by integrating predictions from multiple perspectives, provides smoother and more consistent semantic supervision.

3.3.3. Multi-Teacher Zero-Shot Knowledge Distillation

After constructing the semantically consistent synthetic distillation dataset

D_{s y n} = (X, Y)

, the server proceeds to the final stage of global model reconstruction, namely multi-teacher zero-shot knowledge distillation.

At this stage, the server initializes a global student model

S t u (\cdot)

with a fixed architecture, which can be predefined according to deployment requirements. It is worth emphasizing that the architecture of the student model is completely independent of the client-side search space, thereby achieving a clear decoupling between the server and clients at the model architecture level. Unlike conventional federated learning paradigms that rely on multiple rounds of parameter exchange and synchronization, the training of the student model in pFedZKD is conducted entirely based on the one-shot constructed pseudo-samples and their corresponding consensus soft labels.

Specifically, the student model is trained in a fully supervised manner on the synthetic distillation dataset

D_{s y n}

. Since the server has no access to ground-truth labels, the training process is solely driven by the consensus soft labels

Y

generated in Section 3.3.2. To effectively preserve the class relationships and distributional information embedded in the aggregated predictions of heterogeneous teachers, the Kullback–Leibler (KL) divergence between the student prediction distribution and the teacher consensus distribution is adopted as the distillation loss:

L_{K D} = \frac{1}{| X |} \sum_{\bar{x} \in X} D_{K L} (\bar{y} (\bar{x}) ∥ σ (\frac{S t u (\bar{x})}{τ})),

(17)

By minimizing

L_{K D}

, the global student model is able to effectively absorb the aggregated discriminative knowledge from the heterogeneous teacher ensemble without requiring real data or iterative communication. This enables efficient knowledge reconstruction across both model architectures and data distributions. Algorithm 4 summarizes the complete workflow of the pFedZKD framework, encompassing client-side personalized modeling, server-side pseudo-sample generation, and multi-teacher zero-shot distillation. By jointly addressing privacy preservation and communication efficiency, the proposed framework facilitates effective collaboration among heterogeneous models and demonstrates greater system flexibility and practical value compared to conventional federated learning paradigms.

3.4. Communication and Computational Analysis

From a system cost perspective, pFedZKD transforms the repeatedly incurred communication cost in conventional multi-round federated optimization into a one-time cost of client-side architecture search and model optimization. Specifically, its main computational burden is concentrated in the iterative evaluation of candidate architectures in PSO-FedNAS, while its communication overhead is compressed into a single round of knowledge upload from clients to the server and the subsequent unified distillation process at the server side. Therefore, the key characteristic of pFedZKD is not to reduce all system costs uniformly, but to achieve a trade-off between higher one-time local computation and lower multi-round communication cost.

Compared with conventional multi-round parameter-aggregation-based federated learning, pFedZKD has three system-level advantages. First, the proposed framework requires only one client-server interaction to complete knowledge collaboration, thereby substantially reducing the cumulative number of communication rounds, parameter synchronization delay, and the potential privacy leakage risk caused by frequent interactions. Second, pFedZKD does not rely on parameter-space alignment across clients; instead, it performs semantic-level knowledge aggregation among heterogeneous models through server-side zero-shot knowledge distillation, making it more naturally suited to structurally heterogeneous environments. Third, the search cost of PSO-FedNAS is concentrated in a one-time local search stage at the client side, rather than being repeatedly incurred throughout every round of collaboration as in conventional federated training. In contrast to conventional multi-round federated learning, which distributes client-side computation across repeated rounds of local update and communication, pFedZKD concentrates most of its client-side computational overhead in a one-time local architecture search stage while substantially reducing cross-round communication. This characteristic gives the framework stronger deployability in edge scenarios that permit a certain amount of offline local computation.

Furthermore, the search cost of PSO-FedNAS is mainly determined by the population size, the number of generations, and the local evaluation cost of candidate models. Therefore, its search budget can be flexibly adjusted according to device capability, for example by reducing the population size, decreasing the number of generations, or shortening the candidate evaluation process to further control the local computational burden. Owing to this property, pFedZKD is more suitable for heterogeneous edge environments where communication is constrained, synchronization is expensive, but a certain amount of offline local optimization is acceptable. For extremely resource-constrained devices, however, further reducing the computational and energy cost of the client-side search stage remains an important direction for future research.

Algorithm 4 pFedZKD: Personalized One-Shot Federated Learning via PSO-FedNAS and Multi-ZSKD

1:: Input: Client private datasets ${D_{k}}_{k = 1}^{K}$ ; architecture search space $S$ and PSO configurations; distillation epochs $E_{distill}$ ; batch size B; temperature $τ$ ; learning rate $η$ ; pseudo-sample budget $N_{syn}$
2:: Output: Global student model $S t u$ with parameters $θ_{S t u}$
3:: Phase I: Client-Side Personalized Architecture Search
4:: for $k = 1$ to K in parallel do
5:: $T e_{k} \leftarrow$ Algorithm 2 $(D_{k}, S)$
6:: Upload teacher model $T e_{k}$ to server
7:: end for
8:: Phase II: Server-Side Knowledge Reconstruction
9:: Initialize global student model $S t u$
10:: Step 1: Pseudo-Image Inversion
11:: $X \leftarrow$ Algorithm 3 $({T e_{k}}_{k = 1}^{K}, N_{syn})$
12:: Step 2: Consensus Soft Label Construction
13:: Initialize $D_{syn} \leftarrow \emptyset$
14:: for all $\bar{x} \in X$ do
15:: $\bar{z} \leftarrow 0$
16:: for $k = 1$ to K do
17:: $z_{k} \leftarrow T e_{k} (\bar{x})$
18:: $\bar{z} \leftarrow \bar{z} + σ (z_{k} / τ)$
19:: end for
20:: $\bar{y} \leftarrow \bar{z} / K$
21:: $D_{syn} \leftarrow D_{syn} \cup {(\bar{x}, \bar{y})}$
22:: end for
23:: Step 3: Multi-Teacher Zero-Shot Distillation
24:: for $e = 1$ to $E_{distill}$ do
25:: Shuffle $D_{syn}$ and split into mini-batches of size B
26:: for all $(X_{b}, Y_{b})$ do
27:: $P_{S t u} \leftarrow σ (S t u (X_{b}) / τ)$
28:: Compute loss $L_{K D}$ via Equation (17)
29:: $θ_{S t u} \leftarrow θ_{S t u} - η \nabla_{θ_{S t u}} L_{K D}$
30:: end for
31:: end for
32:: return $S t u$

4. Experiments, Results and Analysis

In this section, we conduct a comprehensive experimental evaluation to systematically assess the effectiveness and robustness of the proposed pFedZKD framework under heterogeneous federated learning environments. Our objective is to verify whether pFedZKD can achieve superior performance compared to existing state-of-the-art (SOTA) methods under extremely low communication cost, by jointly leveraging client-side architectural autonomy and server-side knowledge reconstruction. The remainder of this section is organized as follows. Section 4.1 details the experimental protocol, including dataset partitioning, baseline methods, and implementation specifics. Section 4.2 presents the main comparative results, including client-level personalization performance, global model convergence behavior, robustness analysis under extreme data heterogeneity, and qualitative analysis of pseudo-image generation. Section 4.3 further investigates the contribution of the PSO-FedNAS architecture search module and the multi-teacher zero-shot knowledge distillation (ZSKD) mechanism through extensive ablation studies.

All experiments are implemented using the TensorFlow 2.6 deep learning framework and conducted on a system running Ubuntu 18.04. The experimental platform is equipped with four NVIDIA GeForce GTX 1080 Ti GPUs and the corresponding computational resources.

4.1. Experimental Setup

4.1.1. Datasets and Non-IID Partitioning

To systematically evaluate the generalization capability and robustness of the proposed pFedZKD framework under varying levels of visual complexity and domain characteristics, we conduct experiments on four widely used image classification benchmarks: MNIST, Fashion-MNIST, SVHN, and CIFAR-10. Table 1 summarizes the detailed statistics of these datasets, and their characteristics are described as follows:

MNIST and Fashion-MNIST: MNIST is a foundational benchmark in handwritten digit recognition, consisting of standardized $28 \times 28$ grayscale images. To introduce a more challenging task while preserving the same spatial resolution, we additionally employ Fashion-MNIST, which replaces digit classes with ten categories of clothing items (e.g., coats and footwear). These two datasets are primarily used to assess the learning capability and stability of the models under low visual complexity and highly structured pattern recognition scenarios.
SVHN: The Street View House Numbers (SVHN) dataset represents a more realistic visual scenario with increased complexity. Unlike the clean backgrounds in MNIST, SVHN contains over 600,000 $32 \times 32$ color images that are significantly affected by variations in illumination, motion blur, and background clutter. This dataset is employed to evaluate the robustness of the proposed framework when handling complex natural scenes and substantial noise interference.
CIFAR-10: As a mainstream benchmark for generic object recognition, CIFAR-10 consists of ten mutually exclusive object categories (e.g., airplanes, automobiles, and animals). The high intra-class variability and complex background textures of CIFAR-10 pose substantial challenges to the feature representation learning capability of personalized models.

To rigorously simulate the inherent statistical heterogeneity in federated edge learning scenarios, particularly the label distribution skew, we adopt a Dirichlet distribution-based partitioning strategy denoted as

D i r (α)

. Concretely, the local dataset of each client is sampled according to a Dirichlet distribution, where the concentration parameter

α

controls the degree of data heterogeneity. A smaller

α

results in highly concentrated probability mass, leading to severe class imbalance or even the absence of certain classes at individual clients. In contrast, a larger

α

yields a more uniform data allocation that gradually approaches the independent and identically distributed (IID) setting.

In this work, all experiments are conducted exclusively under non-IID settings, with

α \in {0.1, 0.3, 0.5}

representing three different levels of heterogeneity. In particular,

α = 0.1

corresponds to an extremely heterogeneous scenario and poses substantial challenges to the collaborative learning process.

4.1.2. Hyperparameter Settings and Baselines

To ensure a fair and reproducible evaluation of the proposed pFedZKD framework, the main hyperparameters on both the client and server sides are specified as follows.

On the client side, both candidate architectures during PSO-FedNAS and the final selected personalized model are optimized using the Adam optimizer with an initial learning rate of

1 \times 10^{- 3}

and a batch size of 64. For personalized architecture search based on PSO-FedNAS, the population size is set to 20 particles, and the evolution process is carried out for 10 generations. To reduce the variance caused by random swarm initialization and improve the stability of architecture search, the PSO search procedure is independently repeated 10 times on each client. Each run adopts the same population size and number of generations, and the architecture with the best fitness among all runs is selected as the final personalized model. To balance search efficiency and accuracy, each candidate architecture is trained for only 10 epochs to evaluate its fitness, while the final selected optimal architecture is further fine-tuned for 50 epochs. The search space is explicitly constrained to accommodate the resource limitations of edge devices, where the network depth is restricted to

L \in [3, 20]

, the maximum number of output channels in convolutional layers is limited to 128, and the maximum number of neurons in fully connected layers is set to 300. In addition, the sampling probabilities of convolutional and pooling operations are set to 0.7 and 0.3, respectively, and the maximum convolutional kernel size is limited to

7 \times 7

.

On the server side, the global data inversion process is initialized using a standard normal distribution. The synthetic data budget

N_{syn}

is set to 4000 samples for MNIST and Fashion-MNIST, and 4800 samples for CIFAR-10. The global student model is trained by minimizing the Kullback–Leibler divergence with respect to the aggregated soft labels, where the distillation temperature

τ

is set to 4.0.

To validate the effectiveness of pFedZKD, seven representative federated learning methods are selected as baseline approaches and categorized according to their support for model heterogeneity and their communication paradigms. Specifically, FedAvg [2] and FedProx [29] are adopted as classical baselines in homogeneous multi-round federated learning settings. FedAvg performs global knowledge aggregation through iterative parameter averaging, whereas FedProx introduces a proximal term into the local objective to alleviate training deviation caused by statistical heterogeneity. To account for model heterogeneity, FedDF [37] is further included as a representative heterogeneous federated distillation method. Unlike the proposed approach, FedDF relies on an auxiliary public dataset accessible at the server to accomplish knowledge fusion, which is fundamentally different from the data-free setting considered in this work.

In addition, four advanced one-shot heterogeneous federated learning frameworks, namely DENSE [25], FedMHO [41], FedOM [42], and FedLPA [43], are also considered. Although DENSE supports model heterogeneity, its client architectures are typically predefined and static, without explicitly incorporating personalized architecture adaptation based on local data distributions. FedMHO transfers knowledge by requiring clients to upload locally generated pseudo-images. Although this design is applicable to one-shot learning scenarios, it may also introduce potential privacy risks. FedOM focuses on one-shot knowledge collaboration across heterogeneous clients under a single communication round; however, it does not explicitly integrate a data-driven architecture search mechanism for client-specific model personalization. FedLPA mainly addresses knowledge alignment and aggregation among heterogeneous local models in one-shot federated learning, but does not explicitly perform structure-level adaptive optimization tailored to diverse local data distributions.

In contrast, pFedZKD performs data-driven personalized architecture search through particle swarm optimization and combines it with zero-shot knowledge distillation to achieve cross-architecture knowledge aggregation without sharing raw data or model parameters. As a result, it simultaneously enhances adaptability to model heterogeneity, communication efficiency, and privacy preservation. In all experiments, Top-1 accuracy on the global test set is adopted as the primary evaluation metric.

4.2. Performance Comparison

4.2.1. Personalized Model Performance on Heterogeneous Clients

To comprehensively evaluate the effectiveness of PSO-FedNAS in client-side architecture search, this subsection conducts comparative experiments on the four benchmark image datasets introduced in Section 4.1.1 under varying degrees of non-IID data partitions. Specifically, the personalized models generated by PSO-FedNAS are compared with two representative baseline approaches: (1) the global model obtained by FedAvg after sufficient convergence through multiple communication rounds, and (2) VGG-11 and ResNet-18 models trained independently using only local private data.

To establish a strong federated learning baseline, a standard multi-round federated training protocol is adopted, with the number of communication rounds set to 150. After training, the resulting global model is evaluated by each client on its local test set. As illustrated in Figure 2, PSO-FedNAS consistently exhibits more stable performance than FedAvg on the client side across all datasets and Dirichlet-based non-IID partition settings.

In highly heterogeneous scenarios with

α = 0.1

, the fixed and unified architecture employed by FedAvg struggles to adapt to severely skewed local data distributions, resulting in notable performance degradation on several clients. In contrast, PSO-FedNAS is able to better match the local feature distributions of individual clients by adaptively adjusting network depth and module composition, thereby maintaining more stable test performance under such conditions.

A more detailed analysis of the results in Figure 2 is presented next. Figure 2a–c,g–i report the comparison results on the MNIST and Fashion-MNIST datasets, respectively. Under the Dirichlet setting with

α = 0.1

, the test accuracy of FedAvg on some clients, such as Client0 and Client3, drops significantly to the range of 0.3 to 0.6, indicating that a unified global model fails to simultaneously satisfy all clients under extreme label distribution skew. In contrast, PSO-FedNAS achieves higher accuracy on these representative challenging clients, demonstrating the effectiveness of its architecture search strategy in alleviating the negative impact of label distribution shift. As

α

increases to 0.5 and the degree of data heterogeneity is reduced, the overall performance of FedAvg improves, while PSO-FedNAS continues to maintain superior performance on most clients, reflecting its robustness across different levels of heterogeneity.

The advantage of architecture adaptability becomes even more evident on natural image datasets with higher visual complexity. Figure 2d–f present the experimental results on the SVHN dataset. Despite the presence of substantial background noise and illumination variations, PSO-FedNAS consistently outperforms the baseline methods under all non-IID settings. Figure 2j–l show the results on the CIFAR-10 dataset. Under the extreme heterogeneity setting with

α = 0.1

, the test accuracy of FedAvg remains at a relatively low level on most clients, indicating optimization difficulties in complex tasks under severe data heterogeneity. In contrast, PSO-FedNAS is still able to achieve test accuracies above 0.4 on several clients, such as Client5 and Client6. These results indicate that automatically searched personalized architectures possess stronger feature representation capability when handling data with high intra-class variance.

To further investigate the role of architecture search mechanisms in edge scenarios, a quantitative analysis is conducted on the CIFAR-10 dataset, which represents the most challenging task in terms of complexity. The average client test accuracy is evaluated under different non-IID settings. Two representative locally trained baselines are introduced for comparison: Local-VGG11 with moderate model capacity and Local-ResNet18 with higher capacity. Both models are trained exclusively on local private data without any form of federated collaboration.

As shown in Figure 3, pFedZKD achieves the highest average test accuracy across all heterogeneity settings. In the strongly heterogeneous scenario with

α = 0.1

, pFedZKD attains an accuracy of 33.69%, which is substantially higher than that of Local-VGG11 at 28.54% and Local-ResNet18 at 26.22%. As the degree of non-IID data heterogeneity gradually decreases with

α

increasing to 0.5, both local models exhibit improved performance. However, Local-VGG11 consistently outperforms Local-ResNet18 across multiple settings. This observation suggests that under local-only training with limited data, deeper and more complex network architectures do not necessarily translate into superior performance.

These results highlight the inherent tension between model capacity and data availability in isolated edge training scenarios. Although ResNet-18 possesses a larger parameter capacity, its performance does not surpass that of the simpler VGG-11 under highly non-IID and data-scarce conditions, and may even suffer from optimization difficulties due to an excessively large parameter space. Moreover, as predefined fixed architectures, both VGG and ResNet lack the ability to dynamically adapt to the actual data scale and class distribution of individual clients, which limits their expressive capacity on complex tasks. In contrast, pFedZKD leverages PSO-FedNAS to overcome the constraints of fixed architectures and enables the discovery of model structures that are better matched to local data conditions without predefining network depth or width.

Overall, the results presented in Figure 2 and Figure 3 further highlight the importance of client-side architecture autonomy in heterogeneous federated learning systems. The reason why PSO-FedNAS is able to improve model generalization lies mainly in its ability to adaptively search for more suitable network architectures according to the local data distribution and resource constraints of different clients, thereby reducing the mismatch between a fixed shared architecture and local tasks. Compared with predefined unified architectures, this personalized search mechanism can better balance model capacity and task complexity, alleviating underfitting and overfitting to some extent, and thus maintaining more stable local test performance under highly heterogeneous conditions. Furthermore, within the proposed framework, these searched personalized client models also serve as teacher models for subsequent pseudo-sample construction and zero-shot knowledge distillation at the server side. Therefore, better local architectures not only improve client-side personalization, but also provide higher-quality and more diverse semantic information for global knowledge aggregation, thereby further supporting the distillation effectiveness and generalization performance of the final global student model.

4.2.2. Server-Side Comparison of pFedZKD with Baseline and SOTA Methods

To comprehensively evaluate the overall performance of pFedZKD, we conduct systematic comparisons with five representative baseline methods from two perspectives: final test accuracy (Table 2) and training convergence behavior (Figure 4). The comparisons are performed across four benchmark datasets under three non-IID settings.

As shown in Table 2, pFedZKD achieves the best test performance in the majority of experimental configurations. It should be noted that the original study of FedMHO does not include experiments on the CIFAR-10 dataset, and the corresponding entries are therefore marked as “–”. For CIFAR-10, the best baseline performance is determined among the remaining comparable methods, namely FedAvg, FedProx, FedDF, and DENSE.

Under the strongly heterogeneous setting (

α = 0.1

), pFedZKD attains the highest test accuracy on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. Specifically, on MNIST, pFedZKD slightly outperforms FedMHO with a test accuracy of 87.58%, demonstrating its effectiveness even on relatively simple visual recognition tasks. On the more challenging Fashion-MNIST and CIFAR-10 datasets, pFedZKD surpasses the strongest comparable baseline by 8.38% and 10.34%, respectively, highlighting its pronounced advantage in scenarios with complex feature distributions and high-dimensional natural images. In contrast, FedMHO maintains a certain performance advantage on the SVHN dataset; nevertheless, pFedZKD still achieves competitive results, indicating its robustness under extremely heterogeneous conditions. Overall, these results suggest that pFedZKD exhibits stronger structural adaptability and knowledge transfer capability in complex visual tasks.

Under the moderately heterogeneous setting (

α = 0.3

), the performance advantage of pFedZKD becomes more stable. It consistently outperforms all competing methods on MNIST, Fashion-MNIST, and CIFAR-10, with performance gains ranging from 1.69% to 8.85%. Meanwhile, on the SVHN dataset, the performance gap between pFedZKD and the best-performing method is below 1%, indicating that pFedZKD remains highly competitive. These observations further confirm the robustness of pFedZKD across datasets with varying levels of data complexity.

When the degree of non-IID is further mitigated (

α = 0.5

), pFedZKD continues to maintain a leading trend, achieving particularly notable improvements on the Fashion-MNIST dataset, with a gain of up to 8.62%. Although certain methods may slightly outperform pFedZKD on individual datasets under specific configurations, pFedZKD demonstrates superior overall generalization ability and stability across most datasets and non-IID settings.

The above results systematically reveal the inherent limitations of different approaches in heterogeneous federated learning. Methods based on a unified architecture, such as FedAvg and FedProx, struggle to effectively cope with severe statistical heterogeneity. Distillation-based methods that support model heterogeneity, including FedDF, DENSE, and FedMHO, partially alleviate this issue but remain constrained by predefined model architectures and lack structural adaptability to local data distributions. In contrast, pFedZKD dynamically searches for personalized optimal architectures via client-side PSO-FedNAS and integrates semantically consistent pseudo-samples generated by the server-side multi-ZSKD mechanism. This structure-agnostic knowledge aggregation enables pFedZKD to consistently achieve superior overall performance across diverse heterogeneous scenarios.

To further analyze the differences among methods from the perspective of training dynamics, Figure 4 illustrates the convergence curves of global model Top-1 test accuracy for pFedZKD and three representative baseline methods (FedAvg, FedProx, and DENSE) across four image datasets under different non-IID settings. It is worth noting that both DENSE and pFedZKD adopt a one-shot communication paradigm; therefore, the training rounds shown on the horizontal axis correspond to the distillation iterations of the server-side student model. In contrast, the training rounds of FedAvg and FedProx represent the communication rounds between clients and the server. By comparing convergence speed, stability, and final performance, the optimization advantages of pFedZKD in heterogeneous federated learning can be more clearly characterized.

From an overall perspective, pFedZKD consistently exhibits faster convergence and higher final accuracy across all experimental scenarios. On the MNIST dataset, when

α = 0.1

, pFedZKD rapidly increases the test accuracy to above 60% and stabilizes around 80% during later training stages. In comparison, DENSE converges more slowly, requiring more than 100 training rounds to stabilize, with a final accuracy of approximately 75%. As

α

increases to 0.3 and 0.5, the overall performance of all methods improves; however, pFedZKD consistently maintains the leading position.

On the SVHN dataset, pFedZKD demonstrates the most favorable convergence behavior under all three data distribution settings, characterized by faster performance improvement and higher final accuracy. Under the highly non-IID setting (

α = 0.1

), FedAvg and FedProx exhibit pronounced oscillations and low accuracy, while DENSE shows some improvement but still performs significantly worse than pFedZKD. As data heterogeneity is gradually mitigated, the accuracy of all methods increases, yet pFedZKD retains a stable advantage.

On the Fashion-MNIST dataset, pFedZKD similarly achieves the best performance under all three non-IID settings. In particular, under the strongly heterogeneous scenario (

α = 0.1

), its accuracy exceeds that of other methods by more than 10 percentage points. Although FedAvg and FedProx converge relatively quickly, their final performance remains limited due to the use of a unified architecture. DENSE outperforms these two methods but converges more slowly and still lags behind pFedZKD.

On the most challenging CIFAR-10 dataset, the overall performance of all methods is lower than that on the other datasets, reflecting the intrinsic difficulty of non-IID federated learning on high-dimensional natural images. Nevertheless, pFedZKD exhibits a more stable convergence trend and achieves the highest final test accuracy, significantly outperforming FedAvg, FedProx, and DENSE. This further validates its adaptability and robustness in complex tasks and strongly heterogeneous environments.

Overall, the experimental results demonstrate that pFedZKD can achieve more efficient training convergence and superior global performance when simultaneously confronted with data heterogeneity and model heterogeneity, highlighting its effectiveness and practical value in heterogeneous federated learning scenarios.

4.2.3. Robustness Under Extreme Data Heterogeneity

To further evaluate the robustness of pFedZKD under more severe data heterogeneity, we additionally conduct experiments with the Dirichlet parameter set to

α = 0.01

. Compared with the main experiments where

α \in {0.1, 0.3, 0.5}

, this setting corresponds to a much more extreme label distribution skew and can therefore be regarded as a more stringent stress test for data heterogeneity. Considering that CIFAR-10 is a more challenging task and can more clearly reflect the robustness of different methods under extreme label skew, we select CIFAR-10 as the representative dataset for this supplementary experiment. Meanwhile, since some baseline methods lack publicly available implementations or reproducible experimental protocols under this setting, we mainly compare representative methods that can be stably reproduced under a unified experimental setup, so as to examine their performance degradation trends in extremely heterogeneous scenarios.

Table 3 reports the performance changes of different methods on CIFAR-10 when the Dirichlet parameter is further reduced from

α = 0.1

to

α = 0.01

. Here, “Drop” denotes the decrease in test accuracy from

α = 0.1

to

α = 0.01

. It can be observed that under more extreme label skew, all methods suffer noticeable performance degradation, indicating that extreme non-IID settings substantially increase the difficulty of knowledge collaboration and model generalization in heterogeneous federated learning. Specifically, both FedAvg and FedProx exhibit significant accuracy drops, suggesting that traditional multi-round parameter-aggregation-based federated learning methods are more vulnerable to severe statistical heterogeneity when client class distributions become extremely sparse and highly imbalanced. One-shot methods such as DENSE and FedLPA also experience varying degrees of performance degradation under this setting, indicating that in the absence of multi-round corrective interaction, the adequacy and stability of local knowledge representations have a more direct impact on the final global performance.

In contrast, pFedZKD still achieves a test accuracy of 37.96% at

α = 0.01

, and its performance drop is 11.16 percentage points, which is clearly smaller than that of FedAvg, FedProx, and DENSE. Considering both the absolute performance and the degradation trend, pFedZKD demonstrates a more balanced robustness advantage under extremely heterogeneous conditions. This can be mainly attributed to two aspects. On the one hand, the client-side PSO-FedNAS is able to search for more suitable personalized architectures according to highly skewed local data distributions, thereby improving the quality of local knowledge extraction. On the other hand, the server-side multi-teacher zero-shot knowledge distillation mechanism can aggregate relatively informative semantic knowledge from heterogeneous client models without accessing real data, thereby preserving transferable cross-client information as much as possible even when local class coverage is severely insufficient. Therefore, even in more challenging heterogeneous scenarios, pFedZKD can still maintain favorable effectiveness and stability.

4.2.4. Visualization of Generated Pseudo Images

To visually assess the fidelity and semantic characteristics of the server-side synthesized data, this subsection presents a qualitative comparison between real samples and pseudo-images generated by the proposed pFedZKD framework across four benchmark datasets. As illustrated in Figure 5, the odd rows of each subfigure display real samples from classes 0 to 9, while the even rows correspond to pseudo-images synthesized by the multi-ZSKD mechanism based on heterogeneous client models.

From the visualization results, it can be observed that although the server has no access to any real training data throughout the entire training process, and all pseudo-images are initialized from random noise and generated via reverse optimization under the constraints of teacher models, the synthesized samples still preserve essential class-discriminative cues to varying degrees. On the structurally simple MNIST dataset, the generated pseudo-images exhibit relatively clear stroke contours and coherent spatial layouts. This phenomenon can be attributed to the intrinsically low intra-class variance and highly standardized visual patterns of MNIST, where class semantics can be sufficiently represented by a small number of structural strokes. Consequently, in the absence of real data constraints, pseudo-images obtained by inverting model output distributions tend to capture class prototype characteristics rather than reproducing pixel-level details of specific training samples.

In contrast, for datasets with higher visual complexity, such as SVHN, Fashion-MNIST, and CIFAR-10, the overall perceptual clarity of the generated pseudo-images decreases, accompanied by increased local texture noise and high-frequency perturbations. This behavior is a typical characteristic of zero-shot inversion scenarios, reflecting that, without strong priors from real data distributions, the inversion process focuses more on capturing internal discriminative activation patterns of the models rather than reconstructing visually natural images recognizable to humans. Nevertheless, despite their perceptual differences from real samples, these pseudo-images implicitly encode rich class-related information in the feature space, as has been thoroughly validated by the quantitative results presented earlier.

From a privacy-preserving perspective, the pronounced visual discrepancy between pseudo-images and real samples further highlights an inherent advantage of the pFedZKD framework. The generated pseudo-images exhibit blurred and abstract visual characteristics, enabling effective knowledge extraction and transfer without exposing original pixel-level information. More importantly, although these pseudo-images may lack intuitive visual interpretability, experimental evidence demonstrates that they can effectively activate the corresponding class responses of the student model in the feature space, thereby sufficiently supporting high-quality knowledge distillation.

In summary, by generating visually abstract yet semantically consistent pseudo-images, pFedZKD effectively mitigates the data silo problem in heterogeneous federated learning. This mechanism not only eliminates the need for direct transmission of raw data at the physical level, but also enables efficient knowledge transfer through high-quality distillation, allowing the server-side student model to achieve performance comparable to, or even surpassing, that of centralized training. As a result, pFedZKD achieves a favorable balance between data privacy protection and model effectiveness.

4.3. Ablation Study

Since the pFedZKD framework achieves performance improvements through the synergistic interaction between architecture search (PSO-FedNAS) and multi-teacher zero-shot knowledge distillation (multi-ZSKD), the coupling effects among its components play a critical role in the overall system. To validate the rationality of the framework design and to quantify the contributions of its core modules, this subsection conducts a systematic ablation study focusing on these two components. Specifically, by removing or replacing the corresponding modules, we analyze the effectiveness of the personalized architecture search mechanism in adapting to heterogeneous data distributions, as well as the role of the multi-teacher distillation mechanism in server-side knowledge aggregation.

4.3.1. Impact of Architecture Search (PSO-FedNAS)

To evaluate the contribution of the PSO-FedNAS module to the performance of the global model, this subsection constructs two representative ablation variants:

Fixed Homogeneous Architecture (pFedZKD-Hom): the client-side architecture search mechanism is removed, and all clients are forced to deploy an identical model architecture, thereby simulating a conventional homogeneous federated learning setting;
Random Heterogeneous Architecture (pFedZKD-RandPool): the search optimization process is removed, and each client is randomly assigned a network architecture from a predefined heterogeneous model pool, simulating a heterogeneous federated learning scenario without optimization guidance.

Except for the strategy used to generate client models, these variants share the same data partitioning scheme, training hyperparameters, and distillation procedure as the complete pFedZKD framework.

In the random heterogeneous setting (pFedZKD-RandPool), to realistically simulate variations in computational capability and storage resources across edge devices, we construct a heterogeneous model pool consisting of five representative convolutional neural networks, covering a spectrum of model complexities from lightweight to relatively high-capacity architectures:

CNN1 & CNN2: lightweight convolutional networks. CNN1 comprises two convolutional layers (32–64 channels) followed by a single fully connected layer, making it suitable for extremely resource-constrained devices; CNN2 further enhances feature representation by increasing the depth of convolutional stacking and the dimensionality of the fully connected layer (256);
MobileNetV2: an efficient architecture based on depthwise separable convolutions, which achieves strong classification performance while maintaining low computational complexity, representing a typical mobile-oriented network design;
VGG11: a moderately to relatively high-capacity model built upon deep convolutional stacking, used to emulate edge nodes with comparatively sufficient computational resources;
ResNet18: a model with higher structural complexity that incorporates residual connections, featuring a larger parameter scale and a more complex optimization landscape, and serving to characterize nodes with stronger computational capabilities.

By randomly assigning these architectures to clients, the resulting configuration effectively captures device heterogeneity in realistic edge environments, providing a reasonable baseline for evaluating the optimization capability of PSO-FedNAS under heterogeneous resource constraints.

Table 4 reports the Top-1 test accuracy achieved by different architecture strategies across four benchmark datasets. It can be observed that pFedZKD-Hom consistently underperforms the complete pFedZKD framework on all datasets. Taking the SVHN dataset under a highly heterogeneous setting (

α = 0.1

) as an example, pFedZKD-Hom attains a test accuracy of only 37.66%, which represents a substantial drop compared to the 67.51% achieved by pFedZKD (Full). This result indicates that, in highly heterogeneous federated environments, client data distributions exhibit pronounced domain shifts. When the server performs pseudo-image inversion based on homogeneous teacher models, the gradient information provided by different clients lacks consistency, making it difficult for the generated pseudo-images to form clear and stable semantic boundaries. Such low-discriminative teacher aggregation limits the amount of effective knowledge encoded in the pseudo-samples, thereby degrading the learning performance of the global student model. This observation demonstrates that, in data-free federated learning scenarios, a single fixed architecture is insufficient to cope with severe statistical heterogeneity.

The pFedZKD-RandPool variant exhibits the worst performance across all experimental settings. Due to the absence of targeted optimization guidance, randomly assigned architectures are often mismatched with local data characteristics, leading to substantial semantic discrepancies in the responses of different teacher models to the same pseudo-samples. During the distillation process, this inconsistency is amplified into noise, which hinders the stable convergence of the student model in the feature space.

In contrast, the complete pFedZKD framework effectively decouples client model architectures from predefined configurations by introducing PSO-FedNAS, allowing each client to adaptively search for a network structure that better matches the complexity of its local data. Meanwhile, high-quality personalized teacher models significantly reduce gradient conflicts during distillation, enabling the server to reconstruct fragmented local knowledge into a more consistent and robust global representation. Experimental results demonstrate that PSO-FedNAS not only improves model accuracy but also plays a critical stabilizing role in data-free federated learning under extreme heterogeneity.

4.3.2. Impact of Multi-Teacher Zero-Shot Knowledge Distillation (ZSKD)

To isolate and quantify the contribution of the multi-teacher zero-shot knowledge distillation (multi-ZSKD) module to the overall performance, we construct an ablation variant, termed pFedZKD-w/o ZSKD, in which the distillation mechanism is removed. In this setting, the server no longer performs multi-teacher fusion based on logits or soft target guidance. Instead, it directly trains the global model using the pseudo-images generated by clients together with their corresponding hard pseudo-labels under a conventional supervised learning paradigm. This experiment aims to investigate whether effective global model learning can be sustained solely by synthetic data in the absence of soft knowledge signals.

As reported in Table 5, although the model without ZSKD is still able to maintain a certain level of baseline performance across all datasets, its overall test accuracy exhibits a clear degradation. This performance drop is particularly pronounced in scenarios with higher visual complexity and stronger data heterogeneity. For instance, under the Dirichlet setting of

α = 0.1

on the SVHN and CIFAR-10 datasets, the test accuracy of the variant without ZSKD decreases by 19.70% and 12.10%, respectively, compared to the complete pFedZKD framework.

These results indicate that, under the One-Shot federated learning setting where no real data are accessible, relying solely on hard pseudo-labels generated by heterogeneous client models leads to an evident performance ceiling. Due to structural and distributional discrepancies among client models, the generated hard labels inevitably contain noise. In the absence of an effective fusion mechanism, the server struggles to extract consistent and reliable supervision signals from multiple clients, which in turn results in unstable feature representations and limited generalization capability of the global model.

In contrast, the complete pFedZKD framework introduces the multi-ZSKD module to jointly distill knowledge from multiple teacher models at the server side, enabling efficient knowledge aggregation without accessing real data. By integrating the logits distributions of multiple teachers, this mechanism effectively balances the knowledge biases across heterogeneous clients, yielding pseudo-samples with more consistent semantics and more reliable supervisory signals. Moreover, even when certain pseudo-samples are associated with incorrect hard labels, the probabilistic distributions produced by multiple teachers still preserve rich dark knowledge, such as inter-class relationships. Such soft supervision helps the student model learn more robust decision boundaries, which is particularly beneficial for noise-sensitive and visually complex tasks such as SVHN and CIFAR-10.

In summary, PSO-FedNAS and multi-ZSKD form a highly complementary synergy at the structural and knowledge levels. The former ensures the discriminability of pseudo-sample representations through personalized architecture search, while the latter enhances the robustness and consistency of supervision signals via multi-teacher distillation. Their joint effect constitutes a key foundation for pFedZKD to consistently achieve superior performance in highly heterogeneous federated learning scenarios.

5. Conclusions and Discussion

This paper proposes pFedZKD, a one-shot data-free personalized federated learning framework tailored for heterogeneous edge environments, aiming to overcome the inherent limitations of traditional federated learning paradigms that rely on homogeneous model assumptions and iterative multi-round communication. Unlike existing approaches that depend on unified model architectures or parameter alignment for collaboration, pFedZKD introduces a system-level decouple-and-reconstruct paradigm, which explicitly separates client-side personalized modeling from server-side global knowledge aggregation.

On the client side, the proposed PSO-FedNAS mechanism enables each client to adaptively evolve an optimal network structure based on its local non-IID data distribution, effectively mitigating performance bottlenecks caused by structural rigidity. On the server side, by incorporating a multi-teacher zero-shot knowledge distillation mechanism, pFedZKD reconstructs highly heterogeneous client model knowledge into a unified and robust global student model without accessing any real data or public datasets. The entire framework completes knowledge collaboration with only a single round of communication, significantly reducing communication overhead while simultaneously preserving personalization capability and privacy protection. Extensive experimental results demonstrate that, under challenging scenarios where extreme data non-IID and model heterogeneity coexist, pFedZKD consistently outperforms existing homogeneous and heterogeneous federated learning methods across multiple benchmark datasets, exhibiting superior convergence efficiency, generalization performance, and system robustness. These results validate the effectiveness and feasibility of achieving heterogeneous model collaboration through structural autonomy and semantic-level knowledge reconstruction.

Despite its notable advantages in terms of performance and communication efficiency, pFedZKD still faces several challenges that warrant further investigation for practical deployment. First, the evolutionary search process introduced by PSO-FedNAS inevitably increases the computational burden on the client side, which calls for further optimization when deploying on extremely resource-constrained edge devices. Second, during the data-free distillation stage, the current pseudo-sample generation process mainly relies on model output priors, which may limit the upper bound of knowledge transfer in high-dimensional and complex tasks. Moreover, as client heterogeneity continues to intensify, discrepancies among teacher model output distributions may introduce potential conflicts during knowledge fusion, thereby affecting the stability of student model convergence.

Future research will focus on several directions to further extend and enhance the pFedZKD framework. First, to address the computational and storage overhead induced by architecture search, more lightweight and efficient structure search strategies will be explored. Second, to improve the diversity and representativeness of pseudo-images during the distillation stage, more powerful generative modeling mechanisms can be incorporated to strengthen knowledge transfer. Finally, to further mitigate potential privacy leakage risks arising from model structure and parameter interactions, differential privacy techniques, secure aggregation protocols, or model perturbation strategies may be integrated to enhance robustness and trustworthiness in privacy-sensitive application scenarios.

Author Contributions

Conceptualization, X.Y. and G.H.; Methodology, J.Y. and X.Y.; Software, J.Y. and X.Y.; Validation, J.Y. and D.W.; Investigation, D.W. and Y.X.; Visualization, Y.X.; Writing—original draft preparation, J.Y.; Supervision, Y.X. and G.H.; Funding acquisition, Y.X. and G.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 51574232; the Xinjiang Key Research and Development Special Task, grant number 2022B03003-3; and the Wuxi University Research Start-up Fund for High-level Talents, grant number 550215016.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable and helpful comments, which substantially improved this paper. Finally, we would also like to thank all of the editors for their professional advice and help.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawit, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and Open Problems in Federated Learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: Cambridge, MA, USA, 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Chaddad, A.; Wu, Y.; Desrosiers, C. Federated Learning for Healthcare Applications. IEEE Internet Things J. 2024, 11, 7339–7358. [Google Scholar] [CrossRef]
Wang, X.; Li, Z.; Jin, S.; Zhang, J. Achieving Linear Speedup in Asynchronous Federated Learning with Heterogeneous Clients. IEEE Trans. Mob. Comput. 2025, 24, 435–448. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Tan, A.Z.; Yu, H.; Cui, L.; Yang, Q. Towards Personalized Federated Learning. IEEE Trans. Neural Networks Learn. Syst. 2023, 34, 9587–9603. [Google Scholar] [CrossRef]
Dinh, C.T.; Tran, N.H.; Nguyen, T.D. Personalized Federated Learning with Moreau Envelopes. arXiv 2022, arXiv:2006.08848. [Google Scholar] [CrossRef]
Collins, L.; Hassani, H.; Mokhtari, A.; Shakkottai, S. Exploiting Shared Representations for Personalized Federated Learning. arXiv 2021, arXiv:2102.07078. [Google Scholar]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized Federated Learning: A Meta-Learning Approach. arXiv 2020, arXiv:2002.07948. [Google Scholar] [CrossRef]
Zhan, Z.H.; Li, J.Y.; Zhang, J. Evolutionary deep learning: A survey. Neurocomputing 2022, 483, 42–58. [Google Scholar] [CrossRef]
Liu, Y.; Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G.; Tan, K.C. A Survey on Evolutionary Neural Architecture Search. IEEE Trans. Neural Networks Learn. Syst. 2023, 34, 550–570. [Google Scholar] [CrossRef]
Liu, H.; Simonyan, K.; Yang, Y. DARTS: Differentiable Architecture Search. arXiv 2019, arXiv:1806.09055. [Google Scholar] [CrossRef]
Akimoto, Y.; Shirakawa, S.; Yoshinari, N.; Uchida, K.; Saito, S.; Nishida, K. Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search. arXiv 2019, arXiv:1905.08537. [Google Scholar] [CrossRef]
Li, L.; Khodak, M.; Balcan, M.F.; Talwalkar, A. Geometry-Aware Gradient Algorithms for Neural Architecture Search. arXiv 2021, arXiv:2004.07802. [Google Scholar] [CrossRef]
Lu, Z.; Whalen, I.; Boddeti, V.; Dhebar, Y.; Deb, K.; Goodman, E.; Banzhaf, W. NSGA-Net: Neural Architecture Search using Multi-Objective Genetic Algorithm. arXiv 2019, arXiv:1810.03522. [Google Scholar]
Lu, Z.; Deb, K.; Goodman, E.; Banzhaf, W.; Boddeti, V.N. NSGANetV2: Evolutionary Multi-Objective Surrogate-Assisted Neural Architecture Search. arXiv 2020, arXiv:2007.10396. [Google Scholar]
Sinha, N.; Chen, K.W. Evolving Neural Architecture Using One Shot Model. arXiv 2020, arXiv:2012.12540. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar] [CrossRef]
Sun, Y.; Xue, B.; Zhang, M.; Yen, G.G. A Particle Swarm Optimization-Based Flexible Convolutional Autoencoder for Image Classification. IEEE Trans. Neural Networks Learn. Syst. 2019, 30, 2295–2309. [Google Scholar] [CrossRef] [PubMed]
Wang, B.; Xue, B.; Zhang, M. Surrogate-Assisted Particle Swarm Optimization for Evolving Variable-Length Transferable Blocks for Image Classification. IEEE Trans. Neural Networks Learn. Syst. 2022, 33, 3727–3740. [Google Scholar] [CrossRef]
Fernandes, F.E., Jr.; Yen, G.G. Particle swarm optimization of deep neural networks architectures for image classification. Swarm Evol. Comput. 2019, 49, 62–74. [Google Scholar] [CrossRef]
He, C.; Annavaram, M.; Avestimehr, S. Towards Non-I.I.D. and Invisible Data with FedNAS: Federated Deep Learning via Neural Architecture Search. arXiv 2021, arXiv:2004.08546. [Google Scholar]
Dudziak, L.; Laskaridis, S.; Fernandez-Marques, J. FedorAS: Federated Architecture Search under system heterogeneity. arXiv 2022, arXiv:2206.11239. [Google Scholar] [CrossRef]
Zhou, Y.; Pu, G.; Ma, X.; Li, X.; Wu, D. Distilled One-Shot Federated Learning. arXiv 2021, arXiv:2009.07999. [Google Scholar] [CrossRef]
Zhang, J.; Chen, C.; Li, B.; Lyu, L.; Wu, S.; Ding, S.; Shen, C.; Wu, C. DENSE: Data-Free One-Shot Federated Learning. arXiv 2022, arXiv:2112.12371. [Google Scholar]
Heinbaugh, C.E.; Luz-Ricca, E.; Shao, H. Data-Free One-Shot Federated Learning Under Very High Statistical Heterogeneity. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Li, X.; Jiang, M.; Zhang, X.; Kamp, M.; Dou, Q. FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. arXiv 2021, arXiv:2102.07623. [Google Scholar]
Li, T.; Hu, S.; Beirami, A.; Smith, V. Ditto: Fair and Robust Federated Learning Through Personalization. arXiv 2021, arXiv:2012.04221. [Google Scholar] [CrossRef]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. arXiv 2020, arXiv:1812.06127. [Google Scholar] [CrossRef]
Liang, P.P.; Liu, T.; Ziyin, L.; Allen, N.B.; Auerbach, R.P.; Brent, D.; Salakhutdinov, R.; Morency, L.P. Think Locally, Act Globally: Federated Learning with Local and Global Representations. arXiv 2020, arXiv:2001.01523. [Google Scholar] [CrossRef]
Arivazhagan, M.G.; Aggarwal, V.; Singh, A.K.; Choudhary, S. Federated Learning with Personalization Layers. arXiv 2019, arXiv:1912.00818. [Google Scholar] [CrossRef]
Smith, V.; Chiang, C.K.; Sanjabi, M.; Talwalkar, A. Federated Multi-Task Learning. arXiv 2018, arXiv:1705.10467. [Google Scholar] [CrossRef]
Sattler, F.; Müller, K.R.; Samek, W. Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints. IEEE Trans. Neural Networks Learn. Syst. 2021, 32, 3710–3722. [Google Scholar] [CrossRef]
Ghosh, A.; Chung, J.; Yin, D.; Ramchandran, K. An Efficient Framework for Clustered Federated Learning. IEEE Trans. Inf. Theory 2022, 68, 8076–8091. [Google Scholar] [CrossRef]
Huang, Y.; Chu, L.; Zhou, Z.; Wang, L.; Liu, J.; Pei, J.; Zhang, Y. Personalized Cross-Silo Federated Learning on Non-IID Data. arXiv 2021, arXiv:2007.03797. [Google Scholar] [CrossRef]
Li, D.; Wang, J. FedMD: Heterogenous Federated Learning via Model Distillation. arXiv 2019, arXiv:1910.03581. [Google Scholar] [CrossRef]
Lin, T.; Kong, L.; Stich, S.U.; Jaggi, M. Ensemble Distillation for Robust Model Fusion in Federated Learning. arXiv 2021, arXiv:2006.07242. [Google Scholar] [CrossRef]
Li, M.; Zhang, X.; Wang, Q.; LIU, T.; Wu, R.; Wang, W.; Zhuang, F.; Xiong, H.; Yu, D. Resource-Aware Federated Self-Supervised Learning with Global Class Representations. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Zhu, Z.; Hong, J.; Zhou, J. Data-Free Knowledge Distillation for Heterogeneous Federated Learning. arXiv 2021, arXiv:2105.10056. [Google Scholar] [CrossRef]
Yang, Q.; Chen, J.; Yin, X.; Xie, J.; Wen, Q. FedMMD: Heterogenous Federated Learning based on Multi-teacher and Multi-feature Distillation. In Proceedings of the 2022 7th International Conference on Computer and Communication Systems (ICCCS), Wuhan, China, 22–25 April 2022; pp. 897–902. [Google Scholar] [CrossRef]
Yao, D.; Shi, Y.; Liu, T.; Xu, Z. FedMHO: Heterogeneous One-Shot Federated Learning Towards Resource-Constrained Edge Devices. arXiv 2025, arXiv:2502.08518. [Google Scholar]
Sang, T.; Chu, Z.; Xuan, J.; Zhang, X.; Li, X. Personalized Federated Learning in One-Shot: A Method for Heterogeneous Data Scenarios. IEEE Internet Things J. 2025, 12, 40415–40425. [Google Scholar] [CrossRef]
Liu, X.; Liu, L.; Ye, F.; Shen, Y.; Li, X.; Jiang, L.; Li, J. FedLPA: Personalized One-shot Federated Learning with Layer-Wise Posterior Aggregation. Adv. Neural Inf. Process. Syst. 2024, 37, 81510–81548. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Nayak, G.K.; Mopuri, K.R.; Shaj, V.; Babu, R.V.; Chakraborty, A. Zero-Shot Knowledge Distillation in Deep Networks. arXiv 2019, arXiv:1905.08114. [Google Scholar] [CrossRef]

Figure 1. Overall Architecture of the pFedZKD Framework. pFedZKD decouples client-side personalized architecture search from server-side structure-agnostic knowledge reconstruction. Clients independently search for and train heterogeneous models via PSO-FedNAS, while the server trains a global model through multi-teacher zero-shot knowledge distillation using pseudo-images synthesized from random noise.

Figure 2. Client-level test accuracy comparison between the proposed pFedZKD framework and FedAvg across different clients under different Dirichlet-based non-IID data partitions on MNIST, SVHN, Fashion-MNIST, and CIFAR-10. Subfigures (a–l) correspond to MNIST, SVHN, Fashion-MNIST, and CIFAR-10, respectively, each under Dirichlet parameters

α \in {0.1, 0.3, 0.5}

.

Figure 2. Client-level test accuracy comparison between the proposed pFedZKD framework and FedAvg across different clients under different Dirichlet-based non-IID data partitions on MNIST, SVHN, Fashion-MNIST, and CIFAR-10. Subfigures (a–l) correspond to MNIST, SVHN, Fashion-MNIST, and CIFAR-10, respectively, each under Dirichlet parameters

α \in {0.1, 0.3, 0.5}

.

Figure 3. Comparison of pFedZKD with standalone local training baselines in terms of average client test accuracy on CIFAR-10 under different non-IID settings.

Figure 4. Comparison of Top-1 test accuracy (%) convergence curves of pFedZKD, FedAvg, FedProx, and DENSE under different Dirichlet-based non-IID data partitions on MNIST, SVHN, Fashion-MNIST, and CIFAR-10. Subfigures (a–l) correspond to MNIST, SVHN, Fashion-MNIST, and CIFAR-10, respectively, with the three subfigures in each group representing Dirichlet parameters

α = 0.1

,

0.3

, and

0.5

from left to right.

Figure 4. Comparison of Top-1 test accuracy (%) convergence curves of pFedZKD, FedAvg, FedProx, and DENSE under different Dirichlet-based non-IID data partitions on MNIST, SVHN, Fashion-MNIST, and CIFAR-10. Subfigures (a–l) correspond to MNIST, SVHN, Fashion-MNIST, and CIFAR-10, respectively, with the three subfigures in each group representing Dirichlet parameters

α = 0.1

,

0.3

, and

0.5

from left to right.

Figure 5. Visualization of real samples and generated pseudo-images across four benchmark datasets. From top to bottom, each pair of rows corresponds to MNIST, SVHN, Fashion-MNIST, and CIFAR-10, respectively. In each pair, the upper row shows real samples, while the lower row presents pseudo-images generated by the proposed pFedZKD framework. Each row contains one representative image for each class (classes 0–9).

Table 1. Summary of Benchmark Datasets and Statistics. All datasets are partitioned using a Dirichlet distribution to simulate non-IID settings.

Dataset	Input Size	Channels	Classes	Train/Test Samples	Complexity
MNIST	$28 \times 28$	1	10	60,000/10,000	Low
Fashion-MNIST	$28 \times 28$	1	10	60,000/10,000	Medium
SVHN	$32 \times 32$	3	10	73,257/26,032	High
CIFAR-10	$32 \times 32$	3	10	50,000/10,000	High

Table 2. Top-1 test accuracy (%) comparison of different federated learning methods across four benchmark datasets under different Dirichlet-based non-IID data partitions.

Dataset	MNIST			SVHN			Fashion-MNIST			CIFAR-10
Data Partition ( $α$ )	0.1	0.3	0.5	0.1	0.3	0.5	0.1	0.3	0.5	0.1	0.3	0.5
FedAvg	51.62	74.92	86.91	37.49	46.24	55.83	48.48	58.24	66.05	24.65	36.88	31.58
FedProx	68.80	77.81	88.97	45.93	49.17	55.89	51.52	61.98	68.19	31.27	35.45	45.78
FedDF	65.34	78.12	89.23	52.85	71.29	72.51	45.78	62.56	67.85	35.54	41.57	50.36
DENSE	75.92	85.54	92.54	56.47	69.89	78.53	58.00	69.13	74.02	38.78	47.64	59.40
FedMHO	87.49	92.85	94.00	75.42	79.21	81.34	62.14	72.36	75.36	–	–	–
FedOM	85.54	90.14	92.85	71.67	79.59	86.33	65.64	71.48	80.70	46.57	56.15	62.68
FedLPA	77.43	85.77	88.73	39.77	52.23	54.27	55.33	68.20	73.33	19.97	26.6	24.2
pFedZKD (Ours)	87.58	94.54	95.82	67.51	78.45	80.81	70.52	77.05	83.98	49.12	56.49	64.42

Table 3. Robustness comparison under extreme data heterogeneity on CIFAR-10.

Method	Strong Non-IID $(α = 0.1)$	Extreme Non-IID $(α = 0.01)$	Drop ↓
FedAvg	24.65	11.35	13.30
FedProx	31.27	13.37	17.90
DENSE	38.78	20.47	18.31
FedLPA	19.97	16.17	3.80
pFedZKD (Ours)	49.12	37.96	11.16

Table 4. Top-1 test accuracy (%) of pFedZKD and its variants under Dirichlet-based non-IID data partitions across four benchmark datasets. This ablation study evaluates the effectiveness of PSO-FedNAS for personalized architecture search.

Dataset	MNIST			SVHN			Fashion-MNIST			CIFAR-10
Data Partition ( $α$ )	0.1	0.3	0.5	0.1	0.3	0.5	0.1	0.3	0.5	0.1	0.3	0.5
pFedZKD-Hom	75.09	80.29	86.91	37.66	43.53	59.83	41.90	59.86	64.53	33.16	38.14	46.57
pFedZKD-RandPool	58.20	62.30	64.36	20.89	26.64	31.89	25.09	41.15	53.51	20.25	21.24	35.22
pFedZKD (Full)	87.08	94.54	95.82	67.51	78.45	80.81	70.52	77.05	83.98	49.12	56.49	64.42

Table 5. Ablation study on the effectiveness of the multi-teacher zero-shot knowledge distillation (ZSKD) module.

Dataset	MNIST			SVHN			Fashion-MNIST			CIFAR-10
Data Partition ( $α$ )	0.1	0.3	0.5	0.1	0.3	0.5	0.1	0.3	0.5	0.1	0.3	0.5
pFedZKD (w/o ZSKD)	80.27	86.95	91.24	47.81	53.31	67.63	65.73	72.89	80.21	37.02	43.18	56.76
pFedZKD (Full)	87.08	94.54	95.82	67.51	78.45	80.81	70.52	77.05	83.98	49.12	56.49	64.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, J.; Yang, X.; Wang, D.; Xu, Y.; Hua, G. pFedZKD: A One-Shot Personalized Federated Learning Framework via Evolutionary Architecture Search and Data-Free Distillation. Appl. Sci. 2026, 16, 3878. https://doi.org/10.3390/app16083878

AMA Style

Yan J, Yang X, Wang D, Xu Y, Hua G. pFedZKD: A One-Shot Personalized Federated Learning Framework via Evolutionary Architecture Search and Data-Free Distillation. Applied Sciences. 2026; 16(8):3878. https://doi.org/10.3390/app16083878

Chicago/Turabian Style

Yan, Jiaqi, Xuan Yang, Desheng Wang, Yonggang Xu, and Gang Hua. 2026. "pFedZKD: A One-Shot Personalized Federated Learning Framework via Evolutionary Architecture Search and Data-Free Distillation" Applied Sciences 16, no. 8: 3878. https://doi.org/10.3390/app16083878

APA Style

Yan, J., Yang, X., Wang, D., Xu, Y., & Hua, G. (2026). pFedZKD: A One-Shot Personalized Federated Learning Framework via Evolutionary Architecture Search and Data-Free Distillation. Applied Sciences, 16(8), 3878. https://doi.org/10.3390/app16083878

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

pFedZKD: A One-Shot Personalized Federated Learning Framework via Evolutionary Architecture Search and Data-Free Distillation

Abstract

1. Introduction

2. Related Work and Preliminaries

2.1. Personalized Federated Learning

2.2. Federated Distillation

2.3. Preliminaries

2.3.1. Conventional Federated Learning

2.3.2. Knowledge Distillation

2.3.3. Particle Swarm Optimization

3. Proposed Method: pFedZKD Framework

3.1. Overview of the Decouple-and-Reconstruct Paradigm

3.2. Client-Side Decoupling: Personalized Architecture Search via PSO-FedNAS

3.2.1. Search Space Definition and Particle Encoding

3.2.2. Fitness Evaluation and Local Optimization Strategy

3.2.3. Handling Model Heterogeneity via Architecture Autonomy

3.3. Server-Side Reconstruction: Structure-Agnostic Knowledge Aggregation

3.3.1. Model-Specific Data Inversion Without Real Data

3.3.2. Soft Label Construction from Heterogeneous Teachers

3.3.3. Multi-Teacher Zero-Shot Knowledge Distillation

3.4. Communication and Computational Analysis

4. Experiments, Results and Analysis

4.1. Experimental Setup

4.1.1. Datasets and Non-IID Partitioning

4.1.2. Hyperparameter Settings and Baselines

4.2. Performance Comparison

4.2.1. Personalized Model Performance on Heterogeneous Clients

4.2.2. Server-Side Comparison of pFedZKD with Baseline and SOTA Methods

4.2.3. Robustness Under Extreme Data Heterogeneity

4.2.4. Visualization of Generated Pseudo Images

4.3. Ablation Study

4.3.1. Impact of Architecture Search (PSO-FedNAS)

4.3.2. Impact of Multi-Teacher Zero-Shot Knowledge Distillation (ZSKD)

5. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI