1. Introduction
Federated learning (FL) is a powerful approach designed to leverage decentralized computing and data resources to overcome data silos [
1]. Despite its potential, FL encounters significant challenges when applied to modern neural networks, which often consist of hundreds of millions of parameters [
2,
3]. These challenges are compounded by the traditional FL approach, which requires numerous communication rounds, exacerbating the issues of scalability and communication load. This dual burden of excessive parameters and frequent communications significantly hinders the efficiency of FL methods and can impede the widespread deployment of FL models, especially in environments where communication resources are limited [
4,
5]. Additionally, the uneven distribution of data across different clients typically seen in real-world settings makes it difficult to optimize a global model [
6,
7,
8]. This often results in model biases and suboptimal performance. These issues highlight the need for a new FL framework that not only reduces the communication overhead but also enhances the performance of the global model in scenarios with non-independent identically distributed (non-IID) data [
9].
In this aspect, though FL offers the advantage of allowing participants to retain their raw data locally, which inherently addresses some aspects of data privacy [
10], the privacy concerns in FL are far from being fully resolved, as both the model parameters exchanged during training and the outputs from the trained models can still be exploited for privacy attacks [
10,
11,
12]. To counter these vulnerabilities, various strategies have been developed to enhance privacy in FL, including the use of trusted aggregators and complex cryptographic techniques such as homomorphic encryption [
13,
14]. Despite these efforts, these methods often do not allow for personalized privacy settings and may not effectively protect against breaches involving high-dimensional parameter vectors, which are particularly challenging to secure [
10].
In response to these privacy limitations, the LDP-fed model has been introduced as an innovative FL framework that supports local differential privacy. This model allows each participant to apply their own privacy settings, thus offering tailored privacy protection based on individual preferences [
15,
16]. However, this approach is weak given the increasing complexity and dimensionality of models used in modern FL applications, such as those based on deep learning architectures like ResNet and BERT [
2,
3,
17]. Traditional privacy protection methods struggle to scale with the size of these models and often perform inadequately under high privacy requirements, necessitating extensive computational resources [
8,
18]. Therefore, LDP-fed requires innovations in protection mechanisms or greater efficiency and adaptability.
In this paper, a novel federated framework is designed to address issues related to both privacy and communication volumes. Traditional approaches often require each client to maintain and train its own model and send gradients or weights to a central server for aggregation. Instead, this work proposes creating compact yet high-performance surrogate data for each client. These surrogate datasets, which encapsulate knowledge from the original data, are difficult to visually interpret, thus preserving privacy. Meanwhile, only a one-shot communication round is required, achieving great server performance.
These local surrogate datasets are sent to the server, which constructs a global surrogate function to facilitate updates by minimizing this function. The surrogate data are generated through a process inspired by data distillation techniques. Local surrogate functions are crafted by synthesizing datasets that approximate the objectives; this is achieved by aligning gradients of the original and surrogate datasets across different neural network initializations.
Moreover, the paper explores the challenge of maintaining privacy when surrogate datasets are used. Visual inspection alone does not reveal sensitive information from these datasets; however, using labels could potentially expose privacy-sensitive category information. To address this, an optimized label perturbation method tailored for federated learning (FL) scenarios is developed, striking a balance between data utility and privacy protection.
The framework, including the processes for creating and utilizing surrogate data, and the method for label perturbation are detailed in
Figure 1. This approach significantly reduces communication costs and enhances privacy without compromising the usability of the data in federated learning environments. The contributions of this work are summarized as follows:
A method called FedGM is introduced, which utilizes iterative gradient matching to learn a surrogate function. This technique involves transmitting synthesized data to the server rather than sending local model updates, significantly enhancing communication efficiency and effectiveness. Additionally, a novel strategy for selecting the original dataset reduces the number of training rounds required while improving the training effectiveness of the distilled dataset.
Label differential privacy is employed to protect the privacy of approximate datasets for each client. This method is found to be highly capable even with a small privacy budget and outperforms other methods.
Comprehensive experiments are conducted on three tasks and show that the proposed framework can achieve high performance with just one communication round in scenarios marked by pathological non-IID conditions.
The remainder of this paper is organized as follows:
Section 2 reviews the related work of the paper.
Section 3.1 presents the difference between the FedGM method and traditional FL.
Section 3.2 provides a detailed introduction to the FedGM method, including its advantages and innovative aspects.
Section 3.3 experimentally demonstrates that the surrogate dataset sent by each client to the central authority in the FedGM setup is visually private, and label protection is achieved using an innovative approach to label differential privacy. In
Section 4, numerous experiments are conducted to demonstrate the superiority of the work in terms of handling non-independent and identically distributed (non-IID) data, reducing communication overhead, and by comparing it with traditional LDP-fed approaches.
5. Conclusions and Limitation
In the endeavor to advance the domain of federated learning, FedGM is introduced; it is a pioneering one-shot federated learning framework engineered to diminish the necessity for recurrent communication rounds, which frequently encumber traditional federated learning systems. This innovative approach not only optimizes communication efficiency but also significantly enhances the training effectiveness of the distilled dataset. By capitalizing on the unique strategy of leveraging synthetic datasets that mimic the original data’s distribution, FedGM achieves remarkable training accuracy and privacy protection while demonstrating an exceptional balance between performance and efficiency.
FedGM’s methodology of employing label differential privacy further exemplifies its novel contribution to ensuring data privacy while maintaining high utility in training models. This strategy, which is meticulously designed to protect the labels of the pseudo datasets, showcases a sophisticated understanding of the challenges inherent in federated learning environments. The empirical results, which are corroborated by extensive experiments, underscore the superior performance of FedGM over traditional methods, especially in the context of non-IID data scenarios and under stringent privacy constraints.
Moreover, the research illuminates the significance of visual privacy in federated learning, which is a novel area that has received limited attention in prior studies. The findings reveal that, while the distilled images themselves may not contain sensitive information, the labels attached to them possess a higher degree of sensitivity, necessitating the innovative approach to label privacy adopted by FedGM.
However, the study also acknowledges its limitations: particularly concerning the scalability of the synthetic dataset with an increasing number of clients. This recognition of potential constraints not only underscores the authenticity and originality of the work but also paves the way for future research to explore innovative solutions to these challenges.