1. Introduction
In recent years, machine learning (ML) has demonstrated impressive performance in applications such as image recognition and has been extensively adopted across a wide range of industrial sectors. This achievement is primarily driven by centralized training on large-scale datasets. However, such a paradigm introduces significant privacy risks, as user data needs to be transferred to a central server for model updating. As a result, concerns about data security and privacy have become more prominent among users. Meanwhile, advances in the storage capacity and computational power of edge devices have driven growing interest in distributed machine learning approaches.
In 2017, to protect data privacy in machine learning, McMahan et al. [
1] first proposed the concept of federated learning (FL). As a distributed machine learning paradigm, FL is particularly attractive in application scenarios such as finance [
2] and healthcare [
3], where stringent privacy requirements and limited data utilization coexist. Under this privacy-preserving framework, multiple clients collaboratively train a global model coordinated by a central server and repeatedly interact with the server over multiple communication rounds to achieve the learning objective. This approach effectively preserves data locality while improving data utilization. However, as illustrated in
Figure 1, both the federated learning server (which may be curious or malicious) and malicious participants can threaten the FL system during training. Attackers can mount reconstruction attacks [
4] to recover the original data or perform membership inference attacks [
5,
6] to determine whether a specific sample is present in a client’s training set. Melis et al. [
7] further demonstrated that the model updates shared in federated learning can inadvertently reveal sensitive information about participants’ training data. Consequently, additional safeguards are necessary to ensure robust privacy protection in federated learning systems.
Differential privacy (DP), introduced by Cynthia Dwork [
8], offers a rigorous mathematical framework for formally quantifying privacy guarantees. Through careful theoretical analysis, it demonstrates that using randomized response techniques can ensure that the influence of any single record on the released output is bounded by a specific threshold. This mechanism effectively restricts the ability of external parties to infer sensitive information, by observing changes in the output, whether a specific record has been altered or removed. Differential privacy [
9] has become a leading paradigm for privacy protection and has been widely applied in privacy-preserving machine learning.
In [
10], Wang et al. were among the first to apply differential privacy to deep learning in a federated setting. They injected noise during data release or model training to obfuscate the contribution of any individual data point. Geyer et al. [
11] and others demonstrated that client-level differential privacy is feasible in federated learning. A large body of work [
12,
13,
14,
15,
16,
17,
18] has applied differential privacy to FL by using differentially private stochastic gradient descent (DP-SGD): in each local update, gradients are clipped, and Gaussian noise is added to limit privacy leakage. However, despite its theoretical guarantees, the introduction of DP often negatively affects model performance, for example, by reducing model accuracy and increasing training time. Injecting excessive noise may bias the update direction, slow convergence, and degrade final performance, especially in high-dimensional or long-horizon training regimes.
In existing differentially private federated learning (DP-FL) research, a common practice is to inject Gaussian noise uniformly across all gradient dimensions to ensure privacy for each parameter update. However, such indiscriminate noise injection causes the noise level to scale with the model size, thereby hampering convergence. At the same time, noise is often added to unimportant parameters, wasting the privacy budget and masking critical gradient information. To highlight this issue more intuitively, we analyze the redundancy of gradient dimensions and study gradient distributions in FL tasks. We observe that approximately 95% of gradient values lie within a narrow region around zero. This indicates strong gradient sparsity and suggests that uniformly adding noise across all parameter dimensions is overly redundant. As shown in
Figure 2, this serves as the motivation for the design of our work. Some works [
19,
20,
21,
22] have attempted to mitigate the accuracy degradation caused by uniform noise injection through gradient sparsification (e.g., updating only significant parameters). However, most of these methods employ fixed sparsity rates and constant noise levels throughout the training process, thereby failing to account for the complex training dynamics of neural networks; in particular, an aggressive sparsity rate restricts model capacity by discarding informative gradients. Huang et al. [
23] introduce a dimension-adaptive sparsification mechanism (AdaS-FLDP), which dynamically selects important layers and dimensions based on their contributions to model convergence, thus enabling a more favorable trade-off between model performance and communication efficiency. These observations inspire the development of a mechanism that dynamically identifies key coordinates and allocates the privacy budget preferentially to the most influential parameters, with the objective of improving overall utility. Furthermore, previous works [
10,
14] adopt a constant noise level at all times. This strategy neglects the heterogeneity in parameter magnitudes across different coordinates, as well as their dynamic evolution throughout the training process. For instance,
Figure 3 shows that, in federated learning, the
-norm of gradients can vary substantially across clients and communication rounds, eventually converging to a small value. Consequently, using a fixed noise level can severely degrade training accuracy in the presence of heterogeneous gradients. some works [
24,
25,
26] have enhanced model utility by adaptively adjusting gradient clipping norms and noise levels, these existing methods continue to perform perturbation operations within the full gradient space. As a result, noise is uniformly applied across all dimensions, which may lead to inefficient use of the privacy budget and degradation of useful gradient signals in high-dimensional models.
In this work, we address the above limitations from a different perspective. Instead of performing perturbation over a fixed parameter space, we consider the coordinated design of the effective update subspace and the noise allocation strategy during training. In particular, the update subspace is dynamically determined via importance-based gradient selection, whereby only coordinates with relatively high gradient magnitudes are retained for model updates. Within this reduced space, the noise level is further adjusted according to the evolving training dynamics.Consequently, the perturbation process is shifted from uniform noise injection over all parameters to adaptive perturbation over a dynamically selected subset of informative coordinates, leading to improved signal-to-noise ratio and more efficient utilization of the privacy budget.
Our contributions are as follows:
We propose a novel federated learning differential privacy optimization method, AIGU-DPFL, which combines importance-aware gradient selection and noise adaptation techniques. By focusing updates on the most informative coordinates in each iteration, AIGU-DPFL reduces unnecessary perturbations and improves the signal-to-noise ratio.
We design a dynamic sparsity rate selection strategy that allows the model to focus on the most important parameters in the initial stages while gradually restoring the model’s full capacity, thereby reducing early noise caused by gradient instability. This effectively balances sparsity and the expressive power of model updates.
We design an adaptive noise mechanism for sparse gradients, addressing the inherent limitations of deep learning under fixed differential privacy parameters. Compared with existing differential privacy baseline methods, our method converges faster and achieves higher accuracy.
We conduct comparative experiments on real-world datasets. The results demonstrate that AIGU-DPFL outperforms existing differentially private federated learning methods.
The rest of this paper is structured as follows.
Section 2 reviews the related work.
Section 3 provides the necessary preliminaries and elaborates on the theoretical basis and technical components of our work.
Section 4 details the proposed approach.
Section 5 presents the experimental evaluation along with corresponding analysis.
Section 6 discusses the limitations observed in the experiments and highlights directions for future research. Finally,
Section 7 concludes the paper.
4. Method
In this section, we first introduce gradient selection with dynamic sparsity, followed by adaptive noise control for sparse gradients, and then describe the AIGU-DPFL algorithm that combines these two methods. Our framework is shown in
Figure 4.
4.1. Gradient Selection with Dynamic Sparsity
In federated learning, each client applies gradient descent using its local dataset. To ensure privacy, DP-SGD [
10] clips the gradients of each sample, then aggregates these gradients over a mini-batch, adds noise to the clipped gradients, and finally updates the model parameters using the noisy average gradient. According to the federated learning protocol, the updated parameters are then sent to the server for aggregation. However, uniformly injecting noise across all dimensions leads to a large amount of noise being added to unimportant coordinates, thus weakening important gradient signals [
46]. Therefore, we address this problem by focusing updates on highly informative coordinates in each round. Furthermore, in practice, we found that training with a constant high sparsity rate leads to slow convergence and poor performance. To address these issues, we propose a gradient selection technique with dynamic sparsity incorporated into local training to balance sparsity with the expressive power of model updates.
4.1.1. Gradient Selection
Unlike standard DP-SGD [
10], which injects noise uniformly across all parameter dimensions, our central insight is that not all coordinates contribute equally to model updates. Accordingly, at each iteration, we assign an importance score to every parameter dimension based on the magnitude of its gradient, as it directly reflects the instantaneous contribution of each parameter to the optimization objective. Alternative importance measures, such as gradient variance, second-order information (Hessian-based metrics), or historical update statistics, may capture richer information about parameter sensitivity. However, these approaches typically incur additional computational and communication overhead, which can be challenging in federated learning settings. Second, in the context of differential privacy, more complex importance measures may involve additional data-dependent computations, potentially increasing privacy overhead or complicating privacy accounting. Therefore, gradient-magnitude-based selection can be efficiently integrated into the training process and is commonly used in gradient sparsification methods [
19,
20,
23]. We then construct a sparse gradient mask that preserves the most informative gradients while suppressing the unimportant ones. This sparsification effectively compresses the update space and enhances the utility of the injected noise.
More specifically, we begin by vectorizing all trainable parameters of the neural network into a
d-dimensional representation. Let
It represents the vectorized form of model parameters, where each element corresponds to an individual trainable weight or bias term. Given a loss function
, the associated gradient vector can be expressed as
where each element corresponds to the partial derivative of the loss with respect to a specific parameter, reflecting the sensitivity of the loss function to variations in that parameter.
At iteration
t, we evaluate an importance metric for each dimension
j as
where
is the
j-th coordinate of the clipped gradient at iteration
t. These scores are aggregated into an importance vector
We then sort all dimensions in descending order of importance to obtain a ranked index list
Let be the parameter selection ratio. The top coordinates form the selected index set , while the remaining coordinates form the unselected set . The selected coordinates are retained and updated, while the unselected ones are masked out in subsequent training to reduce noise injection.
Based on this, we construct a binary importance mask vector
, whose
j-th entry is defined as
Finally, we apply the dynamically generated mask to the gradient
to obtain a sparse gradient vector:
where ⊙ denotes the Hadamard (element-wise) product. The sparse gradient will be used for subsequent adaptive noise addition and local model updates. By preserving only the gradient information of important coordinates and setting the gradients of redundant coordinates to zero, we reduce the dimensionality of subsequent noise injection and concentrate the privacy budget on key parameters.
4.1.2. Dynamic Sparsity
To avoid potential degradation of model capacity caused by a fixed selection rate
r, we adopt a dynamic selection schedule that gradually and smoothly restores model capacity during training. Specifically, the selection rate at round
t is defined as
where
is the initial selection rate,
controls the total growth of the selection rate over the entire training process, and
T is the total number of communication rounds. As training progresses, more coordinates are gradually reintroduced into the update process according to the increase in
, effectively expanding the set of trainable parameters. This schedule allows the model to focus on the most critical coordinates in the early stage, and gradually balance sparsity and capacity as training proceeds. In the experimental section, we analyze the impact of
and
on performance.
4.2. Adaptive Noise Control
Although gradient selection reduces the dimensionality of noise injection, using a fixed noise level
is still not optimal. Andrew et al. [
25] also showed that fixed parameters cannot adapt to the evolving state of the model at different training stages. As shown in
Figure 3, the
norm of the gradient gradually decreases as training progresses. Due to this non-stationarity, isotropic noise with a constant scale can severely impair training accuracy. Furthermore, the gradient norm is closely related to the local smoothness of the loss function [
47], meaning that it simultaneously reflects the dynamic optimization process during training. In the early stages of training, the impact of noise on the model gradient is relatively small, and the learning signal is large enough to tolerate stronger perturbations. As training progresses, the gradient norm decreases, and the relative impact of noise on the model increases. In the later stages of training, excessive noise can severely hinder model convergence.
To adapt to this dynamic behavior and improve the effectiveness of sparse gradient updates, we propose an adaptive noise control mechanism that adjusts the current noise scale based on the historical average gradient norm. Research by Loshchilov and Hutter [
48] shows that dynamically adjusting hyperparameters using smooth nonlinear functions often achieves better convergence and generalization than traditional linear or step-wise scheduling methods. Inspired by this, we use a hyperbolic tangent function to establish a mapping between the noise multiplier and the average gradient norm, as follows:
where
is the initial noise multiplier,
is a scaling factor, and
is the historical average of the gradient norms aggregated from the clients. To ensure privacy, we perturb the gradient norm with Gaussian noise before transmission. Since the gradient norm is a single scalar, this allows us to obtain accurate estimates with a high signal-to-noise ratio while using only a minimal portion of the privacy budget, which has a negligible impact on overall privacy accounting. Furthermore, considering fluctuations in the gradient norm that may result from random sampling and injected noise, we use an exponential moving average (EMA) [
49] to stabilize
, which also better captures long-term training trends. Since this is a post-processing step, it does not cause additional privacy leakage.
Analysis of the Tanh-Based Noise Mapping (See Appendix A)
We analyze the properties of the hyperbolic tangent function in Equation (
16) in the context of adaptive noise control. Specifically, the mapping provides a bounded, smooth, and monotonic transformation from the gradient norm to the noise scale. These properties are desirable for noise adaptation. The bounded output of
ensures that the noise multiplier remains within a controlled range, preventing excessive perturbation that may degrade model utility. Its smooth and continuous behavior avoids abrupt changes in the noise level, leading to more stable optimization dynamics. The monotonicity of the function ensures that the noise level varies consistently with the gradient norm, avoiding situations where small gradients are perturbed by disproportionately large noise.
Moreover, the nonlinearity of the tanh function enables robust and adaptive noise control throughout training. In the early stage, the average gradient norm is typically much larger than , so is close to 1, and the noise intensity stabilizes near , avoiding oscillations caused by rapidly changing gradient norms. As training progresses and the model starts to converge, the average gradient norm decreases, and the function gradually enters a near-linear region, allowing the noise multiplier to decay smoothly as the gradients shrink. If the rate at which the noise decreases is too slow, privacy protection becomes stronger, but persistently large noise may obscure useful gradient signals and slow convergence. Conversely, if noise decays too quickly, the model may appear more accurate in the short term, but privacy perturbation in later stages becomes insufficient, effectively consuming the privacy budget too early. In our experiments, we empirically analyze the impact of different settings and observe that an appropriate choice provides a good trade-off.
4.3. AIGU-DPFL
We integrate the two aforementioned approaches and incorporate them into a DP-FL framework, resulting in our AIGU-DPFL method, which aims to achieve optimal model utility. Specifically, sparsity is introduced by selecting only a subset of informative coordinates, which reduces the effective dimensionality of updates and avoids injecting noise into redundant parameters. At the same time, the adaptive noise mechanism adjusts the perturbation scale according to the training dynamics, ensuring that noise does not overwhelm useful gradient signals. Importantly, by restricting updates to informative coordinates, gradient selection changes the effective space where noise is applied, allowing the adaptive noise mechanism to focus perturbation on relevant dimensions. AIGU-DPFL achieves sparse and noise-effective gradient updates through the joint effect of gradient selection and adaptive noise control. Consequently, noise is concentrated on the most relevant dimensions, which improves the signal-to-noise ratio of the effective update space and enables more efficient utilization of the privacy budget under differential privacy constraints.
Under the Rényi differential privacy (RDP) framework, the privacy guarantee of AIGU-DPFL is derived from the Gaussian mechanism applied to clipped gradients. Specifically, per-sample gradients are clipped to a fixed norm bound C, which ensures bounded sensitivity, and Gaussian noise is subsequently added to guarantee privacy. The importance-based gradient selection is performed after clipping and thus does not increase sensitivity. In addition, the adaptive noise control relies on privatized gradient norm statistics. Although this introduces an additional privacy cost, it corresponds to a low-dimensional Gaussian mechanism (on a scalar quantity), which is explicitly accounted for within the RDP composition framework and is typically negligible compared to the primary gradient perturbation. Furthermore, the adaptive noise adjustment depends only on these privatized statistics and can be viewed as a deterministic function of differentially private outputs. Therefore, it satisfies the post-processing property of differential privacy and does not introduce additional privacy leakage beyond the accounted mechanisms. Consequently, the total privacy loss is accounted for through standard RDP composition, while the interaction between sparsity and adaptive noise improves the signal-to-noise ratio of gradient updates, leading to more stable convergence behavior in practice, as empirically validated by the observed convergence behavior in our experiments.
Algorithm 1 outlines the pseudocode of the proposed method. To preserve the overall privacy guarantee, local training is halted once the accumulated privacy budget reaches the predefined limit. Subsequently, Gaussian noise is injected into the gradients, and the averaged noisy updates are applied to refine the model. Upon completion of local training, the resulting model is transmitted to the server for aggregation.
| Algorithm 1: Pseudocode for AIGU-DPFL |
![Computers 15 00288 i001 Computers 15 00288 i001]() |
5. Experiments
In this section, we perform comprehensive experimental evaluations on three benchmark datasets to assess the proposed AIGU-DPFL method with respect to model accuracy, efficiency, and computational overhead.
5.1. Experimental Description
5.1.1. Datasets and Models
MNIST [
50]: This dataset contains 70,000 grayscale images of handwritten digits categorized into 10 classes. On this dataset, we train a convolutional neural network (CNN) derived from the LeNet architecture. The network is composed of three convolutional layers, each followed by a Sigmoid activation function, and is optimized using the Adam optimizer with a learning rate of 0.002. Fashion-MNIST [
51]: Fashion-MNIST contains 70,000 28 × 28 grayscale images representing 10 different clothing categories. We further employed a convolutional neural network (CNN) derived from the LeNet architecture on this dataset, using the Adam optimizer with a learning rate set to 0.001. CIFAR-10 [
52]: This dataset comprises 32 × 32 RGB images spanning 10 categories (e.g., ships, airplanes, dogs, and cats), with 50,000 samples used for training and 10,000 reserved for testing. We used the ResNet-18 model for training on this dataset. The first convolutional layer of the model uses a 3 × 3 convolution kernel and does not include a max pooling layer. Group normalization was used in the convolutional layers of the network. The SGD optimizer with momentum (Momentum = 0.9) was used during training, with an initial learning rate of 0.1.
5.1.2. Baselines
As baselines, we compare AIGU-DPFL with FedAvg [
1] (i.e., FL without privacy) and fixed noise injection (Fixed-DP). To further assess the effectiveness of our method, we also compare against two sparse model perturbation methods and one adaptive DP-FL method:
FedSMP-randk [
19]: Randomly selects k coordinates from all d parameter dimensions as active dimensions, scales them by d/k, and then injects Gaussian noise only on these coordinates.
FedSMP-topk [
19]: Selects the top k coordinates with the largest update magnitudes as active dimensions. The server generates a common top-k mask using a public dataset and broadcasts it to all clients.
ADAP-DPFL [
26]: It adopts the
-norm of the differentially private averaged gradient from the preceding round as the clipping bound for the current iteration, and adaptively tunes each client’s noise magnitude based on variations in the validation loss observed at the server.
All experiments are conducted on a workstation featuring an Intel 12th Gen Core i9-12900K CPU and an NVIDIA A5000 GPU.
5.2. Accuracy Comparison Experiments
In this study, we used the MNIST, Fashion-MNIST, and CIFAR-10 datasets to evaluate the performance of AIGU-DPFL, focusing primarily on test accuracy. We compared a total of six methods and examined how several key factors affect model performance, including the noise multiplier , the Dirichlet heterogeneity parameter , the number of communication rounds T between the client and server, and the number of participating clients.
Noise Multiplier:
Figure 5 shows the robustness of our method under different noise multipliers. Our method maintains high accuracy even as the noise multiplier increases. In the MNIST experiments, AIGU-DPFL consistently maintains an accuracy above 90% for all tested
values. In contrast, the performance of the other methods decreases sharply as the noise multiplier increases. On the FMNIST dataset, even at large noise multipliers, the accuracy of AIGU-DPFL still exceeds 73%, while the other methods perform significantly worse. On the CIFAR-10 dataset, AIGU-DPFL maintains relatively stable accuracy across all
values, demonstrating strong noise resistance compared with the other methods.
Non-independent and identically distributed (non-IID): To better simulate real-world federated learning environments, all experiments are conducted in a non-independent and identically distributed (non-IID) setting. Specifically, we use a Dirichlet distribution to assign labels among clients.
Figure 6 illustrates the impact of variations in the Dirichlet heterogeneity parameter
on accuracy. A smaller
value indicates stronger heterogeneity, corresponding to a more imbalanced label distribution among clients. In this context, AIGU-DPFL demonstrates excellent generalization ability. On the MNIST dataset, when
, AIGU-DPFL performs comparably to Non-DP, achieving an accuracy of 95.96%, while the other methods perform significantly worse. On the FMNIST dataset, AIGU-DPFL shows better convergence than Fixed-DP; when
, the accuracy of AIGU-DPFL is 6.72% higher than that of Fixed-DP. On the CIFAR-10 dataset, as
decreases, the adverse effects of data heterogeneity become more pronounced, but AIGU-DPFL still exhibits stronger performance and a more stable convergence trend. These results confirm that AIGU-DPFL is adaptive and robust when handling non-independent and identically distributed data.
Communication Rounds:
Figure 7 compares the impact of the number of communication rounds
T on the test accuracy across the three datasets. We observe that AIGU-DPFL maintains strong model performance throughout the training process. For example, after 1000 communication rounds with the server, AIGU-DPFL achieves an accuracy of over 95% on the MNIST dataset and nearly 80% on the FMNIST dataset. In contrast, the other methods fail to remain competitive during training.
Client Count Stability:
Figure 8 shows the change in model accuracy as the number of clients increases. The experimental results indicate that AIGU-DPFL exhibits strong stability under this condition. For example, on the MNIST dataset, when the number of clients reaches 50, the accuracy of the other four methods decreases, while AIGU-DPFL still achieves an accuracy of 96.80%. This shows that its performance does not change significantly with the number of clients. This property is particularly valuable in practical deployments, as the number of participating clients in real-world environments is often uncertain and may vary over time.
5.3. Hyperparameter Analysis
To evaluate the impact of three hyperparameters (initial retention rate , growth span , and scaling factor ) on training accuracy, we conducted tests on three different datasets.
5.3.1. Hyperparameter Sensitivity Analysis (, )
As shown in
Figure 9, we investigated the effects of the initial retention rate
and growth span
on the experimental results, the results show that the hyperparameters
and
have a significant impact on model accuracy, and this impact is closely related to the sparsity introduced by the gradient selection mechanism. For example, on the FMNIST dataset, when both
and
are in the range of 0.4 to 0.5, the test accuracy remains high and stable throughout the training process. Notably, all models employing importance-based selection outperform the dense baseline model (i.e., the model without introduced sparsity), indicating that moderate sparsity can enhance the model’s generalization ability. This effect is particularly significant when
and
, with the model achieving a peak accuracy of 80.83% in the later stages of training, which achieves an optimal balance between accuracy and sparsity. Conversely, excessively small
and
lead to a significant drop in accuracy, indicating that over-pruning severely limits the model’s expressive power. In the MNIST and CIFAR-10 experiments, the best performance was achieved with combinations of (
,
) and (
,
), respectively. This configuration effectively removes redundant or low-information parameters while preserving crucial gradient dimensions, thus promoting convergence.
To further evaluate the model’s boundary sensitivity to hyperparameter preservation, we extended the experiments to the region where , corresponding to the model transitioning to full parameter updates early in training. On the FMNIST dataset, we observed that the model still maintains good performance in this region. For example, when and , the model achieved a test accuracy of 80.71%. This behavior mainly stems from the model entering the full update phase earlier, thus mitigating the convergence slowdown typically introduced by sparse gradient selection.
5.3.2. Hyperparameter Sensitivity Analysis ()
We evaluated the impact of the scaling factor (
) on training accuracy. This parameter adjusts the magnitude of the role of the average gradient norm in updating the noisy multiplier, thereby controlling the sensitivity of noise decay. A smaller
makes the noise intensity more sensitive to changes in the gradient norm, leading to rapid noise decay in the early stages of training, while a larger
reduces the magnitude of noise changes with training dynamics, approximating a more stable noise injection mechanism. In our experiments, we set
to 0.1, 0.3, 1, 3, 10, 30, and 50, and evaluated the model’s performance on different datasets. As shown in
Figure 10. The results show that setting
to 10 yields the best performance on the MNIST and Fashion-MNIST datasets. Specifically, when
is 10, the noisy multiplier can better adapt to the optimization dynamics during training, effectively suppressing training oscillations caused by excessively rapid decay in the early stages, while preventing strong noise from masking the effective signal in the later stages. This indicates that adjusting the scaling factor appropriately can effectively improve the model’s accuracy. Similarly, on the CIFAR-10 dataset, we also found the optimal model utility when
.
However, as the complexity of training datasets and models increases, determining the optimal range of hyperparameters becomes more challenging, and the search space expands accordingly.
5.4. Ablation Experiments
In this section, we carry out comprehensive ablation studies to verify the effectiveness of each component within our training framework, the AIGU-DPFL method, which comprises two main modules: (1) an importance-based gradient selection strategy (GS) and (2) an adaptive noise mechanism (AN). We decompose AIGU-DPFL into four variants:
Default: Neither GS nor AN. This setup is used as the reference model in our comparative evaluation.
Default + GS: Only GS is retained, while AN is disabled. This setting allows us to evaluate the effectiveness of the importance-based gradient selection strategy.
Default + AN: Only AN is retained, while GS is disabled. This setup is used to evaluate the effectiveness of our adaptive noise mechanism.
AIGU-DPFL: The complete framework combining the two modules.
We assess these four variants across three datasets under three privacy budget settings, taking the global model accuracy on the central server as the primary evaluation metric, and perform a detailed analysis to examine the role and influence of each module within the algorithm. The outcomes are presented in
Table 1. When only the gradient selection (GS) is retained, accuracy is improved across all privacy budgets compared to the baseline model. The improvement is most significant on the Fashion-MNIST dataset with
, where GS alone improves accuracy by 2.28%. This indicates that selecting important gradient coordinates reduces noise interference in non-critical dimensions and focuses the privacy budget on task-relevant directions. The implementation of the adaptive noise mechanism (AN) also improves accuracy across various privacy budgets. This demonstrates that adaptively adjusting the noise level better adapts to the dynamics of training and effectively mitigates the drawbacks of excessive perturbation in the early stages with fixed noise and convergence obstacles in the later stages. When both modules are activated simultaneously, AIGU-DPFL consistently attains the best performance across all evaluated settings, typically exceeding the sum of the individual module improvements. For example, on the CIFAR-10 dataset with
, AIGU-DPFL achieves an accuracy of 58.14%, 1.63% higher than the second-best result (56.51%), indicating a synergistic amplification effect and good compatibility between the two methods. Furthermore, experimental data clearly demonstrate that removing these two modules leads to a significant drop in model accuracy, highlighting the critical role of each component within the overall framework.
In summary, GS and AN, through their integrated mechanism of jointly filtering key gradient dimensions and adaptively adjusting noise amplitude, significantly suppress perturbations to the update direction. They exhibit complementary advantages on datasets of varying complexity and under different privacy budgets. Activating either module individually outperforms the default settings, while joint activation produces a synergistic effect, ensuring that AIGU-DPFL consistently maintains optimal performance and validating the effectiveness of GS and AN.
5.5. Efficiency and Overhead Analysis
This group of experiments evaluated the number of communication rounds required for AIGU-DPFL to reach the target accuracy compared with other algorithms. For a clearer comparison, we present the experimental results on the MNIST and Fashion-MNIST datasets.
Table 2 shows that on the Fashion-MNIST dataset, AIGU-DPFL achieved 70% accuracy in only 135 rounds, which is 3.37 times and 3.05 times fewer than FedSMP-randk and FedSMP-topk, respectively, and 3.67 times fewer than Adap-DPFL. This advantage stems from our proposed importance-based gradient selection mechanism. This method effectively improves the signal-to-noise ratio of global updates, thereby accelerating the model’s convergence speed.
Table 3 further evaluates the time consumption under the condition of 1000 rounds on both datasets. The results show that AIGU-DPFL is more time-efficient than the FedSMP-randk and FedSMP-topk algorithms, and slightly better than Adap-DPFL. This demonstrates the time efficiency advantage of our method. In terms of communication overhead,
Table 3 provides a comparison of the number of parameters transmitted per iteration for one client. Fixed-DP and Adap-DPFL require transmitting the full model in each iteration, resulting in comparable communication costs under the same model size and number of iterations. In contrast, FedSMP-randk and FedSMP-topk adopt sparsification strategies, where only a fraction of coordinates are communicated in each iteration, leading to reduced communication costs. For AIGU-DPFL, the communication overhead is determined by the dynamically adjusted retention ratio. By selecting only informative coordinates for transmission, AIGU-DPFL effectively reduces the number of transmitted parameters while maintaining sufficient model expressiveness, enabling a more flexible trade-off between communication efficiency and model performance.
Table 3 compares the memory footprint of different methods, measured as the peak GPU memory per client during local training. The highest memory usage is observed in FedSMP-randk and FedSMP-topk, due to the additional index tensors and temporary buffers required for sparse updates. In contrast, Fixed-DP and Adap-DPFL exhibit lower memory consumption, as they operate on dense gradients without introducing extra data structures. AIGU-DPFL incurs a slightly higher memory consumption than Fixed-DP and Adap-DPFL, as it performs coordinate selection on gradient tensors and maintains dynamic control of the retention ratio and noise scale during training. These operations are applied directly to existing gradients and do not introduce additional model-sized allocations, resulting in a limited overall memory overhead.
6. Discussion
Despite AIGU-DPFL’s robust performance, certain aspects still warrant further optimization. First, although the dynamic sparsity scheduling mechanism effectively restores model capacity over time, the local training phase for each client currently adheres to a generic paradigm and has not yet been tailored to the unique data distribution characteristic of each specific client. Prior research in the field of personalized federated learning suggests that customizing local update strategies based on individual client characteristics can further accelerate convergence and enhance accuracy within heterogeneous environments. Consequently, integrating client-adaptive optimization techniques into AIGU-DPFL holds great promise as a highly potential avenue for future research.
Second, while our method maintains stable performance in scenarios involving 50 clients, deploying AIGU-DPFL in large-scale systems—comprising thousands or even millions of clients—may introduce additional challenges in terms of memory usage and communication overhead. In such settings, the communication overhead in federated learning generally scales approximately linearly with the number of participating clients, as each client transmits its local updates to the server in each communication round. As the client population grows, the aggregated communication cost increases proportionally. Future work will investigate hierarchical aggregation schemes, in which client updates are first aggregated at edge servers and then synchronized at the global level [
53], as well as semi-asynchronous coordination that limits stale updates through bounded-delay scheduling [
54], to improve scalability under large client populations. To further reduce communication overhead without performance degradation, model distillation-based methods can be adopted, which reduce the communication burden by exchanging compact knowledge representations instead of full gradient updates, thereby significantly decreasing the amount of transmitted information while preserving task-relevant knowledge [
55]. From the perspective of model dimensionality, these challenges can be further amplified in high-dimensional models, where the size of gradient updates and communication cost grow with the number of parameters. AIGU-DPFL partially alleviates this issue by restricting updates to informative coordinates, thereby reducing the effective update dimensionality. Future work will investigate structured sparsification with parameter grouping [
56] to bound per-round communication and computation costs in high-dimensional settings, following established sparsified training approaches.
Furthermore, despite the reduction in noise applied to non-critical gradients, the approach still requires a careful distribution of the privacy budget. Under stringent privacy constraints, achieving an optimal balance between sparsity induction and noise injection may require further fine-tuning. Future work may investigate adaptive strategies for privacy budget allocation, where the budget is dynamically assigned according to the relative importance of specific parameters or gradients throughout training. Such mechanisms have the potential to improve both privacy efficiency and model performance in sensitive application scenarios.
Despite the above limitations, the proposed AIGU-DPFL method offers significant practical advantages in real-world federated learning deployments where strict data privacy, high model utility, and communication efficiency are simultaneously demanded. For example, in healthcare systems, medical data are distributed across hospitals and cannot be shared due to privacy regulations. Our method enables collaborative model training while protecting sensitive patient information through differential privacy. At the same time, the importance-based gradient selection reduces communication overhead by transmitting only a subset of informative parameters, which is beneficial in bandwidth-limited environments. Similarly, in financial services, where user data are highly sensitive, the adaptive noise mechanism allows flexible control of the privacy–utility trade-off, ensuring strong privacy guarantees while maintaining model performance. In edge-based Internet of Things (IoT) applications, where devices often have limited computational and communication resources, the combination of sparse updates and adaptive noise helps reduce resource consumption and improve training efficiency under heterogeneous conditions.