Next Article in Journal
Generative AI for Education in Infrastructure Systems: Lessons from a BIM-Based Rule-Checking
Previous Article in Journal
On-Device Transformer Architectures for Speech Evaluation in Neurodegenerative Disease Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AIGU-DPFL: Adaptive Differentially Private Federated Learning with Importance-Based Gradient Updates

College of Computer, Zhongyuan University of Technology, Zhengzhou 450007, China
*
Author to whom correspondence should be addressed.
Computers 2026, 15(5), 288; https://doi.org/10.3390/computers15050288
Submission received: 6 April 2026 / Revised: 24 April 2026 / Accepted: 27 April 2026 / Published: 1 May 2026

Abstract

Federated learning, a decentralized machine learning framework, allows multiple participants to jointly train models while keeping their raw data local and unshared. Nevertheless, during the exchange of model updates, the communicated information can still introduce privacy vulnerabilities and potentially result in the exposure of user data. Over the past few years, differential privacy methods have been broadly incorporated into federated learning frameworks to strengthen the protection of sensitive data. Nevertheless, the noise required to satisfy differential privacy guarantees often causes significant degradation in model performance. Prior studies have typically employed a fixed noise-injection strategy following gradient clipping. Although such methods provide privacy protection, they overlook the varying importance of different gradient dimensions, resulting in noise being injected into unimportant or redundant parameters, thereby causing unnecessary performance loss. To address these limitations, we propose an adaptive differentially private federated learning scheme with importance-based gradient updates (AIGU-DPFL). Specifically, we focus on coordinates with high information content and introduce an adaptive noise injection mechanism, which perturbs gradient updates to satisfy differential privacy guarantees while dynamically controlling noise intensity, thereby achieving sparse and noise-effective gradient updates. AIGU-DPFL markedly enhances the training effectiveness of federated learning models. Comprehensive evaluations conducted on real-world datasets indicate that the proposed method achieves superior performance compared to existing differentially private federated learning techniques.

Graphical Abstract

1. Introduction

In recent years, machine learning (ML) has demonstrated impressive performance in applications such as image recognition and has been extensively adopted across a wide range of industrial sectors. This achievement is primarily driven by centralized training on large-scale datasets. However, such a paradigm introduces significant privacy risks, as user data needs to be transferred to a central server for model updating. As a result, concerns about data security and privacy have become more prominent among users. Meanwhile, advances in the storage capacity and computational power of edge devices have driven growing interest in distributed machine learning approaches.
In 2017, to protect data privacy in machine learning, McMahan et al. [1] first proposed the concept of federated learning (FL). As a distributed machine learning paradigm, FL is particularly attractive in application scenarios such as finance [2] and healthcare [3], where stringent privacy requirements and limited data utilization coexist. Under this privacy-preserving framework, multiple clients collaboratively train a global model coordinated by a central server and repeatedly interact with the server over multiple communication rounds to achieve the learning objective. This approach effectively preserves data locality while improving data utilization. However, as illustrated in Figure 1, both the federated learning server (which may be curious or malicious) and malicious participants can threaten the FL system during training. Attackers can mount reconstruction attacks [4] to recover the original data or perform membership inference attacks [5,6] to determine whether a specific sample is present in a client’s training set. Melis et al. [7] further demonstrated that the model updates shared in federated learning can inadvertently reveal sensitive information about participants’ training data. Consequently, additional safeguards are necessary to ensure robust privacy protection in federated learning systems.
Differential privacy (DP), introduced by Cynthia Dwork [8], offers a rigorous mathematical framework for formally quantifying privacy guarantees. Through careful theoretical analysis, it demonstrates that using randomized response techniques can ensure that the influence of any single record on the released output is bounded by a specific threshold. This mechanism effectively restricts the ability of external parties to infer sensitive information, by observing changes in the output, whether a specific record has been altered or removed. Differential privacy [9] has become a leading paradigm for privacy protection and has been widely applied in privacy-preserving machine learning.
In [10], Wang et al. were among the first to apply differential privacy to deep learning in a federated setting. They injected noise during data release or model training to obfuscate the contribution of any individual data point. Geyer et al. [11] and others demonstrated that client-level differential privacy is feasible in federated learning. A large body of work [12,13,14,15,16,17,18] has applied differential privacy to FL by using differentially private stochastic gradient descent (DP-SGD): in each local update, gradients are clipped, and Gaussian noise is added to limit privacy leakage. However, despite its theoretical guarantees, the introduction of DP often negatively affects model performance, for example, by reducing model accuracy and increasing training time. Injecting excessive noise may bias the update direction, slow convergence, and degrade final performance, especially in high-dimensional or long-horizon training regimes.
In existing differentially private federated learning (DP-FL) research, a common practice is to inject Gaussian noise uniformly across all gradient dimensions to ensure privacy for each parameter update. However, such indiscriminate noise injection causes the noise level to scale with the model size, thereby hampering convergence. At the same time, noise is often added to unimportant parameters, wasting the privacy budget and masking critical gradient information. To highlight this issue more intuitively, we analyze the redundancy of gradient dimensions and study gradient distributions in FL tasks. We observe that approximately 95% of gradient values lie within a narrow region around zero. This indicates strong gradient sparsity and suggests that uniformly adding noise across all parameter dimensions is overly redundant. As shown in Figure 2, this serves as the motivation for the design of our work. Some works [19,20,21,22] have attempted to mitigate the accuracy degradation caused by uniform noise injection through gradient sparsification (e.g., updating only significant parameters). However, most of these methods employ fixed sparsity rates and constant noise levels throughout the training process, thereby failing to account for the complex training dynamics of neural networks; in particular, an aggressive sparsity rate restricts model capacity by discarding informative gradients. Huang et al. [23] introduce a dimension-adaptive sparsification mechanism (AdaS-FLDP), which dynamically selects important layers and dimensions based on their contributions to model convergence, thus enabling a more favorable trade-off between model performance and communication efficiency. These observations inspire the development of a mechanism that dynamically identifies key coordinates and allocates the privacy budget preferentially to the most influential parameters, with the objective of improving overall utility. Furthermore, previous works [10,14] adopt a constant noise level at all times. This strategy neglects the heterogeneity in parameter magnitudes across different coordinates, as well as their dynamic evolution throughout the training process. For instance, Figure 3 shows that, in federated learning, the l 2 -norm of gradients can vary substantially across clients and communication rounds, eventually converging to a small value. Consequently, using a fixed noise level can severely degrade training accuracy in the presence of heterogeneous gradients. some works [24,25,26] have enhanced model utility by adaptively adjusting gradient clipping norms and noise levels, these existing methods continue to perform perturbation operations within the full gradient space. As a result, noise is uniformly applied across all dimensions, which may lead to inefficient use of the privacy budget and degradation of useful gradient signals in high-dimensional models.
In this work, we address the above limitations from a different perspective. Instead of performing perturbation over a fixed parameter space, we consider the coordinated design of the effective update subspace and the noise allocation strategy during training. In particular, the update subspace is dynamically determined via importance-based gradient selection, whereby only coordinates with relatively high gradient magnitudes are retained for model updates. Within this reduced space, the noise level is further adjusted according to the evolving training dynamics.Consequently, the perturbation process is shifted from uniform noise injection over all parameters to adaptive perturbation over a dynamically selected subset of informative coordinates, leading to improved signal-to-noise ratio and more efficient utilization of the privacy budget.
Our contributions are as follows:
  • We propose a novel federated learning differential privacy optimization method, AIGU-DPFL, which combines importance-aware gradient selection and noise adaptation techniques. By focusing updates on the most informative coordinates in each iteration, AIGU-DPFL reduces unnecessary perturbations and improves the signal-to-noise ratio.
  • We design a dynamic sparsity rate selection strategy that allows the model to focus on the most important parameters in the initial stages while gradually restoring the model’s full capacity, thereby reducing early noise caused by gradient instability. This effectively balances sparsity and the expressive power of model updates.
  • We design an adaptive noise mechanism for sparse gradients, addressing the inherent limitations of deep learning under fixed differential privacy parameters. Compared with existing differential privacy baseline methods, our method converges faster and achieves higher accuracy.
  • We conduct comparative experiments on real-world datasets. The results demonstrate that AIGU-DPFL outperforms existing differentially private federated learning methods.
The rest of this paper is structured as follows. Section 2 reviews the related work. Section 3 provides the necessary preliminaries and elaborates on the theoretical basis and technical components of our work. Section 4 details the proposed approach. Section 5 presents the experimental evaluation along with corresponding analysis. Section 6 discusses the limitations observed in the experiments and highlights directions for future research. Finally, Section 7 concludes the paper.

2. Related Work

In this section, we first briefly outline the broader issues and challenges in federated learning, and then review representative works related to adaptive differential privacy and gradient compression. Finally, we outline approaches for quantifying privacy loss.

2.1. Issues and Challenges in Federated Learning

Although federated learning provides a decentralized paradigm that mitigates direct data leakage, its practical deployment is hindered by several intrinsic challenges. Recent comprehensive surveys [27,28] have systematically categorized these detailed issues. Specifically, Ayeelyan et al. [27] summarized the design and functional issues of federated learning from the perspectives of client selection, aggregation and optimization, communication cost, knowledge transfer, data management, and incentives, and emphasized that these factors jointly affect the efficiency and practicality of federated learning systems. Sana et al. [28] further pointed out that large-scale federated learning must address not only privacy preservation, but also scalability, resource constraints, heterogeneity, and the trade-offs among communication efficiency, utility, and robustness. In particular, non-IID data, device heterogeneity, and privacy-preserving protection mechanisms often complicate convergence and real-world deployment.

2.2. Adaptive Differential Privacy Schemes

Differential privacy [9] has emerged as a widely adopted framework for safeguarding data confidentiality by limiting the impact that any individual sample can have on the resulting model. It has been broadly utilized in the context of privacy-preserving empirical risk minimization. In deep learning, however, the non-convexity and complexity of models pose unique challenges. Wang et al. [10] proposed a differentially private stochastic gradient descent framework in federated settings, which clips gradients to bound sensitivity and then adds isotropic Gaussian noise to the clipped gradients, effectively perturbing the update direction to protect privacy. However, clipping and noise injection introduce biases that can affect the convergence of the optimization process.
To compensate for the utility loss caused by noise, many works [29,30,31] have studied adaptive DP noise mechanisms that improve model utility while protecting private data. Among them, Xu et al. [29] combine DP with RMSProp and adaptively inject noise into each gradient coordinate based on the moving average of squared historical gradients. Yu et al. [30] propose a preconditioning-based method that adaptively decays noise across iterations. Lee and Kifer [31] dynamically allocate per-iteration privacy budgets to reduce the impact of noise on the optimization process.
Pichapati et al. [32] proposed AdaClip, which estimates clipping thresholds in a coordinate-wise fashion and adaptively injects noise into different gradient dimensions. In asynchronous FL, Li et al. [24] propose a multi-stage adaptive privacy algorithm that gradually reduces the clipping threshold as training converges, thereby reducing unnecessary noise and improving performance. Xue et al. [33] develop a sensitivity estimation framework that leverages both local and global historical information, effectively mitigating the adverse effects of differential privacy on model performance. Wei and Liu [34] combine threshold decay with noise decay; when these strategies are used together, the method demonstrates effective performance. However, these works generally still inject noise uniformly over the full gradient dimensions, without fully exploiting gradient sparsity and coordinate-wise importance, which leads to a waste of the privacy budget.

2.3. Gradient Compression and Sparsity

In recent years, a growing number of studies have focused on compressing the gradient space to improve the effectiveness of differentially private stochastic gradient descent (DP-SGD) and mitigate performance degradation caused by noise amplification. For example, studies such as [35,36] reduce the dimensionality of the gradient space during DP training by freezing most network parameters, thereby significantly weakening the impact of noise on model performance. Nevertheless, these fixed-structure approaches often depend on auxiliary datasets or pre-training stages, thereby constraining the model’s adaptability when optimizing over private data. To overcome these drawbacks, more recent studies have investigated dynamic strategies—such as subspace compression and gradient sparsification—to progressively enhance the utility of DP-SGD throughout training.
Sha et al. [37] present the PCDP-SGD approach, which performs a projection step prior to gradient clipping to reduce redundant norm components while retaining the most informative directions, thereby facilitating improved convergence under differential privacy constraints. In a related line of work, Liu et al. [38] propose the DPDR framework, where current gradients are decomposed with reference to previously perturbed gradients. This design enables the privacy budget to be concentrated on newly introduced information, while reconstructed gradients are utilized for model updates, ultimately leading to enhanced utility. To further limit the impact of noise, Zhu and Blaschko [39] gradually and randomly freeze parameters during the DP-SGD process, generating sparse gradient updates to mitigate performance degradation caused by noise. Adamczewski and Park [40] either freeze a portion of the parameters before training or dynamically select a portion of the parameters during training to reduce the effective parameter space and improve the scalability of DP-SGD. Both methods rely on public datasets to select parameters to avoid additional privacy costs.
Furthermore, Shin et al. [41] use random projection [42] to reduce dimensionality and mitigate the impact of high-dimensional noise. However, randomly dropping dimensions introduces significant reconstruction errors, which may adversely affect learning performance.

2.4. Privacy Loss

Following local training iterations on each client, Gaussian noise is incorporated into the updates to ensure privacy protection. When each client samples mini-batches at the same sampling rate and uses the same noise multiplier, the privacy loss per round is identical across clients. In [10], Wang et al. quantify the total privacy loss via the Moments Accountant, which tracks the privacy loss of DP-SGD and yields relatively tight privacy bounds for the Gaussian mechanism.
In this study, we quantify privacy loss using Rényi Differential Privacy (RDP) [43], which provides a tighter and more fine-grained characterization of privacy loss compared to standard differential privacy. Specifically, we first compute the per-round privacy cost using the Gaussian mechanism under the RDP framework. We then perform composition over multiple communication rounds by leveraging the additive property of RDP. Finally, the cumulative RDP privacy loss is then translated into ( ε , δ ) -differential privacy.

3. Preliminary Knowledge

In this section, we present essential background concepts, including federated learning and differential privacy.

3.1. Federated Learning

Federated learning (FL) is a representative distributed learning framework in which a central server coordinates with N clients to jointly optimize a global model. During this process, clients only upload model updates instead of sharing raw data, thereby mitigating privacy risks to some extent. The FL training process mainly consists of the following two stages:
  • Local training stage
    In communication round t, each client initializes its local model with the global model w t 1 distributed by the server (with w 0 denoting the initial model in round 0), and then performs local training on its dataset D k (for k = 1 , , n ).
    With the sampling rate, q, the client selects a mini-batch, B t , and computes the gradient for each sample, x i , in the batch, denoted in the manuscript by
    g t 1 ( x i ) w t 1 L ( w t 1 , x i )
    Once gradients have been computed for all samples within the mini-batch, the client averages them and performs one gradient descent step to update its local model:
    w t = w t 1 η 1 | B | i B g t 1 ( x i )
  • Server aggregation stage
    After completing local updates, the n clients participating in round t upload their local models, w k t , to the server. The server aggregates them by a weighted average according to the local dataset sizes, | D k | , to obtain the next global model:
    w t + 1 k = 1 n | D k | | D | w k t
    The server then distributes the updated global model to all clients to start the next communication round.

3.2. Differential Privacy

Differential privacy is a rigorous privacy notion that quantifies privacy risk and provides formal guarantees for protecting the confidentiality of individual data. It provides a formal framework for quantifying the privacy loss incurred by data analysis procedures. In intuitive terms, differential privacy requires that modifying any single record in the training dataset should not lead to a statistically significant change in the output distribution of the algorithm. We next recall the relevant definitions.
Definition 1 
(Approximate Differential Privacy [44]). A randomized algorithm, A, is said to satisfy ( ε , δ ) -differential privacy if for any pair of neighboring datasets, D and D , that differ in exactly one data point, and for any event, ω, in the output space, Ω,
Pr [ A ( D ) = ω ] e ε Pr [ A ( D ) = ω ] + δ
Here, ε is the privacy budget, and δ is the failure probability, meaning that with probability at least 1 δ the algorithm achieves privacy guarantees comparable to pure ε-DP.
In this paper, we adopt Rényi differential privacy (RDP) [32] as the measure of privacy loss. RDP can be viewed as a relaxation of ε -DP, defined based on Rényi divergence.
Definition 2 
(Rényi Divergence [43]). Given two probability distributions F and G, the Rényi divergence of order α > 1 is defined as
D α ( F G ) = 1 α 1 log E x G F ( x ) G ( x ) α
where E x G denotes expectation with respect to G, and F ( x ) and G ( x ) are the density values of F and G at point x, respectively.
Definition 3 
(Rényi Differential Privacy [43]). A randomized mechanism f : D R is said to satisfy ( α , ε ) -RDP if, for all neighboring datasets D and D ,
D α ( f ( D ) f ( D ) ) ε
Definition 4 
(Sensitivity [45]). For a query function f : D R , its sensitivity is defined as
Δ f = max D , D | f ( D ) f ( D ) |
where the maximum is taken over all neighboring datasets D and D , and | · | denotes either the L 1 or L 2 norm.
In this work, we adopt the Gaussian mechanism based on L 2 -sensitivity. Specifically, we add zero-mean Gaussian noise with variance σ 2 Δ f 2 to each coordinate of f ( D ) to ensure differential privacy.

4. Method

In this section, we first introduce gradient selection with dynamic sparsity, followed by adaptive noise control for sparse gradients, and then describe the AIGU-DPFL algorithm that combines these two methods. Our framework is shown in Figure 4.

4.1. Gradient Selection with Dynamic Sparsity

In federated learning, each client applies gradient descent using its local dataset. To ensure privacy, DP-SGD [10] clips the gradients of each sample, then aggregates these gradients over a mini-batch, adds noise to the clipped gradients, and finally updates the model parameters using the noisy average gradient. According to the federated learning protocol, the updated parameters are then sent to the server for aggregation. However, uniformly injecting noise across all dimensions leads to a large amount of noise being added to unimportant coordinates, thus weakening important gradient signals [46]. Therefore, we address this problem by focusing updates on highly informative coordinates in each round. Furthermore, in practice, we found that training with a constant high sparsity rate leads to slow convergence and poor performance. To address these issues, we propose a gradient selection technique with dynamic sparsity incorporated into local training to balance sparsity with the expressive power of model updates.

4.1.1. Gradient Selection

Unlike standard DP-SGD [10], which injects noise uniformly across all parameter dimensions, our central insight is that not all coordinates contribute equally to model updates. Accordingly, at each iteration, we assign an importance score to every parameter dimension based on the magnitude of its gradient, as it directly reflects the instantaneous contribution of each parameter to the optimization objective. Alternative importance measures, such as gradient variance, second-order information (Hessian-based metrics), or historical update statistics, may capture richer information about parameter sensitivity. However, these approaches typically incur additional computational and communication overhead, which can be challenging in federated learning settings. Second, in the context of differential privacy, more complex importance measures may involve additional data-dependent computations, potentially increasing privacy overhead or complicating privacy accounting. Therefore, gradient-magnitude-based selection can be efficiently integrated into the training process and is commonly used in gradient sparsification methods [19,20,23]. We then construct a sparse gradient mask that preserves the most informative gradients while suppressing the unimportant ones. This sparsification effectively compresses the update space and enhances the utility of the injected noise.
More specifically, we begin by vectorizing all trainable parameters of the neural network into a d-dimensional representation. Let
w = [ w 1 , w 2 , , w d ]
It represents the vectorized form of model parameters, where each element corresponds to an individual trainable weight or bias term. Given a loss function L ( w ) , the associated gradient vector can be expressed as
g = L ( w ) = L w 1 , L w 2 , , L w d
where each element corresponds to the partial derivative of the loss with respect to a specific parameter, reflecting the sensitivity of the loss function to variations in that parameter.
At iteration t, we evaluate an importance metric for each dimension j as
s j = | g t , j | , j { 1 , , d }
where g t , j is the j-th coordinate of the clipped gradient at iteration t. These scores are aggregated into an importance vector
s = ( s 1 , s 2 , , s d ) R d
We then sort all dimensions in descending order of importance to obtain a ranked index list
I sort t = ( j 1 , j 2 , , j d ) s . t . s j 1 s j 2 s j d
Let r ( 0 , 1 ] be the parameter selection ratio. The top r d coordinates form the selected index set I s t , while the remaining coordinates form the unselected set I n s t . The selected coordinates are retained and updated, while the unselected ones are masked out in subsequent training to reduce noise injection.
Based on this, we construct a binary importance mask vector m t { 0 , 1 } d , whose j-th entry is defined as
m j t = 1 , j I s t , 0 , j I n s t
Finally, we apply the dynamically generated mask to the gradient g t to obtain a sparse gradient vector:
g sparse t = g t m t
where ⊙ denotes the Hadamard (element-wise) product. The sparse gradient will be used for subsequent adaptive noise addition and local model updates. By preserving only the gradient information of important coordinates and setting the gradients of redundant coordinates to zero, we reduce the dimensionality of subsequent noise injection and concentrate the privacy budget on key parameters.

4.1.2. Dynamic Sparsity

To avoid potential degradation of model capacity caused by a fixed selection rate r, we adopt a dynamic selection schedule that gradually and smoothly restores model capacity during training. Specifically, the selection rate at round t is defined as
r t = r 0 + Δ r · t T
where r 0 ( 0 , 1 ) is the initial selection rate, Δ r ( 0 , 1 r 0 ] controls the total growth of the selection rate over the entire training process, and T is the total number of communication rounds. As training progresses, more coordinates are gradually reintroduced into the update process according to the increase in r t , effectively expanding the set of trainable parameters. This schedule allows the model to focus on the most critical coordinates in the early stage, and gradually balance sparsity and capacity as training proceeds. In the experimental section, we analyze the impact of r 0 and Δ r on performance.

4.2. Adaptive Noise Control

Although gradient selection reduces the dimensionality of noise injection, using a fixed noise level σ is still not optimal. Andrew et al. [25] also showed that fixed parameters cannot adapt to the evolving state of the model at different training stages. As shown in Figure 3, the l 2 norm of the gradient gradually decreases as training progresses. Due to this non-stationarity, isotropic noise with a constant scale can severely impair training accuracy. Furthermore, the gradient norm is closely related to the local smoothness of the loss function [47], meaning that it simultaneously reflects the dynamic optimization process during training. In the early stages of training, the impact of noise on the model gradient is relatively small, and the learning signal is large enough to tolerate stronger perturbations. As training progresses, the gradient norm decreases, and the relative impact of noise on the model increases. In the later stages of training, excessive noise can severely hinder model convergence.
To adapt to this dynamic behavior and improve the effectiveness of sparse gradient updates, we propose an adaptive noise control mechanism that adjusts the current noise scale based on the historical average gradient norm. Research by Loshchilov and Hutter [48] shows that dynamically adjusting hyperparameters using smooth nonlinear functions often achieves better convergence and generalization than traditional linear or step-wise scheduling methods. Inspired by this, we use a hyperbolic tangent function to establish a mapping between the noise multiplier and the average gradient norm, as follows:
σ t = σ 0 · tanh n ¯ t λ
where σ 0 is the initial noise multiplier, λ > 0 is a scaling factor, and n ¯ t is the historical average of the gradient norms aggregated from the clients. To ensure privacy, we perturb the gradient norm with Gaussian noise before transmission. Since the gradient norm is a single scalar, this allows us to obtain accurate estimates with a high signal-to-noise ratio while using only a minimal portion of the privacy budget, which has a negligible impact on overall privacy accounting. Furthermore, considering fluctuations in the gradient norm that may result from random sampling and injected noise, we use an exponential moving average (EMA) [49] to stabilize n ¯ t , which also better captures long-term training trends. Since this is a post-processing step, it does not cause additional privacy leakage.

Analysis of the Tanh-Based Noise Mapping (See Appendix A)

We analyze the properties of the hyperbolic tangent function in Equation (16) in the context of adaptive noise control. Specifically, the mapping provides a bounded, smooth, and monotonic transformation from the gradient norm to the noise scale. These properties are desirable for noise adaptation. The bounded output of tanh ( · ) ensures that the noise multiplier remains within a controlled range, preventing excessive perturbation that may degrade model utility. Its smooth and continuous behavior avoids abrupt changes in the noise level, leading to more stable optimization dynamics. The monotonicity of the function ensures that the noise level varies consistently with the gradient norm, avoiding situations where small gradients are perturbed by disproportionately large noise.
Moreover, the nonlinearity of the tanh function enables robust and adaptive noise control throughout training. In the early stage, the average gradient norm is typically much larger than λ , so tanh ( n ¯ t / λ ) is close to 1, and the noise intensity stabilizes near σ 0 , avoiding oscillations caused by rapidly changing gradient norms. As training progresses and the model starts to converge, the average gradient norm decreases, and the function gradually enters a near-linear region, allowing the noise multiplier to decay smoothly as the gradients shrink. If the rate at which the noise decreases is too slow, privacy protection becomes stronger, but persistently large noise may obscure useful gradient signals and slow convergence. Conversely, if noise decays too quickly, the model may appear more accurate in the short term, but privacy perturbation in later stages becomes insufficient, effectively consuming the privacy budget too early. In our experiments, we empirically analyze the impact of different λ settings and observe that an appropriate choice provides a good trade-off.

4.3. AIGU-DPFL

We integrate the two aforementioned approaches and incorporate them into a DP-FL framework, resulting in our AIGU-DPFL method, which aims to achieve optimal model utility. Specifically, sparsity is introduced by selecting only a subset of informative coordinates, which reduces the effective dimensionality of updates and avoids injecting noise into redundant parameters. At the same time, the adaptive noise mechanism adjusts the perturbation scale according to the training dynamics, ensuring that noise does not overwhelm useful gradient signals. Importantly, by restricting updates to informative coordinates, gradient selection changes the effective space where noise is applied, allowing the adaptive noise mechanism to focus perturbation on relevant dimensions. AIGU-DPFL achieves sparse and noise-effective gradient updates through the joint effect of gradient selection and adaptive noise control. Consequently, noise is concentrated on the most relevant dimensions, which improves the signal-to-noise ratio of the effective update space and enables more efficient utilization of the privacy budget under differential privacy constraints.
Under the Rényi differential privacy (RDP) framework, the privacy guarantee of AIGU-DPFL is derived from the Gaussian mechanism applied to clipped gradients. Specifically, per-sample gradients are clipped to a fixed norm bound C, which ensures bounded sensitivity, and Gaussian noise is subsequently added to guarantee privacy. The importance-based gradient selection is performed after clipping and thus does not increase sensitivity. In addition, the adaptive noise control relies on privatized gradient norm statistics. Although this introduces an additional privacy cost, it corresponds to a low-dimensional Gaussian mechanism (on a scalar quantity), which is explicitly accounted for within the RDP composition framework and is typically negligible compared to the primary gradient perturbation. Furthermore, the adaptive noise adjustment depends only on these privatized statistics and can be viewed as a deterministic function of differentially private outputs. Therefore, it satisfies the post-processing property of differential privacy and does not introduce additional privacy leakage beyond the accounted mechanisms. Consequently, the total privacy loss is accounted for through standard RDP composition, while the interaction between sparsity and adaptive noise improves the signal-to-noise ratio of gradient updates, leading to more stable convergence behavior in practice, as empirically validated by the observed convergence behavior in our experiments.
Algorithm 1 outlines the pseudocode of the proposed method. To preserve the overall privacy guarantee, local training is halted once the accumulated privacy budget reaches the predefined limit. Subsequently, Gaussian noise is injected into the gradients, and the averaged noisy updates are applied to refine the model. Upon completion of local training, the resulting model is transmitted to the server for aggregation.
Algorithm 1: Pseudocode for AIGU-DPFL
Computers 15 00288 i001

5. Experiments

In this section, we perform comprehensive experimental evaluations on three benchmark datasets to assess the proposed AIGU-DPFL method with respect to model accuracy, efficiency, and computational overhead.

5.1. Experimental Description

5.1.1. Datasets and Models

MNIST [50]: This dataset contains 70,000 grayscale images of handwritten digits categorized into 10 classes. On this dataset, we train a convolutional neural network (CNN) derived from the LeNet architecture. The network is composed of three convolutional layers, each followed by a Sigmoid activation function, and is optimized using the Adam optimizer with a learning rate of 0.002. Fashion-MNIST [51]: Fashion-MNIST contains 70,000 28 × 28 grayscale images representing 10 different clothing categories. We further employed a convolutional neural network (CNN) derived from the LeNet architecture on this dataset, using the Adam optimizer with a learning rate set to 0.001. CIFAR-10 [52]: This dataset comprises 32 × 32 RGB images spanning 10 categories (e.g., ships, airplanes, dogs, and cats), with 50,000 samples used for training and 10,000 reserved for testing. We used the ResNet-18 model for training on this dataset. The first convolutional layer of the model uses a 3 × 3 convolution kernel and does not include a max pooling layer. Group normalization was used in the convolutional layers of the network. The SGD optimizer with momentum (Momentum = 0.9) was used during training, with an initial learning rate of 0.1.

5.1.2. Baselines

As baselines, we compare AIGU-DPFL with FedAvg [1] (i.e., FL without privacy) and fixed noise injection (Fixed-DP). To further assess the effectiveness of our method, we also compare against two sparse model perturbation methods and one adaptive DP-FL method:
  • FedSMP-randk [19]: Randomly selects k coordinates from all d parameter dimensions as active dimensions, scales them by d/k, and then injects Gaussian noise only on these coordinates.
  • FedSMP-topk [19]: Selects the top k coordinates with the largest update magnitudes as active dimensions. The server generates a common top-k mask using a public dataset and broadcasts it to all clients.
  • ADAP-DPFL [26]: It adopts the l 2 -norm of the differentially private averaged gradient from the preceding round as the clipping bound for the current iteration, and adaptively tunes each client’s noise magnitude based on variations in the validation loss observed at the server.
All experiments are conducted on a workstation featuring an Intel 12th Gen Core i9-12900K CPU and an NVIDIA A5000 GPU.

5.2. Accuracy Comparison Experiments

In this study, we used the MNIST, Fashion-MNIST, and CIFAR-10 datasets to evaluate the performance of AIGU-DPFL, focusing primarily on test accuracy. We compared a total of six methods and examined how several key factors affect model performance, including the noise multiplier σ , the Dirichlet heterogeneity parameter α , the number of communication rounds T between the client and server, and the number of participating clients.
Noise Multiplier: Figure 5 shows the robustness of our method under different noise multipliers. Our method maintains high accuracy even as the noise multiplier increases. In the MNIST experiments, AIGU-DPFL consistently maintains an accuracy above 90% for all tested σ values. In contrast, the performance of the other methods decreases sharply as the noise multiplier increases. On the FMNIST dataset, even at large noise multipliers, the accuracy of AIGU-DPFL still exceeds 73%, while the other methods perform significantly worse. On the CIFAR-10 dataset, AIGU-DPFL maintains relatively stable accuracy across all σ values, demonstrating strong noise resistance compared with the other methods.
Non-independent and identically distributed (non-IID): To better simulate real-world federated learning environments, all experiments are conducted in a non-independent and identically distributed (non-IID) setting. Specifically, we use a Dirichlet distribution to assign labels among clients. Figure 6 illustrates the impact of variations in the Dirichlet heterogeneity parameter α on accuracy. A smaller α value indicates stronger heterogeneity, corresponding to a more imbalanced label distribution among clients. In this context, AIGU-DPFL demonstrates excellent generalization ability. On the MNIST dataset, when α = 0.8 , AIGU-DPFL performs comparably to Non-DP, achieving an accuracy of 95.96%, while the other methods perform significantly worse. On the FMNIST dataset, AIGU-DPFL shows better convergence than Fixed-DP; when α = 0.5 , the accuracy of AIGU-DPFL is 6.72% higher than that of Fixed-DP. On the CIFAR-10 dataset, as α decreases, the adverse effects of data heterogeneity become more pronounced, but AIGU-DPFL still exhibits stronger performance and a more stable convergence trend. These results confirm that AIGU-DPFL is adaptive and robust when handling non-independent and identically distributed data.
Communication Rounds: Figure 7 compares the impact of the number of communication rounds T on the test accuracy across the three datasets. We observe that AIGU-DPFL maintains strong model performance throughout the training process. For example, after 1000 communication rounds with the server, AIGU-DPFL achieves an accuracy of over 95% on the MNIST dataset and nearly 80% on the FMNIST dataset. In contrast, the other methods fail to remain competitive during training.
Client Count Stability: Figure 8 shows the change in model accuracy as the number of clients increases. The experimental results indicate that AIGU-DPFL exhibits strong stability under this condition. For example, on the MNIST dataset, when the number of clients reaches 50, the accuracy of the other four methods decreases, while AIGU-DPFL still achieves an accuracy of 96.80%. This shows that its performance does not change significantly with the number of clients. This property is particularly valuable in practical deployments, as the number of participating clients in real-world environments is often uncertain and may vary over time.

5.3. Hyperparameter Analysis

To evaluate the impact of three hyperparameters (initial retention rate r 0 , growth span Δ r , and scaling factor λ ) on training accuracy, we conducted tests on three different datasets.

5.3.1. Hyperparameter Sensitivity Analysis ( r 0 , Δ r )

As shown in Figure 9, we investigated the effects of the initial retention rate r 0 and growth span Δ r on the experimental results, the results show that the hyperparameters r 0 and Δ r have a significant impact on model accuracy, and this impact is closely related to the sparsity introduced by the gradient selection mechanism. For example, on the FMNIST dataset, when both r 0 and Δ r are in the range of 0.4 to 0.5, the test accuracy remains high and stable throughout the training process. Notably, all models employing importance-based selection outperform the dense baseline model (i.e., the model without introduced sparsity), indicating that moderate sparsity can enhance the model’s generalization ability. This effect is particularly significant when r 0 = 0.4 and Δ r = 0.5 , with the model achieving a peak accuracy of 80.83% in the later stages of training, which achieves an optimal balance between accuracy and sparsity. Conversely, excessively small r 0 and Δ r lead to a significant drop in accuracy, indicating that over-pruning severely limits the model’s expressive power. In the MNIST and CIFAR-10 experiments, the best performance was achieved with combinations of ( r 0 = 0.1 , Δ r = 0.9 ) and ( r 0 = 0.4 , Δ r = 0.4 ), respectively. This configuration effectively removes redundant or low-information parameters while preserving crucial gradient dimensions, thus promoting convergence.
To further evaluate the model’s boundary sensitivity to hyperparameter preservation, we extended the experiments to the region where r 0 + Δ r > 1 , corresponding to the model transitioning to full parameter updates early in training. On the FMNIST dataset, we observed that the model still maintains good performance in this region. For example, when r 0 = 0.5 and Δ r = 0.8 , the model achieved a test accuracy of 80.71%. This behavior mainly stems from the model entering the full update phase earlier, thus mitigating the convergence slowdown typically introduced by sparse gradient selection.

5.3.2. Hyperparameter Sensitivity Analysis ( λ )

We evaluated the impact of the scaling factor ( λ ) on training accuracy. This parameter adjusts the magnitude of the role of the average gradient norm in updating the noisy multiplier, thereby controlling the sensitivity of noise decay. A smaller λ makes the noise intensity more sensitive to changes in the gradient norm, leading to rapid noise decay in the early stages of training, while a larger λ reduces the magnitude of noise changes with training dynamics, approximating a more stable noise injection mechanism. In our experiments, we set λ to 0.1, 0.3, 1, 3, 10, 30, and 50, and evaluated the model’s performance on different datasets. As shown in Figure 10. The results show that setting λ to 10 yields the best performance on the MNIST and Fashion-MNIST datasets. Specifically, when λ is 10, the noisy multiplier can better adapt to the optimization dynamics during training, effectively suppressing training oscillations caused by excessively rapid decay in the early stages, while preventing strong noise from masking the effective signal in the later stages. This indicates that adjusting the scaling factor appropriately can effectively improve the model’s accuracy. Similarly, on the CIFAR-10 dataset, we also found the optimal model utility when λ = 30 .
However, as the complexity of training datasets and models increases, determining the optimal range of hyperparameters becomes more challenging, and the search space expands accordingly.

5.4. Ablation Experiments

In this section, we carry out comprehensive ablation studies to verify the effectiveness of each component within our training framework, the AIGU-DPFL method, which comprises two main modules: (1) an importance-based gradient selection strategy (GS) and (2) an adaptive noise mechanism (AN). We decompose AIGU-DPFL into four variants:
  • Default: Neither GS nor AN. This setup is used as the reference model in our comparative evaluation.
  • Default + GS: Only GS is retained, while AN is disabled. This setting allows us to evaluate the effectiveness of the importance-based gradient selection strategy.
  • Default + AN: Only AN is retained, while GS is disabled. This setup is used to evaluate the effectiveness of our adaptive noise mechanism.
  • AIGU-DPFL: The complete framework combining the two modules.
We assess these four variants across three datasets under three privacy budget settings, taking the global model accuracy on the central server as the primary evaluation metric, and perform a detailed analysis to examine the role and influence of each module within the algorithm. The outcomes are presented in Table 1. When only the gradient selection (GS) is retained, accuracy is improved across all privacy budgets compared to the baseline model. The improvement is most significant on the Fashion-MNIST dataset with ε = 4 , where GS alone improves accuracy by 2.28%. This indicates that selecting important gradient coordinates reduces noise interference in non-critical dimensions and focuses the privacy budget on task-relevant directions. The implementation of the adaptive noise mechanism (AN) also improves accuracy across various privacy budgets. This demonstrates that adaptively adjusting the noise level better adapts to the dynamics of training and effectively mitigates the drawbacks of excessive perturbation in the early stages with fixed noise and convergence obstacles in the later stages. When both modules are activated simultaneously, AIGU-DPFL consistently attains the best performance across all evaluated settings, typically exceeding the sum of the individual module improvements. For example, on the CIFAR-10 dataset with ε = 4 , AIGU-DPFL achieves an accuracy of 58.14%, 1.63% higher than the second-best result (56.51%), indicating a synergistic amplification effect and good compatibility between the two methods. Furthermore, experimental data clearly demonstrate that removing these two modules leads to a significant drop in model accuracy, highlighting the critical role of each component within the overall framework.
In summary, GS and AN, through their integrated mechanism of jointly filtering key gradient dimensions and adaptively adjusting noise amplitude, significantly suppress perturbations to the update direction. They exhibit complementary advantages on datasets of varying complexity and under different privacy budgets. Activating either module individually outperforms the default settings, while joint activation produces a synergistic effect, ensuring that AIGU-DPFL consistently maintains optimal performance and validating the effectiveness of GS and AN.

5.5. Efficiency and Overhead Analysis

This group of experiments evaluated the number of communication rounds required for AIGU-DPFL to reach the target accuracy compared with other algorithms. For a clearer comparison, we present the experimental results on the MNIST and Fashion-MNIST datasets. Table 2 shows that on the Fashion-MNIST dataset, AIGU-DPFL achieved 70% accuracy in only 135 rounds, which is 3.37 times and 3.05 times fewer than FedSMP-randk and FedSMP-topk, respectively, and 3.67 times fewer than Adap-DPFL. This advantage stems from our proposed importance-based gradient selection mechanism. This method effectively improves the signal-to-noise ratio of global updates, thereby accelerating the model’s convergence speed.
Table 3 further evaluates the time consumption under the condition of 1000 rounds on both datasets. The results show that AIGU-DPFL is more time-efficient than the FedSMP-randk and FedSMP-topk algorithms, and slightly better than Adap-DPFL. This demonstrates the time efficiency advantage of our method. In terms of communication overhead, Table 3 provides a comparison of the number of parameters transmitted per iteration for one client. Fixed-DP and Adap-DPFL require transmitting the full model in each iteration, resulting in comparable communication costs under the same model size and number of iterations. In contrast, FedSMP-randk and FedSMP-topk adopt sparsification strategies, where only a fraction of coordinates are communicated in each iteration, leading to reduced communication costs. For AIGU-DPFL, the communication overhead is determined by the dynamically adjusted retention ratio. By selecting only informative coordinates for transmission, AIGU-DPFL effectively reduces the number of transmitted parameters while maintaining sufficient model expressiveness, enabling a more flexible trade-off between communication efficiency and model performance. Table 3 compares the memory footprint of different methods, measured as the peak GPU memory per client during local training. The highest memory usage is observed in FedSMP-randk and FedSMP-topk, due to the additional index tensors and temporary buffers required for sparse updates. In contrast, Fixed-DP and Adap-DPFL exhibit lower memory consumption, as they operate on dense gradients without introducing extra data structures. AIGU-DPFL incurs a slightly higher memory consumption than Fixed-DP and Adap-DPFL, as it performs coordinate selection on gradient tensors and maintains dynamic control of the retention ratio and noise scale during training. These operations are applied directly to existing gradients and do not introduce additional model-sized allocations, resulting in a limited overall memory overhead.

6. Discussion

Despite AIGU-DPFL’s robust performance, certain aspects still warrant further optimization. First, although the dynamic sparsity scheduling mechanism effectively restores model capacity over time, the local training phase for each client currently adheres to a generic paradigm and has not yet been tailored to the unique data distribution characteristic of each specific client. Prior research in the field of personalized federated learning suggests that customizing local update strategies based on individual client characteristics can further accelerate convergence and enhance accuracy within heterogeneous environments. Consequently, integrating client-adaptive optimization techniques into AIGU-DPFL holds great promise as a highly potential avenue for future research.
Second, while our method maintains stable performance in scenarios involving 50 clients, deploying AIGU-DPFL in large-scale systems—comprising thousands or even millions of clients—may introduce additional challenges in terms of memory usage and communication overhead. In such settings, the communication overhead in federated learning generally scales approximately linearly with the number of participating clients, as each client transmits its local updates to the server in each communication round. As the client population grows, the aggregated communication cost increases proportionally. Future work will investigate hierarchical aggregation schemes, in which client updates are first aggregated at edge servers and then synchronized at the global level [53], as well as semi-asynchronous coordination that limits stale updates through bounded-delay scheduling [54], to improve scalability under large client populations. To further reduce communication overhead without performance degradation, model distillation-based methods can be adopted, which reduce the communication burden by exchanging compact knowledge representations instead of full gradient updates, thereby significantly decreasing the amount of transmitted information while preserving task-relevant knowledge [55]. From the perspective of model dimensionality, these challenges can be further amplified in high-dimensional models, where the size of gradient updates and communication cost grow with the number of parameters. AIGU-DPFL partially alleviates this issue by restricting updates to informative coordinates, thereby reducing the effective update dimensionality. Future work will investigate structured sparsification with parameter grouping [56] to bound per-round communication and computation costs in high-dimensional settings, following established sparsified training approaches.
Furthermore, despite the reduction in noise applied to non-critical gradients, the approach still requires a careful distribution of the privacy budget. Under stringent privacy constraints, achieving an optimal balance between sparsity induction and noise injection may require further fine-tuning. Future work may investigate adaptive strategies for privacy budget allocation, where the budget is dynamically assigned according to the relative importance of specific parameters or gradients throughout training. Such mechanisms have the potential to improve both privacy efficiency and model performance in sensitive application scenarios.
Despite the above limitations, the proposed AIGU-DPFL method offers significant practical advantages in real-world federated learning deployments where strict data privacy, high model utility, and communication efficiency are simultaneously demanded. For example, in healthcare systems, medical data are distributed across hospitals and cannot be shared due to privacy regulations. Our method enables collaborative model training while protecting sensitive patient information through differential privacy. At the same time, the importance-based gradient selection reduces communication overhead by transmitting only a subset of informative parameters, which is beneficial in bandwidth-limited environments. Similarly, in financial services, where user data are highly sensitive, the adaptive noise mechanism allows flexible control of the privacy–utility trade-off, ensuring strong privacy guarantees while maintaining model performance. In edge-based Internet of Things (IoT) applications, where devices often have limited computational and communication resources, the combination of sparse updates and adaptive noise helps reduce resource consumption and improve training efficiency under heterogeneous conditions.

7. Conclusions

This paper proposes an adaptive differentially private federated learning method, AIGU-DPFL, which effectively mitigates the performance degradation typically induced by differential privacy mechanisms in federated learning. Specifically, AIGU-DPFL enhances model utility by dynamically selecting the most informative gradient coordinates and adaptively adjusting the noise scale based on historical gradient norms. Extensive empirical results show that AIGU-DPFL surpasses existing approaches in both model accuracy and operational efficiency.

Author Contributions

The authors confirm contribution to the paper as follows: F.S. conceived the study, developed the software, performed investigation and formal analysis, and drafted the original manuscript; Z.C. co-conceived the study, acquired funding, provided resources, supervised the project, and reviewed and edited the manuscript; Y.M. curated the data and contributed to drafting the original manuscript; Y.L. (Yuhang Liu) reviewed and edited the manuscript; L.F. produced visualizations and participated in investigation; Y.L. (Yanlong Lu) provided resources and supervised the project. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is jointly supported by the National Natural Science Foundation of China (Grant No. 62302540, author F.F.S.; https://www.nsfc.gov.cn (accessed on 24 April 2026)), the Key Research and Development Program of Henan Province (Grant No. 251111212000, author F.F.S.; http://xt.hnkjt.gov.cn/data/ (accessed on 24 April 2026)).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We only use publicly available datasets. The MNIST dataset can be found at http://yann.lecun.com/exdb/mnist (accessed on 24 April 2026). The Fashion MNIST dataset can be found at https://github.com/zalandoresearch/fashionmnist (accessed on 24 April 2026). The CIFAR-10 dataset can be found at http://www.cs.toronto.edu/kriz/cifar.html (accessed on 24 April 2026).

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Noise Mapping Functions

As training progresses, gradient magnitudes typically decrease, while a fixed noise level may dominate the optimization process, especially in later stages. This mismatch can significantly degrade model utility by overwhelming useful gradient signals.
To address this issue, it is necessary to adapt the noise magnitude in accordance with the evolving training dynamics. Consequently, Zhang et al. [57] introduced a linearly decaying noise multiplier to alleviate the adverse effects caused by fixed noise injection throughout the training process. Building upon this idea, Chilukotia et al. [58] introduce an inverse-time mapping strategy, where the noise multiplier is adjusted according to a rational decay form, allowing it to vary with the progression of training. In addition to these approaches, exponential mapping provides another representative strategy, in which the noise multiplier follows an exponential functional form with respect to training rounds. Different mapping functions correspond to distinct adaptation behaviors, resulting in distinct noise adjustment behaviors throughout the training process. The noise multiplier mapping functions discussed above are summarized in Table A1.
Table A1. Types of noise multiplier mapping functions. σ 0 denotes the initial noise multiplier, t represents the training or communication round, and α , β , and γ are hyperparameters controlling the decay rate of different mapping functions.
Table A1. Types of noise multiplier mapping functions. σ 0 denotes the initial noise multiplier, t represents the training or communication round, and α , β , and γ are hyperparameters controlling the decay rate of different mapping functions.
Mapping FunctionMathematical Expression
Linear σ t = σ 0 α t
Inverse-time σ t = σ 0 1 + β t
Exponential σ t = σ 0 e γ t

Appendix A.2. Comparison of Noise Mapping Functions

To evaluate the impact of different noise mapping functions, we compare the Tanh-based mapping in Equation (16) with the linear, inverse-time, and exponential mappings listed in Table A1. The experimental results on the Fashion-MNIST dataset are reported in Table A2, where evaluations are conducted under privacy budgets ϵ = 2 , ϵ = 4 , and ϵ = 8 .
In Table A2, the labels Linear, Inverse-time, and Exponential correspond to the respective noise mapping functions defined in Table A1, while Tanh denotes the gradient-dependent mapping specified in Equation (16). It can be observed that the Tanh-based mapping consistently achieves the highest accuracy across all privacy levels. Specifically, under ϵ = 2 , the Tanh mapping yields an accuracy of 47.48%, which is higher than Linear (46.96%), Inverse-time (47.01%), and Exponential (46.64%). Similar trends are observed for ϵ = 4 and ϵ = 8 , where the proposed mapping maintains a consistent advantage. This improvement arises because the Tanh-based mapping adapts the noise level based on gradient information, thereby preventing the noise from overwhelming the gradients in later training stages where gradient magnitudes become small.
Table A2. Accuracy of different noise mappings on the Fashion-MNIST dataset under different privacy budgets.
Table A2. Accuracy of different noise mappings on the Fashion-MNIST dataset under different privacy budgets.
Accuracy
Linear Inverse-Time Exponential Tanh
ϵ = 2 46.96%47.01%46.64%47.48%
ϵ = 4 62.92%62.95%62.60%63.33%
ϵ = 8 74.61%74.59%74.47%74.82%

References

  1. McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Agüera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR: Red Hook, NY, USA, 2017; Volume 54, pp. 1273–1282. [Google Scholar]
  2. Long, G.; Tan, Y.; Jiang, J. Federated learning for open banking. In Federated Learning: Privacy and Incentive; Springer: Cham, Switzerland, 2020; pp. 240–254. [Google Scholar]
  3. Taimoor, N.; Rehman, S. Reliable and resilient AI and IoT-based personalized healthcare services: A survey. IEEE Access 2021, 10, 535–563. [Google Scholar] [CrossRef]
  4. Al-Rubaie, M.; Chang, J.M. Reconstruction attacks against mobile-based continuous authentication systems in the cloud. IEEE Trans. Inf. Forensics Secur. 2016, 11, 2648–2663. [Google Scholar] [CrossRef]
  5. Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3–18. [Google Scholar]
  6. Salem, A.; Zhang, Y.; Humbert, M.; Berrang, P.; Fritz, M. ML-Leaks: Model and data independent membership inference attacks and defenses. arXiv 2018, arXiv:1806.01246. [Google Scholar] [CrossRef]
  7. Melis, L.; Song, C.; De Cristofaro, E.; Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 691–706. [Google Scholar]
  8. Dwork, C. Differential privacy. In International Colloquium on Automata, Languages, and Programming; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
  9. Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
  10. Wang, Z.; Yu, X.; Huang, Q.; Gong, Y. An adaptive differential privacy method based on federated learning. arXiv 2024, arXiv:2408.08909. [Google Scholar]
  11. Geyer, R.C.; Klein, T.; Nabi, M. Differentially private federated learning: A client level perspective. arXiv 2017, arXiv:1712.07557. [Google Scholar]
  12. Choudhury, O.; Gkoulalas-Divanis, A.; Salonidis, T.; Sylla, S.; Das, G. Differential privacy-enabled federated learning for sensitive health data. arXiv 2019, arXiv:1910.02578. [Google Scholar]
  13. Zheng, Q.; Chen, S.; Long, Q.; Su, W. Federated f-differential privacy. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; PMLR: Red Hook, NY, USA, 2021; pp. 2251–2259. [Google Scholar]
  14. McMahan, H.B.; Ramage, D.; Talwar, K.; Zhang, L. Learning differentially private recurrent language models. arXiv 2017, arXiv:1710.06963. [Google Scholar]
  15. Wei, K.; Li, J.; Ding, M.; Ma, C.; Yang, H.H.; Farokhi, F.; Jin, S.; Quek, Q.S.T.; Poor, H.V. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3454–3469. [Google Scholar] [CrossRef]
  16. Shokri, R.; Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, 12–16 October 2015; pp. 1310–1321. [Google Scholar]
  17. Jayaraman, B.; Evans, D. Evaluating differentially private machine learning in practice. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 1895–1912. [Google Scholar]
  18. Xiang, L.; Yang, J.; Li, B. Differentially-private deep learning from an optimization perspective. In Proceedings of the IEEE INFOCOM, Paris, France, 29 April–2 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 559–567. [Google Scholar]
  19. Hu, R.; Guo, Y.; Gong, Y. Federated learning with sparsified model perturbation: Improving accuracy under client-level differential privacy. IEEE Trans. Mob. Comput. 2023, 23, 8242–8255. [Google Scholar] [CrossRef]
  20. Wang, G.; Qi, Q.; Han, R.; Bai, L.; Choi, J. P2CEFL: Privacy-preserving and communication efficient federated learning with sparse gradient and dithering quantization. IEEE Trans. Mob. Comput. 2024, 23, 14722–14736. [Google Scholar] [CrossRef]
  21. Wei, K.; Li, J.; Ma, C.; Ding, M.; Shu, F.; Zhao, H.; Zhu, H. Gradient sparsification for efficient wireless federated learning with differential privacy. Sci. China Inf. Sci. 2024, 67, 142303. [Google Scholar] [CrossRef]
  22. Medjadji, C.; Alawadi, S.; Awaysheh, F.M.; Leduc, G.; Kubler, S.; Le Traon, Y. FedSparQ: Adaptive sparse quantization with error feedback for robust and efficient federated learning. In Proceedings of the 3rd International Conference on Federated Learning Technologies and Applications (FLTA), Valencia, Spain, 10–12 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 364–372. [Google Scholar]
  23. Huang, C.; Li, Y.; Zhao, Y.; Du, X.; Huang, J. AdaS-FLDP: Local differentially private federated learning with adaptive sparsification. J. Supercomput. 2025, 81, 1109. [Google Scholar] [CrossRef]
  24. Li, Y.; Yang, S.; Ren, X.; Shi, L.; Zhao, C. Multi-stage asynchronous federated learning with adaptive differential privacy. IEEE Trans. Pattern. Anal. Mach. Intell. 2023, 46, 1243–1256. [Google Scholar] [CrossRef]
  25. Andrew, G.; Thakkar, O.; McMahan, B.; Ramaswamy, S. Differentially private learning with adaptive clipping. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 17455–17466. [Google Scholar]
  26. Fu, J.; Chen, Z.; Han, X. Adap DP-FL: Differentially private federated learning with adaptive noise. In Proceedings of the 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications, Wuhan, China, 17–19 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 656–663. [Google Scholar]
  27. Ayeelyan, J.; Utomo, S.; Rouniyar, A.; Hsu, H.C.; Hsiung, P.A. Federated learning design and functional models: Survey. Artif. Intell. Rev. 2025, 67, 142303. [Google Scholar] [CrossRef]
  28. Sana, T.Z.; Abdulla, S.; Nag, A.; Das, A.; Hassan, M.M.; Fiza, Z.Z.; Karim, A.; Kabir, S.R.R. Advancing federated learning: A systematic literature review of methods, challenges, and applications. IEEE Access 2025, 13, 153817–153844. [Google Scholar] [CrossRef]
  29. Xu, Z.; Shi, S.; Liu, A.X.; Zhao, J.; Chen, L. An adaptive and fast convergent approach to differentially private deep learning. In Proceedings of the IEEE INFOCOM 2020, Virtual, 6–9 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1867–1876. [Google Scholar]
  30. Yu, L.; Liu, L.; Pu, C.; Gursoy, M.E.; Truex, S. Differentially private model publishing for deep learning. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 332–349. [Google Scholar]
  31. Lee, J.; Kifer, D. Concentrated differentially private gradient descent with adaptive per-iteration privacy budget. In Proceedings of the 24th ACM SIGKDD Conference, London, UK, 19–23 August 2018; pp. 1656–1665. [Google Scholar]
  32. Pichapati, V.; Suresh, A.T.; Yu, F.X.; Reddi, S.J.; Kumar, S. AdaClip: Adaptive clipping for private SGD. arXiv 2019, arXiv:1908.07643. [Google Scholar]
  33. Xue, R.; Xue, K.; Zhu, B.; Luo, X.; Zhang, T.; Sun, Q.; Lu, J. Differentially private federated learning with an adaptive noise mechanism. IEEE Trans. Inf. Forensics Secur. 2023, 19, 74–87. [Google Scholar] [CrossRef]
  34. Wei, W.; Liu, L. Gradient leakage attack resilient deep learning. IEEE Trans. Inf. Forensics Secur. 2021, 17, 303–316. [Google Scholar] [CrossRef]
  35. Tramer, F.; Boneh, D. Differentially private learning needs better features (or much more data). arXiv 2020, arXiv:2011.11660. [Google Scholar]
  36. Oyallon, E.; Zagoruyko, S.; Huang, G.; Komodakis, N.; Lacoste-Julien, S.; Blaschko, M.; Belilovsky, E. Scattering networks for hybrid representation learning. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2208–2221. [Google Scholar] [CrossRef]
  37. Sha, H.; Liu, R.; Liu, Y.; Chen, H. PCDP-SGD: Improving the convergence of differentially private SGD via projection in advance. arXiv 2023, arXiv:2312.03792. [Google Scholar] [CrossRef]
  38. Liu, Y.; Xiong, L.; Liu, Y.; Gu, Y.; Liu, R.; Chen, H. DPDR: Gradient decomposition and reconstruction for differentially private deep learning. arXiv 2024, arXiv:2406.02744. [Google Scholar] [CrossRef]
  39. Zhu, J.; Blaschko, M.B. Improving differentially private SGD via randomly sparsified gradients. arXiv 2021, arXiv:2112.00845. [Google Scholar]
  40. Adamczewski, K.; Park, M. Differential privacy meets neural network pruning. arXiv 2023, arXiv:2303.04612. [Google Scholar] [CrossRef]
  41. Shin, H.; Kim, S.; Shin, J.; Xiao, X. Privacy enhanced matrix factorization for recommendation with local differential privacy. IEEE Trans. Knowl. Data Eng. 2018, 30, 1770–1782. [Google Scholar] [CrossRef]
  42. Kirszbraun, M.D. Extensions of Lipschitz mappings into a Hilbert space. Ann. Math. 1934, 45–76. [Google Scholar]
  43. Mironov, I. Rényi differential privacy. In Proceedings of the 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Santa Barbara, CA, USA, 21–25 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 263–275. [Google Scholar]
  44. Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology—EUROCRYPT; Springer: Berlin/Heidelberg, Germany, 2006; pp. 486–503. [Google Scholar]
  45. Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
  46. Chen, L.; Ding, X.; Li, M.; Jin, H. Differentially private federated learning with importance client sampling. IEEE Trans. Consum. Electron 2023, 70, 3635–3649. [Google Scholar] [CrossRef]
  47. Zhang, J.; He, T.; Sra, S.; Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv 2019, arXiv:1905.11881. [Google Scholar]
  48. Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
  49. Morales-Brotons, D.; Vogels, T.; Hendrikx, H. Exponential moving average of weights in deep learning: Dynamics and benefits. arXiv 2024, arXiv:2411.18704. [Google Scholar] [CrossRef]
  50. Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
  51. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
  52. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  53. Qiu, C.; Wu, Z.; Wang, H.; Yang, Q.; Wang, Y.; Su, C. Hierarchical aggregation for federated learning in heterogeneous IoT scenarios: Enhancing privacy and communication efficiency. Future Internet 2025, 17, 18. [Google Scholar] [CrossRef]
  54. Liu, P.; Jia, L.; Xiao, Y. Participant selection for efficient and trusted federated learning in blockchain-assisted hierarchical federated learning architectures. Future Internet 2025, 17, 75. [Google Scholar] [CrossRef]
  55. He, Z.; Zhu, G.; Zhang, S.; Luo, E.; Zhao, Y. FedDT: A communication-efficient federated learning via knowledge distillation and ternary compression. Electronics 2025, 14, 2183. [Google Scholar] [CrossRef]
  56. Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv 2017, arXiv:1712.01887. [Google Scholar]
  57. Zhang, X.; Ding, J.; Wu, M.; Wong, S.T.; Van Nguyen, H.; Pan, M. Adaptive privacy preserving deep learning algorithms for medical data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1169–1178. [Google Scholar]
  58. Chilukoti, S.V.; Hossen, M.I.; Shan, L.; Tida, V.S.; Bappy, M.M.; Tian, W.; Hei, X. DP-SGD-global-adapt-V2-S: Triad improvements of privacy, accuracy and fairness via step decay noise multiplier and step decay upper clipping threshold. Electron. Commer. Res. Appl. 2025, 70, 101476. [Google Scholar] [CrossRef]
Figure 1. A typical federated learning training process with potentially malicious server and participants.
Figure 1. A typical federated learning training process with potentially malicious server and participants.
Computers 15 00288 g001
Figure 2. Gradient distribution at Epoch 20.
Figure 2. Gradient distribution at Epoch 20.
Computers 15 00288 g002
Figure 3. Gradient variation across clients and over rounds.
Figure 3. Gradient variation across clients and over rounds.
Computers 15 00288 g003
Figure 4. Overall framework of AIGU-DPFL.
Figure 4. Overall framework of AIGU-DPFL.
Computers 15 00288 g004
Figure 5. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 under varying noise levels σ , where larger σ indicates stronger noise perturbation.
Figure 5. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 under varying noise levels σ , where larger σ indicates stronger noise perturbation.
Computers 15 00288 g005
Figure 6. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 under varying Dirichlet distribution parameter α , where smaller α indicates higher data heterogeneity.
Figure 6. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 under varying Dirichlet distribution parameter α , where smaller α indicates higher data heterogeneity.
Computers 15 00288 g006
Figure 7. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 over communication rounds T, where larger T corresponds to more training iterations.
Figure 7. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 over communication rounds T, where larger T corresponds to more training iterations.
Computers 15 00288 g007
Figure 8. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 with varying numbers of clients, where more clients typically increase system heterogeneity.
Figure 8. Test accuracy of six federated learning algorithms on MNIST, Fashion-MNIST, and CIFAR-10 with varying numbers of clients, where more clients typically increase system heterogeneity.
Computers 15 00288 g008
Figure 9. Parameter analysis of r 0 and Δ r on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. Subfigures (ac) correspond to MNIST, Fashion-MNIST, and CIFAR-10, respectively.
Figure 9. Parameter analysis of r 0 and Δ r on the MNIST, Fashion-MNIST, and CIFAR-10 datasets. Subfigures (ac) correspond to MNIST, Fashion-MNIST, and CIFAR-10, respectively.
Computers 15 00288 g009
Figure 10. On the MNIST, Fashion-MNIST, and CIFAR-10 datasets: impact of the scale factor on AIFU-DPFL.
Figure 10. On the MNIST, Fashion-MNIST, and CIFAR-10 datasets: impact of the scale factor on AIFU-DPFL.
Computers 15 00288 g010
Table 1. Ablation study results. (In the table, ✔ denotes that the corresponding module is enabled, while × indicates that it is not included. Bold values represent the best accuracy for each dataset under different privacy budgets ( ε ), while underlined values represent the second-best accuracy.)
Table 1. Ablation study results. (In the table, ✔ denotes that the corresponding module is enabled, while × indicates that it is not included. Bold values represent the best accuracy for each dataset under different privacy budgets ( ε ), while underlined values represent the second-best accuracy.)
DatasetModulesAccuracy
GS AN ε = 2 ε = 4 ε = 8
MNIST××53.79%81.05%89.70%
×54.18%82.41%90.58%
×54.31%81.84%90.18%
54.86%83.48%91.31%
Fashion-MNIST××44.37%59.19%72.45%
×46.14%61.47%73.71%
×45.48%60.48%73.07%
47.48%63.33%74.82%
CIFAR-10××37.86%54.82%71.12%
×39.45%56.51%72.56%
×38.56%55.62%71.94%
40.60%58.14%73.98%
Table 2. Rounds required to achieve target accuracy on MNIST and Fashion-MNIST.
Table 2. Rounds required to achieve target accuracy on MNIST and Fashion-MNIST.
DatasetRounds Required to Achieve Target Accuracy
Condition FedSMP-Randk FedSMP-Topk ADAP-DPFL AIGU-DPFL Non-DP
MNIST85% Accuracy74076267414478
Fashion-MNIST70% Accuracy45541249613562
Table 3. Computation overhead, communication overhead, and memory footprint (MB) on MNIST and Fashion-MNIST. Σ denotes the total number of trainable parameters in the model. p = k / Σ is the compression ratio in FedSMP, where only k coordinates are transmitted in the sparse update. r ¯ = 1 T t = 0 T 1 r t is the average retention ratio in AIGU-DPFL, where r t = r 0 + Δ r · t T .
Table 3. Computation overhead, communication overhead, and memory footprint (MB) on MNIST and Fashion-MNIST. Σ denotes the total number of trainable parameters in the model. p = k / Σ is the compression ratio in FedSMP, where only k coordinates are transmitted in the sparse update. r ¯ = 1 T t = 0 T 1 r t is the average retention ratio in AIGU-DPFL, where r t = r 0 + Δ r · t T .
MethodDatasetTotal Time
( T = 1000 )
CommunicationMemory Footprint
(MB)
FedSMP-randkMNIST1138.2106 s ( 1 + p ) × Σ 240 ± 3
Fashion-MNIST1161.9058 s 247 ± 4
FedSMP-topkMNIST1289.4473 s ( 1 + p ) × Σ 245 ± 2
Fashion-MNIST1337.2689 s 258 ± 4
Adap-DPFLMNIST1071.1427 s 2 × Σ 228 ± 2
Fashion-MNIST1084.0948 s 234 ± 2
Fixed-DPMNIST1064.4618 s 2 × Σ 223 ± 2
Fashion-MNIST1079.1273 s 231 ± 3
AIGU-DPFLMNIST1069.2195 s ( 1 + r ¯ ) × Σ 237 ± 2
Fashion-MNIST1080.8642 s 249 ± 3
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shan, F.; Chen, Z.; Mao, Y.; Liu, Y.; Fan, L.; Lu, Y. AIGU-DPFL: Adaptive Differentially Private Federated Learning with Importance-Based Gradient Updates. Computers 2026, 15, 288. https://doi.org/10.3390/computers15050288

AMA Style

Shan F, Chen Z, Mao Y, Liu Y, Fan L, Lu Y. AIGU-DPFL: Adaptive Differentially Private Federated Learning with Importance-Based Gradient Updates. Computers. 2026; 15(5):288. https://doi.org/10.3390/computers15050288

Chicago/Turabian Style

Shan, Fangfang, Zhuo Chen, Yifan Mao, Yuhang Liu, Lulu Fan, and Yanlong Lu. 2026. "AIGU-DPFL: Adaptive Differentially Private Federated Learning with Importance-Based Gradient Updates" Computers 15, no. 5: 288. https://doi.org/10.3390/computers15050288

APA Style

Shan, F., Chen, Z., Mao, Y., Liu, Y., Fan, L., & Lu, Y. (2026). AIGU-DPFL: Adaptive Differentially Private Federated Learning with Importance-Based Gradient Updates. Computers, 15(5), 288. https://doi.org/10.3390/computers15050288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop