Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings

Li, Yiwei; Wang, Shuai; Wu, Qilong

doi:10.3390/math13030497

Open AccessArticle

Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings^†

by

Yiwei Li

¹

,

Shuai Wang

²

and

Qilong Wu

^1,*

¹

Fujian Key Laboratory of Communication Network and Information Processing, Xiamen University of Technology, Xiamen 361024, China

²

National Key Laboratory of Wireless Communications, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

^†

This article is an expanded version of a paper entitled “Secure federated averaging algorithm with differential privacy”, which was presented at IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Espoo, Finland, 17–20 September 2020.

Mathematics 2025, 13(3), 497; https://doi.org/10.3390/math13030497

Submission received: 15 January 2025 / Revised: 26 January 2025 / Accepted: 27 January 2025 / Published: 2 February 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

Federated learning (FL) has emerged as a prominent approach for distributed machine learning, enabling collaborative model training while preserving data privacy. However, the presence of non-i.i.d. data and the need for robust privacy protection introduce significant challenges in theoretically analyzing the performance of FL algorithms. In this paper, we present novel theoretical analysis on typical differentially private federated averaging (DP-FedAvg) by judiciously considering the impact of non-i.i.d. data on convergence and privacy guarantees. Our contributions are threefold: (i) We introduce a theoretical framework for analyzing the convergence of DP-FedAvg algorithm by considering different client sampling and data sampling strategies, privacy amplification and non-i.i.d. data. (ii) We explore the privacy–utility tradeoff and demonstrate how client strategies interact with differential privacy to affect learning performance. (iii) We provide extensive experimental validation using real-world datasets to verify our theoretical findings.

Keywords:

federated learning; convergence analysis; privacy analysis; data heterogeneity

MSC:

68W15

1. Introduction

With the rapid growth of the Internet of Things (IoT) in recent years, the number of intelligent devices has surged globally. Many of these devices are equipped with various sensors and increasingly powerful hardware, enabling them to collect and process data at unprecedented scales [1,2,3]. However, this growth has been accompanied by rising concerns about data privacy. Federated learning (FL) has emerged as a key solution in this context, which enables collaborative machine learning (ML) by allowing a parameter server (PS) to coordinate the training of a global model across distributed clients, thus eliminating the need to centralize sensitive data [4,5]. However, FL systems may involve millions of clients, and clients collect their own data based on their local environments and usage patterns; the size and distribution of local datasets vary significantly across clients [6]. This variation leads to data heterogeneity, as data generated by different clients are typically non-independent and identically distributed (non-i.i.d.) [7]. Data heterogeneity is also reflected in the diverse dynamics exhibited by different devices and their communication network.

Federated Averaging (FedAvg), proposed by [5], is one of the most popular FL algorithms. In FedAvg, a small subset of clients is randomly selected to perform local model updates at each communication round, which are then aggregated by the PS to update the global model. Due to its simplicity and effectiveness, FedAvg serves as the foundation for most subsequent FL algorithms. Despite its widespread adoption, the convergence behavior of FedAvg under data heterogeneity remains poorly understood. Theoretical results, such as those in [8,9], suggest that data heterogeneity significantly impacts FedAvg, requiring many more communication rounds to converge as client gradients diverge. These findings are consistent with observations on synthetic or artificially partitioned datasets [10,11]. However, in many real-world FL tasks, FedAvg often performs remarkably well [5], contrary to these theoretical predictions. Furthermore, advanced methods designed to address data heterogeneity, such as SCAFFOLD [12], demonstrate significant advantages on synthetic heterogeneous datasets but exhibit similar performance to FedAvg on realistic datasets. This inconsistency between theory and real-world performance raises questions about the practical relevance of data heterogeneity and the adequacy of existing theoretical models for real-world FL systems.

On the other hand, FedAvg still faces significant challenges, particularly in ensuring robust privacy protections. Data privacy in FL is inherently at risk during the exchange of model parameters between clients and the PS, as adversaries may attempt to infer sensitive information from these exchanges [13]. Differential privacy (DP) has emerged as a key tool to address this risk by introducing calibrated noise into the learning process [14], offering mathematically rigorous privacy guarantees. However, the added DP noise can negatively affect model accuracy and algorithm convergence, requiring a careful balance between privacy and utility [15,16].

Currently, there are numerous works focused on analyzing the impact of data heterogeneity in FedAvg [17,18]. However, to the best of our knowledge, none of the existing works have provided a complete theoretical analysis on FedAvg for handling data heterogeneity together with the consideration of client and data sampling strategies as well as privacy amplification analysis, hence motivating us to develop an advanced theoretical framework for a DP-based FedAvg algorithm (DP-FedAvg) over non-i.i.d. data.

1.1. Related Work

This section provides a focused review of the literature to contextualize our work within the broader scope of FL research:

FL in the presence of data heterogeneity: A substantial body of research has tackled the challenges posed by non-i.i.d. data in FL. Studies such as [6,19] indicate that the convergence rate of FL models in non-i.i.d. settings is significantly lower than in i.i.d. scenarios. These works also provide theoretical analyses suggesting that effective client selection schemes can improve convergence rates in non-i.i.d. scenarios. Many existing FL frameworks [5,11,20] rely on random client selection strategies [19]. Additionally, some works [21,22] consider device heterogeneity by taking into account factors like wireless channel capacity and computational power. However, a unified theoretical analysis that examines both non-i.i.d. data effects and client sampling strategies in a single framework remains unexplored.
FL with privacy considerations: Privacy protection has become a critical focus in the FL system. The works [23,24] have explored DP as a robust mechanism to protect data privacy in FL, demonstrating its effectiveness in achieving privacy guarantees without incurring excessive computational overhead. Numerous privacy protection techniques based on DP algorithm in FL have been investigated, including DP-SGD [25], DP-FedAvg [1,26] and the DP-based primal-dual method (DP-PDM) [27], where additive noise is applied to local gradients/models. Notably, the impact of this additive noise on learning performance can be quantified through convergence analysis, and total privacy loss can be readily tracked during the training process. However, few of these studies consider privacy amplification techniques [28,29] to reduce the adverse effect of DP noise, which could potentially improve the privacy–utility tradeoff.

To the best of our knowledge, none of the existing works have fully explored the interplay between DP noise and non-i.i.d. data on the convergence of the DP-FedAvg algorithm. Although various attempts have been made to integrate DP into FL algorithms, such as DP-SGD [1,26], DP-Prox [30] and DP-SCAFFOLD [31], the privacy–utility tradeoffs achieved by these methods remain suboptimal in non-i.i.d. settings, especially, they did not include the privacy amplification analyses. This poor privacy–utility tradeoff can be attributed to both data heterogeneity and the added DP noise, as existing FL works do not comprehensively integrate DP noise, privacy amplification, data heterogeneity, client selection and data sampling strategies within one work. This work addresses these gaps, providing robust privacy protections through DP while offering strong theoretical guarantees.

1.2. Contributions

The main contributions are summarized as follows:

Theoretical framework: We propose a novel theoretical framework to analyze the convergence of DP-FedAvg under various client sampling strategies, with a particular focus on the impact of privacy protection and non-i.i.d. data. This framework enables us to quantify how client sampling methods and data heterogeneity influence the algorithm’s convergence rate and privacy guarantees.
Privacy–utility tradeoff: We explore the tradeoff between privacy protection and learning performance, taking into account the effects of privacy amplification. Our theoretical results demonstrate how various system parameters influence this tradeoff and highlights that the privacy protection level and gradient clipping play a crucial role in determining the privacy–utility tradeoff in practical FL systems.
Empirical validation: We empirically validate our theoretical findings through extensive experiments on real-world datasets, demonstrating the behavior of DP-FedAvg under non-i.i.d. data conditions.

Synopsis: For ease of the ensuing presentation, all the mathematical notations used are listed in Table 1. Section 2 introduces the preliminaries of FL, DP and the existing DP-FedAvg algorithm. Section 3 provides the convergence analysis and privacy analysis for the DP-FedAvg algorithm. Experimental results are presented in Section 4, and Section 5 concludes the paper.

2. Preliminaries

2.1. Federated Learning

We consider a vanilla FL system that consists of one parameter server (PS) and N clients in which each client has its local dataset

D_{k} : = \{(x_{k, j}, y_{k, j}), \forall k \in [N], j \in [n_{k}]\}

, where

n_{k}

and

n = \sum_{k = 1}^{N} n_{k}

denote the number of k-th client’s training samples and its number of training samples, respectively.

x_{k, j}

denotes the training data sample and

y_{k, j}

is the corresponding label, while

[N]

denotes the set

{1, 2, \dots, N}

. The FL algorithm is to solve the following distributed optimization problem:

\begin{matrix} min_{w} \{F (w) ≜ \sum_{k = 1}^{N} p_{k} F_{k} (w)\} \end{matrix}

(1)

where

w

is the global model and

p_{k} = n_{k} / n

is the weight of the k-th client and

\sum_{k = 1}^{N} p_{k} = 1

; the

F_{k}

is the local objective function for client k, which is defined by

\begin{matrix} F_{k} (w) = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} ℓ (w; x_{k, j}, y_{k, j}), \end{matrix}

(2)

where

ℓ (\cdot)

is a user-specified loss function. Problem (1) can be solved using the well-known FedAvg algorithm [5]. Specifically, FedAvg involves T communication rounds, and each round contains the following steps:

1.: First, at the beginning of round $t \in {0, \dots, T - 1}$ , the PS selects a subset of clients $S_{t} \subseteq [N]$ to participate and sends them the latest global model $w_{t}$ .
2.: Each client $i \in S_{t}$ initializes its local model $w_{t + 1}^{k}$ to be the global model $w_{t}$ and then performs Q iterations of stochastic gradient descent (SGD) on its local dataset.

$\begin{matrix} w_{t + 1}^{k} = & w_{t}, \end{matrix}$

(3)

$\begin{matrix} w_{t + j}^{k} = & w_{t + j - 1}^{k} - η_{t + j - 1} (\nabla F_{k} (w_{t + j - 1}^{k}, B_{t + j - 1}^{k})), j \in [Q], k \in S_{t}, \end{matrix}$

(4)

where $η_{t + j - 1}$ represents the learning rate, and $B_{t + j - 1}^{k} \subseteq D^{k}$ is the mini-batch data with $| B_{t + j - 1}^{k} | = b$ ; $\nabla F_{k} (w_{t + j - 1}^{k}, B_{t + j - 1}^{k})$ denotes the stochastic gradient.
3.: Each client $i \in S_{t}$ uploads the final local model update $w_{t + Q}^{k}$ to the PS.
4.: The PS aggregates the local model updates from all participating clients and improves the global model as

\begin{matrix} w_{t + Q} & = \sum_{k \in S_{t}} p_{k} w_{t + Q}^{k} . \end{matrix}

(5)

The same procedure repeats for the next round until the algorithm converges.

2.2. Differential Privacy

Although FedAvg avoids the direct information leakage by keeping the data locally, the intermediate updates exchanged during the collaboration process such as

w_{t + Q}^{k}

and

w_{t + Q}

could still leak private information about the local data as demonstrated in recent advanced attacks such as model inversion attacks and membership attacks [32]. Figure 1 depicts the FL system model under attacks from an adversary. The adversary in this context can be either an “honest-but-curious” aggregation PS or clients within the system. The aggregation PS is assumed to honestly follow the designed training protocol but may be curious about clients’ private data and attempt to infer it from the shared messages. To provide strong privacy protection for local data, we adopt

(ϵ, δ)

-DP, which is defined as follows.

Definition 1

(

(ϵ, δ)

-DP [14]). A randomized mechanism

M

satisfies

(ϵ, δ)

-DP if for any two adjacent datasets

D

and

D^{'}

differing in only one record, and for any subsets of outputs

S \subseteq range (M)

,

\begin{matrix} P [M (D) \in S] \leq exp (ϵ) \cdot P [M (D^{'}) \in S] + δ, \end{matrix}

(6)

where

ϵ > 0

accounts for privacy protection level, and

δ \geq 0

is the probability threshold to break

(ϵ, 0)

-DP.

For a query function

g : X \to R

, the

(ϵ, δ)

-DP mechanism can be implemented by adding artificial Gaussian noise to function g as follows [14]:

\begin{matrix} M (D) ≜ g (D) + N (0, Δ^{2} \cdot σ^{2}), \end{matrix}

(7)

where

Δ

is the sensitivity of function g, defined by

\begin{matrix} Δ = max_{D, D^{'}} ∥ g (D) - g (D^{'}) ∥, \end{matrix}

(8)

in which

D, D^{'} \subseteq X

are neighboring datasets.

σ^{2}

denotes the minimal required “noise scale” for achieving

(ϵ, δ)

-DP given by the following lemma in [14].

Lemma 1.

Suppose that the query function g accesses the dataset

D

via randomized mechanism

M

. Then,

M

satisfies

(ϵ, δ)

-DP if

\begin{matrix} σ^{2} \geq \frac{2 Δ^{2} ln (1.25 / δ)}{ϵ^{2}} . \end{matrix}

(9)

Definition 2

(Privacy loss [14]). Suppose that a randomized mechanism

M

satisfies

(ϵ, δ)

-DP. Let

D

and

D^{'}

be two neighboring datasets and

o

be a possible output of

M (D)

and

M (D^{'})

. The privacy loss is defined by

\begin{matrix} c (o) = ln (\frac{P [M (D) = o]}{P [M (D^{'}) = o]}) . \end{matrix}

(10)

It has been known that, running on a randomly generated subset of a dataset, the DP mechanism can yield stronger privacy protection than running on the entire dataset. This fact implies that the DP noise required for achieving a predefined DP level can be reduced when partial data are randomly selected at each local update. For the ease of later analysis, we restate the privacy amplification theorem as follows [7].

Theorem 1.

Suppose that a mechanism

M

is

(ϵ, δ)

-DP over a given dataset

D

with size

| D | = n

. Consider the subsampling mechanism that outputs a random sample uniformly over all subsets

D_{s} \subseteq D

with size

| D_{s} | = b

. Then, when

ϵ \leq 1

, executing

M

mechanism on the subset

D_{s}

guarantees

(ϵ^{'}, δ^{'})

-DP, where

ϵ^{'}

and

δ^{'}

are given by

\begin{matrix} ϵ^{'} = & min (2 q ϵ, ϵ), \end{matrix}

(11)

\begin{matrix} δ^{'} = & q δ, \end{matrix}

(12)

where

q = b / n

is the data sampling ratio when data are sampled without replacement.

According to Theorem 1, the privacy would be amplified when

q \leq 1 / 2

. Note that the privacy amplification is pervasively adopted in the existing literature on FL [1,7,23] since only a small portion of data are currently being used in local SGD.

2.3. The Typical DP-FedAvg Algorithm

In this subsection, we present the typical DP-FedAvg, which is summarized in Algorithm 1. The main steps follow the FedAvg framework outlined in Section 2.1. The key distinctions between DP-FedAvg and FedAvg are twofold:

(i)

Gaussian noise is added to local gradients to ensure robust privacy protection and

(i i)

gradient clipping is applied to the locally updated model. Specifically, to protect the privacy of the local model

w_{t}^{k}

, we apply DP by adding Gaussian noise

ξ_{t + 1}^{k} \sim N (0, {(σ_{t + 1}^{k})}^{2} I_{d})

to the local gradient as follows:

\begin{matrix} w_{t + 1}^{k} = w_{t}^{k} - η_{t} ({\tilde{g}}_{t}^{k} + ξ_{t + 1}^{k}), \end{matrix}

(13)

where

{\tilde{g}}_{t}^{k}

is the clipped local gradient, defined by

\begin{matrix} {\tilde{g}}_{t}^{k} = g_{t}^{k} / max (1, \frac{∥ g_{t}^{k} ∥}{G}), \end{matrix}

(14)

in which

g_{t}^{k}

is the local stochastic gradient, given as follows:

\begin{matrix} g_{t}^{k} = \nabla F_{k} (w_{t}^{k}, B_{t}^{k}) . \end{matrix}

(15)

It is important to note that determining the

ℓ_{2}

-sensitivity requires bounded local gradients. Therefore, gradient clipping is applied to the locally updated model. Specifically, if

∥ g_{t + 1}^{k} ∥ \geq G

, the gradient is scaled down to have a norm of

G

; otherwise, it remains unchanged. Gradient clipping is a standard operation in SGD with DP, particularly for determining sensitivity in FL systems.

Algorithm 1 Proposed DP-FedAvg algorithm

1:: $Input :$ Initial values of $w_{0}^{1}, \dots, w_{0}^{N}$ at the clients, and initial value of $w_{0}$ at the server.
2:: for $t = 0, 1, \dots, T - 1$ do
3:: $Client side :$
4:: for $k \in S_{t}$ in parallel do
5:: Calculate local gradients: $g_{t}^{k}$ via (15).
6:: Perform gradient clipping: ${\tilde{g}}_{t}^{k}$ via (14).
7:: if $t + 1 \neq z Q, z = 1, 2, \dots$ then
8:: Update $w_{t + 1}^{k} = w_{t}^{k} - η_{t} g_{t}^{k}$
9:: else if $t + 1 = z Q, z = 1, 2, \dots$ then
10:: Update $w_{t + 1}^{k}$ via (13).
11:: end if
12:: $Server side :$
13:: Update $w_{t + 1} = \sum_{k = 1}^{K} p_{k} w_{t + 1}^{k}$ .
14:: Update the subset of clients $S_{t + 1} \subseteq [N]$ through uniformly sampling, and broadcast $x_{0}^{t}$ to all clients.
15:: end for
16:: end for

3. Theoretical Analysis for DP-FedAvg

In this section, we give the formal privacy guarantee and rigorous convergence analysis of DP-FedAvg. Before stating our results, we make the following assumptions.

3.1. Assumptions

Assumption 1.

F_{1}

,

F_{2}

,…,

F_{N}

are all L-smooth: for all

w

and

v

,

F_{k} (v) \leq F_{k} (w) + {(v - w)}^{⊤} \nabla F_{k} (w) + \frac{L}{2} {∥ v - w ∥}^{2}, \forall k \in [N]

.

Assumption 2.

F_{1}

,

F_{2}

,…,

F_{N}

are all μ-strong convex: for all

w

and

v

,

F_{k} (v) \geq F_{k} (w) + {(v - w)}^{⊤} \nabla F_{k} (w) + \frac{μ}{2} {∥ v - w ∥}^{2}, \forall k \in [N]

.

Assumption 3.

Denote

B_{t}^{k}

as the mini-batch data from the dataset in client k with size

| B_{t}^{k} | = b

; then, the variance of stochastic gradients in each client is bounded, that is,

E [∥ \nabla F_{k} (w_{t}^{k}, B_{t}^{k}) - \nabla F_{k} (w_{t}^{k}) ∥^{2}] \leq ϕ_{k}^{2} / b, \forall k \in [N]

.

Assumption 4.

The local gradient is bounded, i.e.,

∥ \nabla F_{k} (w_{t}^{k}, B_{t}^{k}) ∥ \leq G

.

Quantifying the degree of non - i . i . d . data

. Inspired by [6], we quantify data heterogeneity through the degree of non-i.i.d. data, defined as

\begin{matrix} Γ = F^{🟉} - \sum_{k = 1}^{K} p_{k} F_{k}^{🟉}, \end{matrix}

(16)

where

F^{🟉}

and

F_{k}^{🟉}

represent the minimum values of F and

F_{k}

, respectively. A larger value of

Γ

indicates a higher degree of data heterogeneity. Specifically, when the data are i.i.d.,

Γ

approaches zero.

3.2. Privacy Analysis

We firstly estimate the sensitivity on client perspective. The

ℓ_{2}

-norm sensitivity for each local uploaded model is given in following Lemma.

Lemma 2.

Supposed that the Assumptions 1–4 hold and

mod (t + 1, Q) = 0

. The

ℓ_{2}

-norm sensitivity is given by

\begin{matrix} ▵_{t + 1} = max_{D_{k}, D_{k}^{'}} ∥ w_{t + 1, D_{k}}^{k} - w_{t + 1, D_{k}^{'}}^{k} ∥ = 4 Q G η_{t + 1}, \end{matrix}

(17)

where

w_{t + 1, D_{k}}^{k}

and

w_{t + 1, D_{k}^{'}}^{k}

denote the local model parameters updated from two neighboring datasets

D_{k}

and

D_{k}^{'}

, respectively.

Proof.

See Appendix A. □

Lemma 3.

According to the privacy amplification theorem in Theorem 1, the required privacy protection level to guarantee

(ϵ, δ)

-DP is given by

\begin{matrix} ϵ^{'} = & \frac{ϵ}{2 q}, \end{matrix}

(18)

\begin{matrix} δ^{'} = & \frac{δ}{q}, \end{matrix}

(19)

where q is the data sampling ratio.

By Lemma 2 and Lemma 3, the total DP noise scale for t-th iteration is

\begin{matrix} σ_{t}^{2} = \sum_{k = 1}^{K} p_{k} {(σ_{t}^{k})}^{2} = \frac{128 η_{t}^{2} Q^{4} b^{2} G^{2}}{n ϵ^{2}} \sum_{k = 1}^{K} \frac{log (1.25 Q b / n_{k} δ)}{n_{k}} . \end{matrix}

(20)

One can observe from (20) that the increasing Q promptly increases the noise scale, which is consistent the result in Lemma 2. According to (20), the DP noise scale is primarily determined by the privacy protection level

ϵ

, the gradient clipping threshold

G

and other algorithm-specific parameters, such as Q, b and

η_{t}

. Thus, in practical FL systems, achieving a balance between privacy protection and learning performance relies on effectively controlling the key parameters

ϵ

and

G

.

Then, the accumulated noise over the entire training process with total

T / Q

communication rounds can be obtained by

\begin{matrix} σ_{t o t a l}^{2} = \sum_{t = 1}^{T / Q} \sum_{k = 1}^{K} p_{k} {(σ_{t}^{k})}^{2} = \frac{128 T Q^{3} b^{2} G^{2}}{n ϵ^{2}} \sum_{t = 1}^{T / Q} η_{t}^{2} \sum_{k = 1}^{K} \frac{log (1.25 Q b / n_{k} δ)}{n_{k}} . \end{matrix}

(21)

By considering privacy amplification, the total privacy loss after T communication rounds has been explored in [27]. For the sake of clarity and ease of later analysis, we restate it here as follows.

Theorem 2.

Suppose that any client

i, \forall i \in [N]

in Algorithm 1 is uniformly sampled with probability

p_{i}

and each communication round guarantees

(ϵ, δ)

-DP. The total privacy loss for client i after T communication rounds satisfies

\begin{matrix} {\bar{ϵ}}_{i} = c_{0} q_{i} ϵ \sqrt{\frac{p_{i} T}{1 - q_{i}}}, \forall i \in [N], \end{matrix}

(22)

where

c_{0} > 0

is a constant and

q_{i}

is given by

\begin{matrix} q_{i} = \{\begin{matrix} \frac{Q b}{| D_{i} |}, & d a t a s a m p l i n g W O R, \\ 1 - {(1 - \frac{1}{| D_{i} |})}^{Q b}, & d a t a s a m p l i n g W R, \end{matrix} \end{matrix}

(23)

where WOR and WR denote data sampling without and with replacement, respectively.

Remark 1.

By Theorem 2, the total privacy loss is significantly impacted by data sampling strategies. Specifically, the total privacy loss for data sampling with replacement is smaller than that for data sampling without replacement due to

1 - (1 - 1 / | D_{i} {|)}^{Q b} \leq Q b / | D_{i} |

.

3.3. Convergence Analysis for DP-FedAvg

In this section, we analyze the convergence performance of the DP-FedAvg algorithm. Let T denote the total number of iterations and

w^{🟉}

the optimal solution of

F (w)

. The convergence performance is measured by

E [F (w_{T}) - F (w^{🟉})]

, where the expectation is taken over the algorithm’s randomness. This metric represents the gap between the achieved objective value and the optimal objective value after T iterations. Notably, when

E [F (w_{T}) - F (w^{🟉})] = 0

, the algorithm converges to the optimal solution.

Theorem 3.

For the full-participation case, assume that Assumption 1, Assumption 2 and Assumption 3 hold. Let

κ = L / μ

,

α = max {8 κ, Q}

, and the learning rate

η_{t} = 2 / (μ (α + t))

. Then, the following inequality holds:

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{2 κ}{t + α} (\frac{A + B}{μ} + 2 L {∥ w_{0} - w^{🟉} ∥}^{2}), \end{matrix}

(24)

where

\begin{matrix} A = \sum_{k = 1}^{K} p_{k}^{2} ϕ_{k}^{2} / b + 8 {(Q - 1)}^{2} G^{2} + 6 L Γ, \end{matrix}

(25)

\begin{matrix} B = \frac{8 d Q^{4} b^{2} G^{2}}{n L^{2} ϵ^{2}} log (n) log (1 . 25^{2} Q^{2} b^{2} / n δ^{2}) (2 {(Q - 1)}^{2} + 1) . \end{matrix}

(26)

Proof.

See Appendix B. □

Remark 2.

In Theorem 3, since

∥ w_{0} - w^{🟉} ∥^{2} \leq \frac{1}{μ^{2}} {(\nabla F (w) - \nabla F (w^{🟉}))}^{2} \leq \frac{4 G^{2}}{μ^{2}}

, the dominating term in (24) is

O (\frac{A + B}{μ T})

. Therefore, the DP-FedAvg algorithm achieves a convergence rate in the order of

\begin{matrix} O (\frac{\sum_{k = 1}^{K} p_{k}^{2} ϕ_{k}^{2} / b + Q^{2} G^{2} + L Γ}{μ T} + \frac{d Q^{4} b^{2} G^{2} log (n) log (Q^{2} b^{2} / n δ^{2})}{μ T n L^{2} ϵ^{2}}) . \end{matrix}

(27)

Therefore, an increase in Q can significantly increase the noise scale, thereby deteriorating the convergence performance.

Next, we present the convergence analysis by considering client sampling. Inspired by [6], we make the following assumption.

Assumption 5.

Assume the total number of clients is N and the size of the selected client set is

| S_{t} | = K

. If we uniformly sample the participating clients with replacement with sampling probabilities

p_{1}, \dots, p_{N}

, then the aggregation step on the server side performs

w_{t} = \frac{1}{K} \sum_{k \in S_{t}} w_{t}^{k}

. Otherwise, if we uniformly sample the participated clients without replacement from

[N]

, then the aggregation step on the PS performs

w_{t} = \frac{N}{K} \sum_{k \in S_{t}} p_{k} w_{t}^{k}

.

Theorem 4.

Suppose that Assumptions 1, 2, 3 and 5 hold. Let κ, α,

η_{t}

, A and B be defined in Theorem 3. Then, the following inequality holds:

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{2 κ}{t + α} (\frac{A + B + H}{μ} + 2 L {∥ w_{0} - w^{🟉} ∥}^{2}), \end{matrix}

(28)

if sampling the clients with replacement:

\begin{matrix} H = \frac{4}{K} Q^{2} G^{2}, \end{matrix}

(29)

and if sampling the clients without replacement:

\begin{matrix} H = \frac{N - K}{N - 1} \frac{4}{K} Q^{2} G^{2} . \end{matrix}

(30)

Proof.

See Appendix D. □

Theorem 4 provides an upper bound of the average total local SGDs over t iterations. A smaller value indicates a higher convergence rate and a lower cost function value in (28). Based on Theorems 3 and 4, we have following remarks.

Remark 3.

(The impact of Q on communication efficiency) By denoting

T_{ε}

as the number of required iterations for DP-FedAvg to achieve an ε accuracy. Based on the result in (27), the number of required communication rounds

T_{ε} / Q

is

\begin{matrix} \frac{T_{ε}}{Q} \propto O (\frac{d b^{2} G^{2} log (n) log (Q^{2} b^{2} / n δ^{2})}{μ T n L^{2} ϵ^{2}} Q^{3} + G^{2} Q + \frac{\sum_{k = 1}^{K} p_{k}^{2} ϕ_{k}^{2} / b + L Γ}{Q}) . \end{matrix}

(31)

From (31), we observe that the communication round

T_{ε} / Q

is the function of Q, and

T_{ε} / Q

first decreases and then increases, i.e., there exists an optimal Q such that the communication cost is optimal. Furthermore, if Q is too large, the

w_{k}

will promptly achieve local optimal, making FedAvg behave like one-shot average [33]. However, the one-shot averaging does not perform well in the non-i.i.d. data case as demonstrated in [6].

Remark 4.

(The impact of Q on learning performance) The DP noise scale increases significantly with larger values of Q, which inevitably degrades learning performance. However, a larger Q can also reduce the number of communication rounds, as stated in Remark 3. Therefore, there exists an optimal value of Q that balances learning performance and communication efficiency.

Remark 5.

(The impact of gradient clipping bound level

G

) The convergence performance is influenced by gradient clipping level

G

. However, limiting the gradient norm has the following shortcomings:

(i)

Gradient clipping destroys the unbiasedness of the gradient estimate.

(i i)

If

G

is too small, the average of the clipped gradients may point in a direction significantly different from the true gradient, potentially leading to poor convergence.

(i i i)

If

G

is too large, the gradients will carry more DP noise, as

G

determines the sensitivity.

4. Numerical Experiments

4.1. Experimental Setting

In this section, we evaluate the performance of DP-FedAvg under a non-i.i.d. data setting. Denote the

ℓ (w; x_{i})

as the prediction model with parameter

w = (W, b)

, then the form for logistic regression and softmax regression are

ℓ (w; x_{i}, y_{i}) = logisitc (W x + b)

and

ℓ (w; x_{i}, y_{i}) = softmax (W x + b)

, respectively. The loss function for both problems is given by

\begin{matrix} F_{k} (w) = \frac{1}{n_{k}} \sum_{j = 1}^{n_{k}} CrossEntropy (ℓ (w; x_{j}), y_{j}) . \end{matrix}

(32)

Note that this is a convex and smooth problem. The testing accuracy (learning accuracy over the testing data) is evaluated as the learning performance in the experiments.

Datasets: Two benchmark datasets, Adult [34] and MNIST [35], were considered for performance evaluation. The MNIST dataset consists of 60,000 training samples and 10,000 testing samples, while the Adult dataset consists of 32,561 training samples and 16,281 testing samples. To simulate the FL system, we distributed all training samples across

N = 100

clients in a non-i.i.d. manner, ensuring that each client possessed partial labels for the data:

For the MNIST dataset, following the heterogeneous data partition method in [6], each client was allocated data samples of only four different labels with $| D_{i} | = 600$ . This led to a high degree of non-i.i.d. datasets among clients.
For the Adult dataset, following [1], all the training samples were uniformly distributed among a total $N = 100$ clients such that each client only contained data from one class with $| D_{i} | = 325$ .

Parameter setting :

For all experiments, we maintained the following parameter settings:

K = 10

,

b = 10

and

δ = 10^{- 5}

. The choice of the learning rate

η^{t}

depended on the dataset under consideration.

For the MNIST dataset, the local learning rate was set to

η^{t} = 0.04 / \sqrt{1 + t}

, while for the Adult dataset,

η^{t} = 0.01 / \sqrt{1 + t}

. Following the rule of gradient clipping in [25], we selected the value of

G

by taking the median of the norms of the unclipped gradients over the training process.

4.2. Impact of the Privacy Protection Level $ϵ$

To validate the learning performance at various privacy protection levels, the parameters were set as

K = 10

,

Q = 5

and

δ = 10^{- 5}

, and the testing accuracy was computed with respect to communication rounds (

T / Q

). We compared the accuracy with various values of

ϵ

under both i.i.d. and non-i.i.d. scenarios. As illustrated in Figure 2, both the MNIST and Adult datasets provided strong privacy protection without a significant decrease in testing accuracy. Moreover, we find that MNIST has a higher tolerance to DP noise than the Adult dataset. Specifically, the difference in testing accuracy between the non-private and private models for both i.i.d. and non-i.i.d. datasets is much smaller than the corresponding drop in accuracy for the Adult dataset. The reason for the MNIST dataset’s robustness to DP noise compared with the Adult dataset may have been due to the following reasons:

Impact of non-i.i.d. data: The presence of non-i.i.d. data across clients in FL amplifies the impact of DP noise. The Adult dataset, which contains categorical and numerical features, is more sensitive to such heterogeneity. Conversely, the MNIST dataset, consisting of grayscale images of handwritten digits with simpler patterns, is inherently more resilient to non-i.i.d. data settings.
Inherent data structure: The MNIST dataset benefits from the spatial coherence of pixel intensities, where patterns such as the shapes of digits remain discernible when DP noise is introduced. This property of image data enables models to identify meaningful features despite perturbations. On the other hand, the Adult dataset, being tabular with categorical and numerical features, lacks such spatial structure. DP noise in tabular data can obscure critical relationships between features and labels, making it harder for the FL model to extract meaningful patterns under noisy conditions.

Figure 2. The testing accuracy versus communication rounds of the DP-FedAvg algorithm on various privacy protection levels under Adult and MNIST datasets.

4.3. Impact of Local Epoch Length Q

In this experiment, we compared the learning performance under the parameter settings

ϵ = 0.5

,

K = 10

and

δ = 10^{- 5}

for various values of Q. As shown in Figure 3, we observed that both excessively large and excessively small values of Q reduced the convergence rate on the Adult dataset. However, for the MNIST dataset, a large Q had a negligible effect on convergence. This may have been due to two reasons:

(i)

The optimal value of Q for MNIST is significantly larger than for the Adult dataset, and

(i i)

MNIST exhibits a higher tolerance to noise compared to the Adult dataset. Although a large Q alleviates the communication burden, it substantially increases the noise scale, thereby degrading learning performance as discussed in Remark 4. Additionally, our simulation results indicate that the optimal value of Q is sensitive to both the parameter settings and the training dataset. Furthermore, for both datasets, the convergence performance on i.i.d. data is better than on non-i.i.d. data under the same parameter settings.

5. Conclusions and Future Works

In this paper, we conducted a thorough theoretical analysis of DP-FedAvg under different client and data sampling strategies, with a particular focus on the impact of non-i.i.d. data on the convergence and privacy protection of FL algorithms. Our work fills a significant gap in the literature by providing a unified framework that integrates data heterogeneity with DP. Through our analysis, we demonstrated that the DP-FedAvg algorithm maintains robustness toward data heterogeneity and DP noise, achieving a sublinear convergence rate for convex FL problems. Additionally, our convergence and privacy analyses offer valuable insight and perspectives into how various system parameters in the FL framework influence learning performance, which can be used as guidelines for the practical FL algorithm design. Our theoretical findings are supported by extensive numerical experiments on real-world datasets, which are consistent with our theoretical findings.

While our theoretical analysis provides useful insights for DP-FedAvg under the data heterogeneity scenario, several promising directions remain for future research. First, our work is based on the typical star network framework, consisting of one server and multiple clients. A promising direction would be to explore hierarchical FL systems, where multiple servers collaborate to train a global model. This framework could involve more clients and larger datasets, potentially improving learning performance. Second, another promising avenue for future work involves examining the tradeoffs between privacy, utility and computational overhead when implementing advanced methods in resource-constrained FL systems, such as those in mobile edge computing environments. Finally, while the DP mechanism offers strong privacy guarantees, more sophisticated approaches could be employed to further enhance privacy protection. Techniques such as homomorphic encryption could be integrated to strengthen the security of DP-FedAvg. Addressing these challenges and exploring innovative theoretical analyses will be an exciting and valuable direction for future research.

Author Contributions

Methodology, Y.L. and S.W.; Formal analysis, Y.L., S.W. and Q.W.; Investigation, Y.L. and Q.W.; Writing—original draft preparation, Y.L.; Writing—review and editing, Y.L., S.W. and Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of the Lemma 2

Proof.

Since there is only one different record between

D_{k}, D_{k}^{'}

, and each client performs Q steps of SGD per communication round, the local update scheme is

w_{t + 1, D_{k}}^{k} = w_{t + 1 - Q, D_{k}}^{k} - \sum_{τ = 0}^{Q - 1} η_{t + 1 - Q + τ} g_{t + 1 - Q + τ, D_{k}}^{k}

. Thus, we have

\begin{matrix} {∥w_{t + 1, D_{k}}^{k} - w_{t + 1, D_{k}^{'}}^{k}∥}^{2} & = ∥ \sum_{τ = 0}^{Q - 1} η_{t + 1 - Q + τ} g_{t + 1 - Q + τ, D_{k}}^{k} - \sum_{τ = 0}^{Q - 1} η_{t + 1 - Q + τ} g_{t + 1 - Q + τ, D_{k}^{'}}^{k} ∥^{2} \\ \overset{(a)}{\leq} 4 η_{t + 1}^{2} ∥ \sum_{τ = 0}^{Q - 1} (g_{t + 1 - Q + τ, D_{k}}^{k} - g_{t + 1 - Q + τ, D_{k}^{'}}^{k}) ∥^{2} \\ \leq 4 Q^{2} η_{t + 1}^{2} \sum_{τ = 0}^{Q - 1} {∥g_{t + 1 - Q + τ, D_{k}}^{k} - g_{t + 1 - Q + τ, D_{k}^{'}}^{k}∥}^{2} \\ \overset{(b)}{\leq} 16 Q^{2} G^{2} η_{t + 1}^{2}, \end{matrix}

(A1)

where

(a)

holds due to

η_{t} \leq 2 η_{t + 1}

;

(b)

follows from Assumption 4. □

Appendix B. Proof of Theorem 3

Proof.

Inspired by the perturbed iterate framework in [6], we define

I_{Q} = \{z Q | z = 1, 2, \dots\}

as the set of global synchronization steps. When

t + 1 \notin I_{Q}

, it is the time that clients run the local model update; otherwise, it is the time step to communication. Then, the update rule of DP-FedAvg can be described as

\begin{matrix} v_{t + 1}^{k} = \{\begin{matrix} w_{t}^{k} - η_{t} \nabla F_{k} (w_{t}^{k}, B_{t}^{k}) & if t + 1 \notin I_{Q} \\ w_{t}^{k} - η_{t} (\nabla F_{k} (w_{t}^{k}, B_{t}^{k}) + ξ_{t}^{k}) & if t + 1 \in I_{Q} \end{matrix} \end{matrix}

(A2)

\begin{matrix} w_{t + 1}^{k} = \{\begin{matrix} v_{t + 1}^{k} & if t + 1 \notin I_{Q} \\ \sum_{k = 1}^{K} p_{k} v_{t + 1}^{k} & if t + 1 \in I_{Q} \end{matrix} \end{matrix}

(A3)

where

ξ_{t}^{k}

is the Gaussian noise added to gradients,

v_{t + 1}^{k}

denotes one-step SGD update from

w_{t}^{k}

and

w_{t + 1}^{k}

can be interpreted as the result after communication steps.

Motivated by [6], we define two virtual sequences,

{\bar{w}}_{t} = \sum_{k = 1}^{K} p_{k} w_{t}^{k}

and

{\bar{v}}_{t} = \sum_{k = 1}^{K} p_{k} v_{t}^{k}

. For simplicity of analysis, when

t + 1 \in I_{Q}

, define

\begin{matrix} {\bar{g}}_{t} = & \sum_{k = 1}^{K} p_{k} (\nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k}), \end{matrix}

(A4)

\begin{matrix} g_{t} = & \sum_{k = 1}^{K} p_{k} (\nabla F_{k} (w_{t}^{k}, B_{t}^{k}) + ξ_{t}^{k}) . \end{matrix}

(A5)

When

t + 1 \notin I_{Q}

, define

\begin{matrix} g_{t} = & \sum_{k = 1}^{K} p_{k} (\nabla F_{k} (w_{t}^{k}, B_{t}^{k})), \end{matrix}

(A6)

\begin{matrix} {\bar{g}}_{t} = & \sum_{k = 1}^{K} p_{k} (\nabla F_{k} (w_{t}^{k})) . \end{matrix}

(A7)

Note that, for both cases, we always have

{\bar{v}}_{t + 1} = {\bar{w}}_{t} - η_{t} g_{t}

and

E [g_{t}] = {\bar{g}}_{t}

. □

Appendix B.1. Key Lemmas

To clearly convey the proof of Theorem 3, we need the following key lemmas.

Lemma A1.

(one-step SGD) If Assumption 1 and Assumption 2 hold and

η_{t} \leq \frac{1}{4 L}

. Then,

w h e n t + 1 \notin I_{Q}

,

\begin{matrix} E [{∥{\bar{v}}_{t + 1} - w^{🟉}∥}^{2}] \leq & (1 - μ η_{t}) E [{∥{\bar{w}}_{t} - w^{🟉}∥}^{2}] + η_{t}^{2} E [{∥g_{t} - {\bar{g}}_{t}∥}^{2}] \\ + 2 E [\sum_{k = 1}^{K} p_{k} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2}] + 6 L η_{t}^{2} Γ, \end{matrix}

(A8)

when

t + 1 \in I_{Q}

,

\begin{matrix} E [{∥{\bar{v}}_{t + 1} - w^{🟉}∥}^{2}] \leq & (1 - μ η_{t}) E [{∥{\bar{w}}_{t} - w^{🟉}∥}^{2}] + η_{t}^{2} E [{∥g_{t} - {\bar{g}}_{t}∥}^{2}] \\ + 2 E [\sum_{k = 1}^{K} p_{k} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2}] + 6 L η_{t}^{2} Γ + 2 d η_{t}^{2} E [\sum_{k = 1}^{K} p_{k} {∥ ξ_{t}^{k} ∥}^{2}] . \end{matrix}

(A9)

Lemma A2.

Suppose that Assumption 3 holds. Then,

\begin{matrix} E [∥ g_{t} - {\bar{g}}_{t} ∥^{2}] \leq \sum_{k = 1}^{K} p_{k}^{2} ϕ_{k}^{2} / b . \end{matrix}

(A10)

Lemma A3.

Assume

η_{t}

is non-increasing and

η_{t} \leq 2 η_{t + Q}

, then,

when

t + 1 \in I_{Q}

,

\begin{matrix} E [\sum_{k = 1}^{K} p_{k} ∥ {\bar{w}}_{t} - w_{t}^{k} ∥^{2}] \leq 4 {(Q - 1)}^{2} η_{t}^{2} (G^{2} + d \sum_{k = 1}^{K} p_{k} σ_{t, k}^{2}) . \end{matrix}

(A11)

When

t + 1 \notin I_{Q}

,

\begin{matrix} E [\sum_{k = 1}^{K} p_{k} ∥ {\bar{w}}_{t} - w_{t}^{k} ∥^{2}] \leq 4 {(Q - 1)}^{2} G^{2} η_{t}^{2} . \end{matrix}

(A12)

Lemma A4.

Assume

η_{t}

is non-increasing and

η_{t} \leq \frac{1}{4 L}

, then the total noise in each communication round is given by

\begin{matrix} E [\sum_{k = 1}^{K} p_{k} ∥ ξ_{t}^{k} ∥^{2}] \leq \frac{4 Q^{4} b^{2} G^{2}}{n L^{2} ϵ^{2}} log (n) log (1 . 25^{2} Q^{2} b^{2} / n δ^{2}) . \end{matrix}

(A13)

Appendix B.2. Completing the Proof of Theorem 3

Proof.

Let

△_{t + 1} = E [{∥{\bar{w}}_{t + 1} - w^{🟉}∥}^{2}]

. Based on the results of Lemma A1–A4, when

t + 1 \notin I_{Q}

, it follows that

\begin{matrix} E [{∥{\bar{v}}_{t + 1} - w^{🟉}∥}^{2}] & \leq (1 - μ η_{t}) E [{∥{\bar{w}}_{t} - w^{🟉}∥}^{2}] + A η_{t}^{2}, \end{matrix}

(A14)

where

\begin{matrix} A = \sum_{k = 1}^{K} p_{k}^{2} ϕ_{k}^{2} / b + 8 {(Q - 1)}^{2} G^{2} + 6 L Γ . \end{matrix}

(A15)

When

t + 1 \in I_{Q}

,

\begin{matrix} E [∥ {\bar{v}}_{t + 1} - w^{🟉} ∥^{2}] & \leq (1 - μ η_{t}) E [∥ {\bar{w}}_{t} - w^{🟉} ∥^{2}] + (A + B) η_{t}^{2}, \end{matrix}

(A16)

where

\begin{matrix} B = \frac{8 d Q^{4} b^{2} G^{2} log (n) log (1 . 25^{2} Q^{2} b^{2} / n δ^{2})}{n L^{2} ϵ^{2}} (2 {(Q - 1)}^{2} + 1) . \end{matrix}

(A17)

Then, (A16) becomes

\begin{matrix} △_{t + 1} \leq (1 - μ η_{t}) △_{t} + η_{t}^{2} (A + B) . \end{matrix}

(A18)

For a diminishing stepsize,

η_{t} = \frac{β}{t + α}

for some

β > \frac{1}{μ}

and

α \geq 0

such that

η_{0} \leq min {\frac{1}{μ}, \frac{1}{4 L}} = \frac{1}{4 L}

and

η_{t} \leq 2 η_{t + Q}

, this also implies that

α \geq Q

:

\begin{matrix} △_{t + 1} & \leq (1 - μ η_{t}) △_{t} + η_{t}^{2} (A + B) \\ = (1 - \frac{μ β}{t + α}) \frac{ν}{t + α} + \frac{β^{2}}{{(t + α)}^{2}} (A + B) \\ \leq (1 - \frac{μ β}{t + α}) \frac{ν}{t + α} + \frac{β^{2} (A + B)}{{(t + α)}^{2}} \\ \leq \frac{t + α - 1}{{(t + α)}^{2}} ν + (\frac{β^{2} (A + B)}{{(t + α)}^{2}} - \frac{(μ β - 1) ν}{{(t + α)}^{2}}) \\ \leq \frac{t + α - 1}{{(t + α)}^{2} - 1} ν \\ = \frac{ν}{t + α + 1}, \end{matrix}

(A19)

where

ν \geq β^{2} (A + B) / (μ β - 1)

. By the L-smoothness of

F (\cdot)

, we have

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{L}{2} E {∥ {\bar{w}}_{t} - w^{🟉} ∥}^{2} = \frac{L}{2} △_{t} \leq \frac{L}{2} \frac{ν}{t + α} . \end{matrix}

(A20)

Note that, from (A20), we have

ν \geq α △_{0}

. Thus, we obtain

ν = max \{\frac{β^{2} (A + B)}{μ β - 1}, α △_{0}\}

. If we choose

β = 2 / μ

and then

η_{t} = \frac{2}{μ} \frac{1}{α + t}

, where

α \geq max {8 κ, Q}

,

κ = \frac{L}{μ}

. The range of

α

due to

η_{0} = \frac{2}{μ} \frac{1}{α} \leq \frac{1}{4 L}

implies that

α \geq 8 κ

. Then, we have

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{2 κ}{t + α} (\frac{A + B}{μ} + 2 L △_{0}), \end{matrix}

(A21)

where if

\frac{β^{2} (A + B)}{μ β - 1} \geq α △_{0}

, then

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{L}{2} \frac{1}{t + α} \frac{β^{2} A}{μ β - 1} = \frac{2 κ}{t + α} \frac{A + B}{μ}; \end{matrix}

(A22)

otherwise, if

\frac{β^{2} (A + B)}{μ β - 1} < α △_{0}

, and in this case,

α = 8 κ

, then

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{L}{2} \frac{1}{t + α} α △_{0} = \frac{4 κ L △_{0}}{t + α} . \end{matrix}

(A23)

By combining (A21), (A22) and (A23), we obtain the result. □

Appendix C. Proofs of Key Lemmas in Theorem 3

Appendix C.1. Proof of Lemma A1

Proof.

Since we have

{\bar{v}}_{t + 1} = {\bar{w}}_{t} - η_{t} g_{t}

, then

\begin{matrix} {∥{\bar{v}}_{t + 1} - w^{🟉}∥}^{2} = & {∥{\bar{w}}_{t} - η_{t} g_{t} - w^{🟉}∥}^{2} = {∥{\bar{w}}_{t} - η_{t} g_{t} - w^{🟉} - η_{t} {\bar{g}}_{t} + η_{t} {\bar{g}}_{t}∥}^{2} \\ = & \underset{≜ A_{1}}{\underset{︸}{{∥{\bar{w}}_{t} - w^{🟉} - η_{t} {\bar{g}}_{t}∥}^{2}}} + 2 η_{t} 〈{\bar{w}}_{t} - w^{🟉} - η_{t} {\bar{g}}_{t}, {\bar{g}}_{t} - g_{t}〉 \\ + η_{t}^{2} \underset{≜ A_{2}}{\underset{︸}{{∥g_{t} - {\bar{g}}_{t}∥}^{2}}} . \end{matrix}

(A24)

Then,

E [〈{\bar{w}}_{t} - w^{🟉} - η_{t} {\bar{g}}_{t}, {\bar{g}}_{t} - g_{t}〉] = 0

due to

E [g_{t}] = {\bar{g}}_{t}

. Then, for the

A_{1}

term, we have

\begin{matrix} {∥{\bar{w}}_{t} - w^{🟉} - η_{t} {\bar{g}}_{t}∥}^{2} = & {∥{\bar{w}}_{t} - w^{🟉}∥}^{2} + η_{t}^{2} {∥{\bar{g}}_{t}∥}^{2} - 2 η_{t} 〈{\bar{w}}_{t} - w^{🟉}, {\bar{g}}_{t}〉 \\ \overset{(a)}{\leq} & {∥{\bar{w}}_{t} - w^{🟉}∥}^{2} + η_{t}^{2} \sum_{k = 1}^{K} p_{k} {∥ \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k} ∥}^{2} \\ - 2 η_{t} \sum_{k = 1}^{K} p_{k} 〈{\bar{w}}_{t} - w_{t}^{k} + w_{t}^{k} - w^{🟉}, \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k}〉 \\ = & {∥{\bar{w}}_{t} - w^{🟉}∥}^{2} - 2 η_{t} \sum_{k = 1}^{K} p_{k} 〈w_{t}^{k} - w^{🟉}, \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k}〉 \\ - 2 η_{t} \sum_{k = 1}^{K} p_{k} 〈{\bar{w}}_{t} - w_{t}^{k}, \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k}〉 \\ + η_{t}^{2} \sum_{k = 1}^{K} p_{k} {∥ \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k} ∥}^{2}, \end{matrix}

(A25)

where

(a)

follows from Jensen’s inequality of function

{∥ \cdot ∥}^{2}

. Then, by the L-smoothness of

F_{k} (\cdot)

, we have

{∥\nabla F_{k} (w_{t}^{k})∥}^{2} \leq 2 L (F_{k} (w_{t}^{k}) - F_{k}^{🟉})

. Then,

\begin{matrix} η_{t}^{2} \sum_{k = 1}^{K} p_{k} ∥ Π_{G} (\nabla F_{k} (w_{t}^{k})) + ξ_{t}^{k} ∥^{2} \leq & η_{t}^{2} \sum_{k = 1}^{K} p_{k} ∥ \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k} ∥^{2} \\ = & η_{t}^{2} \sum_{k = 1}^{K} p_{k} (∥ \nabla F_{k} (w_{t}^{k}) ∥^{2} + 2 \nabla F_{k} {(w_{t}^{k})}^{⊤} ξ_{t}^{k} + ∥ ξ_{t}^{k} ∥^{2}) \\ \leq & 2 L η_{t}^{2} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k}^{🟉}) + η_{t}^{2} \sum_{k = 1}^{K} p_{k} ∥ ξ_{t}^{k} ∥^{2} \\ + 2 η_{t}^{2} \sum_{k = 1}^{K} p_{k} (\nabla F_{k} {(w_{t}^{k})}^{⊤} ξ_{t}^{k}), \end{matrix}

(A26)

where

Π_{G} (\cdot)

denotes the gradient clipping operation. By

μ

-strong convex of

F_{k} (\cdot)

, we have

\nabla F_{k} {(w_{t}^{k})}^{⊤} (w_{t}^{k} - w^{🟉}) \geq (F_{k} (w_{t}^{k}) - F_{k} (w^{🟉})) + \frac{μ}{2} {∥w_{t}^{k} - w^{🟉}∥}^{2}

, then

\begin{matrix} - 〈w_{t}^{k} - w^{🟉}, \nabla F_{k} (w_{t}^{k}) + G \cdot ξ_{t}^{k}〉 = & - 〈w_{t}^{k} - w^{🟉}, \nabla F_{k} (w_{t}^{k})〉 - G 〈w_{t}^{k} - w^{🟉}, ξ_{t}^{k}〉 \\ \leq & - (F_{k} (w_{t}^{k}) - F_{k} (w^{🟉})) - \frac{μ}{2} {∥w_{t}^{k} - w^{🟉}∥}^{2} \\ - G 〈w_{t}^{k} - w^{🟉}, ξ_{t}^{k}〉 . \end{matrix}

(A27)

By the AM-GM inequality, i.e.,

2 〈 x, y 〉 \leq {ζ ∥ x ∥}^{2} + ζ^{- 1} {∥ y ∥}^{2}

, for any

ζ > 0

, we have,

\begin{matrix} - 2 〈{\bar{w}}_{t} - w_{t}^{k}, \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k}〉 \leq & \frac{1}{η_{t}} ∥ {\bar{w}}_{t} - w_{t}^{k} ∥^{2} + η_{t} {∥ \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k} ∥}^{2} \\ \leq & \frac{1}{η_{t}} ∥ {\bar{w}}_{t} - w_{t}^{k} ∥^{2} + η_{t} {∥ \nabla F_{k} (w_{t}^{k}) ∥}^{2} \\ + 2 η_{t} ∥ \nabla F_{k} {(w_{t}^{k})}^{⊤} ξ_{t}^{k} ∥ + η_{t} {∥ ξ_{t}^{k} ∥}^{2} \\ \leq & \frac{1}{η_{t}} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2} + 2 L η_{t} (F_{k} (w_{t}^{k}) - F_{k}^{🟉}) \\ + 2 η_{t} ∥ \nabla F_{k} {(w_{t}^{k})}^{⊤} ξ_{t}^{k} ∥ + η_{t} {∥ ξ_{t}^{k} ∥}^{2} . \end{matrix}

(A28)

By applying these estimates, we have

\begin{matrix} {∥{\bar{w}}_{t} - η_{t} {\bar{g}}_{t} - w^{🟉}∥}^{2} \leq & {∥{\bar{w}}_{t} - w^{🟉}∥}^{2} + 4 L η_{t}^{2} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k}^{🟉}) \\ - 2 η_{t} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k} (w^{🟉})) + 4 η_{t}^{2} \sum_{k = 1}^{K} p_{k} (\nabla F_{k} {(w_{t}^{k})}^{⊤} ξ_{t}^{k}) \\ + 2 η_{t}^{2} \sum_{k = 1}^{K} p_{k} ∥ ξ_{t}^{k} ∥^{2} + \sum_{k = 1}^{K} p_{k} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2} - μ η_{t} \sum_{k = 1}^{K} p_{k} {∥w_{t}^{k} - w^{🟉}∥}^{2} \\ - 2 η_{t} \sum_{k = 1}^{K} p_{k} {(w_{t}^{k} - w^{🟉})}^{⊤} ξ_{t}^{k} \\ \overset{(a)}{\leq} & (1 - μ η_{t}) {∥{\bar{w}}_{t} - w^{🟉}∥}^{2} + 4 L η_{t}^{2} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k}^{🟉}) \\ - 2 η_{t} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k} (w^{🟉})) + 2 η_{t}^{2} \sum_{k = 1}^{K} p_{k} {∥ ξ_{t}^{k} ∥}^{2} \\ + \sum_{k = 1}^{K} p_{k} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2}, \end{matrix}

(A29)

where in

(a)

, we apply the fact that

w_{t}^{k}

is independent of the

ξ_{t}^{k}

and thus,

E [ξ_{t}^{k}] = 0

. Furthermore, we also use the inequality

- \sum_{k = 1}^{K} p_{k} {∥w_{t}^{k} - w^{🟉}∥}^{2} \leq - {∥{\bar{w}}_{t} - w^{🟉}∥}^{2}

, which can be proved by

\begin{matrix} \sum_{k = 1}^{K} p_{k} {∥w_{t}^{k} - w^{🟉}∥}^{2} & \geq \sum_{k = 1}^{K} p_{k} (∥ w_{t}^{k} ∥^{2}) - 2 ∥ {\bar{w}}_{t} ∥ ∥ w^{🟉} ∥ + ∥ w^{🟉} ∥^{2} \\ \geq \sum_{k = 1}^{K} (∥ p_{k} w_{t}^{k} ∥^{2}) - 2 ∥ {\bar{w}}_{t} ∥ ∥ w^{🟉} ∥ + ∥ w^{🟉} ∥^{2} \\ = ∥ {\bar{w}}_{t} - w^{🟉} ∥^{2} . \end{matrix}

(A30)

Then, by letting

Ψ (t) ≜ 4 L η_{t}^{2} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k}^{🟉}) - 2 η_{t} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k} (w^{🟉}))

, we have

\begin{matrix} Ψ (t) & = 4 L η_{t}^{2} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k}^{🟉}) - 2 η_{t} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k} (w^{🟉})) \\ = (4 L η_{t}^{2} - 2 η_{t}) \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k} (w^{🟉})) + 4 L η_{t}^{2} \sum_{k = 1}^{K} p_{k} (F_{k} (w^{🟉}) - F_{k}^{🟉}) \\ = (4 L η_{t}^{2} - 2 η_{t}) \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F^{🟉}) + 4 L η_{t}^{2} (F^{🟉} - \sum_{k = 1}^{K} p_{k} F_{k}^{🟉}) \\ = (4 L η_{t}^{2} - 2 η_{t}) \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F^{🟉}) + 4 L η_{t}^{2} Γ . \end{matrix}

(A31)

Since

η_{t} \leq \frac{1}{4 L}

, which implies that

4 L η_{t}^{2} - 2 η_{t} \leq 0

, then,

\sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F^{🟉})

can be bounded by

\begin{matrix} \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F^{🟉}) = & \sum_{k = 1}^{K} p_{k} (F_{k} (w_{t}^{k}) - F_{k} ({\bar{w}}_{t})) + \sum_{k = 1}^{K} p_{k} (F_{k} ({\bar{w}}_{t}) - F^{🟉}) \\ \overset{(a)}{\geq} & \sum_{k = 1}^{K} p_{k} 〈\nabla F_{k} ({\bar{w}}_{t}), {\bar{w}}_{t}^{k} - {\bar{w}}_{t}〉 + (F (\bar{w}) - F^{🟉}) \\ \overset{(b)}{\geq} & - \frac{1}{2} \sum_{k = 1}^{K} p_{k} (η_{t} {∥\nabla F_{k} ({\bar{w}}_{t})∥}^{2} + \frac{1}{η_{t}} {∥w_{t}^{k} - {\bar{w}}_{t}∥}^{2}) + (F (\bar{w}) - F^{🟉}) \\ \overset{(c)}{\geq} & - L η_{t} \sum_{k = 1}^{K} p_{k} (F_{k} ({\bar{w}}_{t}) - F_{k}^{🟉}) - \frac{1}{2 η_{t}} \sum_{k = 1}^{K} p_{k} {∥w_{t}^{k} - {\bar{w}}_{t}∥}^{2} \\ + (F (\bar{w}) - F^{🟉}), \end{matrix}

(A32)

where

(a)

follows from the convexity of

F_{k} (\cdot)

;

(b)

holds due to the AM-GM inequality; in

(c)

, we invoke the L-smoothness of

F_{k} (\cdot)

. Thus, by plugging (A32) into (A31), we have

\begin{matrix} Ψ (t) \leq & L η_{t} (2 η_{t} - 4 L η_{t}^{2}) \sum_{k = 1}^{K} p_{k} (F_{k} ({\bar{w}}_{t}) - F_{k}^{🟉}) + (1 - 2 L η_{t}) \sum_{k = 1}^{K} p_{k} {∥w_{t}^{k} - {\bar{w}}_{t}∥}^{2} \\ - (2 η_{t} - 4 L η_{t}^{2}) (F (\bar{w}) - F^{🟉}) + 4 L η_{t}^{2} Γ \\ \leq & (L η_{t} - 1) (2 η_{t} - 4 L η_{t}^{2}) (F (\bar{w}) - F^{🟉}) + (1 - 2 L η_{t}) \sum_{k = 1}^{K} p_{k} {∥w_{t}^{k} - {\bar{w}}_{t}∥}^{2} \\ + (4 L η_{t}^{2} + L η_{t} (2 η_{t} - 4 L η_{t}^{2})) Γ \\ \overset{(a)}{\leq} & \sum_{k = 1}^{K} p_{k} {∥w_{t}^{k} - {\bar{w}}_{t}∥}^{2} + 6 L η_{t}^{2} Γ, \end{matrix}

(A33)

where

(a)

follows from

L η_{t} - 1 < 0

and

2 η_{t} - 4 L η_{t}^{2} \leq 2 η_{t}

. Then, combining the results of (A29), (A31)–(A33), we have

\begin{matrix} E [{∥{\bar{v}}_{t + 1} - w^{🟉}∥}^{2}] \leq & (1 - μ η_{t}) E [{∥{\bar{w}}_{t} - w^{🟉}∥}^{2}] + 2 E [\sum_{k = 1}^{K} p_{k} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2}] \\ + 2 p η_{t}^{2} E [\sum_{k = 1}^{K} p_{k} {∥ ξ_{t}^{k} ∥}^{2}] + 6 L Γ η_{t}^{2} . \end{matrix}

(A34)

When

t + 1 \notin I_{Q}

, we have

\begin{matrix} g_{t} = & \sum_{k = 1}^{K} p_{k} \nabla F_{k} (w_{t}^{k}, B_{t}^{k}), \end{matrix}

(A35a)

\begin{matrix} {\bar{g}}_{t} = & \sum_{k = 1}^{K} p_{k} \nabla F_{k} (w_{t}^{k}) . \end{matrix}

(A35b)

By following the same procedure, we obtain the similar result as follows:

\begin{matrix} E [{∥{\bar{v}}_{t + 1} - w^{🟉}∥}^{2}] & \leq (1 - μ η_{t}) E [{∥{\bar{w}}_{t} - w^{🟉}∥}^{2}] + 2 E [\sum_{k = 1}^{K} p_{k} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2}] + 6 L Γ η_{t}^{2} . \end{matrix}

(A36)

Thus, we complete the proof. □

Appendix C.2. Proof of Lemma A2

Proof.

According to Assumption 3, the variance of stochastic gradients in each client is bounded by

ϕ_{k}^{2} / b

. Then, the

A_{2}

term in (A24) can be bounded by

\begin{matrix} E [{∥g_{t} - {\bar{g}}_{t}∥}^{2}] & = E [∥ \sum_{k = 1}^{K} p_{k} (Π_{G} \nabla F_{k} (w_{t}^{k}, B_{t}^{k}) + ξ_{t}^{k}) - \sum_{k = 1}^{K} p_{k} (Π_{G} \nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k}) ∥^{2}] \\ = \sum_{k = 1}^{K} p_{k}^{2} E [∥ \nabla F_{k} (w_{t}^{k}, B_{t}^{k}) - \nabla F_{k} (w_{t}^{k}) ∥^{2}] \\ \leq \sum_{k = 1}^{K} p_{k}^{2} ϕ_{k}^{2} / b . \end{matrix}

(A37)

Thus, the proof is completed. □

Appendix C.3. Proof of Lemma A3

Proof.

For any

t \in [0, Q - 1]

, there exists

t_{0} \leq t

such that

t - t_{0} \leq Q - 1

and

{\bar{w}}_{t_{0}} = w_{t_{0}}^{k}

for

k \in [0, N]

. Assume

η_{t_{0}} \leq 2 η_{t}

and

η_{t}

is non-increasing, i.e.,

η_{t} \leq η_{t_{0}}

. When

t + 1 \in I_{Q}

, we have

\begin{matrix} E [\sum_{k = 1}^{K} p_{k} {∥ {\bar{w}}_{t} - w_{t}^{k} ∥}^{2}] & = E [\sum_{k = 1}^{K} p_{k} {∥ (w_{t}^{k} - {\bar{w}}_{t_{0}}) - ({\bar{w}}_{t} - {\bar{w}}_{t_{0}}) ∥}^{2}] \\ = E [\sum_{k = 1}^{K} p_{k} ∥ w_{t}^{k} - {\bar{w}}_{t_{0}} ∥^{2}] - E [\sum_{k = 1}^{K} p_{k} ∥ w_{t}^{k} - {\bar{w}}_{t_{0}} ∥^{2}] \\ \overset{(a)}{\leq} E [\sum_{k = 1}^{K} p_{k} ∥ w_{t}^{k} - {\bar{w}}_{t_{0}} ∥^{2}] \\ \overset{(b)}{\leq} \sum_{k = 1}^{K} p_{k} E [\sum_{τ = t_{0}}^{t} (Q - 1) η_{τ}^{2} {∥ \nabla F_{k} (w_{τ}^{k}, B_{τ}^{k}) + z_{τ}^{k} ∥}^{2}] \\ \leq \sum_{k = 1}^{K} p_{k} \sum_{τ = t_{0}}^{t} (Q - 1) η_{t_{0}}^{2} (G^{2} + d σ_{t, k}^{2}) \\ \leq 4 {(Q - 1)}^{2} η_{t}^{2} (G^{2} + d \sum_{k = 1}^{K} p_{k} σ_{t, k}^{2}), \end{matrix}

(A38)

where

(a)

follows from the convexity of

{∥ \cdot ∥}^{2}

, and

(b)

holds since

{E ∥ X - E X ∥}^{2} = {E ∥ X ∥}^{2} - {∥ E [X] ∥}^{2}

. Similarly, when

t + 1 \notin I_{Q}

, we can derive a similar result:

\begin{matrix} E [\sum_{k = 1}^{K} p_{k} ∥ {\bar{w}}_{t} - w_{t}^{k} ∥^{2}] \leq 4 {(Q - 1)}^{2} G^{2} η_{t}^{2} . \end{matrix}

(A39)

Thus, the proof is completed. □

Appendix C.4. Proof of Lemma A4

Proof.

When

t + 1 \in I_{Q}

, the DP noise in each round is bounded by

\begin{matrix} E [\sum_{k = 1}^{K} p_{k} {∥ ξ_{t}^{k} ∥}^{2}] & = \sum_{k = 1}^{K} p_{k} {(σ_{t}^{k})}^{2} \\ = \frac{128 η_{t}^{2} Q^{4} b^{2} G^{2}}{n ϵ^{2}} \sum_{k = 1}^{K} \frac{log (1.25 Q b / n_{k} δ)}{n_{k}} \\ \overset{(a)}{\leq} \frac{128 η_{t}^{2} Q^{4} b^{2} G^{2}}{n ϵ^{2}} \int_{k = 1}^{n} \frac{log (1.25 Q b / k δ)}{k} d k \\ = - \frac{128 η_{t}^{2} Q^{4} b^{2} G^{2}}{n ϵ^{2}} \int_{k = 1}^{n} log (k δ / 1.25 Q b) d log (k δ / 1.25 Q b) \\ = - \frac{64 η_{t}^{2} Q^{4} b^{2} G^{2}}{n ϵ^{2}} ({log}^{2} (n δ / 1.25 Q b) - {log}^{2} (δ / 1.25 Q b)) \\ = \frac{64 η_{t}^{2} Q^{4} b^{2} G^{2}}{n ϵ^{2}} log (n) log (1 . 25^{2} Q^{2} b^{2} / n δ^{2}) \\ \overset{(b)}{\leq} \frac{4 Q^{4} b^{2} G^{2}}{n L^{2} ϵ^{2}} log (n) log (1 . 25^{2} Q^{2} b^{2} / n δ^{2}), \end{matrix}

(A40)

where

(a)

follows

\sum_{k = 1}^{K} log (1.25 Q b / n_{k} δ) / n_{k} \leq \int_{k = 1}^{n} (log (1.25 Q b / k δ) / k) d k

, and

(b)

holds due to

η_{t} \leq 1 / 4 L

. □

Appendix D. Proof of Theorem 4

As defined in Appendix A, let

I_{Q} = \{z Q | z = 1, 2, \dots\}

be the set of global synchronization steps. Note that, for each communication round, the PS randomly activates a subset of clients

S_{t}

according to some sampling schemes, and then FedAvg only performs updates on selected clients. Thus, the

S_{t}

varies in each communication round. Additionally, we define two virtual sequences

{\bar{w}}_{t} = \sum_{k = 1}^{K} p_{k} w_{t}^{k}

and

{\bar{v}}_{t} = \sum_{k = 1}^{K} p_{k} v_{t}^{k}

; when

t + 1 \in I_{Q}

, we define

\begin{matrix} {\bar{g}}_{t} = & \sum_{k = 1}^{K} p_{k} (\nabla F_{k} (w_{t}^{k}) + ξ_{t}^{k}), \end{matrix}

(A41)

\begin{matrix} g_{t} = & \sum_{k = 1}^{K} p_{k} (\nabla F_{k} (w_{t}^{k}, B_{t}^{k}) + ξ_{t}^{k}) . \end{matrix}

(A42)

When

t + 1 \notin I_{Q}

, we define

\begin{matrix} g_{t} = & \sum_{k = 1}^{K} p_{k} \nabla F_{k} (w_{t}^{k}, B_{t}^{k}), \end{matrix}

(A43)

\begin{matrix} {\bar{g}}_{t} = & \sum_{k = 1}^{K} p_{k} \nabla F_{k} (w_{t}^{k}) . \end{matrix}

(A44)

Thus, we always have

{\bar{v}}_{t + 1} = {\bar{w}}_{t} - η_{t} g_{t}

and

E [g_{t}] = {\bar{g}}_{t}

. Then, the update scheme of DP-FedAvg with PCP being taken into consideration is

\begin{matrix} v_{t + 1}^{k} = \{\begin{matrix} w_{t}^{k} - η_{t} \nabla F_{k} (w_{t}^{k}, B_{t}^{k}), & if t + 1 \notin I_{Q}, \\ w_{t}^{k} - η_{t} (\nabla F_{k} (w_{t}^{k}, B_{t}^{k}) + ξ_{t}^{k}), & if t + 1 \in I_{Q}, \end{matrix} \end{matrix}

(A45)

\begin{matrix} w_{t + 1}^{k} = \{\begin{matrix} v_{t + 1}^{k} & if t + 1 \notin I_{Q}, \\ samples S_{t + 1} and average {\{v_{t + 1}^{k}\}}_{k \in S_{t + 1}} & if t + 1 \in I_{Q}, \end{matrix} \end{matrix}

(A46)

where

v_{t + 1}^{k}

denotes a one-step SGD update from

w_{t}^{k}

and

w_{t + 1}^{k}

can be interpreted as the result after the communication steps.

Appendix D.1. Key Lemmas

Motivated by [6], when

t + 1 \notin I_{Q}

, we still have

{\bar{w}}_{t + 1} = {\bar{v}}_{t + 1}

. However, when

t + 1 \in I_{Q}

, we aim to establish this relation in the sense of expectation, i.e.,

E_{S_{t + 1}} [{\bar{w}}_{t + 1}] = {\bar{v}}_{t + 1}

. To achieve this, the sampling strategy on the PS must be unbiased. We identify two sampling and averaging schemes that satisfy this requirement and provide convergence guarantees.

(I) If the PS performs the sampling with replacement to establish

S_{t + 1}

, then each element in

S_{t + 1}

may occur multiple times, with each client being sampled with equal probability. In this case, the PS averages the parameters by

w_{t + 1}^{k} = \frac{1}{K} \sum_{k \in S_{t + 1}} v_{t + 1}^{k}

.

(II) If the PS performs the sampling without replacement to establish

S_{t + 1}

, then each element in

S_{t + 1}

can appear only once. In this case, the PS averages the parameters by

w_{t + 1}^{k} = \frac{N}{K} \sum_{k \in S_{t + 1}} p_{k} v_{t + 1}^{k}

.

Due to the fact that

S_{t + 1}

varies in each communication round, we assume that the DP-FedAvg algorithm activates all devices at the beginning of each communication round. The updated parameters depend only on the sampled clients, and the next-round parameters are broadcasted to all clients. This approach simplifies our theoretical analysis. As expected,

w_{t + 1}^{k}

is equal to the weighted average of the parameters across all devices after the SGD updates.

Lemma A5.

If

t + 1 \in I_{Q}

, for both client sampling with/without replacement schemes, we have

\begin{matrix} E_{S_{t + 1}} [{\bar{w}}_{t + 1}] = {\bar{v}}_{t + 1} . \end{matrix}

(A47)

Proof.

Please refer to the proof of Lemma 4 in [6]. □

Lemma A6.

For

t + 1 \in I_{Q}

, assume

η_{t}

is non-increasing and

η_{t} \leq 2 η_{t + Q}

; then, for client sampling with the replacement scheme,

\begin{matrix} E_{S_{t + 1}} [∥ {\bar{w}}_{t + 1} - {\bar{w}}_{t + 1} ∥^{2}] = \frac{4}{K} η_{t}^{2} Q^{2} G^{2}, \end{matrix}

(A48)

and for client sampling without the replacement scheme,

\begin{matrix} E_{S_{t + 1}} [∥ {\bar{w}}_{t + 1} - {\bar{w}}_{t + 1} ∥^{2}] = \frac{N - K}{N - 1} \frac{4}{K} η_{t}^{2} Q^{2} G^{2} . \end{matrix}

(A49)

Proof.

Please refer to the proof of Lemma 5 in [6]. □

Appendix D.2. Completing the Proof of Theorem 4

Proof.

\begin{matrix} {∥{\bar{w}}_{t + 1} - w^{🟉}∥}^{2} & = {∥{\bar{w}}_{t + 1} - {\bar{v}}_{t + 1} + {\bar{v}}_{t + 1} - w^{🟉}∥}^{2} \\ = {∥{\bar{w}}_{t + 1} - {\bar{v}}_{t + 1}∥}^{2} + {∥{\bar{v}}_{t + 1} - w^{🟉}∥}^{2} + 2 \underset{≜ A_{3}}{\underset{︸}{〈{\bar{w}}_{t + 1} - {\bar{v}}_{t + 1}, {\bar{v}}_{t + 1} - w^{🟉}〉}}, \end{matrix}

(A50)

where the

A_{3}

term will be 0 due to Lemma A5. When

t + 1 \notin I_{Q}

, the first term will vanish due to

{\bar{w}}_{t + 1} = {\bar{v}}_{t + 1}

. Then, we have

\begin{matrix} E [∥ {\bar{w}}_{t + 1} - w^{🟉} ∥^{2}] \leq (1 - μ η_{t}) E [∥ {\bar{w}}_{t} - w^{🟉} ∥^{2}] + A η_{t}^{2} . \end{matrix}

(A51)

When

t + 1 \in I_{Q}

, we use Lemma A6 to bound the second term in the second equality:

\begin{matrix} E [∥ {\bar{w}}_{t + 1} - w^{🟉} ∥^{2}] & = E [∥ {\bar{w}}_{t + 1} - {\bar{v}}_{t + 1} ∥^{2}] + E [∥ {\bar{v}}_{t + 1} - w^{🟉} ∥^{2}] \\ \leq (1 - μ η_{t}) E [∥ {\bar{w}}_{t} - w^{🟉} ∥^{2}] + (A + B + H) η_{t}^{2}, \end{matrix}

(A52)

where H is defined in Theorem 4. The main difference between (A16) and (A52) is the additional term H. Thus, we can follow the same progress and argument to prove the theorem. To be specific, we choose

ν = max \{\frac{β^{2} (A + B + H)}{μ β - 1}, α △_{0}\}

,

β = \frac{2}{μ}

and

η_{t} = \frac{2}{μ} \frac{1}{α + t}

, where

α \geq max {8 κ, Q}

,

κ = \frac{L}{μ}

. By the L-smoothness of

F (\cdot)

, we have

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{L}{2} E [∥ {\bar{w}}_{t} - w^{🟉} ∥^{2}] \leq \frac{L}{2} \frac{ν}{t + α}, \end{matrix}

(A53)

then,

\begin{matrix} E [F ({\bar{w}}_{t}) - F^{🟉}] \leq \frac{2 κ}{t + α} (\frac{A + B + H}{μ} + 2 L ∥ w_{1} - w^{🟉} ∥^{2}) . \end{matrix}

(A54)

Thus, we complete the proof. □

References

Li, Y.; Chang, T.H.; Chi, C.Y. Secure Federated Averaging Algorithm with Differential Privacy. In Proceedings of the IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Espoo, Finland, 17–20 September 2020; pp. 1–6. [Google Scholar]
Sattler, F.; Wiedemann, S.; Müller, K.R.; Samek, W. Robust and communication-efficient federated learning from non-iid data. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 3400–3413. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Chang, T.-H. Federated Matrix Factorization: Algorithm Design and Application to Data Clustering. IEEE Trans. Signal Process. 2022, 70, 1625–1640. [Google Scholar] [CrossRef]
Wang, S.; Xu, Y.; Yuan, Y.; Quek, T.Q.S. Toward Fast Personalized Semi-Supervised Federated Learning in Edge Networks: Algorithm Design and Theoretical Guarantee. IEEE Trans. Wireless Commun. 2024, 23, 1170–1183. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Li, X.; Huang, K.; Yang, W.; Wang, S.; Zhang, Z. On the Convergence of FedAvg on non-IID Data. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020; pp. 1–26. [Google Scholar]
Li, Y.; Wang, S.; Chi, C.Y.; Quek, T.Q.S. Differentially Private Federated Clustering Over Non-IID Data. IEEE Internet Things J. 2024, 11, 6705–6721. [Google Scholar] [CrossRef]
Glasgow, M.R.; Yuan, H.; Ma, T. Sharp bounds for federated averaging (local SGD) and continuous perspective. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; pp. 9050–9090. [Google Scholar]
Woodworth, B.E.; Patel, K.K.; Srebro, N. Minibatch vs local SGD for heterogeneous distributed learning. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020; pp. 6281–6292. [Google Scholar]
Hsu, T.M.H.; Qi, H.; Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv 2019, arXiv:1909.06335. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2020, 2, 429–450. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.J.; Stich, S.U.; Suresh, A.T. Scaffold: Stochastic controlled averaging for on-device federated learning. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020; pp. 5132–5143. [Google Scholar]
Geiping, J.; Bauermeister, H.; Dröge, H.; Moeller, M. Inverting gradients-how easy is it to break privacy in federated learning? In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Virtual, 6–12 December 2020; pp. 16937–16947. [Google Scholar]
Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Found. Trends® Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Li, Y.; Huang, C.W.; Wang, S.; Chi, C.Y.; Quek, T.Q. Privacy-Preserving Federated Primal-Dual Learning for Non-Convex and Non-Smooth Problems With Model Sparsification. IEEE Internet Things J. 2024, 11, 25853–25866. [Google Scholar] [CrossRef]
Wang, X.; Wang, S.; Li, Y.; Fan, F.; Li, S.; Lin, X. Differentially Private and Heterogeneity-Robust Federated Learning with Theoretical Guarantee. IEEE Trans. Artif. Intell. 2024, 5, 6369–6384. [Google Scholar] [CrossRef]
Li, Z.; He, Y.; Yu, H.; Kang, J.; Li, X.; Xu, Z.; Niyato, D. Data heterogeneity-robust federated learning via group client selection in industrial IoT. IEEE Internet Things J. 2022, 9, 17844–17857. [Google Scholar] [CrossRef]
Wang, S.; Xu, Y.; Wang, Z.; Chang, T.-H.; Quek, T.Q.S.; Sun, D. Beyond ADMM: A Unified Client-Variance-Reduced Adaptive Federated Learning Framework. In Proceedings of the AAAI, Washington, DC, USA, 7–14 February 2023; pp. 10175–10183. [Google Scholar]
Wu, H.; Wang, P. Fast-convergent federated learning with adaptive weighting. IEEE Trans. Cogn. Commun. Netw. 2021, 7, 1078–1088. [Google Scholar] [CrossRef]
Wu, H.; Tang, X.; Zhang, Y.J.A.; Gao, L. Incentive Mechanism for Federated Learning With Random Client Selection. IEEE Trans. Netw. Sci. Eng. 2024, 11, 1922–1933. [Google Scholar] [CrossRef]
Saha, R.; Misra, S.; Chakraborty, A.; Chatterjee, C.; Deb, P.K. Data-centric client selection for federated learning over distributed edge networks. IEEE Trans. Parallel Distrib. Syst. 2022, 34, 675–686. [Google Scholar] [CrossRef]
Pang, J.; Yu, J.; Zhou, R.; Lui, J.C. An incentive auction for heterogeneous client selection in federated learning. IEEE Trans. Mob. Comput. 2022, 22, 5733–5750. [Google Scholar] [CrossRef]
Shen, X.; Liu, Y.; Zhang, Z. Performance-enhanced Federated Learning with Differential Privacy for Internet of Things. IEEE Internet Things J. 2022, 9, 24079–24094. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; Chi, C.Y.; Quek, T.Q. Differentially Private Federated Learning in Edge Networks: The Perspective of Noise Reduction. IEEE Netw. 2022, 36, 167–172. [Google Scholar] [CrossRef]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
McMahan, H.B.; Ramage, D.; Talwar, K.; Zhang, L. Learning Differentially Private Recurrent Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Huang, Z.; Hu, R.; Guo, Y.; Chan-Tin, E.; Gong, Y. DP-ADMM: ADMM-based distributed learning with differential privacy. IEEE Trans. Inf. Forensics Secur. 2019, 15, 1002–1012. [Google Scholar] [CrossRef]
Erlingsson, Ú.; Feldman, V.; Mironov, I.; Raghunathan, A.; Talwar, K.; Thakurta, A. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, San Diego, CA, USA, 6–9 January 2019; pp. 2468–2479. [Google Scholar]
Balle, B.; Barthe, G.; Gaboardi, M. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 6277–6287. [Google Scholar]
Wei, K.; Li, J.; Ding, M.; Ma, C.; Yang, H.H.; Farokhi, F.; Jin, S.; Quek, T.Q.S.; Poor, H.V. Federated Learning With Differential Privacy: Algorithms and Performance Analysis. IEEE Trans. Inf. Forensics Secur. 2020, 15, 3454–3469. [Google Scholar] [CrossRef]
Noble, M.; Bellet, A.; Dieuleveut, A. Differentially Private Federated Learning on Heterogeneous Data. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; pp. 10110–10145. [Google Scholar]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership inference attacks against machine learning models. In Proceedings of the IEEE symposium on security and privacy (SP), San Jose, CA, USA, 22–26 May 2017; pp. 3–18. [Google Scholar]
Zhang, Y.; Duchi, J.C.; Wainwright, M.J. Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 2013, 14, 3321–3363. [Google Scholar]
Blake, C.L.; Merz, C.J. UCI Repository of Machine Learning Databases. 1996. Available online: https://archive.ics.uci.edu/dataset/2/adult (accessed on 30 April 1996).
LeCun, Y.; Cortes, C.; Burges, C. The MNIST Database. 1998. Available online: http://yann.lecun.com/exdb/mnist (accessed on 1 November 1998).

Figure 1. A typical FL system.

Figure 3. The testing accuracy versus communication rounds of the DP-FedAvg algorithm with various Q under Adult and MNIST datasets.

Table 1. Notations.

Notation	Definition
$∥ \cdot ∥$	Euclidean norm.
$〈 x, y 〉 = x^{⊤} y$	The inner product operator.
$ϵ$ , $δ$	The DP parameter.
$▵$	The $ℓ_{2}$ -norm sensitivity.
N	The total number of clients.
T	The total number of iterations.
Q	The number of local SGD updates.
$\| D \|$	The size of set $D$ .
$w^{⊤}$	Transpose of vector $w$ .
$w_{t}$	The global model at t-th iteration.
$w_{t}^{k}$	The local model for k-th client at t-th iteration.
$F_{k} (w)$	The objective function for the client k.
$\nabla F_{k} (w_{t}^{k}, B_{t}^{k})$	The mini-batch gradient using the data $B_{t}^{k}$ at client k.
$Γ$	The degree of non-i.i.d. data.
$G$	The gradient clipping level.
n, $n_{k}$	The number of data samples for all client and the client k.
b	The size of mini-batch data.
d	The dimension of model parameters.
$ϕ_{k}$	The variance of stochastic gradients for the k-th client.
L	L-smooth parameter.
$μ$	$μ$ -strong convex parameter.
$E [\cdot]$ , $P [\cdot]$	The statistical expectation and the probability function, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Wang, S.; Wu, Q. Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings. Mathematics 2025, 13, 497. https://doi.org/10.3390/math13030497

AMA Style

Li Y, Wang S, Wu Q. Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings. Mathematics. 2025; 13(3):497. https://doi.org/10.3390/math13030497

Chicago/Turabian Style

Li, Yiwei, Shuai Wang, and Qilong Wu. 2025. "Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings" Mathematics 13, no. 3: 497. https://doi.org/10.3390/math13030497

APA Style

Li, Y., Wang, S., & Wu, Q. (2025). Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings. Mathematics, 13(3), 497. https://doi.org/10.3390/math13030497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings †

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions

2. Preliminaries

2.1. Federated Learning

2.2. Differential Privacy

2.3. The Typical DP-FedAvg Algorithm

3. Theoretical Analysis for DP-FedAvg

3.1. Assumptions

3.2. Privacy Analysis

3.3. Convergence Analysis for DP-FedAvg

4. Numerical Experiments

4.1. Experimental Setting

4.2. Impact of the Privacy Protection Level ϵ

4.3. Impact of Local Epoch Length Q

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of the Lemma 2

Appendix B. Proof of Theorem 3

Appendix B.1. Key Lemmas

Appendix B.2. Completing the Proof of Theorem 3

Appendix C. Proofs of Key Lemmas in Theorem 3

Appendix C.1. Proof of Lemma A1

Appendix C.2. Proof of Lemma A2

Appendix C.3. Proof of Lemma A3

Appendix C.4. Proof of Lemma A4

Appendix D. Proof of Theorem 4

Appendix D.1. Key Lemmas

Appendix D.2. Completing the Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Convergence Analysis for Differentially Private Federated Averaging in Heterogeneous Settings^†

4.2. Impact of the Privacy Protection Level $ϵ$