Privacy-Preserving Distributed Learning via Newton Algorithm

: Federated learning (FL) is a prominent distributed learning framework. The main barriers of FL include communication cost and privacy breaches. In this work, we propose a novel privacy-preserving second-order-based FL method, called GDP-LocalNewton . To improve the communication efﬁciency, we use Newton’s method to iterate and allow local computations before aggregation. To ensure strong privacy guarantee, we make use of the notion of differential privacy (DP) to add Gaussian noise in each iteration. Using advanced tools of Gaussian differential privacy (GDP), we prove that the proposed algorithm satisﬁes the strong notion of GDP. We also establish the convergence of our algorithm. It turns out that the convergence error comes from the local computation and Gaussian noise for DP. We conduct experiments to show the merits of the proposed algorithm.


Introduction
Federated learning (FL) is a popular distributed learning paradigm that enables a large amount of clients to collaboratively train a global model without sharing their individual data [1]. The most popular algorithm is called federated averaging (FedAvg). In each round, the server broadcasts the current model to all the clients. The clients then run multiple steps of stochastic gradient descent (SGD) in a distributed fashion. After that, the server updates the global model by aggregating results from local clients. The convergence rate of FedAvg has been studied extensively in recent years [2][3][4].
One desiderata of a real FL paradigm is the communication efficiency. The communication cost of FL mainly comes from the latency cost, that is, the fixed cost of sending messages, which is proportional to the communication rounds, regardless of the size of the message. In pursuit of this, many first-order gradient-based methods have been developed, such as local stochastic gradient descent (Local SGD) [1,[5][6][7] and mini-batch SGD [8][9][10]. These algorithms reduce the communication cost by performing local computations at the client devices before aggregation. It works when the clients are mobile resources which have reasonable computational power but may suffer from the communication latency [11]. However, for some serverless systems, say, the cloud-based systems, the latency cost is severe. The communication failures are more frequent. Hence, the communication rounds should be further reduced. Recent years have seen a few work on using second-order-based methods [11][12][13][14][15] to improve the convergence of first-order-based methods. Wang et al. [12] proposed the GIANT algorithm, a new distributed approximation of Newton's method, and showed an improved convergence rate over the distributed first-order competitors. Dünner et al. [13] developed a method which approximates the global Hessian matrix by using the block Hessian matrices from users, thereby reducing the computational burden of the global Hessian matrix. Gupta et al. [11] proposed and analyzed a second-order method with local computations, called LocalNewton. Due to its second-order nature and local computations, LocalNewton is superior in terms of the communication cost.
The other desideratum of a real FL paradigm is the privacy preservation. The data from local clients often contain sensitive information of individuals. To some extent, FL could preserve the individuals' privacy because the original data never leave the clients. However, there exist adversarial attacks, which can result in privacy leakage by using the information during the communication; see [16] for example. Hence, the vanilla FL paradigm does not have a rigorous privacy guarantee. Differential privacy (DP; Dwork [17]) is a standard and well-adopted framework which provides strong guarantee of an individual's privacy by ensuring that no individual has a significant influence on the algorithm's output. The worstcase influence is indeed termed the privacy budget. A differentially private (DP) algorithm is often achieved by randomizing the algorithm's output. Besides the vanilla ( , δ)-DP, several DP notions are developed, say, the Rényi differential privacy (RDP) [18] and the Gaussian differential privacy (GDP) [19]. RDP and GDP enjoy an elegant composition property, that is, the theoretical bound of the privacy leakage of repeated queries can be tight and lossless, which would bring benefits when accounting for the privacy leakage in iterativebased algorithms. Using the notion of DP, there have been several studies on developing a privacy-preserving FL algorithm; see McMahan et al. [2], Geyer et al. [3], Triastcyn and Faltings [4], Noble et al. [20], Wei et al. [21], Girgis et al. [22], Cheu et al. [23], Rastogi and Nath [24], Huang et al. [25], among others. Most of these algorithms are first-order based.
With the two desiderata in mind, we develop a novel privacy-preserving second-orderbased method called GDP-LocalNewton within the FL framework. Building upon Newton's algorithm, each device performs Newton's iterations locally. To avoid privacy leakage, we take advantage of the notion of GDP [19] to add Gaussian noise to the updates in each iteration. In particular, we propose a novel linear search method to determine the step size in the Newton update. After several local steps, the devices send parameter updates to the central server. The server then aggregates these updates and broadcasts them to synchronize all local parameters. This process is iterated until convergence or specific conditions are attained. It is worth mentioning that we assume a specific adversary termed as the curious onlooker, who can eavesdrop on the communication between the server and the local machines. The server is a typical example of a curious onlooker.
We rigorously show that under proper parameters set-up, the proposed algorithm satisfies GDP. We also analyze the convergence bound of GDP-LocalNewton. It turns out that the convergence error comprises two components. One comes from the DP noises, and the other comes from the local computations. We also conduct experiments to show the merits of GDP-LocalNewton.

GDP-LocalNewton
In this section, we first provide the problem formulation and the framework of GDP. Then, we develop the GDP-LocalNewton algorithm. After that, we provide rigorous privacy analysis and convergence analysis of the proposed algorithm.

Some Notations and Symbols
We represent vectors (e.g., g) and matrices (e.g., H) as bold lowercase and uppercase letters, respectively. g denotes its 2 norm, and the spectral norm of matrix H is denoted by H . I denotes an identity matrix, and the set {1, 2, . . . , n} is denoted by [n]. To distinguish the index of workers and iterations, we denote the worker index as the superscript (e.g., g k ) and denote the iteration counter as the subscript (e.g., g t ).
Our paper contains a large number of parameters. For ease of reading and distinction, we list them in Table 1. Global model parameter at the t-th iteration. w k The k-th worker's model parameter.
The k-th worker's model parameter at the t-th iteration.
Like w t , w k and w k t .

Fundamental Problem
We consider the empirical risk minimization problems of the following form: where f j (·) : R d → R, for all j ∈ [n] = 1, 2, . . . , n, represents the loss of the j-th observation, given an underlying parameter w ∈ R d . Generally, we call the f (·) the global loss function. In the field of machine learning, such problems are general, e.g., logistic and linear regression, support vector machines, neural networks and graphical models. Taking the logistic regression as an example, we have where j (·) is the loss function for sample j ∈ [n] and γ is an approximately chosen regularization parameter. X = [x 1 , x 2 , . . . , x n ] ∈ R d×n is the sample matrix containing n data, and y = [y 1 , y 2 , . . . , y n ] is the corresponding label vector. Hence, (X, y) defines the training dataset, which is composed as (x j , y j ), j = 1, 2, . . . , n.
At the t-th iteration, the gradient and the Hessian are denoted by and respectively, where w t is the updated parameter at the t-th iteration. Specially, in this paper, we make some assumptions about loss functions. Global loss function f (·) is smooth and strongly convex.
BI and ∇ f i (·) 2 Γ, where B ∈ R and Γ ∈ R are some fixed constants.

Data Distribution
Assume that the FL system has K workers in total. S k , for all k ∈ [K] = {1, 2, . . . , K}, represents a subset chosen uniformly from [n] = {1, 2, . . . , n} without replacement randomly, and the workers can be identified by the subsets, uniquely. The number of each workers' sample is equal to s = |S k | for any k ∈ [K]. Due to the sampling without replacement assumption, we have S 1 ∪ S 2 ∪ . . . S K = [n] and S i ∩ S j = Φ for all i, j ∈ [K]. Also, K = n/s represents the number of workers.

Gaussian Differential Privacy
In this subsection, we review the definition of Gaussian differential privacy (GDP) [26]. Then, we give some theorems and tools about GDP [26], which are the base of the proof that our algorithm satisfies µ − GDP. Lemmas 1 and 2 show that every user's output satisfies µ k − GDP. Lemma 3 guarantees DP when the entire dataset is divided as disjointed parts.
Definition 1 ( f -DP). Let h : R n×m → R p be a randomized function, and D = {x 1 , . . . , x i , . . . , x n } be a dataset. If the output of h for any α-level test between the hypotheses of the formulation H 0 : Definition 2 (Gaussian differential privacy). Define Φ(·) as the standard Gaussian cumulative distribution. If h is f -DP and Definition 3 (Global sensitivity). Let g : R n×m → R p be a deterministic function. The global sensitivity of g is the (possibly infinite) number where x 1:n means a dataset {x 1 , x 2 , . . . , x n }, and x 1:n is different from x 1:n at some datum.
Lemma 1 (Gaussian Mechanism of GDP [26]). Let g : R n×m → R p be a function with finite global sensitivity GS g . Let Z be a standard normal p-dimensional random vector. For all µ > 0 and x ∈ R n×m , the random function h( Lemma 2 (Matrix Gaussian mechanism [26]). Consider a data matrix A ∈ R n×m such that each row vector a i satisfies a i 1. Further define the function h(A) = 1 n A T A. Let W be a symmetric random matrix whose upper-triangular elements and the diagonal are i.i.d. 1 µ N(0, 1). Then the random functionh(A) = h(A) + W satisfies µ − GDP. [27]). Suppose we have a set of privacy mechanisms M = {M 1 , · · · , M m }. If each M i provides a µ i -GDP guarantee on a disjointed subset of the entire dataset, M will provide (max{µ 1 , · · · , µ m })-GDP. Lemma 4 (Composition of GDP [19]). The n-fold composition of µ i − GDP mechanisms is

GDP-LocalNewton Algorithm
In this subsection, we propose the GDP-LocalNewton algorithm for privacy-preserving distributed learning; see Algorithm 1. 12: Find step-size α k t using line search (Equation (10)) 13: Update model: end for 15:

end for
At the k-th worker in the t-th iteration, we define the local loss function (at the local iteration w k t ) as The k-th worker's target is to minimize the local loss in Equation (3). We give the corresponding local gradient g k t and local Hessian H k t at the k-th worker in the t-th iteration as the following: The updates of the local workers' parameter are given by then,p k diagonals are i.i.d. standard normal randoms. µ is the privacy parameter, B is the norm bound of the Hessian matrix and Γ is the norm bound of the gradient. I T is the set of communication nodes, where every user communicates their own updated parameters to the server, T represents the set of iterations, and L means the number of local computations.

Remark 1.
In practice,Ĥ k (w t ) may not be positive definite. So we need to truncate the eigenvalues as max{λ j , }, where > 0 is equal to the regularization parameter. The matrix is also differentially private as a result of the post-processing property of GDP ( [19], Proposition 2.8).
Here, we propose a novel step selection, coping with the negative influence of noises, and it plays an important role in the convergence of GDP-LocalNewton.
Step-size selection: Let each worker locally choose a step size according to the following rule: for some β ∈ (0, 1/2] and γ = 1 − β so that where N k Remark 2. In the subsequent theoretical section, we can observe the advantages and significance of the linear search in the algorithm's convergence analysis. In part of the experiment, we will compare our strategy with the fixed step size and decaying step size. The results show that our strategy is optimal. The decaying step size refers to compressing the step size of the previous iteration at each iteration, which means α t+1 = σ t+1 α t , where σ is the decaying rate.

Assumptions on the Loss Functions
We need the following assumptions on the loss functions.
BI and ∇ f i (·) 2 Γ, where B ∈ R and Γ ∈ R are some fixed constants.
where we use Assumption 4, and j and j represent different datasets. Then, using Lemma 4, we know that the output of the k-worker (4) satisfies µ √ T − GDP. Finally, through Lemmas 3 and 4, the whole algorithm satisfies µ − GDP. The proof is completed.

Convergence Analysis
The following three lemmas show that we can separate the Gaussian noises added in the algorithm and bound them so as to guarantee the convergence of GDP-LocalNewton.

Lemma 5 ([26]
). Let X ∈ R d be a sub-Gaussian random vector with variance proxy σ 2 . For any α > 0, with probability at least 1 − α, , and use Lemma 1 and Lemma 2, then we can rewrite the noisy term in (4) as Here, with probability at least 1 − ξ 0 , we have and C is a constant.
, β = 1/2 and using Lemma 7, we obtain which means the workers' local gradient. N k t , M privacy and C are from Lemma 7.

Remark 3.
In Lemma 8, we can find that the latter two terms come from the noise by DP. The second term can be eliminated by setting the β = 1/2 and it is the merit of our new linear selection that can reduce the random negative effectiveness. As a result, Lemma 8 shows that the novel linear selection can improve the convergence of the algorithm. The small Ñ k t 2 leads to the small third term. Theorem 2. /L = 1 case/ Suppose that Assumptions 1 − 4 hold, and the step-size α k t satisfies the line-search condition (10). Also, let T, µ , 0 < δ, ξ 0 < 1, 0 < , 1 < 1/2 be fixed constants and let β = 1/2, Γ = max 1 i n ∇ f i (·) . Moreover, assume that the sample size for each worker satisfies s 4B κ 2 log 2d δ , where the samples are chosen without replacement. When the GDP-LocalNewton is in the L = 1 case, we obtain the following: 2.

If s
, we obtain, with probability at least 1 − K(6δ + ξ 0 ),  In this theorem, we choose β = 1/2 so that the second random term in the Lemma 8 is eliminated. That is because the less random terms commonly make the algorithm more stable. Theorem 3. /L > 1 case/. Suppose Assumptions 1 − 4 hold and step size α k t satisfies the line search condition (10). Also, let T, µ, 0 < δ, ξ 0 < 1, 0 < < 1/2 be fixed constants and let Γ = max 1 i n ∇ f i (·) . Moreover, assume that the sample size for each worker satisfies s }. Then, the GDP LocalNewton updates,w t 0 , within the case of L > 1, satisfy with probability at least 1 − KL(6δ + ξ 0 ). Here, 2 is a constant and C M privacy is from Lemma 7.

Empirical Evaluation
In this section, we evaluate the numerical performance of GDP-LocalNewton under L = 1 and L > 1, respectively. In addition, we design experiments to explore the performance of different step size strategies, such as the linear search step size, fixed step size, and decaying step size. In particular, we conduct both the simulation and the real data experiments.
The simulated datasets are generated according to the logistic model. Supposing d = 10, n = 50,000, each entry of the model parameter vector w = (w 1 , . . . , w 10 ) is generated i.i.d. from U(−0.5, 0.5). For each sample, each entry of the predictor vector is generated i.i.d. from the standard Gaussian. The labels are then generated from the logistic regression. We also generate 10,000 samples for testing.
The real datasets we use are summarized in Table 2, which are publicly available in LIBSVM [28]. We fix the total clients K to be 50. Hence, the number of samples per client s is n/50. We fix the total communication rounds T as 10. The regularization parameter is 0.001. We set the privacy parameter µ = 1, 5 and 10, respectively. We compare our method with GDP-GD algorithm with µ = 1 and LocalNewton with µ = ∞. The details of GDP-GD can be found in Appendix D. Note that we set the max step size α of GDP-LocalNewton to be the fixed step size of GDP-GD. This is because, due to noise accumulation, the step sizes of algorithms perturbed by GDP should be much smaller than those without DP perturbation in order to ensure convergence. Figures 1 and 2 display the training loss and test accuracy of each method on one simulated dataset and three real datasets, where the training loss is the empirical average of the loss function evaluated at the training data points, and the test accuracy is the proportion of test samples that are correctly classified. As Figures 1 and 2 show, GDP-LocalNewton with different µ converges much faster than GDP-GD with µ = 1 in all datasets. It means that our privacy-preserving second-order algorithm is much better than the privacy-preserving first-order algorithm GDP-GD. We can also see that as the the privacy parameter µ increases, the performance of GDP-LocalNewton increases, especially when the communication round is large.

With Local Computation (L > 1)
We use the a9a and the ijcnn1 datasets to show how different values of L influence the performance of GDP-LocalNewton. We set the GDP parameter µ = 2 and µ = 4 for the a9a dataset and set µ = 1 and µ = 2 for the ijcnn1 dataset, the local computation parameter L = 1 and L = 3, and regularization γ = 0.5. The communication round is 8. The other parameter set-ups are the same as those of the former experiments.
As Figure 3 shows, GDP-LocalNewton converges in all datasets. When the values of µ are equal, L = 3 can speed up the convergence of the algorithm. That means the noises do not influence the speed of convergence too much. At the 7-th communication of the algorithm, the error from the privacy protection is bigger than the error of the local computation. That means that the privacy protection influences the usefulness of GDP-LocalNewton more than the local computation.

Different Step Size Strategies
In this section, we explore how different step size strategies, including linear-search step size (α = 0.03), fixed step size (fix α k t = 0.03), and decaying step size (the initial step size α = 0.03), impact GDP-LocalNewton under different levels of privacy protection(different µ values). In addition , we set different decaying rates to show its impact. Under L = 1, we use the simulation and the a9a data. Due to the varying characteristics of the datasets, we made different parameter settings for some of them. In the generated data, we set the µ = 1 and 5 and a better decaying rate σ as 0.5. In the a9a data, we set the µ = 5 and 10 and a better decaying rate σ as 0.9.
As Figures 4 and 5 show, the linear-search step size performs well and stably in every case. This validates the effectiveness of our strategy both theoretically and experimentally. The reason for the similarity between the fixed step size and the search step size is that the noise causes some of the search steps to be equal to the maximum step size set, making them equivalent to the fixed step size. For the decaying step size, the experiments show that the decaying rate is very important for the convergence. An appropriate decaying rate can help the algorithm converge. In the later stages of the algorithm, decaying to a smaller step size helps resist the negative impact of noise on the algorithm and further promotes convergence, such as Figures 4a and 5a. An excessive decaying rate leads to algorithm non-convergence, like the 0.1 decaying rate in every figure. However, in practice, adjusting the decaying rate will increase the operational costs and raise the risk of privacy leakage. Therefore, the linear search strategy is more effective, efficient, and stable.
(a) generated dataset, µ = 1 (b) generated dataset, µ = 5 Figure 4. The training loss of GDP-LocalNewton on the generated data with respect to different strategies of step size. (a,b) Correspond to µ = 1 and µ = 5, respectively. An appropriate decaying rate is the half decay (σ = 0.5).

Conclusions
In this paper, we proposed a novel algorithm called GDP-LocalNewton for privacypreserving and communication-efficient distributed learning. To improve the communication efficiency, we developed the algorithm based on Newton's method and traded more local computations between communications. To handle possible privacy leakage via the curious onlooker, we adopted the notion of GDP [19] by adding Gaussian noise to the updates of each local machine. In particular, we developed a step searching strategy to determine the step size in the noisy Newton update. We validated the effectiveness, efficiency, and stability of our strategy through experiments. We theoretically studied the convergence of GDP-LocalNewton, which turns out to have two error terms, one corresponding to the privacy protection and the other corresponding to the local computation. The experiments corroborated the theoretical findings.
There are some interesting problems that deserve further study. First, the DP framework would lead to strong privacy protection; however, a DP algorithm often lead to much accuracy loss compared with its non-DP counterpart. Therefore, it is desirable to consider other weaker notions of privacy [29]. Second, how to handle the heterogeneity among local datasets algorithmically and theoretically in the second-order-based framework deserves further study. Finally, it is worth tackling other issues or challenges that FL faces within the second-order-based framework.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Some Auxiliary Lemmas
Here, we prove the auxiliary lemmas that are used in the main proofs of the paper. Lemma A1 ([11]). Let f (·) satisfy Assumptions 1 − 4, and 0 < < 1/2 and 0 < δ < 1 be fixed constants. Then, if s 4B κ 2 log 2d δ , the local Hessian matrix at the k-th worker satisfies for all w ∈ R d and k ∈ [K] with probability at least 1 − δ.
Lemma A2 (McDiarmid's inequality). Let X = (X 1 , X 2 , . . . , X m ) be m-independent random variables taking values from the set A, and assume that f : A m → R satisfies the following condition (bounded difference): for all i ∈ 1, . . . , m. Then for any > 0, we have Lemma A3 ([11]). Let S ∈ R n×s be any uniform sampling matrix, then for any matrix B = [b 1 , . . . , b n ] ∈ R d×n with probability 1 − δ, for any δ > 0, we have where the vector B1 is the sum of column of the matrix B and BSS T 1 is the sum of uniformly sampled and scaled column of the matrix B, where the scaling factor is 1 √ sp with p = 1 n . If (i 1 , . . . , i s ) is the set of sampled indices, the BSS T 1 = ∑ k∈(i 1 ,...,i s ) From Lemma A1, we can obtain a key corollary easily, which is taken to bound |(p k t ) T g(w t ) − (p k t ) T g k (w t )|. And this bound is crucial for our main theorems.
Corollary A1. Let g k (w t ) be the gradient of the loss function in the k-th worker, and g(w t ) be the gradient of the global loss function, wherew t is the t-th communication (L 1). Then, we obtain the following: Provided that g i (w t ) Γ, where Γ = max 1 i n ∇ f i (·) and g k (w t ) G, then using vector Bernstein inequality with t = 1 g k , we obtain So, as long as with probability at least 1 − δ.

Appendix B. The Proofs of Some Lemmas in The Context
Appendix B.1. The Proof of Lemma 7 µs Z k t , and note that the Neumann series formula leads to the identity where It remains to bound N k t 2 with high probability in order to complete the proof. Note that with probability at least 1 − ξ 0 , we have Z k t 2 4 √ d + 2 2 log(2/ξ 0 ) and U k t 2 2d log(4d/ξ 0 ), where C 0 and C are constants andM privacy = ΓB √ Td log(d/ξ 0 ) . Note that in order to make the second and thirst inequalities right, we need U k hold with probability at least 1 − ξ 0 /2.
with probability at least 1 − 4δ, where the second inequality uses the fact p k with probability at least 1 − ξ 0 . Hence, we have with probability at least 1 − K(6δ + ξ 0 ).
Appendix C.3. The Proof of Theorem 3 Proof. We define some symbols as the following: wherep τ = 1 K ∑ K k=1 α k τp k τ is the average descent direction andp k τ = (Ĥ k τ ) −1ĝk τ is the local disturbed descent direction at the k-th worker at the iteration τ. L is the local computing times.
Invoking the M smoothness of the function f , we have we make use of inequality (A16) in the last inequality.