Communication-Efﬁcient Distributed Learning for High-Dimensional Support Vector Machines

: Distributed learning has received increasing attention in recent years and is a special need for the era of big data. For a support vector machine (SVM), a powerful binary classiﬁcation tool, we proposed a novel efﬁcient distributed sparse learning algorithm, the communication-efﬁcient surrogate likelihood support vector machine (CSLSVM), in high-dimensions with convex or nonconvex penalties, based on a communication-efﬁcient surrogate likelihood (CSL) framework. We extended the CSL for distributed SVMs without the need to smooth the hinge loss or the gradient of the loss. For a CSLSVM with lasso penalty, we proved that its estimator could achieve a near-oracle property for l 1 penalized SVM estimators on whole datasets. For a CSLSVM with smoothly clipped absolute deviation penalty, we showed that its estimator enjoyed the oracle property, and that it used local linear approximation (LLA) to solve the optimization problem. Furthermore, we showed that the LLA was guaranteed to converge to the oracle estimator, even in our distributed framework and the ultrahigh-dimensional setting, if an appropriate initial estimator was available. The proposed approach is highly competitive with the centralized method within a few rounds of communications. Numerical experiments provided supportive evidence.


Introduction
The support vector machine (SVM), originally introduced by [1], has been a great success when applied to many classification problems. Owing to its high accuracy and flexibility it has provided solid mathematical foundations in machine learning. It is one of the most popular binary classification tools. The motivation of an SVM is to find a maximum-margin hyperplane by a regularized functional optimization problem. In statistical machine learning, the penalized functional is a sum of the hinge loss plus l 2 -norm regularization. The statistical properties of an SVM have been studied in a lot of works. In this work, we focused on a distributed penalized linear SVM for datasets with large sample sizes and large dimensions.
With the development of modern technology, the size of data has become incredibly large, and, in some cases, cannot even be stored on a single machine. In real-world applications, many datasets are stored locally on individual servers and individual's devices, such as mobile phones and computers. It is difficult to collect these local data onto a single machine due to communication costs and privacy preservation. Thus, new methods and theories in distributed learning are called for. Distributed learning has attracted increasing attention in recent years, for example, see Refs. [2,3] for M-estimation, Refs. [4][5][6] for quantile regression, Refs. [7,8] for nonparametric regression, Refs. [3,9] for confidence intervals, and so on. These works focus on the simple setting of data parallelism under which the dataset is partitioned and distributed on m worker machines that can analyze data independently. Most methods suggest that in each round of communication, each worker machine estimates the parameters of the model locally, and then communicates these local estimators to a master machine that averages these estimators to form a global estimator. Although this divide-and-conquer approach is communication-efficient, it has some restrictions: for achieving the minimax rate of convergence, the number of worker machines cannot be too large and samples in each worker machine should be large enough, these restrictions are highly restrictive. In addition, averaging can perform poorly if the estimator is nonlinear.
Not only can the size of data be exceedingly large, but also many successful models are heavily over-parameterized. These problems have been discussed widely. Zou [10] proposed an improved l 1 penalized SVM for simultaneous feature selection and classification, and showed that the hybrid SVM not only often improved the classification accuracy, but also enjoyed better feature-selection performance. Meinshausen and Buhlmann [11] studied the problem of variable selection in high-dimensional graphs, and explained that neighborhood selection with the lasso is a computationally attractive alternative to standard covariance selection for sparse high-dimensional graphs. Zhao and Yu [12] studied almost necessary and sufficient conditions for lasso to select the true model when p < n or p n, where p was the dimension of the model parameters and n was the sample size. Meinshausen and Yu [13] introduced sparse representations for high-dimensional data and proved that the estimator was still consistent even though the lasso could not recover the correct sparsity pattern. The adaptive lasso was proposed by Zou [14] and is a new version of the lasso that enjoys oracle properties. In the high-dimensional problems, the dimension p of the covariates was larger than the size of data, but there were only a few covariates that were relevant to the response. As a concrete example, in a microarray data set, which contains more than 10,000 genes, only several genes will make a difference to the result. Recently, the statistical inference for high-dimensional data has been investigated; readers can refer to [15,16] for details. In high-dimensional surroundings, a standard SVM can be easily affected by many redundant variables, so variable selection is important for high-dimensional SVMs. Fan and Fan [17] has shown that it was as poor as "tossing a coin" if all features were used in classification due to the accumulation of noise in high-dimensional analysis. Many works have been proposed to handle such a problem. Bradley and Mangasarian [18], Peng et al. [19], Zhu et al. [20] and Wegkamp and Yuan [21] studied the l 1 penalized SVM; Fan and Li [22] proposed an approach of variable selection and the estimation of model simultaneously by using a smoothly clipped absolute deviation penalty (SCAD) or minimax concave penalty (MCP), which are non-convex. Becker et al. [23], Park et al. [24] and Zhang et al. [25] considered the SCAD penalized SVM. Lian and Fan [26] gave the divide-and-conquer debiased estimator for an SVM, but such simple averaging might result in high computational costs for the high-dimensional problem, although it could become a debiased estimator by the lasso penalty. Jordan et al. [3] proposed the communication-efficient surrogate likelihood (CSL) framework for solving distributed statistical inference problems, which could work for high-dimensional penalized regression. As [27] stated, the CSL approach is different from distributed first-order optimization methods, which leverage both global first-order information and local higherorder information; yet, to the best of our knowledge, the distributed inference of variable selection for high-dimensional SVM has not been studied in the CSL framework.
In this paper, we propose communication-efficient distributed learning for support vector machines in high dimensions. Instead of using all the data to estimate the parameters, our method only needed to solve a regularized optimization problem on the first machine that was based on all gradients obtained from all worker machines. For the penalty function, we considered the convex l 1 and the non-convex SCAD penalty in a high-dimensional SVM, which could achieve variable selection and estimation simultaneously. In the distributed learning high-dimensional SVM, we did not need smooth assumptions to the loss or the gradient of the loss. We give some theoretical results in the paper.
The reminder of this paper is organized as follows. In Section 2, we give the problem formulation. Communication-efficient distributed estimation for an SVM is presented in Section 3. In Section 4, we provide simulation studies and real data examples and demonstrate encouraging performances.

Problem Formulation
In this section, we set up our learning problem formally. We considered a strategy, which was empirical risk minimization, to obtain the optimal model. We considered a distributed learning framework with m worker machines, in which the 1st machine was regarded as the central machine. The 1st worker could aggregate information from another m − 1 worker machines. In addition, every machine had n samples. So the size of total samples was N = nm. For a standard binary classification problem, we denoted X to be the input space and Y = {−1, +1} to be the output space. Random vectors (X, Y) ∈ X × Y were drawn from an unknown joint distribution D on X × Y. Let the parameters β = β 0 , β 1 , . . . , β p T and the features are available from D. Let (X, β) be a loss function and where β 0 = β 00 , β 01 , . . . , β 0p T is the true parameter. Let the i.i.d (independent identically be stored on m machines and use I k to denote the indices of samples on the kth machine with |I k | = n = N/m for all k ∈ [m] and I j ∩ I k = ∅ for j = k, j, k ∈ [m]. The empirical risk of the kth machine is defined by the empirical risk based on all N samples is We used structural risk minimization strategy to learn β 0 , which is defined by where g(β) is the penalty term, such as l 1 , l 2 , SCAD and MCP. In distributed statistical learning, Jordan et al. [3] proposed a distributed estimator with statistical guarantee and communication efficiency. Given an appropriate initial estimator β, we have By the above formula, we could introduce a communication-efficient distributed learning algorithm: the first worker machine broadcasts the initial β = β 1 to the remaining m − 1 machines and each machine computes the local gradient ∇ L k ( β). Then, these local gradients are sent back to the first machine where the first worker carries out an SVM by aggregating these local gradients. The communication-efficient estimator is given by

Distributed Learning for an SVM
A standard non-separable SVM has the following form, where (x) + = max(x, 0) is the hinge loss function that is piecewise linear and is differentiable except at point 0; β * = β 1 , . . . , β p T is the unknown p-dimensional parameter; λ is the regularization parameter, which determines the importance of penalty term. The true parameter β 0 is the minimum of the following population loss function We define where I{·} is the indicator function and δ{·} is the Dirac delta function. The S(β) and H(β) could be viewed as the gradient vector and Hessian matrix of L(β).
The empirical loss function of the kth worker machine is and then obtain the distributed estimatoř In this paper, we considered the l 1 penalty and SCAD penalty, respectively. The l 1 penalty is g(β) = λ β 1 , and the SCAD penalty is for some a > 2. Note that the SCAD penalty has the following properties: In practice, we adopted the following CSL distributed learning for an SVM, which is summarized in Algorithm 1. However, our theories were based on the distributed estimatorβ in (1).
The 1st machine: broadcast the current iterate β (t) to other worker machines. for all k ∈ {1, 2, · · · , m} parallel do Worker machine k: The 1st machine: aggregate gradients for SVM with l 1 penalty (Call it L1SVM algorithm), compute for SVM with SCAD penalty (Call it SCADSVM algorithm), computes end Output: β (T) .

A Communication-Efficient Distributed SVM with Lasso Penalty
In this section, we establish the theoretical properties of the proposed estimator. Despite the generality and elegance of [3]'s method, the approach uses only at least second derivatives for smooth loss functions, such as cross-entropy loss, which can not directly apply to the hinge loss of an SVM. Recall that surrogate loss L(β) . We only used first-order information, so if the gradient existed almost everywhere, it was usable for the aforementioned method. For an SVM, the hinge loss was differentiable except at point 0. We could use a subgradient function in place of the gradient and, thus, the surrogate loss was directly usable. Our main results were established under the following assumptions. (A1) The conditional densities of X T β 01 given Y = 1 and Y = −1 are denoted as f and g, respectively. It is assumed that f is uniformly bounded away from 0 and ∞ in a neighborhood of 1, and g is uniformly bounded away from 0 and ∞ in a neighborhood of −1.
(A2) β 0 is a sparse and nonzero vector, and S denotes the support of β 0 . (A3) x is a sub-Gaussian random vector. That is, for any η ∈ R p , it is assumed that each component x ij of feather x i is a random variable with a mean of zero and variance 1.
Denote X = (X 1 , X 2 , . . . , X n ) T to be the feature design matrix and define restricted eigenvalues as follows, where ∆ is a restricted cone in R p+1 , . . , p} and |S| ≤ q.
(A4) λ max and λ min are bounded away from zero.

Remark 1. Under Assumption (A1), the Hessian matrix H(β) was well-defined and continuous in β.
(A1) ensured that we could obtain sufficient information around the non-differentiable point of the hinge loss; see more details in [24,28]. (A2) is a common assumption in high-dimensional problems and we knew |S| ≤ s for some s ≤ min{p, n}. In this paper, we used q as the number of nonzero entries in β 0 . Using the sub-Gaussianity assumption (A3), we easily obtained max i x i ∞ ≤ C log p with probability 1 − p −C . Assumption (A4) is similar to Lemma 2 in [19]; we used these to control the bounds of the empirical loss function of the SVM and it's expectation. (A5) is an assumption of the initial estimator for the iterative algorithm; Ref. [19] proved that the L 1 -norm SVM coefficients satisfied such assumptions.
Our results were as follows.

Remark 2.
Although N did not appear in the formula, in fact, the condition λ ≥ 2 ∇ L(β 0 ) ∞ implied that the convergence rate was dependent on N. If m was not too big, that is, n was not too small, we chose λ log p/N. If q n 1/4 /(log p) 1/2 and q log p/n 3/4 log p/N, the convergence rate would be dominated by the first term λ √ q. That is, This was a near-oracle property for the l 1 penalized SVM estimator based on the entirety of the datasets [27].
Proof of Theorem 1. We prove the result by the following three steps in line with the proof of [6]. Step for all β. In terms of L(β) + λ β 1 ≤ L(β 0 ) + λ β 0 1 and Hölder's inequality, we obtain Writing After rearranging, we have Step 2. We observe that With arguments basically the same as the proof of Proposition 3, we could prove that for any δ By following Proposition 1, we have Based on (2)-(5), we have with probability Step 3. Assume that β − β > t for some t > 0. By step 1, this implies By the triangle inequality, we have Using the result from Step 2 and the lower bound for E L 1 (β 0 + δ) − E L 1 (β 0 ) , similar to Lemma 4 of [29], we have Thus, we have Some algebra shows that t ≤ C λ √ q + q 3/2 (log p) 5/2 n + q log p n + q 3/2 log p n 3/4 .

Proposition 1.
Under the same assumptions as Theorem 1 with probability at least Proof of Proposition 1. By the definition of L, The last term above is the same as that dealt with in Lemma 1 in [19], which shows that with probability at least 1 − p −C , In Proposition 2, we show that with probability Thus, we have Then we can obtain

Proposition 2.
Under the same assumptions as Theorem 1, with probability at least 1 − p −C , we have Proof of Proposition 2. We take Ω = β ∈ R p : β 0 ≤ q, β − β 0 1 ≤ Cq log p/N . Define the class of functions Since there are at most Let σ 2 = sup f ∈∪ j G j P f 2 . Then by Theorem 3.12 of [31], we have Rademacher random variables. Using the symmetrization inequality, which states that that is, with probability at least 1 − p −C , Finally, we need to decide the size of σ 2 . For β ∈ Ω, we have that

Proposition 3.
Under the same assumptions as Theorem 1, with probability at least 1 − p −C , we have Proof of Proposition 3. The proof is similar to the proof of Proposition 2. We take Ω = β ∈ R p : β 0 ≤ q, β − β 0 1 ≤ Cq log p/N . Define the class of functions where σ 2 = sup f ∈G j P f 2 . Next, we need to decide the order of σ 2 . For β ∈ Ω, using basic inequality 2ab ≤ a 2 + b 2 , we have that Thus, we complete the proof of Proposition 3.

A Communication-Efficient SVM with SCAD Penalty
In this section, we further discuss the advantage of a distributed non-convex penalized SVM in ultra-high dimension. Similarly, the oracle property of distributed non-convex penalized SVM coefficients are be investigated.
(C7) f is uniformly bounded away from 0 and ∞ in a neighborhood of 1, and g is uniformly bounded away from 0 and ∞ in a neighborhood of −1, where f and g are the conditional densities of X T β 01 given Y = 1 and Y = −1, respectively. Remark 3. Assumptions (C1)-(C2) and (C4)-(C5) are similar to the assumptions in Section 3.1, which have been used by [25]. Assumption (C3) controled the divergence rate of the number of nonzero coefficients, which could not be faster than √ N. In addition, see Remark 2. Assumption (C6) simply required that the signals could not decay too quickly, which implied that the relevant signals were not too small so that it could be identified, which is common in the literature of high-dimensional problems. Assumption (C7) was trivially held by the unbounded support of the conditional distribution of X A given Y. See Remark 1 in [25].
In the non-convex penalty, there might be multiple local minimums. We used B N (λ) to denote the set of local minimums. T he non-convex problem could be written as the difference of two convex functions, and then presented as a sufficient local optimal condition. Let Although f (β) is non-convex, we can write it as and Obviously, h(β) and g(β) are convex. To present our main results, we need a sufficient local optimal condition based on subgradient estimation as described below.

Lemma 1. (Sufficient local optimal condition) if there is a neighbourhood U around the point x
Lemma 1 has been stated as Corollary 1 in [32]. The main results are summarized in the following theorem.

Remark 4.
From Theorem 2, we could see that the oracle estimator held when taking λ = N −1/2+δ for some c 1 < δ < c 2 /2 even for p = o exp(N (δ−c 1 )/2 ) . So, the local oracle property held for a non-convex distributed penalized SVM even when the number of features, p, grew exponentially with the sample size, N, of the whole dataset.
Proof of Theorem 2. We sketch our proof as follows: Step 1. From Theorem 1 in [25], we obtain some properties about s j (β) andβ j , with probability approaching 1, Step 2. By Proposition 1, we have with probability approaching 1 − p −c , . . , p.
In this paper, we did not need to assume that the solution of the minimum problem was unique. By some numerical algorithms, which solve the non-convex penalized SCADSVM, we could identify the oracle estimator. Ref. [33] introduced the LLA algorithm to obtain the sparse estimator in non-convex penalized likelihood models. We applied the LLA algorithm to our SCADSVM approach. Now, we flesh out the problem. Let We update β (t) by solving Consider the following events: The first four events are similar to [25]. Denote P ni = P(F ni ), then we have the following Theorem 3.

Remark 5.
Theorem 3 gave a non-asymptotic lower bounded probability, which implied the oracle estimator could be obtained by the LAA algorithm. That is, the LAA algorithm could identify the oracle estimator in two iterations.

Proof of Theorem 3.
Assume that none of the events F ni is true, for i = 1, . . . , 5. The probability that none of these event is true is at least 1 − P n1 − P n2 − P n3 − P n4 − P n5 . Then we have β By properties of the class of non-convex penalties, we have p λ β (0) j = 0 for 1 j q. Therefore, the solution of the next iteration of β (1) is the solution to the convex optimization By the fact that F n3 is not true, there are some subgradients of oracle estimator s(β) such that s j (β) = 0 for 0 j q and s j (β) < (1 − 1/a)λ for q + 1 j p. By the definition of subgradient, we have Then we have for any β So we can obtain β (1) =β. This proves that the LLA algorithm finds the oracle estimator after one iteration. If F n2 is not true, one obtains β j > aλ for all 1 j q. So we have p λ β j = 0 for all 1 j q and p λ β j = p λ (0) = λ for all q + 1 j p by Property 2 of the class of penalties. At iteration 1, when the LLA algorithm has foundβ, the solution to the next LLA iterationβ (2) is the minimum of the convex optimization problem Then we have for any β Hence, the iteration 2 finds an oracle estimator β (2) =β again and the algorithm stops.

Simulation Experiments
We considered four models to evaluate the finite sample performance of the distributed SVM. The first and the second models were similar to models 1 and 2 of [27], respectively. The first model was essentially a standard linear discriminant analysis which was used in [19,24,25]. The other three models were probit regression under a different setting. In numerical simulation experiments, we generated the data in R, and used CPLEX solver to solve the optimization problem in AMPL for model 1. In addition, we used Python for the other models. . . , 0) T ∈ R p , Σ = σ ij with non-zero elements σ ii = 1 for i = 1, 2, . . . , p and σ ij = ρ = −0.2 for 1 i = j q. The Bayes rule is sgn(2.67X 1 + 2.83X 2 + 3X 3 + 3.17X 4 + 3.33X 5 ) with Bayes error 6.3%. Model 2: X * ∼ MN(0, Σ), Σ = σ ij and σ ij = 0.4 |i−j| for 1 i = j p, σ ij = 1 for is the cumulative density function of the standard normal distribution. The Bayes rule is sgn(1X 1 + 1X 2 + 1X 3 + 1X 4 ) with Bayes error 10.4%.
Model 3: X ij ∼ MN(0, Σ), Σ = σ ij and σ ij = 0.5 |i−j| for 1 i n, 1 j m, is the cumulative density function of the standard normal distribution. The true parameter β 0 is set to be sparse and it's first q entries are uniformly distributed i.i.d. random variables from [0, 1].
Model 4: X ij ∼ MN(0, Σ), Σ = σ ij and σ ij = 0.5 |i−j|/5 for 1 i n, 1 j m, is the cumulative density function of the standard normal distribution. The true parameter β 0 is set to be sparse and it's first q entries are uniformly distributed i.i.d. random variables from [0, 1].This is an ill-conditioned case for model 3.
We compared the finite sample performances of the following four estimators: • L1SVM algorithm: the proposed communication-efficient estimatorβ L1 ; • SCADSVM algorithm: the proposed communication-efficient estimatorβ SCAD ; • Cen algorithm: the central estimator β Cen , which computes the l 1 -regularized estimator using all of the dataset; • Sub algorithm: the sub-data estimator β Sub , which computes the l 1 -regularized estimator using data only on the first machine.
We use the first model to compare the performance of variable selection among the above four algorithms, which are listed in Table 1; and use the other models to evaluate the estimation errors of parameters of algorithms via MSE, which are presented Figures 1-3.
The numbers in Table 1 are the number of zero coefficients incorrectly estimated to be nonzero. The number of nonzero coefficients incorrectly estimated is zero, which meant all of the four algorithms could find the relevant variables; hence, they are not listed. From Table 1, we observed that: (i) The centralized algorithm was the best among these algorithms because it used the information of the whole dataset. (ii) The sub algorithm was bad because it only used the information of the data on the first machine. (iii) Our proposed L1SVM and SCADSVM could both select relevant variables, and the SCADSVM had a better performance than L1SVEM. This implied the non-convex SCADSVM algorithm was more robust than convex L1SVEM, especially for the complex models and massive datasets. (iv) When N = mn was large, our two proposed distributed SVM algorithms were as good as the centralized algorithm.
We gave the prediction error analysis for models 2-4, see Figures 1-3. From Figures 1-3, we have the following observations: (i) The central algorithm was still the best classifier, but had the highest communication cost and risk of privacy leakage.There was a big gap between the sub estimator and the centralized estimator. (ii) Our two proposed communication-efficient estimators could match the central estimator with a few rounds of communication. The prediction errors of SCADSVM were lower than that of L1SVM, and it was more robust than L1SVM.

Real Data
In this subsection, we verify the performance of the CSLSVM algorithm (L1SVM and SCADSVM) using three real datasets. We use 'a9a', 'w8a', and 'phishing' datasets from the LIBSVM website (https://www.csie.ntu.edu.tw/∼cjlin/libsvm/ accessed on 12 February 2022). These real datasets are listed Table 2. The 'a9a' dataset was an adult dataset from the 1994 Census database. The prediction task was to determine whether a person makes over 50K a year or not. The 'w8a' dataset was also based on the Census database, but it had more features than 'a9a'. The phishing dataset aimed to predict phishing websites. Phishing is the process by which a fraudster impersonates a legitimate person by simulating the same or similar web pages or websites to steal personal or private information for illegal political and economic gain. As phishing is becoming more and more serious, phishing web detection is gaining attention as an anti-phishing measure and technique.
Approximately 80% of the data was used to train the model and the remaining was applied to test the model. In distributed learning, we used the number of worker machines m = 5, 10, and 20, respectively. The results of classification errors for the three datasets are provided in Figures 4-6. From Figures 4-6, we found the following: (i) Since these datasets had no well-specified model, the curves behaved quite differently on these datasets. However, overall there was a large gap between the sub algorithm and centralized solution. (ii) In most of the cases, the distributed L1SVM algorithm still converged quite slowly. (iii) The proposed distributed SCADSVM could obtain a solution that was highly competitive with the centralized model within a few rounds of communications, and was more robust than the distributed L1SVM.
The experimental results on simulated and real datasets verified that the proposed distributed SCADSVM/L1SVM algorithms were two effective procedures for distributed sparse learning on classification via the SVM technique, which maintained efficiency in both communication and computation.   In the computational effort of the four methods, by numerical experiments, we also observed the following. For the central algorithm, the whole dataset was used to train the model, so it was the most accurate estimation, but had the highest computational cost and had a risk of privacy leakage. The sub-data estimation algorithm had the least computational cost because of the small amount of data, and no communication was required; however, it had the largest estimation error. Our proposed L1SVM and SCADSVM algorithms were communication efficient and computational efficient using the CSL framework, and could match the central estimator.

Conclusions
In the paper, we proposed a novel distributed CSLSVM learning algorithm with convex (l 1 )/nonconvex (SCAD) penalties based on a communication-efficient surrogate likelihood (CSL) framework, which wads efficient in both communication and computation. For CSLSVM with l 1 penalty, we proved that the estimator of L1SVM could achieve a near-oracle property for an l 1 penalized SVM estimator based on the whole datasets. For CSLSVM with SCAD penalty, we showed that the estimator of SCADSVM enjoyed the oracle property, i.e., one of the local minimums of the distributed non-convex penalized SVM behaved similarly to the oracle estimator based on the whole dataset, as if the true sparsity was known in advance and only the relevant features were found to form the decision boundary. We also showed that, as long as the initial estimator was appropriate, the oracle estimator could be identified with a probability tending to 1. Extensive experiments on both simulated and real data illustrated that the proposed SCLSVM algorithm improved the performance of the work on the first worker machine and matched the centralized method. In addition, the proposed distributed SCADSVM could obtain a solution that was highly competitive with the centralized model within a few rounds of communication, and was more robust than the distributed L1SVM.