Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent

: Precision matrices can efficiently exhibit the correlation between variables and they have received much attention in recent years. When one encounters large datasets stored in different locations and when data sharing is not allowed, the implementation of high-dimensional precision matrix estimation can be numerically challenging or even infeasible. In this work, we studied distributed sparse precision matrix estimation via an alternating block-based gradient descent method. We obtained a global model by aggregating each machine’s information via a communication-efficient surrogate penalized likelihood. The procedure chooses the block coordinates using the local gradient, to guide the global gradient updates, which can efficiently accelerate precision estimation and lessen communication loads on sensors. The proposed method can efficiently achieve the correct selection of non-zero elements of a sparse precision matrix. Under mild conditions, we show that the proposed estimator achieved a near-oracle convergence rate, as if the estimation had been conducted with a consolidated dataset on a single computer. The promising performance of the method was supported by both simulated and real data examples.


Introduction
Estimating an inverse covariance (or precision) matrix in high dimensions naturally arises in a wide variety of application domains, such as clustering analysis [1,2], discriminant analysis [3], and so on.When the dimension p is much larger than the sample size N, the precision matrix cannot be estimated using the inverse of the sample covariance matrix, due to the singularity of the sample covariance matrix, and estimating the precision matrix is ill-posed and time-consuming, as the number of parameters to be estimated is of the order O(p 2 ).As an illustration, in the prostate cancer RNA-Seq dataset we analyze in this paper, genetic activity measurements have been documented for 102 subjects, with 50 normal control subjects and 52 prostate cancer patients.Given that there are over D = 6033 2 parameters to estimate, the analytical challenges associated with simultaneous discriminant analysis and estimation are significantly amplified.Accurate and fast precision estimation is becoming increasingly important in statistical learning.Among the many high-dimensional inference problems, a variety of precision estimating methods have been proposed to enrich the theory of this field.Friedman et al. [4] developed an l 1 penalized likelihood approach to directly estimate the precision matrix, namely graphical Lasso (GLasso); Cai et al. [5] proposed a constrained l 1 -minimization procedure to seek a sparse precision matrix under a matrix inversion constraint; Liu and Luo [6] developed a penalized column-wise procedure for estimating a precision matrix; Zhang and Zou [7] advocated a new empirical loss termed the D-trace loss, to avoid computing the log determinant term.For more details refer to [8,9].However, the rapid emergence of massive datasets poses a serious challenge for high-dimensional precision estimation, where the dimensionality p and the sample size N are both huge.In addition, the computing power, memory constraints, and privacy considerations often make it difficult to pool the separate collections of massive data into a single dataset.Communication is prohibitively expensive due to the limited bandwidth, and direct data sharing raises concerns about privacy and loss of ownership.For example, hospitals may collect the information of tens of thousands of patients, and directly transferring the raw data can be inefficient due to storage bottlenecks.Moreover, in practice, the hospitals are unwilling to share their raw data directly when scientists need to locate relevant genes corresponding to a certain disease from massive data, owing to privacy considerations.The accelerated growth of data sizes and joint analysis of data collected by different parties make statistical inferences on a single computer no longer sufficient, which in addition makes high-dimensional precision estimation a challenging task.
To resolve the above difficulties, one natural strategy is to consider using a "divideand-conquer" strategy.In such a strategy, a large problem is first divided into smaller manageable subproblems, and the final output is obtained by combining the corresponding sub-outputs.Following this idea, statisticians can improve computing efficiency and reduce privacy risks, while obtaining a global method by aggregating the statistics of each machine.Many distributed statistical methods have been rebuilt for processing massive datasets.Lee et al. [10] proposed a debiasing approach to allow aggregation of local estimates in a distributed setting; Jordan et al. [11] developed an approximate likelihood approach for solving distributed statistical inference problems; and Fan et al. [12] extended their idea and presented two communication-efficient accurate statistical estimators (CEASE).For more details refer to [13,14].
Due to the importance of estimating a precision matrix, some studies have begun to focus on distributed estimation for the precision matrix, where the datasets are distributed over multiple machines, due to size limitations or privacy considerations.Arroyo and Hou [15] estimated the precision matrix for Gaussian graphical models via a simple averaging method; Wang and Cui [16] developed a distributed estimator of the sparse precision matrix by debiasing a D-trace Lasso-type estimator and aggregated estimator by simple averaging.Under distributed data storage, one needs to carefully address two crucial questions for estimating the precision matrix: (a) Estimation-effectiveness: The estimation suffers non-negligible information loss of the whole data, and one should design a distributed procedure to conduct an effective global high-dimensional precision matrix estimation, as if the data were used with a consolidated dataset on a single computer; (b) Communication-efficiency: Estimating a precision matrix suffers from high communication costs under a distributed setup, and the communication costs increases O(p 2 ) with the dimensionality p from each machine, and one should design an efficient method to reduce the communication costs incurred by transferring matrices of O(p 2 ).
To ease the implementation difficulties and communication costs of estimating a precision matrix, we propose an alternating block-based gradient descent (Bgd) method for distributed precision matrix estimation.In detail, we optimize a surrogate loss function, with all the machines participating to optimize their corresponding gradient-enhanced loss functions and evaluate gradients.In each iteration, we only update the block coordinates of the precision matrix, and the block is chosen using the largest κ sizes of the local gradient in a random machine m.By setting κ ≪ p 2 , we can develop an accurate statistical estimation for the precision matrix under a distributed setup, which can lessen the communication costs and computation budget by using a random machine to evaluate the choice of block.Under mild conditions, we show that Bgd led to a consistent estimator, it could even achieve a similar performance as debiased lasso penalized D-trace estimation [7].The promising performance of the method was supported by both simulated and real data examples.
The rest of this paper is organized as follows: In Section 2, we formulate the research problem and introduce the Bgd framework.In Section 3, we investigate the theoretical properties of Bgd.In Section 4, we demonstrate the promising performance of Bgd through Monte Carlo simulations and a real data example.Concluding remarks are given in Section 5. Technical details are presented in Appendix A.

Distributed Sparse Precision Matrix Estimation 2.1. Model Setups
Assume that we have N independent and identically distributed p-dimensional random variables with a covariance matrix Σ or its corresponding precision matrix Ω := Σ −1 .Each nonzero entry of Ω corresponds to an edge in a Gaussian graphical model for describing the conditional dependence structure of the observed variables.In particular, if a p-dimensional random vector X ∼ N p (µ, Σ), the conditional independence between X j and X l given other features is equivalent to Ω jl = 0, j, l ∈ [p].A sparse structure of the precision matrix Ω provides a concise relationship between features and also gives a meaningful interpretation of the conditional independence among the features; thus, one needs to achieve a sparse and stable estimation for the precision matrix Ω.
Throughout this paper, we assume that the number of features p can be much larger than the total sample size N, but the true precision matrix is sparse, so there are few nonzero entries in the high-dimensional setting.We use at random and stored on M local clients.Without loss of generality, assume that X is sub-Gaussian and the data are equally partitioned into M machines.In the high-dimensional setting, a common approach to obtaining a sparse estimator of Ω is by minimizing the following l 1 -regularized negative log-likelihood, known as a graphical lasso, which is defined as where S is the sample covariance matrix.Many algorithms have been developed to solve the above problem.However, eigendecomposition or calculation of the determinant of a p × p matrix is inevitable in these algorithms.Motivated by [6,7], under the distributed scenario, the global and the local loss functions can be written as where S m is the local sample covariance matrix in the client m.For a single machine, many algorithms have been developed to solve the above problem, and some authors have shown that their estimators are asymptotically consistent.The goal of this study is to estimate the high-dimensional precision matrix Ω in a distributed system, where the communication cost and the accuracy of estimation are the major considerations.

Block-Gradient Descent Algorithm
To develop a communication-efficient method for learning a high-dimensional precision matrix, we first review the proposal of Jordan et al. [11].Starting from an initial estimator, the gradient can be communicated and the parameters can be obtained based on a communication-efficient surrogate likelihood framework.Note that, in [11], only the first machine solved optimization problems, and the global Hessian matrix was replaced by the first local Hessian matrix.To fully utilize the information on each machine, we choose a random machine m to solve optimization problems in every iteration.In this strategy, we define the loss function for a random machine m as where Ω is an initial estimator of Ω and λ) is a concave penalty function with a tuning parameter λ > 0. In high-dimensional regimes, it is impossible to derive the closed-form solution of Ω.A naive method to remedy this is to add a strict convex quadratic regularization term ν 2 ∥Ω − Ω∥ 2 2 , and use g(Ω| Ω) to approximate the surrogate loss function L(Ω).Then, g(Ω| Ω) can be defined as Using ( 4), if we set Ω as the current t-th iteration Ω (t) , an approximate solution to (3) is obtained by the following iterative procedure At each iteration, the regularization term prevents its minimizer from moving too far away from Ω (t) .This feature performs a non-greedy update in searching for the estimation of a high-precision matrix [17].We can use gradient descent to optimize g(Ω|Ω (t) ), and g(Ω|Ω (t) ) can well approximate L(Ω) for Ω (t) close to Ω when the stepsize ν is chosen appropriately by the local sample covariance matrix S m in machine m.However, the gradient descent method needs to transmit O(p 2 ) bits from each machine, which results in high communication costs and computation burden per round.In intuition, we should choose the block that can best update the global gradient rapidly with communication constraints.In this paper, we use the local gradient to guide the choice of the block, where the block in the t−th iteration is chosen using where vec(B) (κ) denotes κ-th largest component of the vector vec(B), and is the local gradient in machine k and k is a random machine chosen to optimize the surro- In every iteration, every machine transfers O(κ) bits rather than O(p 2 ) gradient matrices, and we just update the gradient and precision matrix in block C t .Details can be found in Algorithm 1.
With the aid of surrogate negative likelihood, we can efficiently train a global model that aggregates information from other machines.Thus, this has good potential to provide more reliable estimation results and degrade the communication loads and costs.
Regarding the update Ω, we need to design a distributed algorithm with communication constraints.To update Ω, we need to solve the following minimization problem: where The penalty function in (6) for updating Ω can be chosen from Lasso, MCP, or SCAD off-diagonal penalties [18,19] .Closed-form solutions exist for the Lasso, SCAD, and MCP penalties for (6).For example, let S(a, λ) = sign(a)(|a| − λ) + be the soft thresholding rule, and (b) + = b if b > 0, and (b) + = 0 otherwise.Denote The closed-form solution for the Lasso penalty is For the MCP penalty with ς > 1/ν, For the SCAD penalty with ς > 1/ν + 1, the closed-form solution can be written as where ), ς is a parameter that controls the concavity of the MCP and SCAD function.In particular, the SCAD converges to the Lasso penalty as ς → ∞.Following Fan and Li [20], we treat ς as a fixed constant, such as ς = 3.7.The SCAD not only enjoys sparsity as the L 1 penalty, but also has the property of unbiasedness, in that it does not shrink large estimated parameters, so that they remain unbiased throughout the iterations.
Note that the solution Ω (t+1) 1 to (6) is not symmetric in general.To make the solution symmetric, following the symmetrization strategy of Cai et al. [5] and Cai et al. [21], the final estimator Ω (t+1) is constructed through comparison and assigning the one with the smallest magnitude at both entries of This symmetrizing procedure is not ad hoc.The procedure ensures that the final estimator Ω (t+1) achieves the same entry-wise L ∞ estimation error as Ω (t+1) 1 . For more details refer to Section 3 of Cai et al. [21].
We now discuss how to select the constraints κ and the stepsize ν.Regarding the setup for κ, a larger value of κ often transmits more information of each local client in each iteration and leads to a more accurate and faster convergence of Bgd.Nevertheless, a larger κ also means higher communication loads and costs.The choice of κ is a great challenge.The value of κ should not be too small or too large.Fortunately, we found the performance of Bgd was robust with a wide range of choices of κ within certain a interval, which facilitates the use of Bgd by avoiding an elaborative specification on κ.In addition, many empirical studies have shown that a smaller value of ν often leads to a faster convergence of the above algorithm.Theorem 1 indicates that only if ν is larger than Λ max (S m ), will the objective function be guaranteed to increase for every iteration.In practice, one can first use a tentatively small ν in local client m, and then check the condition based on the data in the local client m, as follows: where ∆Ω (t) = Ω (t+1) − Ω (t) .If (11) is not satisfied, we take ν as twice its current value.
The proposed Bgd algorithm is summarized in Algorithm 1.
• choose a random machine m, update C t ; • The machine m sends C t to machines through the central processor; • Each machine evaluates ∇ Ω ℓ (k) (Ω (t) ) and sends ∇ Ω C t ℓ (k) (Ω (t) ) to the central processor; • Then central processor transmit ∇ Ω C t ℓ (k) (Ω (t) ) to machine m and machine m computes and broadcast to machines through the central processor, where the regularizer ν is chosen through the local covariance matrix S m .Output Ω (T) .
Remark 1.In the Bgd algorithm, we only transmit the gradient ∇ Ω C t ℓ (k) (Ω (t) ) and Ω C t with block C t to the central processor and client m, by setting ∥C t ∥ 0 ≪ p 2 , we can efficiently reduce the communication loads and costs by avoiding transferring O(p 2 ) bits of the gradient matrix and the estimated precision matrix.

Theoretical Properties
We conducted a theoretical analysis to justify the proposed Bgd procedure.In particular, we studied the efficiency of the Bgd estimator in a distributed setup.To investigate the properties of the proposed Bgd algorithm, we required the following conditions: be the set of all non-zero entries of 0, and δ c be the complement of δ, for some 0 < α < 1, the covariance matrix (A3) (Bounded condition) There exists a constant c 0 ≥ 1 such that c −1 0 ≤ Λ min (Ω * ) ≤ Λ max (Ω * ) ≤ c 0 , where Λ min (A) and Λ max (A) denote the smallest and largest eigenvalues of matrix A, respectively.(A4) (Restricted strong convexity for negative loglikelihood) There exists a positive constant c 1 and Ω such that Condition (A1) indicates that the precision matrix has a sparse structure, it has been widely used in the literature on Gaussian graphical model estimation [5,6].Condition (A2) is in the same spirit as the mutual incoherence or irrepresentable condition of Liu and Luo [6].Condition (A3) requires that the smallest eigenvalue of the precision matrix Ω * is bounded below zero, and that its largest eigenvalue is finite.Condition (A3) also implies that c This assumption is commonly imposed in the literature for the analysis of Gaussian graphical models [22].Condition (A4) states that the negative log-likelihood ℓ N (Ω) is restricted to strong convexity at ∥∆ s * c ∥ 1 ≤ 3∥∆ s * ∥ 1 .
Theorem 1, provided in Appendix A, indicates that with the appropriate scale ν, we can ensure L(Ω (t+1) ) ≤ L(Ω (t) ) with limited communication costs in every iteration.Theorem 1 provides an insight into choosing the stepsize ν with the local data under the distributed setting in a practical implementation.Theorem 2. Under the sub-Gaussian condition, suppose that assumptions (A1)-(A4) hold, if Theorem 2, provided in Appendix A, shows the convergence rate under the Frobenius norm.

Numerical Studies
In this section, we present several simulation studies and a real data example, to investigate the finite-sample performance of the proposed Bgd procedure in terms of its estimation accuracy.We compare the proposed method with several other distributed high-dimensional precision matrix estimation methods: naive estimation based on averaging the local estimation obtained from the R package "glasso"(Naive) using R version 4.1.0,debiased distributed D-trace loss penalized estimation (Dtrace, [16]), and debiased distributed graphical lasso estimation (Dglasso, [15]).For a benchmark, we set the debiased D-trace loss penalized estimation proposed by Zhang and Zou [7] with all data in a nondistributed setting as the global method.Each estimator was tuned by cross-validation and all numerical experiments were conducted using the software R on a Microsoft Windows computer with a sixteen-core 4.50 GHz CPU and 32 GB RAM.In addition, in this paper, for the penalty function in the objective function (3), we chose lasso and SCAD.
In our numerical studies, the Bgd model was implemented based on Algorithm 1 with Ω (0) = I.We terminated the iterations when C t ∥ F < 10 −3 or T ≥ T max , and we set T max = 300 in Bgd.We chose κ = ⌞ √ p log p⌟, obviously, κ ≪ p 2 , which could efficiently reduce the communication loads and costs by avoiding transferring O(p 2 ) bits of the gradient matrix and the estimated precision matrix.Here, ⌞a⌟ denotes the largest integer part of a.The Bgd model was implemented based on Algorithm 1 and we evaluated the estimation accuracy of each method using Frobenius loss and spectral loss, as follows: Generally, the smaller L F and L 2 are, the higher the estimation accuracy.Moreover, to assess the accuracy with the sparseness of the true precision matrix recovered, we also evaluated false negative (FN) and false positive (FP) rates, as described below: (50 normal control subjects and 52 prostate cancer patients).This dataset has been analyzed in several articles for high-dimensional analysis [5,24].
To evaluate the performance of the proposed Bgd method for distributed precision estimation, we randomly partitioned 100γ% samples as training data and the remaining 100(1 − γ)% samples as testing data, and the data were equally partitioned into M = {5, 10, 20} data segments.Often having more than 50% training data is preferred [25], and we set γ = 0.6 or 0.8, which led to every client having only 3 or 4 observations for training when the total machines M = 20.For simplicity of calculation, we selected p = 300 genes from all the 16,386 genes using the Package "SIS" with the logistic model, which resulted in over 90,000 parameters to be estimated.
Our goal was to estimate the precision (inverse covariance) matrix in a distributed setting, and we could not use L 2 or L F to measure the estimation accuracy of each method as we did not know the true values of the Ω * .Following the same analysis as [5], the normalized gene expression data were assumed to be normally distributed as N(µ k , Σ), where the two groups were assumed to have the same covariance matrix Σ but different means µ k , k = 1, 2. The estimated inverse covariance Ω produced by the different methods was used in the linear discriminant scores: The classification rule was taken to be k(x) = arg max δ k (x) for k = 1, 2. For simplicity, the μk and πk in the linear discriminant scores were estimated by training data with the non-distributed setting, whereas Ω was estimated by training data under a distributed setup.The classification performance was clearly associated with the estimation accuracy of Ω .The training dataset was used to perform parameter estimation, while the testing dataset was adopted to compute the classification error.We used the classification error in the testing dataset to assess the estimation performance and compare it with the existing results of other methods.A good estimation method for a precision matrix is expected to have a low misclassification (prediction error, Prr), high sensitivity (Sen), and specificity (Spe) for all partitions.
We summarized the assessment based on T = 100 replications in terms of sensitivity (Sen), specificity (Spe), as well as the overall prediction error.The proposed Bgd method outperformed all the other methods and even had a similar performance to the global method.The models chosen by Bgd had a higher sensitivity and specificity, and lower misclassification error (Prr).From Table A2, the promising performance of Bgd is again observed.
Moreover, to demonstrate that Bgd is robust to a wide range of choices of κ within a certain interval, we repeated the above procedure with γ = 0.8 and calculated their corresponding Prr values.Figure A2 plots the Prr values by taking τ as the x-axis with κ = ⌞τ √ p log p⌟.Inspection of Figure A2 indicates that the proposed Bgd method was robust against a wide choice of κ and still performed better than the other methods, in that the Prr of Bgd was lower than the other methods with various κ.

Discussion
This paper proposes a novel method for high-dimensional precision estimation when the dataset is distributed into different machines.In this work, we studied distributed sparse precision matrix estimation via an alternating block-based gradient descent method, where the block was chosen by the local gradient.This procedure can increase the communication loads and costs for a reliable estimation.The proposed method showed good potential to improve the accuracy of estimation compared with the other distributed methods.
The current work focused on cases with homogeneous data analysis.It would be an interesting topic for future research to further extend the existing work to joint estimation of multiple precision matrices.We have completed the proof of Theorem 1.
Lemma A1.Given X 1 , X 2 , • • • , X N being i.i.d.sub-Gaussian random variables with Var(X i ) = Σ * , and suppose ∥X i ∥ ψ 2 ≤ H. Let S m be the local covariance matrix of machine m and define S = 1 M ∑ M m=1 S m .Then we have a constant C ∥S − Σ * ∥ max ≤ CH 2 log p N .
The Lemma can be found in [26].
Proof of Theorem 2. We prove the Theorem 2 by setting the penalty as Lasso, other penalties can simplify modified the proof.Considering By the fact that tr[( Ω − Ω * ) ⊤ S( Ω − Ω * )] ≥ 0, we have Then, the condition (C4) is satisfied, and we have Then, we have Now, we turn to the surrogate loss function L(Ω), when given the initial Ω (t) in t-th iteration, h(Ω|Ω (t) ) has the same gradient and solution path as L(Ω) without the block constraints using the gradient descent method.Note that for Ω (t) , we have Using theorem 4 of [5] and the above statements, we have the same conclusion as (A3), that is ).

Table A1 .
Performance of the Bgd and competing methods in the simulation study.