Abstract
Precision matrices can efficiently exhibit the correlation between variables and they have received much attention in recent years. When one encounters large datasets stored in different locations and when data sharing is not allowed, the implementation of high-dimensional precision matrix estimation can be numerically challenging or even infeasible. In this work, we studied distributed sparse precision matrix estimation via an alternating block-based gradient descent method. We obtained a global model by aggregating each machine’s information via a communication-efficient surrogate penalized likelihood. The procedure chooses the block coordinates using the local gradient, to guide the global gradient updates, which can efficiently accelerate precision estimation and lessen communication loads on sensors. The proposed method can efficiently achieve the correct selection of non-zero elements of a sparse precision matrix. Under mild conditions, we show that the proposed estimator achieved a near-oracle convergence rate, as if the estimation had been conducted with a consolidated dataset on a single computer. The promising performance of the method was supported by both simulated and real data examples.
Keywords:
block-based gradient descent; distributed estimation; high-dimensional; near-oracle; precision matrix MSC:
62-08
1. Introduction
Estimating an inverse covariance (or precision) matrix in high dimensions naturally arises in a wide variety of application domains, such as clustering analysis [1,2], discriminant analysis [3], and so on. When the dimension p is much larger than the sample size N, the precision matrix cannot be estimated using the inverse of the sample covariance matrix, due to the singularity of the sample covariance matrix, and estimating the precision matrix is ill-posed and time-consuming, as the number of parameters to be estimated is of the order . As an illustration, in the prostate cancer RNA-Seq dataset we analyze in this paper, genetic activity measurements have been documented for 102 subjects, with 50 normal control subjects and 52 prostate cancer patients. Given that there are over parameters to estimate, the analytical challenges associated with simultaneous discriminant analysis and estimation are significantly amplified. Accurate and fast precision estimation is becoming increasingly important in statistical learning. Among the many high-dimensional inference problems, a variety of precision estimating methods have been proposed to enrich the theory of this field. Friedman et al. [4] developed an penalized likelihood approach to directly estimate the precision matrix, namely graphical Lasso (GLasso); Cai et al. [5] proposed a constrained -minimization procedure to seek a sparse precision matrix under a matrix inversion constraint; Liu and Luo [6] developed a penalized column-wise procedure for estimating a precision matrix; Zhang and Zou [7] advocated a new empirical loss termed the D-trace loss, to avoid computing the log determinant term. For more details refer to [8,9].
However, the rapid emergence of massive datasets poses a serious challenge for high-dimensional precision estimation, where the dimensionality p and the sample size N are both huge. In addition, the computing power, memory constraints, and privacy considerations often make it difficult to pool the separate collections of massive data into a single dataset. Communication is prohibitively expensive due to the limited bandwidth, and direct data sharing raises concerns about privacy and loss of ownership. For example, hospitals may collect the information of tens of thousands of patients, and directly transferring the raw data can be inefficient due to storage bottlenecks. Moreover, in practice, the hospitals are unwilling to share their raw data directly when scientists need to locate relevant genes corresponding to a certain disease from massive data, owing to privacy considerations. The accelerated growth of data sizes and joint analysis of data collected by different parties make statistical inferences on a single computer no longer sufficient, which in addition makes high-dimensional precision estimation a challenging task.
To resolve the above difficulties, one natural strategy is to consider using a “divide-and-conquer” strategy. In such a strategy, a large problem is first divided into smaller manageable subproblems, and the final output is obtained by combining the corresponding sub-outputs. Following this idea, statisticians can improve computing efficiency and reduce privacy risks, while obtaining a global method by aggregating the statistics of each machine. Many distributed statistical methods have been rebuilt for processing massive datasets. Lee et al. [10] proposed a debiasing approach to allow aggregation of local estimates in a distributed setting; Jordan et al. [11] developed an approximate likelihood approach for solving distributed statistical inference problems; and Fan et al. [12] extended their idea and presented two communication-efficient accurate statistical estimators (CEASE). For more details refer to [13,14].
Due to the importance of estimating a precision matrix, some studies have begun to focus on distributed estimation for the precision matrix, where the datasets are distributed over multiple machines, due to size limitations or privacy considerations. Arroyo and Hou [15] estimated the precision matrix for Gaussian graphical models via a simple averaging method; Wang and Cui [16] developed a distributed estimator of the sparse precision matrix by debiasing a D-trace Lasso-type estimator and aggregated estimator by simple averaging. Under distributed data storage, one needs to carefully address two crucial questions for estimating the precision matrix: (a) Estimation-effectiveness: The estimation suffers non-negligible information loss of the whole data, and one should design a distributed procedure to conduct an effective global high-dimensional precision matrix estimation, as if the data were used with a consolidated dataset on a single computer; (b) Communication-efficiency: Estimating a precision matrix suffers from high communication costs under a distributed setup, and the communication costs increases with the dimensionality p from each machine, and one should design an efficient method to reduce the communication costs incurred by transferring matrices of .
To ease the implementation difficulties and communication costs of estimating a precision matrix, we propose an alternating block-based gradient descent (Bgd) method for distributed precision matrix estimation. In detail, we optimize a surrogate loss function, with all the machines participating to optimize their corresponding gradient-enhanced loss functions and evaluate gradients. In each iteration, we only update the block coordinates of the precision matrix, and the block is chosen using the largest sizes of the local gradient in a random machine m. By setting , we can develop an accurate statistical estimation for the precision matrix under a distributed setup, which can lessen the communication costs and computation budget by using a random machine to evaluate the choice of block. Under mild conditions, we show that Bgd led to a consistent estimator, it could even achieve a similar performance as debiased lasso penalized D-trace estimation [7]. The promising performance of the method was supported by both simulated and real data examples.
The rest of this paper is organized as follows: In Section 2, we formulate the research problem and introduce the Bgd framework. In Section 3, we investigate the theoretical properties of Bgd. In Section 4, we demonstrate the promising performance of Bgd through Monte Carlo simulations and a real data example. Concluding remarks are given in Section 5. Technical details are presented in Appendix A.
Throughout this paper, we use c and C to represent certain positive constants, which may be different from line to line. Let mean the set of , and use , , and to denote its , , norms for a vector , respectively. For a matrix , let , , , be its max, spectral, inf, and Frobenius norms, respectively.
2. Distributed Sparse Precision Matrix Estimation
2.1. Model Setups
Assume that we have N independent and identically distributed p-dimensional random variables with a covariance matrix or its corresponding precision matrix . Each nonzero entry of corresponds to an edge in a Gaussian graphical model for describing the conditional dependence structure of the observed variables. In particular, if a p-dimensional random vector , the conditional independence between and given other features is equivalent to , . A sparse structure of the precision matrix provides a concise relationship between features and also gives a meaningful interpretation of the conditional independence among the features; thus, one needs to achieve a sparse and stable estimation for the precision matrix .
Throughout this paper, we assume that the number of features p can be much larger than the total sample size N, but the true precision matrix is sparse, so there are few non-zero entries in the high-dimensional setting. We use to denote the index set of the true nonzero components in the precision matrix . Given N independent observations of , we suppose it is partitioned into M subsets completely at random and stored on M local clients. Without loss of generality, assume that is sub-Gaussian and the data are equally partitioned into M machines. In the high-dimensional setting, a common approach to obtaining a sparse estimator of is by minimizing the following -regularized negative log-likelihood, known as a graphical lasso, which is defined as
where is the sample covariance matrix. Many algorithms have been developed to solve the above problem. However, eigendecomposition or calculation of the determinant of a matrix is inevitable in these algorithms. Motivated by [6,7], under the distributed scenario, the global and the local loss functions can be written as
where is the local sample covariance matrix in the client m. For a single machine, many algorithms have been developed to solve the above problem, and some authors have shown that their estimators are asymptotically consistent. The goal of this study is to estimate the high-dimensional precision matrix in a distributed system, where the communication cost and the accuracy of estimation are the major considerations.
2.2. Block-Gradient Descent Algorithm
To develop a communication-efficient method for learning a high-dimensional precision matrix, we first review the proposal of Jordan et al. [11]. Starting from an initial estimator, the gradient can be communicated and the parameters can be obtained based on a communication-efficient surrogate likelihood framework. Note that, in [11], only the first machine solved optimization problems, and the global Hessian matrix was replaced by the first local Hessian matrix. To fully utilize the information on each machine, we choose a random machine m to solve optimization problems in every iteration. In this strategy, we define the loss function for a random machine m as
where is an initial estimator of and , is a concave penalty function with a tuning parameter . In high-dimensional regimes, it is impossible to derive the closed-form solution of . A naive method to remedy this is to add a strict convex quadratic regularization term , and use to approximate the surrogate loss function . Then, can be defined as
Using (4), if we set as the current t-th iteration , an approximate solution to (3) is obtained by the following iterative procedure
At each iteration, the regularization term prevents its minimizer from moving too far away from . This feature performs a non-greedy update in searching for the estimation of a high-precision matrix [17]. We can use gradient descent to optimize , and can well approximate for close to when the stepsize is chosen appropriately by the local sample covariance matrix in machine m. However, the gradient descent method needs to transmit bits from each machine, which results in high communication costs and computation burden per round. In intuition, we should choose the block that can best update the global gradient rapidly with communication constraints. In this paper, we use the local gradient to guide the choice of the block, where the block in the -th iteration is chosen using
where denotes -th largest component of the vector , and is the local gradient in machine k and k is a random machine chosen to optimize the surrogate negative likelihood . In every iteration, every machine transfers bits rather than gradient matrices, and we just update the gradient and precision matrix in block . Details can be found in Algorithm 1. With the aid of surrogate negative likelihood, we can efficiently train a global model that aggregates information from other machines. Thus, this has good potential to provide more reliable estimation results and degrade the communication loads and costs.
Regarding the update , we need to design a distributed algorithm with communication constraints. To update , we need to solve the following minimization problem:
where
The penalty function in (6) for updating can be chosen from Lasso, MCP, or SCAD off-diagonal penalties [18,19]. Closed-form solutions exist for the Lasso, SCAD, and MCP penalties for (6). For example, let be the soft thresholding rule, and if , and otherwise. Denote ]. The closed-form solution for the Lasso penalty is
For the MCP penalty with ,
For the SCAD penalty with , the closed-form solution can be written as
where , is a parameter that controls the concavity of the MCP and SCAD function. In particular, the SCAD converges to the Lasso penalty as . Following Fan and Li [20], we treat as a fixed constant, such as . The SCAD not only enjoys sparsity as the penalty, but also has the property of unbiasedness, in that it does not shrink large estimated parameters, so that they remain unbiased throughout the iterations.
Note that the solution to (6) is not symmetric in general. To make the solution symmetric, following the symmetrization strategy of Cai et al. [5] and Cai et al. [21], the final estimator is constructed through comparison and assigning the one with the smallest magnitude at both entries of , which is
This symmetrizing procedure is not ad hoc. The procedure ensures that the final estimator achieves the same entry-wise estimation error as . For more details refer to Section 3 of Cai et al. [21].
We now discuss how to select the constraints and the stepsize . Regarding the setup for , a larger value of often transmits more information of each local client in each iteration and leads to a more accurate and faster convergence of Bgd. Nevertheless, a larger also means higher communication loads and costs. The choice of is a great challenge. The value of should not be too small or too large. Fortunately, we found the performance of Bgd was robust with a wide range of choices of within certain a interval, which facilitates the use of Bgd by avoiding an elaborative specification on . In addition, many empirical studies have shown that a smaller value of often leads to a faster convergence of the above algorithm. Theorem 1 indicates that only if is larger than , will the objective function be guaranteed to increase for every iteration. In practice, one can first use a tentatively small in local client m, and then check the condition based on the data in the local client m, as follows:
where . If (11) is not satisfied, we take as twice its current value. The proposed Bgd algorithm is summarized in Algorithm 1.
| Algorithm 1 Distributed sparse precision matrix estimation via Bgd |
|
Remark 1.
In the Bgd algorithm, we only transmit the gradient and with block to the central processor and client m, by setting , we can efficiently reduce the communication loads and costs by avoiding transferring bits of the gradient matrix and the estimated precision matrix.
3. Theoretical Properties
We conducted a theoretical analysis to justify the proposed Bgd procedure. In particular, we studied the efficiency of the Bgd estimator in a distributed setup. To investigate the properties of the proposed Bgd algorithm, we required the following conditions:
- (A1)
- (Sparse matrix class) Suppose that with
- (A2)
- (Irrepresentability condition) Let be the set of all non-zero entries of 0, and be the complement of , for some , the covariance matrix satisfies
- (A3)
- (Bounded condition) There exists a constant such that , where and denote the smallest and largest eigenvalues of matrix , respectively.
- (A4)
- (Restricted strong convexity for negative loglikelihood) There exists a positive constant and such thatfor any satisfying .
Condition (A1) indicates that the precision matrix has a sparse structure, it has been widely used in the literature on Gaussian graphical model estimation [5,6]. Condition (A2) is in the same spirit as the mutual incoherence or irrepresentable condition of Liu and Luo [6]. Condition (A3) requires that the smallest eigenvalue of the precision matrix is bounded below zero, and that its largest eigenvalue is finite. Condition (A3) also implies that . This assumption is commonly imposed in the literature for the analysis of Gaussian graphical models [22]. Condition (A4) states that the negative log-likelihood is restricted to strong convexity at .
Theorem 1.
Let be the sequence defined in the above Algorithm 1, if we use the local client m to compute 6, and set , , then
Theorem 1, provided in Appendix A, indicates that with the appropriate scale , we can ensure with limited communication costs in every iteration. Theorem 1 provides an insight into choosing the stepsize with the local data under the distributed setting in a practical implementation.
Theorem 2.
Under the sub-Gaussian condition, suppose that assumptions (A1)–(A4) hold, if with for some , and setting for some ,
Theorem 2, provided in Appendix A, shows the convergence rate under the Frobenius norm.
4. Numerical Studies
In this section, we present several simulation studies and a real data example, to investigate the finite-sample performance of the proposed Bgd procedure in terms of its estimation accuracy. We compare the proposed method with several other distributed high-dimensional precision matrix estimation methods: naive estimation based on averaging the local estimation obtained from the R package “glasso”(Naive) using R version 4.1.0, debiased distributed D-trace loss penalized estimation (Dtrace, [16]), and debiased distributed graphical lasso estimation (Dglasso, [15]). For a benchmark, we set the debiased D-trace loss penalized estimation proposed by Zhang and Zou [7] with all data in a non-distributed setting as the global method. Each estimator was tuned by cross-validation and all numerical experiments were conducted using the software R on a Microsoft Windows computer with a sixteen-core 4.50 GHz CPU and 32 GB RAM. In addition, in this paper, for the penalty function in the objective function (3), we chose lasso and SCAD.
In our numerical studies, the Bgd model was implemented based on Algorithm 1 with . We terminated the iterations when or , and we set in Bgd. We chose , obviously, , which could efficiently reduce the communication loads and costs by avoiding transferring bits of the gradient matrix and the estimated precision matrix. Here, ⌞a⌟ denotes the largest integer part of a. The Bgd model was implemented based on Algorithm 1 and we evaluated the estimation accuracy of each method using Frobenius loss and spectral loss, as follows:
Generally, the smaller and are, the higher the estimation accuracy. Moreover, to assess the accuracy with the sparseness of the true precision matrix recovered, we also evaluated false negative (FN) and false positive (FP) rates, as described below:
The false negative rate gives the percentage of nonzero-elements that are wrongly estimated to be zero. In contrast, the false positive rate gives the percentage of zero-elements that are wrongly estimated as nonzero. Both values are desired to be as small as possible. For each model under study, we set . We further specify the parameter settings in the following sections, where the corresponding simulation results are also given.
(S1) We first assessed the performance of Bgd and its competitors based on their estimation accuracy across two different values of p (i.e., and , and this design resulted in and parameters to be estimated, respectively). Here, we set as a band matrix, i.e., for , , and the other elements of were taken as zero, where . The sample size was .
(S2) We evaluated the performance of Bgd and its competitors across two different values of total sample size N. To this end, we considered the dimensional , with sample sizes and , respectively. Using a similar set as Wang and Jiang [23], we set , . In this setting, had a sparse structure.
(S3) We considered a case with varying sparsity levels of the precision matrix. To this end, let , where , and , for , is the Bernoulli random variable with a success probability of and , respectively, and we chose . We let to ensure that the precision matrix was positive definite.
For each of the above presented cases, we generated datasets. For each of the 100 datasets, the aforementioned methods were adopted to perform high-dimensional distributed precision estimation. The average Frobenius loss , spectral loss , FN, and FP for each method are reported in Table A1. We investigated the effect of the number of machines M and the local sample sizes n in terms of the estimation error. As shown in Table A1, (i) the naive method performance was poor in all cases; (ii) Dtrace and Dglasso exhibited improvement over the naive estimate by debiasing local lasso estimators and averaging the debiased local estimators; (iii) as the number of machines increased, the Dtrace and Dglasso methods deteriorated drastically; (iv) considering more complex structures of precision and varied machines, the proposed Bgd method with lasso and SCAD penalties still achieved smaller errors than the other methods, and the SCAD method had smaller values than the Lasso, in general, since SCAD had more accurate selection results and produced less biased estimates. In summary, the proposed Bgd method outperformed the other methods regardless of the machine number M, and the structure of the precision matrix, in that the values of , and for the former were smaller than those for the latter except the Global method, which was the benchmark.
To investigate the effect on the number of machines M, we replicated the aforementioned simulation study for cases S1–S3 and varied the number of machines M from 5 to 30, and plot the in Figure A1. From Figure A1, the performance of the Naive, Dtrace, and Dglasso deteriorated drastically with the number of machines. In contrast, the of Bgd versus M was almost flat and was very close to the global method, even when M was large. The proposed Bgd still showed accuracies that surpassed its competitors.
Real Data Analysis
In this subsection, we applied our method Bgd to a real data example. The prostate cancer data are available at http://bioinformatics.mdanderson.org/ (acessed on 1 March 2023). The data consist of genetic expression levels for genes from 102 individuals (50 normal control subjects and 52 prostate cancer patients). This dataset has been analyzed in several articles for high-dimensional analysis [5,24].
To evaluate the performance of the proposed Bgd method for distributed precision estimation, we randomly partitioned samples as training data and the remaining samples as testing data, and the data were equally partitioned into data segments. Often having more than 50% training data is preferred [25], and we set or , which led to every client having only 3 or 4 observations for training when the total machines . For simplicity of calculation, we selected genes from all the 16,386 genes using the Package “SIS” with the logistic model, which resulted in over 90,000 parameters to be estimated.
Our goal was to estimate the precision (inverse covariance) matrix in a distributed setting, and we could not use or to measure the estimation accuracy of each method as we did not know the true values of the . Following the same analysis as [5], the normalized gene expression data were assumed to be normally distributed as , where the two groups were assumed to have the same covariance matrix but different means , . The estimated inverse covariance produced by the different methods was used in the linear discriminant scores:
The classification rule was taken to be for . For simplicity, the and in the linear discriminant scores were estimated by training data with the non-distributed setting, whereas was estimated by training data under a distributed setup. The classification performance was clearly associated with the estimation accuracy of . The training dataset was used to perform parameter estimation, while the testing dataset was adopted to compute the classification error. We used the classification error in the testing dataset to assess the estimation performance and compare it with the existing results of other methods. A good estimation method for a precision matrix is expected to have a low misclassification (prediction error, Prr), high sensitivity (Sen), and specificity (Spe) for all partitions.
We summarized the assessment based on replications in terms of sensitivity (Sen), specificity (Spe), as well as the overall prediction error. The proposed Bgd method outperformed all the other methods and even had a similar performance to the global method. The models chosen by Bgd had a higher sensitivity and specificity, and lower misclassification error (Prr). From Table A2, the promising performance of Bgd is again observed.
Moreover, to demonstrate that Bgd is robust to a wide range of choices of within a certain interval, we repeated the above procedure with and calculated their corresponding Prr values. Figure A2 plots the Prr values by taking as the x-axis with . Inspection of Figure A2 indicates that the proposed Bgd method was robust against a wide choice of and still performed better than the other methods, in that the Prr of Bgd was lower than the other methods with various .
5. Discussion
This paper proposes a novel method for high-dimensional precision estimation when the dataset is distributed into different machines. In this work, we studied distributed sparse precision matrix estimation via an alternating block-based gradient descent method, where the block was chosen by the local gradient. This procedure can increase the communication loads and costs for a reliable estimation. The proposed method showed good potential to improve the accuracy of estimation compared with the other distributed methods.
The current work focused on cases with homogeneous data analysis. It would be an interesting topic for future research to further extend the existing work to joint estimation of multiple precision matrices.
Author Contributions
Validation, H.L.; Writing—original draft, W.D. The authors carried out this work collaboratively. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Social Science Foundation of China (23BTJ061).
Data Availability Statement
The real data in this paper consist of genetic expression levels for genes from 102 individuals. For simplicity of calculation and comparison, we selected genes from the total 16,386 genes using the Package “SIS” with the logistic model, which resulted in over 90,000 parameters to be estimated. The variables could be obtained through R code with “SIS (prostate$x, prostate$y, family = “binomial”, nsis = 300, iter = F) $ix0”.
Acknowledgments
The authors are grateful to the two reviewers for the constructive comments and suggestions that led to significant improvements to the original manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A
Appendix A.1. Tables and Figures
Table A1.
Performance of the Bgd and competing methods in the simulation study.
Table A1.
Performance of the Bgd and competing methods in the simulation study.
| Setup | Meth | FN | FP | Time | FN | FP | Time | FN | FP | Time | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Naive | 5.36 | 0.77 | 0 | 0.01 | 7.18 | 7.03 | 0.81 | 0.01 | 0.03 | 5.42 | 8.86 | 1.34 | 0.08 | 0.01 | 3.10 | |
| Dtrace | 3.22 | 0.70 | 0 | 0 | 322 | 5.14 | 0.80 | 0.01 | 0 | 207 | 5.43 | 1.35 | 0.03 | 0 | 214 | |
| S1 (type1) | Dglasso | 3.76 | 0.72 | 0.01 | 0 | 9.44 | 4.97 | 0.97 | 0.05 | 0 | 7.74 | 5.96 | 1.26 | 0.26 | 0 | 3.66 |
| Bgd-lasso | 3.22 | 0.70 | 0.01 | 0 | 32.1 | 4.25 | 0.74 | 0.01 | 0 | 15.9 | 4.26 | 0.75 | 0.02 | 0 | 13.8 | |
| Bgd-scad | 3.20 | 0.68 | 0.01 | 0 | 45.1 | 3.82 | 0.73 | 0.01 | 0 | 33.8 | 3.90 | 0.74 | 0.02 | 0 | 29.2 | |
| Global | 2.80 | 0.68 | 0 | 0 | 62.6 | |||||||||||
| Naive | 9.30 | 0.77 | 0 | 0.03 | 81.8 | 12.1 | 0.96 | 0.01 | 0.01 | 75.2 | 14.5 | 1.30 | 0.90 | 0 | 22.8 | |
| Dtrace | 7.56 | 0.90 | 0.04 | 0 | 923 | 8.15 | 0.84 | 0.02 | 0 | 872 | 9.03 | 1.68 | 0.03 | 0 | 864 | |
| S1 (type2) | Dglasso | 6.04 | 0.85 | 0.05 | 0 | 108 | 7.75 | 0.90 | 0.18 | 0 | 115 | 10.2 | 1.39 | 0.38 | 0 | 122 |
| Bgd-lasso | 6.05 | 0.74 | 0.02 | 0 | 447 | 6.50 | 0.75 | 0.02 | 0 | 301 | 6.40 | 0.74 | 0.02 | 0 | 144 | |
| Bgd-scad | 6.00 | 0.82 | 0.01 | 0 | 448 | 5.95 | 0.77 | 0.01 | 0 | 308 | 6.03 | 0.82 | 0.01 | 0 | 225 | |
| Global | 5.89 | 0.82 | 0 | 0 | 984 | |||||||||||
| Naive | 12.1 | 1.21 | 0.01 | 0.06 | 12.8 | 13.5 | 1.29 | 1.00 | 0 | 15.2 | 13.5 | 1.84 | 1 | 0 | 22.2 | |
| Dtrace | 9.96 | 1.15 | 0.01 | 0.01 | 457 | 10.4 | 1.08 | 0.04 | 0 | 446 | 10.9 | 1.92 | 0.27 | 0 | 438 | |
| S2 (type1) | Dglasso | 8.37 | 1.09 | 0.08 | 0 | 18.8 | 8.41 | 1.06 | 0.19 | 0 | 20.1 | 10.7 | 2.02 | 0.35 | 0 | 24.7 |
| Bgd-lasso | 8.07 | 0.96 | 0.01 | 0.01 | 73.8 | 8.15 | 0.97 | 0.01 | 0.01 | 37.6 | 8.20 | 0.98 | 0.01 | 0.01 | 25.5 | |
| Bgd-scad | 7.78 | 0.97 | 0.01 | 0 | 90.3 | 7.87 | 0.99 | 0.01 | 0.01 | 39.3 | 8.00 | 1.02 | 0.01 | 0.01 | 62.8 | |
| Global | 7.89 | 1.01 | 0.04 | 0 | 210 | |||||||||||
| Naive | 11.9 | 1.12 | 0.19 | 0.04 | 9.79 | 12.5 | 1.24 | 0.17 | 0.04 | 10.4 | 16.2 | 4.28 | 1 | 0 | 12.6 | |
| Dtrace | 11.2 | 1.35 | 0.02 | 0.02 | 592 | 10.8 | 1.16 | 0.24 | 0 | 624 | 14.8 | 3.71 | 0.99 | 0 | 657 | |
| S2 (type2) | Dglasso | 9.02 | 1.24 | 0.10 | 0 | 20.2 | 9.09 | 1.25 | 0.30 | 0 | 26.7 | 14.5 | 4.32 | 0.58 | 0 | 32.8 |
| Bgd-lasso | 8.86 | 1.15 | 0.02 | 0.02 | 80.6 | 8.95 | 1.15 | 0.03 | 0.02 | 40.9 | 12.2 | 1.28 | 0.04 | 0.06 | 25.4 | |
| Bgd-scad | 8.74 | 1.13 | 0.02 | 0.02 | 92.9 | 8.89 | 1.15 | 0.03 | 0.02 | 98.5 | 9.29 | 1.44 | 0.04 | 0.03 | 81.7 | |
| Global | 8.68 | 1.12 | 0.03 | 0 | 208 | |||||||||||
| Naive | 15.2 | 2.02 | 0.40 | 0 | 18.2 | 16.5 | 2.42 | 0.79 | 0.01 | 19.1 | 17.0 | 2.64 | 0.92 | 0 | 25.9 | |
| Dtrace | 9.45 | 1.36 | 0.20 | 0 | 805 | 12.1 | 1.80 | 0.25 | 0.01 | 834 | 16.7 | 2.49 | 0.95 | 0 | 845 | |
| S3 (type1) | Dglasso | 10.6 | 2.35 | 0.26 | 0 | 20.8 | 14.5 | 2.79 | 0.06 | 0.01 | 28.1 | 15.4 | 4.16 | 0.83 | 0 | 30.4 |
| Bgd-lasso | 8.15 | 1.18 | 0.02 | 0.02 | 416 | 10.2 | 1.59 | 0.01 | 0 | 328 | 10.8 | 1.64 | 0 | 0.02 | 232 | |
| Bgd-scad | 6.32 | 0.85 | 0.02 | 0 | 428 | 6.39 | 0.86 | 0.04 | 0 | 388 | 6.74 | 0.91 | 0.06 | 0 | 280 | |
| Global | 5.54 | 0.95 | 0.04 | 0 | 239 | |||||||||||
| Naive | 24.1 | 3.32 | 0.63 | 0 | 11.3 | 25.0 | 3.58 | 0.83 | 0 | 15.5 | 25.1 | 3.80 | 0.96 | 0 | 22.7 | |
| Dtrace | 16.1 | 2.15 | 0.30 | 0 | 874 | 16.2 | 2.26 | 0.36 | 0 | 892 | 21.9 | 2.79 | 0.80 | 0 | 924 | |
| S3 (type2) | Dglasso | 16.4 | 2.40 | 0.01 | 0.06 | 11.7 | 17.8 | 3.60 | 0.18 | 0.09 | 25.8 | 22.4 | 4.01 | 0.88 | 0 | 32.1 |
| Bgd-lasso | 10.6 | 1.42 | 0 | 0.05 | 394 | 10.6 | 1.39 | 0 | 0.05 | 313 | 11.9 | 1.58 | 0.01 | 0.05 | 223 | |
| Bgd-scad | 9.40 | 1.18 | 0 | 0.04 | 408 | 9.43 | 1.19 | 0 | 0.05 | 352 | 11.1 | 1.28 | 0.03 | 0.03 | 261 | |
| Global | 10.1 | 1.49 | 0.09 | 0 | 524 |
Table A2.
Performance of the Bgd and competing methods in the real-data analysis with different partitions.
Table A2.
Performance of the Bgd and competing methods in the real-data analysis with different partitions.
| Meth. | Prr | Sen | Spe | Prr | Sen | Spe | Prr | Sen | Spe | |
|---|---|---|---|---|---|---|---|---|---|---|
| Naive | 0.13 | 0.89 | 0.86 | 0.14 | 0.89 | 0.84 | 0.17 | 0.87 | 0.78 | |
| Dtrace | 0.11 | 0.88 | 0.89 | 0.12 | 0.89 | 0.87 | 0.15 | 0.87 | 0.84 | |
| Dglasso | 0.15 | 0.89 | 0.81 | 0.15 | 0.89 | 0.81 | 0.18 | 0.86 | 0.77 | |
| Bgd-lasso | 0.08 | 0.91 | 0.91 | 0.11 | 0.89 | 0.89 | 0.13 | 0.88 | 0.86 | |
| Bgd-scad | 0.10 | 0.89 | 0.90 | 0.11 | 0.88 | 0.89 | 0.14 | 0.86 | 0.86 | |
| Global | 0.06 | 0.92 | 0.95 | |||||||
| Naive | 0.14 | 0.87 | 0.83 | 0.15 | 0.88 | 0.84 | 0.17 | 0.87 | 0.78 | |
| Dtrace | 0.11 | 0.88 | 0.89 | 0.12 | 0.89 | 0.87 | 0.15 | 0.87 | 0.84 | |
| Dglasso | 0.14 | 0.88 | 0.82 | 0.15 | 0.89 | 0.80 | 0.17 | 0.87 | 0.78 | |
| Bgd-lasso | 0.07 | 0.90 | 0.94 | 0.09 | 0.90 | 0.91 | 0.12 | 0.88 | 0.89 | |
| Bgd-scad | 0.07 | 0.90 | 0.94 | 0.08 | 0.89 | 0.95 | 0.13 | 0.87 | 0.87 | |
| Global | 0.04 | 0.96 | 0.97 | |||||||
Figure A1.
The norms for type1 of cases S1–S3 with varies machines.
Figure A2.
The Prr for real data analysis with various .
Appendix A.2. Proof of the Main Results
Proof of Theorem 1.
For and a random machine m, if we set the initial value as , then we take the Taylor expansion of at and define , , there exists a between and such that
where
By the fact that
If we choose satisfying
we have
Note that subject to , where , we have
which means if and we use local client m to compute (6), one can obtain
We have completed the proof of Theorem 1. □
Lemma A1.
Given being i.i.d. sub-Gaussian random variables with , and suppose . Let be the local covariance matrix of machine m and define . Then we have a constant C
The Lemma can be found in [26].
Proof of Theorem 2.
We prove the Theorem 2 by setting the penalty as Lasso, other penalties can simplify modified the proof. Considering
Define as the sequence generated by the following optimization problem without the block constraint :
Suppose the finial solution is , then by theorem 3 of Beck and Teboulle [27], we have
By Taylor’s expansion, we have
Note that using Lemma A1,
When the iteration , we have
Combining the above inequalities, and if we set , we have
which further implies that
By the fact that , we have
Then, the condition (C4) is satisfied, and we have
Then, we have
Now, we turn to the surrogate loss function , when given the initial in t-th iteration, has the same gradient and solution path as without the block constraints using the gradient descent method. Note that for , we have
Using theorem 4 of [5] and the above statements, we have the same conclusion as (A3), that is
□
References
- Hao, B.; Sun, W.W.; Liu, Y.; Cheng, G. Simultaneous clustering and estimation of heterogeneous graphical models. J. Mach. Learn. Res. 2017, 18, 7981–8038. [Google Scholar]
- Ren, M.; Zhang, S.; Zhang, Q.; Ma, S. Gaussian graphical model based heterogeneity analysis via penalized fusion. Biometrics 2022, 78, 524–535. [Google Scholar] [CrossRef]
- Jiang, B.; Wang, X.; Leng, C. Quda: A direct approach for sparse quadratic discriminant analysis. J. Mach. Learn. Res. 2018, 19, 1098–1134. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. [Google Scholar] [CrossRef]
- Cai, T.T.; Liu, W.D.; Luo, X. A constrained l1 minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 2011, 104, 594–607. [Google Scholar] [CrossRef]
- Liu, W.; Luo, X. Fast and adaptive sparse precision matrix estimation in high dimensions. J. Multivar. Anal. 2015, 135, 153–162. [Google Scholar] [CrossRef]
- Zhang, T.; Zou, H. Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika 2014, 101, 103–120. [Google Scholar] [CrossRef]
- Cai, T.T.; Liu, W.D.; Zhou, H.H. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. Ann. Stat. 2016, 44, 455–488. [Google Scholar] [CrossRef]
- Fan, J.Q.; Yuan, L.; Han, L. An overview of the estimation of large covariance and precision matrices. Econom. J. 2016, 19, C1–C32. [Google Scholar] [CrossRef]
- Lee, J.D.; Liu, Q.; Sun, Y.; Taylor, J.E. Communication-efficient sparse regression. J. Mach. Learn. Res. 2017, 18, 115–144. [Google Scholar]
- Jordan, M.; Lee, J.; Yang, Y. Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 2018. [Google Scholar] [CrossRef]
- Fan, J.Q.; Guo, Y.Y.; Wang, K.Z. Communication-efficient accurate statistical estimation. J. Am. Stat. Assoc. 2023, 118, 1000–1010. [Google Scholar] [CrossRef] [PubMed]
- Gao, Y.; Liu, W.D.; Wang, H.S.; Wang, X.Z.; Yan, Y.B.; Zhang, R.Q. A review of distributed statistical inference. Stat. Theory Relat. Fields 2022, 6, 89–99. [Google Scholar] [CrossRef]
- Li, X.X.; Xu, C. Feature screening with conditional rank utility for big-data classification. J. Am. Stat. Assoc. 2023, 1–22. [Google Scholar] [CrossRef]
- Arroyo, J.; Hou, E. Efficient distributed estimation of inverse covariance matrices . In Proceedings of the 2016 IEEE Statistical Signal Processing Workshop (SSP), Mallorca, Spain, 26–29 June 2016; pp. 1–5. [Google Scholar] [CrossRef]
- Wang, G.P.; Cui, H.J. Efficient distributed estimation of high-dimensional sparse precision matrix for transelliptical graphical models. Acta Math. Sin. Engl. Ser. 2021, 37, 689–706. [Google Scholar] [CrossRef]
- Dong, W.; Li, X.X.; Xu, C.; Tang, N.S. Hybrid hard-soft screening for high-dimensional latent class analysis. Stat. Sin. 2023, 33, 1319–1341. [Google Scholar] [CrossRef]
- Zhang, C. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
- Ma, S.; Huang, J. A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 2017, 112, 410–423. [Google Scholar] [CrossRef]
- Fan, J.Q.; Li, R.Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Cai, T.T.; Li, H.Z.; Liu, W.D.; Xie, J. Joint estimation of multiple high-dimensional precision matrices. Stat. Sin. 2016, 26, 445–464. [Google Scholar] [CrossRef]
- Ravikumar, P.; Wainwright, M.J.; Raskutti, G.; Yu, B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electron. J. Stat. 2011, 5, 935–980. [Google Scholar] [CrossRef]
- Wang, C.; Jiang, B. An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss. Comput. Data Anal. 2020, 142, 106812. [Google Scholar] [CrossRef]
- Xie, J.H.; Lin, Y.Y.; Yan, X.D.; Tang, N.S. Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. J. Am. Stat. Assoc. 2019, 747–760. [Google Scholar] [CrossRef]
- Uçar, M.K.; Nour, M.; Sindi, H.; Polat, K. The effect of training and testing process on machine learning in biomedical datasets. Math. Probl. Eng. 2020, 2020, 2836236. [Google Scholar] [CrossRef]
- Xu, P.; Tian, L.; Gu, Q.Q. Communication-efficient distributed estimation and inference for transelliptical graphical models. arXiv 2016, arXiv:1612.09297. [Google Scholar]
- Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).