Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent

Dong, Wei; Liu, Hongzhen

doi:10.3390/math12050646

Open AccessArticle

Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent

by

Wei Dong

¹

and

Hongzhen Liu

^2,*

¹

School of Mathematics and Statistics, Zhengzhou University, Zhengzhou 450001, China

²

School of Physical Education (Main Campus), Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(5), 646; https://doi.org/10.3390/math12050646

Submission received: 21 January 2024 / Revised: 17 February 2024 / Accepted: 21 February 2024 / Published: 22 February 2024

Download

Browse Figures

Versions Notes

Abstract

Precision matrices can efficiently exhibit the correlation between variables and they have received much attention in recent years. When one encounters large datasets stored in different locations and when data sharing is not allowed, the implementation of high-dimensional precision matrix estimation can be numerically challenging or even infeasible. In this work, we studied distributed sparse precision matrix estimation via an alternating block-based gradient descent method. We obtained a global model by aggregating each machine’s information via a communication-efficient surrogate penalized likelihood. The procedure chooses the block coordinates using the local gradient, to guide the global gradient updates, which can efficiently accelerate precision estimation and lessen communication loads on sensors. The proposed method can efficiently achieve the correct selection of non-zero elements of a sparse precision matrix. Under mild conditions, we show that the proposed estimator achieved a near-oracle convergence rate, as if the estimation had been conducted with a consolidated dataset on a single computer. The promising performance of the method was supported by both simulated and real data examples.

Keywords:

block-based gradient descent; distributed estimation; high-dimensional; near-oracle; precision matrix

MSC:

62-08

1. Introduction

Estimating an inverse covariance (or precision) matrix in high dimensions naturally arises in a wide variety of application domains, such as clustering analysis [1,2], discriminant analysis [3], and so on. When the dimension p is much larger than the sample size N, the precision matrix cannot be estimated using the inverse of the sample covariance matrix, due to the singularity of the sample covariance matrix, and estimating the precision matrix is ill-posed and time-consuming, as the number of parameters to be estimated is of the order

O (p^{2})

. As an illustration, in the prostate cancer RNA-Seq dataset we analyze in this paper, genetic activity measurements have been documented for 102 subjects, with 50 normal control subjects and 52 prostate cancer patients. Given that there are over

D = 6033^{2}

parameters to estimate, the analytical challenges associated with simultaneous discriminant analysis and estimation are significantly amplified. Accurate and fast precision estimation is becoming increasingly important in statistical learning. Among the many high-dimensional inference problems, a variety of precision estimating methods have been proposed to enrich the theory of this field. Friedman et al. [4] developed an

l_{1}

penalized likelihood approach to directly estimate the precision matrix, namely graphical Lasso (GLasso); Cai et al. [5] proposed a constrained

l_{1}

-minimization procedure to seek a sparse precision matrix under a matrix inversion constraint; Liu and Luo [6] developed a penalized column-wise procedure for estimating a precision matrix; Zhang and Zou [7] advocated a new empirical loss termed the D-trace loss, to avoid computing the log determinant term. For more details refer to [8,9].

However, the rapid emergence of massive datasets poses a serious challenge for high-dimensional precision estimation, where the dimensionality p and the sample size N are both huge. In addition, the computing power, memory constraints, and privacy considerations often make it difficult to pool the separate collections of massive data into a single dataset. Communication is prohibitively expensive due to the limited bandwidth, and direct data sharing raises concerns about privacy and loss of ownership. For example, hospitals may collect the information of tens of thousands of patients, and directly transferring the raw data can be inefficient due to storage bottlenecks. Moreover, in practice, the hospitals are unwilling to share their raw data directly when scientists need to locate relevant genes corresponding to a certain disease from massive data, owing to privacy considerations. The accelerated growth of data sizes and joint analysis of data collected by different parties make statistical inferences on a single computer no longer sufficient, which in addition makes high-dimensional precision estimation a challenging task.

To resolve the above difficulties, one natural strategy is to consider using a “divide-and-conquer” strategy. In such a strategy, a large problem is first divided into smaller manageable subproblems, and the final output is obtained by combining the corresponding sub-outputs. Following this idea, statisticians can improve computing efficiency and reduce privacy risks, while obtaining a global method by aggregating the statistics of each machine. Many distributed statistical methods have been rebuilt for processing massive datasets. Lee et al. [10] proposed a debiasing approach to allow aggregation of local estimates in a distributed setting; Jordan et al. [11] developed an approximate likelihood approach for solving distributed statistical inference problems; and Fan et al. [12] extended their idea and presented two communication-efficient accurate statistical estimators (CEASE). For more details refer to [13,14].

Due to the importance of estimating a precision matrix, some studies have begun to focus on distributed estimation for the precision matrix, where the datasets are distributed over multiple machines, due to size limitations or privacy considerations. Arroyo and Hou [15] estimated the precision matrix for Gaussian graphical models via a simple averaging method; Wang and Cui [16] developed a distributed estimator of the sparse precision matrix by debiasing a D-trace Lasso-type estimator and aggregated estimator by simple averaging. Under distributed data storage, one needs to carefully address two crucial questions for estimating the precision matrix: (a) Estimation-effectiveness: The estimation suffers non-negligible information loss of the whole data, and one should design a distributed procedure to conduct an effective global high-dimensional precision matrix estimation, as if the data were used with a consolidated dataset on a single computer; (b) Communication-efficiency: Estimating a precision matrix suffers from high communication costs under a distributed setup, and the communication costs increases

O (p^{2})

with the dimensionality p from each machine, and one should design an efficient method to reduce the communication costs incurred by transferring matrices of

O (p^{2})

.

To ease the implementation difficulties and communication costs of estimating a precision matrix, we propose an alternating block-based gradient descent (Bgd) method for distributed precision matrix estimation. In detail, we optimize a surrogate loss function, with all the machines participating to optimize their corresponding gradient-enhanced loss functions and evaluate gradients. In each iteration, we only update the block coordinates of the precision matrix, and the block is chosen using the largest

κ

sizes of the local gradient in a random machine m. By setting

κ ≪ p^{2}

, we can develop an accurate statistical estimation for the precision matrix under a distributed setup, which can lessen the communication costs and computation budget by using a random machine to evaluate the choice of block. Under mild conditions, we show that Bgd led to a consistent estimator, it could even achieve a similar performance as debiased lasso penalized D-trace estimation [7]. The promising performance of the method was supported by both simulated and real data examples.

The rest of this paper is organized as follows: In Section 2, we formulate the research problem and introduce the Bgd framework. In Section 3, we investigate the theoretical properties of Bgd. In Section 4, we demonstrate the promising performance of Bgd through Monte Carlo simulations and a real data example. Concluding remarks are given in Section 5. Technical details are presented in Appendix A.

Throughout this paper, we use c and C to represent certain positive constants, which may be different from line to line. Let

[p]

mean the set of

{1, 2, \dots, p}

, and use

{∥ a ∥}_{0} = \sum_{j = 1}^{p} I (a_{j} \neq 0)

,

{∥ a ∥}_{\infty} = {max}_{j} | a_{j} |

, and

{∥ a ∥}_{2} = \sqrt{\sum_{j = 1}^{p} a_{j}^{2}}

to denote its

l_{0}

,

l_{\infty}

,

l_{2}

norms for a vector

a = {(a_{1}, \dots, a_{p})}^{⊤} \in R^{p}

, respectively. For a matrix

A = (a_{i j}) \in R^{p_{1} \times p_{2}}

, let

{∥ A ∥}_{max} = {max}_{i, j} | a_{i j} |

,

{∥ A ∥}_{2} = {sup}_{{∥ x ∥}_{2} \leq 1} {∥ A x ∥}_{2}

,

{∥ A ∥}_{\infty} = {max}_{i = 1, \dots, p_{1}} \sum_{j = 1}^{p_{2}} | a_{i j} |

,

{∥ A ∥}_{F} = \sqrt{\sum_{i = 1}^{p_{1}} \sum_{j = 1}^{p_{2}} a_{i j}^{2}}

be its max, spectral, inf, and Frobenius norms, respectively.

2. Distributed Sparse Precision Matrix Estimation

2.1. Model Setups

Assume that we have N independent and identically distributed p-dimensional random variables with a covariance matrix

Σ

or its corresponding precision matrix

Ω : = Σ^{- 1}

. Each nonzero entry of

Ω

corresponds to an edge in a Gaussian graphical model for describing the conditional dependence structure of the observed variables. In particular, if a p-dimensional random vector

X \sim N_{p} (μ, Σ)

, the conditional independence between

X_{j}

and

X_{l}

given other features is equivalent to

Ω_{j l} = 0

,

j, l \in [p]

. A sparse structure of the precision matrix

Ω

provides a concise relationship between features and also gives a meaningful interpretation of the conditional independence among the features; thus, one needs to achieve a sparse and stable estimation for the precision matrix

Ω

.

Throughout this paper, we assume that the number of features p can be much larger than the total sample size N, but the true precision matrix is sparse, so there are few non-zero entries in the high-dimensional setting. We use

s^{*} = {(j_{1}, j_{2}) : ω_{j_{1} j_{2}} \neq 0 for j_{1}, j_{2} \in [p]}

to denote the index set of the true nonzero components in the precision matrix

Ω^{*}

. Given N independent observations of

{X_{i}}_{i = 1}^{N}

, we suppose it is partitioned into M subsets

D_{1}, \dots, D_{M}

completely at random and stored on M local clients. Without loss of generality, assume that

X

is sub-Gaussian and the data are equally partitioned into M machines. In the high-dimensional setting, a common approach to obtaining a sparse estimator of

Ω

is by minimizing the following

l_{1}

-regularized negative log-likelihood, known as a graphical lasso, which is defined as

tr (S Ω) - \log \det (Ω) + {λ ∥ Ω ∥}_{1},

where

S

is the sample covariance matrix. Many algorithms have been developed to solve the above problem. However, eigendecomposition or calculation of the determinant of a

p \times p

matrix is inevitable in these algorithms. Motivated by [6,7], under the distributed scenario, the global and the local loss functions can be written as

\begin{matrix} l_{N} (Ω) & = \frac{1}{M} \sum_{m = 1}^{M} l_{(m)} (Ω) \end{matrix}

(1)

\begin{matrix} l_{(m)} (Ω) & = : \frac{1}{2} tr (Ω^{⊤} S_{m} Ω) - tr (Ω), \end{matrix}

(2)

where

S_{m}

is the local sample covariance matrix in the client m. For a single machine, many algorithms have been developed to solve the above problem, and some authors have shown that their estimators are asymptotically consistent. The goal of this study is to estimate the high-dimensional precision matrix

Ω

in a distributed system, where the communication cost and the accuracy of estimation are the major considerations.

2.2. Block-Gradient Descent Algorithm

To develop a communication-efficient method for learning a high-dimensional precision matrix, we first review the proposal of Jordan et al. [11]. Starting from an initial estimator, the gradient can be communicated and the parameters can be obtained based on a communication-efficient surrogate likelihood framework. Note that, in [11], only the first machine solved optimization problems, and the global Hessian matrix was replaced by the first local Hessian matrix. To fully utilize the information on each machine, we choose a random machine m to solve optimization problems in every iteration. In this strategy, we define the loss function for a random machine m as

\begin{matrix} \tilde{L} (Ω) = l_{(m)} (Ω) - tr [(\nabla_{Ω} l_{(m)} (\bar{Ω}) - \nabla_{Ω} l_{N} (\bar{Ω})) Ω] + p (| Ω |, λ), \end{matrix}

(3)

where

\bar{Ω}

is an initial estimator of

Ω

and

\nabla_{Ω} l_{N} (\bar{Ω}) = \sum_{m = 1}^{M} \nabla_{Ω} l_{(m)} (\bar{Ω})

,

p (, λ)

is a concave penalty function with a tuning parameter

λ > 0

. In high-dimensional regimes, it is impossible to derive the closed-form solution of

Ω

. A naive method to remedy this is to add a strict convex quadratic regularization term

\frac{ν}{2} {∥ Ω - \bar{Ω} ∥}_{2}^{2}

, and use

g (Ω | \bar{Ω})

to approximate the surrogate loss function

\tilde{L} (Ω)

. Then,

g (Ω | \bar{Ω})

can be defined as

\begin{matrix} g (Ω | \bar{Ω}) = & l_{(m)} (\bar{Ω}) + tr [\nabla_{Ω} l_{(m)} (\bar{Ω}) (Ω - \bar{Ω})] - tr [(\nabla_{Ω} l_{(m)} (\bar{Ω}) - \nabla_{Ω} l_{N} (\bar{Ω})) Ω] \\ + \frac{ν}{2} ∥ Ω - \bar{Ω}} ∥_{F}^{2} + p (| Ω |, λ) . \end{matrix}

(4)

Using (4), if we set

\bar{Ω}

as the current t-th iteration

Ω^{(t)}

, an approximate solution to (3) is obtained by the following iterative procedure

Ω^{(t + 1)} = arg min g (Ω | Ω^{(t)}) .

At each iteration, the regularization term prevents its minimizer from moving too far away from

Ω^{(t)}

. This feature performs a non-greedy update in searching for the estimation of a high-precision matrix [17]. We can use gradient descent to optimize

g (Ω | Ω^{(t)})

, and

g (Ω | Ω^{(t)})

can well approximate

\tilde{L} (Ω)

for

Ω^{(t)}

close to

Ω

when the stepsize

ν

is chosen appropriately by the local sample covariance matrix

S_{m}

in machine m. However, the gradient descent method needs to transmit

O (p^{2})

bits from each machine, which results in high communication costs and computation burden per round. In intuition, we should choose the block that can best update the global gradient rapidly with communication constraints. In this paper, we use the local gradient to guide the choice of the block, where the block in the

t

-th iteration is chosen using

\begin{matrix} C_{t} = {(j, l), | B_{j, l} {| \geq ρ = vec (| B |)}_{(κ)}}, \end{matrix}

(5)

where

vec {(B)}_{(κ)}

denotes

κ

-th largest component of the vector

vec (B)

, and

B = \nabla_{Ω} l_{(k)} (Ω^{(t)})

is the local gradient in machine k and k is a random machine chosen to optimize the surrogate negative likelihood

l_{(k)} (Ω) - tr [(\nabla_{Ω} l_{(k)} (\bar{Ω}) - \nabla_{Ω} l_{N} (\bar{Ω})) Ω] + p (| Ω |, λ)

. In every iteration, every machine transfers

O (κ)

bits rather than

O (p^{2})

gradient matrices, and we just update the gradient and precision matrix in block

C_{t}

. Details can be found in Algorithm 1. With the aid of surrogate negative likelihood, we can efficiently train a global model that aggregates information from other machines. Thus, this has good potential to provide more reliable estimation results and degrade the communication loads and costs.

Regarding the update

Ω

, we need to design a distributed algorithm with communication constraints. To update

Ω

, we need to solve the following minimization problem:

\begin{matrix} Ω^{(t + 1)} = arg min_{Ω} \frac{ν}{2} ∥ Ω - Ω^{(t)} + ν^{- 1} C_{t} [\nabla_{Ω} l_{N} (Ω^{(t)})] ∥^{2} + \sum_{(j, l) \in C_{t}} p (∥ Ω_{j l} ∥, λ), \end{matrix}

(6)

where

\begin{matrix} C_{t} {[\nabla_{Ω} l_{N} (Ω^{(t)})]}_{j l} = \{\begin{matrix} \nabla_{Ω_{j l}} l_{N} (Ω^{(t)}) & (j, l) \in C_{t} \\ 0 & (j, l) \notin C_{t} . \end{matrix} \end{matrix}

(7)

The penalty function in (6) for updating

Ω

can be chosen from Lasso, MCP, or SCAD off-diagonal penalties [18,19]. Closed-form solutions exist for the Lasso, SCAD, and MCP penalties for (6). For example, let

S (a, λ) = sign (a) {(| a | - λ)}_{+}

be the soft thresholding rule, and

{(b)}_{+} = b

if

b > 0

, and

{(b)}_{+} = 0

otherwise. Denote

ϑ^{(t + 1)} = Ω^{(t)} - ν^{- 1} C_{t} [\nabla_{Ω} l_{N} (Ω^{(t)})

]. The closed-form solution for the Lasso penalty is

\begin{matrix} ω_{j l}^{(t + 1)} = \{\begin{matrix} S (ϑ_{j l}^{(t + 1)}, λ / ν) & (j, l) \in C_{t} \\ ω_{j l}^{(t)} & (j, l) \notin C_{t} . \end{matrix} \end{matrix}

(8)

For the MCP penalty with

ς > 1 / ν

,

\begin{matrix} ω_{j l}^{(t + 1)} = \{\begin{matrix} \frac{S (ϑ_{j l}^{(t + 1)}, λ / ν)}{1 - 1 / (ς ν)} & | ϑ_{j l}^{(t + 1)} | \leq ς λ \\ ϑ_{j l}^{(t + 1)} & | ϑ_{j l}^{(t + 1)} | > ς λ \\ ω_{j l}^{(t)} & (j, l) \notin C_{t} . \end{matrix} \end{matrix}

(9)

For the SCAD penalty with

ς > 1 / ν + 1

, the closed-form solution can be written as

\begin{matrix} ω_{j l}^{(t + 1)} = \{\begin{matrix} S (ϑ_{j l}^{(t + 1)}, λ / ν) & | ϑ_{j l}^{(t + 1)} | \leq λ + λ / ν \\ \frac{S (ϑ_{j l}^{(t + 1)}, ς λ / ((ς - 1) ν))}{1 - 1 / ((ς - 1) ν)} & λ + λ / ν < | ϑ_{j l}^{(t + 1)} | \leq ς λ \\ ϑ_{j l}^{(t + 1)} & | ϑ_{j l}^{(t + 1)} | > ς λ \\ ω_{j l}^{(t)} & (j, l) \notin C_{t} . \end{matrix} \end{matrix}

(10)

where

Ω^{(t + 1)} = (ω_{j l}^{(t + 1)})

,

ς

is a parameter that controls the concavity of the MCP and SCAD function. In particular, the SCAD converges to the Lasso penalty as

ς \to \infty

. Following Fan and Li [20], we treat

ς

as a fixed constant, such as

ς = 3.7

. The SCAD not only enjoys sparsity as the

L_{1}

penalty, but also has the property of unbiasedness, in that it does not shrink large estimated parameters, so that they remain unbiased throughout the iterations.

Note that the solution

Ω_{1}^{(t + 1)}

to (6) is not symmetric in general. To make the solution symmetric, following the symmetrization strategy of Cai et al. [5] and Cai et al. [21], the final estimator

Ω^{(t + 1)}

is constructed through comparison and assigning the one with the smallest magnitude at both entries of

Ω_{1}^{(t + 1)}

, which is

ω_{j l}^{(t + 1)} = ω_{1 j l}^{(t + 1)} I (| ω_{1 j l}^{(t + 1)} | \leq | ω_{1 l j}^{(t + 1)} |) + ω_{1 l j}^{(t + 1)} I (| ω_{1 l j}^{(t + 1)} | < | ω_{1 j l}^{(t + 1)} |) .

This symmetrizing procedure is not ad hoc. The procedure ensures that the final estimator

Ω^{(t + 1)}

achieves the same entry-wise

L_{\infty}

estimation error as

Ω_{1}^{(t + 1)}

. For more details refer to Section 3 of Cai et al. [21].

We now discuss how to select the constraints

κ

and the stepsize

ν

. Regarding the setup for

κ

, a larger value of

κ

often transmits more information of each local client in each iteration and leads to a more accurate and faster convergence of Bgd. Nevertheless, a larger

κ

also means higher communication loads and costs. The choice of

κ

is a great challenge. The value of

κ

should not be too small or too large. Fortunately, we found the performance of Bgd was robust with a wide range of choices of

κ

within certain a interval, which facilitates the use of Bgd by avoiding an elaborative specification on

κ

. In addition, many empirical studies have shown that a smaller value of

ν

often leads to a faster convergence of the above algorithm. Theorem 1 indicates that only if

ν

is larger than

Λ_{max} (S_{m})

, will the objective function be guaranteed to increase for every iteration. In practice, one can first use a tentatively small

ν

in local client m, and then check the condition based on the data in the local client m, as follows:

\begin{matrix} ν ∥ Δ Ω^{(t)} ∥_{F}^{2} \geq tr ({(Δ Ω^{(t)})}^{⊤} S_{m} Δ Ω^{(t)}), \end{matrix}

(11)

where

Δ Ω^{(t)} = Ω^{(t + 1)} - Ω^{(t)}

. If (11) is not satisfied, we take

ν

as twice its current value. The proposed Bgd algorithm is summarized in Algorithm 1.

Algorithm 1 Distributed sparse precision matrix estimation via Bgd

Input Initial value $Ω^{(0)}$ , number of iterations T.
For $t = 0, 1, 2, \dots, T - 1$ :
• choose a random machine m, update $C_{t}$ ;
• The machine m sends $C_{t}$ to machines through the central processor;
• Each machine evaluates $\nabla_{Ω} l_{(k)} (Ω^{(t)})$ and sends $\nabla_{Ω_{C_{t}}} l_{(k)} (Ω^{(t)})$ to the central processor;
• Then central processor transmit $\nabla_{Ω_{C_{t}}} l_{(k)} (Ω^{(t)})$ to machine m and machine m computes

$Ω^{(t + 1)} = arg min_{Ω} \frac{ν}{2} ∥ Ω - Ω^{(t)} - ν^{- 1} C_{t} [\nabla_{Ω} l_{N} (Ω^{(t)})] ∥^{2} + \sum_{(j, l) \in C_{t}} p (∥ Ω_{j l} ∥, λ)$

and broadcast to machines through the central processor, where the regularizer $ν$ is chosen through the local covariance matrix $S_{m}$ .
Output $Ω^{(T)}$ .

Remark 1.

In the Bgd algorithm, we only transmit the gradient

\nabla_{Ω_{C_{t}}} l_{(k)} (Ω^{(t)})

and

Ω_{C_{t}}

with block

C_{t}

to the central processor and client m, by setting

∥ C_{t} ∥_{0} ≪ p^{2}

, we can efficiently reduce the communication loads and costs by avoiding transferring

O (p^{2})

bits of the gradient matrix and the estimated precision matrix.

3. Theoretical Properties

We conducted a theoretical analysis to justify the proposed Bgd procedure. In particular, we studied the efficiency of the Bgd estimator in a distributed setup. To investigate the properties of the proposed Bgd algorithm, we required the following conditions:

(A1): (Sparse matrix class) Suppose that $Ω^{*} \in U$ with

$U = \{Ω ≻ 0 : max_{1 \leq j \leq p} \sum_{l = 1}^{p} 1 {w_{j l} \neq 0} \leq s_{p}, {∥ Ω ∥}_{L_{1}} \leq Q\} .$
(A2): (Irrepresentability condition) Let $δ = {(j, l) | Ω_{j l} \neq 0}$ be the set of all non-zero entries of 0, and $δ^{c}$ be the complement of $δ$ , for some $0 < α < 1$ , the covariance matrix $Σ^{*}$ satisfies

$max_{1 \leq i \leq p} {∥ Σ_{δ_{i}^{c} δ_{i}}^{*} (Σ_{δ_{i}}^{*} Σ_{δ_{i}}^{*}) ∥}_{\infty} \leq 1 - α .$
(A3): (Bounded condition) There exists a constant $c_{0} \geq 1$ such that $c_{0}^{- 1} \leq Λ_{min} (Ω^{*}) \leq Λ_{max} (Ω^{*}) \leq c_{0}$ , where $Λ_{min} (A)$ and $Λ_{max} (A)$ denote the smallest and largest eigenvalues of matrix $A$ , respectively.
(A4): (Restricted strong convexity for negative loglikelihood) There exists a positive constant $c_{1}$ and $\tilde{Ω}$ such that

$l_{N} (Ω^{*} + Δ) - l_{N} (Ω^{*}) - tr [{\dot{l}}_{N} {(Ω^{*})}^{⊤} Δ] = tr [Δ^{⊤} {\ddot{l}}_{N} (\tilde{Ω}) Δ] \geq c_{4} {∥ Δ ∥}_{F}^{2},$

for any $Δ \neq 0$ satisfying $∥ Δ_{s_{c}^{*}} ∥_{1} \leq 3 {∥ Δ_{s^{*}} ∥}_{1}$ .

Condition (A1) indicates that the precision matrix has a sparse structure, it has been widely used in the literature on Gaussian graphical model estimation [5,6]. Condition (A2) is in the same spirit as the mutual incoherence or irrepresentable condition of Liu and Luo [6]. Condition (A3) requires that the smallest eigenvalue of the precision matrix

Ω^{*}

is bounded below zero, and that its largest eigenvalue is finite. Condition (A3) also implies that

c_{0}^{- 1} \leq Λ_{min} (Σ^{*}) \leq Λ_{max} (Σ^{*}) \leq c_{0}

. This assumption is commonly imposed in the literature for the analysis of Gaussian graphical models [22]. Condition (A4) states that the negative log-likelihood

l_{N} (Ω)

is restricted to strong convexity at

∥ Δ_{s_{c}^{*}} ∥_{1} \leq 3 {∥ Δ_{s^{*}} ∥}_{1}

.

Theorem 1.

Let

Ω^{(t)}

be the sequence defined in the above Algorithm 1, if we use the local client m to compute 6, and set

\bar{Ω} = Ω^{(t)}

,

ν > Λ_{max} (S_{m})

, then

\tilde{L} (Ω^{(t + 1)}) \leq \tilde{L} (Ω^{(t)}) .

Theorem 1, provided in Appendix A, indicates that with the appropriate scale

ν

, we can ensure

\tilde{L} (Ω^{(t + 1)}) \leq \tilde{L} (Ω^{(t)})

with limited communication costs in every iteration. Theorem 1 provides an insight into choosing the stepsize

ν

with the local data under the distributed setting in a practical implementation.

Theorem 2.

Under the sub-Gaussian condition, suppose that assumptions (A1)–(A4) hold, if

s_{p} = o (\sqrt{N / log p})

with

λ \leq C_{1} Q \sqrt{log p / N}

for some

C_{1} > 2

, and setting

ρ = C_{2} Q \sqrt{log p / N}

for some

C_{2} \geq 0

,

\frac{1}{p} {∥ \hat{Ω} - Ω^{*} ∥}_{F}^{2} = O_{p} (Q^{2} \frac{s_{p} log p}{N}) \to 0 .

Theorem 2, provided in Appendix A, shows the convergence rate under the Frobenius norm.

4. Numerical Studies

In this section, we present several simulation studies and a real data example, to investigate the finite-sample performance of the proposed Bgd procedure in terms of its estimation accuracy. We compare the proposed method with several other distributed high-dimensional precision matrix estimation methods: naive estimation based on averaging the local estimation obtained from the R package “glasso”(Naive) using R version 4.1.0, debiased distributed D-trace loss penalized estimation (Dtrace, [16]), and debiased distributed graphical lasso estimation (Dglasso, [15]). For a benchmark, we set the debiased D-trace loss penalized estimation proposed by Zhang and Zou [7] with all data in a non-distributed setting as the global method. Each estimator was tuned by cross-validation and all numerical experiments were conducted using the software R on a Microsoft Windows computer with a sixteen-core 4.50 GHz CPU and 32 GB RAM. In addition, in this paper, for the penalty function in the objective function (3), we chose lasso and SCAD.

In our numerical studies, the Bgd model was implemented based on Algorithm 1 with

Ω^{(0)} = I

. We terminated the iterations when

∥ Ω_{C_{t}}^{(t + 1)} - Ω_{C_{t}}^{(t)} ∥_{F} / {∥ Ω_{C_{t}}^{(t)} ∥}_{F} < 10^{- 3}

or

T \geq T_{max}

, and we set

T_{max} = 300

in Bgd. We chose

κ = ⌞ \sqrt{p} log p ⌟

, obviously,

κ ≪ p^{2}

, which could efficiently reduce the communication loads and costs by avoiding transferring

O (p^{2})

bits of the gradient matrix and the estimated precision matrix. Here, ⌞a⌟ denotes the largest integer part of a. The Bgd model was implemented based on Algorithm 1 and we evaluated the estimation accuracy of each method using Frobenius loss and spectral loss, as follows:

L_{F} = ∥ \hat{Ω} - Ω^{*} ∥_{F}, L_{2} = {∥ \hat{Ω} - Ω^{*} ∥}_{2} .

Generally, the smaller

L_{F}

and

L_{2}

are, the higher the estimation accuracy. Moreover, to assess the accuracy with the sparseness of the true precision matrix recovered, we also evaluated false negative (FN) and false positive (FP) rates, as described below:

F N = \frac{\sum_{i < j} 1 ({\hat{w}}_{i j} = 0, w_{i j}^{*} \neq 0)}{\sum_{i < j} 1 (w_{i j}^{*} \neq 0)}, F P = \frac{\sum_{i < j} 1 ({\hat{w}}_{i j} \neq 0, w_{i j}^{*} = 0)}{\sum_{i < j} 1 (w_{i j}^{*} = 0)},

The false negative rate gives the percentage of nonzero-elements that are wrongly estimated to be zero. In contrast, the false positive rate gives the percentage of zero-elements that are wrongly estimated as nonzero. Both values are desired to be as small as possible. For each model under study, we set

M = {5, 10, 20}

. We further specify the parameter settings in the following sections, where the corresponding simulation results are also given.

(S1) We first assessed the performance of Bgd and its competitors based on their estimation accuracy across two different values of p (i.e.,

p = 200

and

p = 500

, and this design resulted in

D = 200^{2}

and

D = 500^{2}

parameters to be estimated, respectively). Here, we set

Ω^{*}

as a band matrix, i.e.,

ω_{j l}^{*} = 0.45

for

| j - l | = 1

,

ω_{j j}^{*} = 1.0

, and the other elements of

Ω^{*}

were taken as zero, where

j, l = 1, \dots, p

. The sample size was

N = 140

.

(S2) We evaluated the performance of Bgd and its competitors across two different values of total sample size N. To this end, we considered the dimensional

p = 350

, with sample sizes

N = 160

and

N = 120

, respectively. Using a similar set as Wang and Jiang [23], we set

Σ^{*} = 0 . 4^{| j - l |}

,

j, l = 1, \dots, p

. In this setting,

Ω^{*}

had a sparse structure.

(S3) We considered a case with varying sparsity levels of the precision matrix. To this end, let

Ω_{1} = (ω_{j l})

, where

ω_{j l} = u_{j l} δ_{j l}

,

u_{j j} = 1.0

and

u_{j l} = 0.5

, for

j \neq l

,

δ_{j l}

is the Bernoulli random variable with a success probability of

0.01

and

0.02

, respectively, and we chose

(N, p) = (300, 400)

. We let

Ω^{*} = Ω_{1} + 1 (Λ_{m i n} \leq 0) (| Λ_{m i n} | + 0.02) I_{p}

to ensure that the precision matrix

Ω^{*}

was positive definite.

For each of the above presented cases, we generated

T = 100

datasets. For each of the 100 datasets, the aforementioned methods were adopted to perform high-dimensional distributed precision estimation. The average Frobenius loss

L_{F}

, spectral loss

L_{2}

, FN, and FP for each method are reported in Table A1. We investigated the effect of the number of machines M and the local sample sizes n in terms of the estimation error. As shown in Table A1, (i) the naive method performance was poor in all cases; (ii) Dtrace and Dglasso exhibited improvement over the naive estimate by debiasing local lasso estimators and averaging the debiased local estimators; (iii) as the number of machines increased, the Dtrace and Dglasso methods deteriorated drastically; (iv) considering more complex structures of precision and varied machines, the proposed Bgd method with lasso and SCAD penalties still achieved smaller errors than the other methods, and the SCAD method had smaller

L_{F}

values than the Lasso, in general, since SCAD had more accurate selection results and produced less biased estimates. In summary, the proposed Bgd method outperformed the other methods regardless of the machine number M, and the structure of the precision matrix, in that the values of

L_{F}

,

L_{2}

and

F N

for the former were smaller than those for the latter except the Global method, which was the benchmark.

To investigate the effect on the number of machines M, we replicated the aforementioned simulation study for cases S1–S3 and varied the number of machines M from 5 to 30, and plot the

L_{F}

in Figure A1. From Figure A1, the performance of the Naive, Dtrace, and Dglasso deteriorated drastically with the number of machines. In contrast, the

L_{F}

of Bgd versus M was almost flat and was very close to the global method, even when M was large. The proposed Bgd still showed accuracies that surpassed its competitors.

Real Data Analysis

In this subsection, we applied our method Bgd to a real data example. The prostate cancer data are available at http://bioinformatics.mdanderson.org/ (acessed on 1 March 2023). The data consist of genetic expression levels for

p = 6033

genes from 102 individuals (50 normal control subjects and 52 prostate cancer patients). This dataset has been analyzed in several articles for high-dimensional analysis [5,24].

To evaluate the performance of the proposed Bgd method for distributed precision estimation, we randomly partitioned

100 γ %

samples as training data and the remaining

100 (1 - γ) %

samples as testing data, and the data were equally partitioned into

M = {5, 10, 20}

data segments. Often having more than 50% training data is preferred [25], and we set

γ = 0.6

or

0.8

, which led to every client having only 3 or 4 observations for training when the total machines

M = 20

. For simplicity of calculation, we selected

p = 300

genes from all the 16,386 genes using the Package “SIS” with the logistic model, which resulted in over 90,000 parameters to be estimated.

Our goal was to estimate the precision (inverse covariance) matrix in a distributed setting, and we could not use

L_{2}

or

L_{F}

to measure the estimation accuracy of each method as we did not know the true values of the

Ω^{*}

. Following the same analysis as [5], the normalized gene expression data were assumed to be normally distributed as

N (μ_{k}, Σ)

, where the two groups were assumed to have the same covariance matrix

Σ

but different means

μ_{k}

,

k = 1, 2

. The estimated inverse covariance

Ω

produced by the different methods was used in the linear discriminant scores:

δ_{k} (x) = x^{⊤} \hat{Ω} {\hat{μ}}_{k} - \frac{1}{2} {\hat{μ}}_{k}^{⊤} \hat{Ω} {\hat{μ}}_{k} + log {\hat{π}}_{k} .

The classification rule was taken to be

\hat{k} (x) = arg max δ_{k} (x)

for

k = 1, 2

. For simplicity, the

{\hat{μ}}_{k}

and

{\hat{π}}_{k}

in the linear discriminant scores were estimated by training data with the non-distributed setting, whereas

\hat{Ω}

was estimated by training data under a distributed setup. The classification performance was clearly associated with the estimation accuracy of

Ω

. The training dataset was used to perform parameter estimation, while the testing dataset was adopted to compute the classification error. We used the classification error in the testing dataset to assess the estimation performance and compare it with the existing results of other methods. A good estimation method for a precision matrix is expected to have a low misclassification (prediction error, Prr), high sensitivity (Sen), and specificity (Spe) for all partitions.

We summarized the assessment based on

T = 100

replications in terms of sensitivity (Sen), specificity (Spe), as well as the overall prediction error. The proposed Bgd method outperformed all the other methods and even had a similar performance to the global method. The models chosen by Bgd had a higher sensitivity and specificity, and lower misclassification error (Prr). From Table A2, the promising performance of Bgd is again observed.

Moreover, to demonstrate that Bgd is robust to a wide range of choices of

κ

within a certain interval, we repeated the above procedure with

γ = 0.8

and calculated their corresponding Prr values. Figure A2 plots the Prr values by taking

τ

as the x-axis with

κ = ⌞ τ \sqrt{p} log p ⌟

. Inspection of Figure A2 indicates that the proposed Bgd method was robust against a wide choice of

κ

and still performed better than the other methods, in that the Prr of Bgd was lower than the other methods with various

κ

.

5. Discussion

This paper proposes a novel method for high-dimensional precision estimation when the dataset is distributed into different machines. In this work, we studied distributed sparse precision matrix estimation via an alternating block-based gradient descent method, where the block was chosen by the local gradient. This procedure can increase the communication loads and costs for a reliable estimation. The proposed method showed good potential to improve the accuracy of estimation compared with the other distributed methods.

The current work focused on cases with homogeneous data analysis. It would be an interesting topic for future research to further extend the existing work to joint estimation of multiple precision matrices.

Author Contributions

Validation, H.L.; Writing—original draft, W.D. The authors carried out this work collaboratively. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Social Science Foundation of China (23BTJ061).

Data Availability Statement

The real data in this paper consist of genetic expression levels for

p = 6033

genes from 102 individuals. For simplicity of calculation and comparison, we selected

p = 300

genes from the total 16,386 genes using the Package “SIS” with the logistic model, which resulted in over 90,000 parameters to be estimated. The variables could be obtained through R code with “SIS (prostate$x, prostate$y, family = “binomial”, nsis = 300, iter = F) $ix0”.

Acknowledgments

The authors are grateful to the two reviewers for the constructive comments and suggestions that led to significant improvements to the original manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Tables and Figures

Table A1. Performance of the Bgd and competing methods in the simulation study.

				$M = 5$					$M = 10$					$M = 20$
Setup	Meth	$L_{F}$	$L_{2}$	FN	FP	Time	$L_{F}$	$L_{2}$	FN	FP	Time	$L_{F}$	$L_{2}$	FN	FP	Time
	Naive	5.36	0.77	0	0.01	7.18	7.03	0.81	0.01	0.03	5.42	8.86	1.34	0.08	0.01	3.10
	Dtrace	3.22	0.70	0	0	322	5.14	0.80	0.01	0	207	5.43	1.35	0.03	0	214
S1 (type1)	Dglasso	3.76	0.72	0.01	0	9.44	4.97	0.97	0.05	0	7.74	5.96	1.26	0.26	0	3.66
	Bgd-lasso	3.22	0.70	0.01	0	32.1	4.25	0.74	0.01	0	15.9	4.26	0.75	0.02	0	13.8
$D = 200^{2}$	Bgd-scad	3.20	0.68	0.01	0	45.1	3.82	0.73	0.01	0	33.8	3.90	0.74	0.02	0	29.2
	Global	2.80	0.68	0	0	62.6
	Naive	9.30	0.77	0	0.03	81.8	12.1	0.96	0.01	0.01	75.2	14.5	1.30	0.90	0	22.8
	Dtrace	7.56	0.90	0.04	0	923	8.15	0.84	0.02	0	872	9.03	1.68	0.03	0	864
S1 (type2)	Dglasso	6.04	0.85	0.05	0	108	7.75	0.90	0.18	0	115	10.2	1.39	0.38	0	122
	Bgd-lasso	6.05	0.74	0.02	0	447	6.50	0.75	0.02	0	301	6.40	0.74	0.02	0	144
$D = 500^{2}$	Bgd-scad	6.00	0.82	0.01	0	448	5.95	0.77	0.01	0	308	6.03	0.82	0.01	0	225
	Global	5.89	0.82	0	0	984
	Naive	12.1	1.21	0.01	0.06	12.8	13.5	1.29	1.00	0	15.2	13.5	1.84	1	0	22.2
	Dtrace	9.96	1.15	0.01	0.01	457	10.4	1.08	0.04	0	446	10.9	1.92	0.27	0	438
S2 (type1)	Dglasso	8.37	1.09	0.08	0	18.8	8.41	1.06	0.19	0	20.1	10.7	2.02	0.35	0	24.7
	Bgd-lasso	8.07	0.96	0.01	0.01	73.8	8.15	0.97	0.01	0.01	37.6	8.20	0.98	0.01	0.01	25.5
$N = 160$	Bgd-scad	7.78	0.97	0.01	0	90.3	7.87	0.99	0.01	0.01	39.3	8.00	1.02	0.01	0.01	62.8
	Global	7.89	1.01	0.04	0	210
	Naive	11.9	1.12	0.19	0.04	9.79	12.5	1.24	0.17	0.04	10.4	16.2	4.28	1	0	12.6
	Dtrace	11.2	1.35	0.02	0.02	592	10.8	1.16	0.24	0	624	14.8	3.71	0.99	0	657
S2 (type2)	Dglasso	9.02	1.24	0.10	0	20.2	9.09	1.25	0.30	0	26.7	14.5	4.32	0.58	0	32.8
	Bgd-lasso	8.86	1.15	0.02	0.02	80.6	8.95	1.15	0.03	0.02	40.9	12.2	1.28	0.04	0.06	25.4
$N = 120$	Bgd-scad	8.74	1.13	0.02	0.02	92.9	8.89	1.15	0.03	0.02	98.5	9.29	1.44	0.04	0.03	81.7
	Global	8.68	1.12	0.03	0	208
	Naive	15.2	2.02	0.40	0	18.2	16.5	2.42	0.79	0.01	19.1	17.0	2.64	0.92	0	25.9
	Dtrace	9.45	1.36	0.20	0	805	12.1	1.80	0.25	0.01	834	16.7	2.49	0.95	0	845
S3 (type1)	Dglasso	10.6	2.35	0.26	0	20.8	14.5	2.79	0.06	0.01	28.1	15.4	4.16	0.83	0	30.4
	Bgd-lasso	8.15	1.18	0.02	0.02	416	10.2	1.59	0.01	0	328	10.8	1.64	0	0.02	232
$p r = 0.01$	Bgd-scad	6.32	0.85	0.02	0	428	6.39	0.86	0.04	0	388	6.74	0.91	0.06	0	280
	Global	5.54	0.95	0.04	0	239
	Naive	24.1	3.32	0.63	0	11.3	25.0	3.58	0.83	0	15.5	25.1	3.80	0.96	0	22.7
	Dtrace	16.1	2.15	0.30	0	874	16.2	2.26	0.36	0	892	21.9	2.79	0.80	0	924
S3 (type2)	Dglasso	16.4	2.40	0.01	0.06	11.7	17.8	3.60	0.18	0.09	25.8	22.4	4.01	0.88	0	32.1
	Bgd-lasso	10.6	1.42	0	0.05	394	10.6	1.39	0	0.05	313	11.9	1.58	0.01	0.05	223
$p r = 0.02$	Bgd-scad	9.40	1.18	0	0.04	408	9.43	1.19	0	0.05	352	11.1	1.28	0.03	0.03	261
	Global	10.1	1.49	0.09	0	524

Table A2. Performance of the Bgd and competing methods in the real-data analysis with different partitions.

	$M = 5$			$M = 10$			$M = 20$
		$M = 5$			$M = 10$			$M = 20$
$γ$	Meth.	Prr	Sen	Spe	Prr	Sen	Spe	Prr	Sen	Spe
	Naive	0.13	0.89	0.86	0.14	0.89	0.84	0.17	0.87	0.78
	Dtrace	0.11	0.88	0.89	0.12	0.89	0.87	0.15	0.87	0.84
	Dglasso	0.15	0.89	0.81	0.15	0.89	0.81	0.18	0.86	0.77
$0.6$	Bgd-lasso	0.08	0.91	0.91	0.11	0.89	0.89	0.13	0.88	0.86
	Bgd-scad	0.10	0.89	0.90	0.11	0.88	0.89	0.14	0.86	0.86
	Global	0.06	0.92	0.95
	Naive	0.14	0.87	0.83	0.15	0.88	0.84	0.17	0.87	0.78
	Dtrace	0.11	0.88	0.89	0.12	0.89	0.87	0.15	0.87	0.84
	Dglasso	0.14	0.88	0.82	0.15	0.89	0.80	0.17	0.87	0.78
$0.8$	Bgd-lasso	0.07	0.90	0.94	0.09	0.90	0.91	0.12	0.88	0.89
	Bgd-scad	0.07	0.90	0.94	0.08	0.89	0.95	0.13	0.87	0.87
	Global	0.04	0.96	0.97

Figure A1. The

L_{F}

norms for type1 of cases S1–S3 with varies machines.

Figure A1. The

L_{F}

norms for type1 of cases S1–S3 with varies machines.

Figure A2. The Prr for real data analysis with various

τ

.

Figure A2. The Prr for real data analysis with various

τ

.

Appendix A.2. Proof of the Main Results

Proof of Theorem 1.

For

\tilde{L} (Ω^{(t + 1)})

and a random machine m, if we set the initial value as

Ω^{(t)}

, then we take the Taylor expansion of

l_{(m)} (Ω)

at

Ω^{(t)}

and define

Δ Ω^{(t)} = Ω^{(t + 1)} - Ω^{(t)}

,

f (Ω^{(t)}) = (\nabla_{Ω} l_{(m)} (Ω^{(t)}) - \nabla_{Ω} l_{N} (Ω^{(t)})) Ω^{(t)}

, there exists a

\tilde{Ω}

between

Ω^{(t)}

and

Ω^{(t + 1)}

such that

\begin{matrix} l \tilde{L} (Ω^{(t + 1)}) + f (Ω^{(t)}) - l_{(m)} (Ω^{(t)}) \\ = {(vec (Δ Ω^{(t)}))}^{⊤} \nabla_{vec (Ω)} l_{N} (Ω^{(t)}) + \frac{1}{2} {(vec (Δ Ω^{(t)}))}^{⊤} \nabla_{vec (Ω)}^{2} l_{m} (\tilde{Ω}) vec (Δ Ω^{(t)}) + \\ p (| Ω^{t + 1} |, λ), \end{matrix}

where

\nabla_{{vec}^{2} (Ω)} l_{m} (\tilde{Ω}) = diag (S_{m}, \dots, S_{m}) = (\begin{matrix} S_{m} \\ ⋱ \\ S_{m} \end{matrix}) .

By the fact that

{(vec (Δ Ω^{(t)}))}^{⊤} \nabla_{vec (Ω)} l_{m} (Ω^{(t)}) = tr (\nabla_{Ω} l_{m} (Ω^{(t)}) Δ Ω^{(t)}),

\frac{ν}{2} {(vec (Δ Ω^{(t)}))}^{⊤} vec (Δ Ω^{(t)}) = \frac{ν}{2} {∥ Δ Ω^{(t)} ∥}_{F}^{2} .

If we choose

ν

satisfying

\begin{matrix} \frac{1}{2} {(vec (Δ Ω^{(t)}))}^{⊤} \nabla_{vec (Ω)}^{2} l_{m} (\tilde{Ω}) vec (Δ Ω^{(t)}) \leq \frac{ν}{2} {(vec (Δ Ω^{(t)}))}^{⊤} vec (Δ Ω^{(t)}), \end{matrix}

(A1)

we have

\tilde{L} (Ω^{(t + 1)}) + f (Ω^{(t)}) \leq g (Ω^{(t + 1)} | Ω^{(t)}) + f (Ω^{(t)}) .

Note that

Ω^{(t + 1)} = arg {min}_{Ω} g (Ω | Ω^{(t)})

subject to

(j, l) \in C_{t}

, where

j, l \in [p]

, we have

g (Ω^{(t + 1)} | Ω^{(t)}) \leq g (Ω^{(t)} | Ω^{(t)}) = \tilde{L} (Ω^{(t)}),

which means if

ν \geq Λ_{max} (S_{m})

and we use local client m to compute (6), one can obtain

\tilde{L} (Ω^{(t + 1)}) \leq g (Ω^{(t + 1)} | Ω^{(t)}) \leq g (Ω^{(t)} | Ω^{(t)}) = \tilde{L} (Ω^{(t)}) .

We have completed the proof of Theorem 1. □

Lemma A1.

Given

X_{1}, X_{2}, \dots, X_{N}

being i.i.d. sub-Gaussian random variables with

V a r (X_{i}) = Σ^{*}

, and suppose

∥ X_{i} ∥_{ψ_{2}} \leq H

. Let

S_{m}

be the local covariance matrix of machine m and define

S = \frac{1}{M} \sum_{m = 1}^{M} S_{m}

. Then we have a constant C

{∥ S - Σ^{*} ∥}_{max} \leq C H^{2} \sqrt{\frac{log p}{N}} .

The Lemma can be found in [26].

Proof of Theorem 2.

We prove the Theorem 2 by setting the penalty as Lasso, other penalties can simplify modified the proof. Considering

\overset{ˇ}{Ω} = arg min_{Ω} l_{N} (Ω) + λ {∥ Ω ∥}_{1, off} .

Define

{{\tilde{Ω}}^{(t)}}

as the sequence generated by the following optimization problem without the block constraint

C_{t}

:

\begin{matrix} arg min_{Ω} h (Ω | {\tilde{Ω}}^{(t)}) = l_{N} ({\tilde{Ω}}^{(t)}) + tr [\nabla_{Ω} l_{N} ({\tilde{Ω}}^{(t)}) (Ω - {\tilde{Ω}}^{(t)})] + \frac{ν}{2} ∥ Ω - {\tilde{Ω}}^{(t)} ∥_{F}^{2} + p (| Ω |, λ) . \end{matrix}

Suppose the finial solution is

\tilde{Ω}

, then by theorem 3 of Beck and Teboulle [27], we have

\begin{matrix} l_{N} (\tilde{Ω}) + p (| \tilde{Ω} |, λ) \leq l_{N} (Ω^{*}) + p (| Ω^{*} |, λ) + 1 / (2 t) Λ_{max} (S) {∥ {\tilde{Ω}}^{(0)} - Ω^{*} ∥}_{F}^{2} . \end{matrix}

By Taylor’s expansion, we have

l_{N} (\tilde{Ω}) - l_{N} (Ω^{*}) = tr [(\tilde{Ω} - Ω^{*}) \nabla_{Ω} l_{N} (Ω^{*})] + (1 / 2) tr [(\tilde{Ω} - Ω^{*}) S (\tilde{Ω} - Ω^{*})] .

Note that using Lemma A1,

∥ \nabla_{Ω} l_{N} (Ω^{*}) ∥_{\infty} = ∥ S Ω^{*} {- I ∥}_{\infty} \leq ∥ S^{*} Ω^{*} {- I ∥}_{\infty} + {∥ (S - S^{*}) Ω^{*} ∥}_{\infty} \leq Q \sqrt{log p / N} \leq λ / 2 .

When the iteration

t > 4 λ ∥ diag (Ω^{*} - \tilde{Ω}) ∥_{1} / (Λ_{max} (S) ∥ \tilde{Ω} - Ω^{*} ∥_{F}^{2})

, we have

(1 / t) Λ_{max} (S) ∥ {\tilde{Ω}}^{(0)} - Ω^{*} ∥_{F}^{2} \leq 4 λ {∥ diag (Ω^{*} - \tilde{Ω}) ∥}_{1} .

Combining the above inequalities, and if we set

p (| Ω |, λ) = λ ∥ Ω^{*} ∥_{1, off}

, we have

\begin{matrix} (1 / 2) tr [{(\tilde{Ω} - Ω^{*})}^{⊤} S (\tilde{Ω} - Ω^{*})] \\ \leq λ ∥ \tilde{Ω} - Ω^{*} ∥_{1} + 2 λ ∥ Ω^{*} ∥_{1, off} - 2 λ ∥ \tilde{Ω} ∥_{1, off} + 2 λ {∥ diag (Ω^{*} - \tilde{Ω}) ∥}_{1} . \end{matrix}

(A2)

which further implies that

\begin{matrix} (1 / 2) tr [{(\tilde{Ω} - Ω^{*})}^{⊤} S (\tilde{Ω} - Ω^{*})] + λ {∥ \tilde{Ω} - Ω^{*} ∥}_{1} \\ \leq 2 λ ∥ \tilde{Ω} - Ω^{*} ∥_{1} + 2 λ ∥ Ω^{*} ∥_{1, off} - 2 λ ∥ \tilde{Ω} ∥_{1, off} + 2 λ {∥ diag (Ω^{*} - \tilde{Ω}) ∥}_{1} \\ \leq 4 λ \sum_{j_{1} j_{2} \in S^{*}} {∥ {(\tilde{Ω} - Ω^{*})}_{j_{1} j_{2}} ∥}_{1} . \end{matrix}

By the fact that

tr [{(\tilde{Ω} - Ω^{*})}^{⊤} S (\tilde{Ω} - Ω^{*})] \geq 0

, we have

∥ \tilde{Ω} - Ω^{*} ∥_{1} \leq 4 {∥ {(\tilde{Ω} - Ω^{*})}_{s^{*}} ∥}_{1} .

Then, the condition (C4) is satisfied, and we have

C_{3} ∥ \tilde{Ω} - Ω^{*} ∥_{F}^{2} \leq (1 / 2) tr [{(\tilde{Ω} - Ω^{*})}^{⊤} S (\tilde{Ω} - Ω^{*})] \leq λ ∥ {(\tilde{Ω} - Ω^{*})}_{s^{*}} ∥_{1} \leq λ \sqrt{p s_{p}} {∥ \tilde{Ω} - Ω^{*} ∥}_{F} .

Then, we have

\begin{matrix} \frac{1}{p} {∥ \tilde{Ω} - Ω^{*} ∥}_{F}^{2} \leq C s_{p} M^{2} λ^{2} = O_{p} (M^{2} \frac{s_{p} log p}{N}) . \end{matrix}

(A3)

Now, we turn to the surrogate loss function

\tilde{L} (Ω)

, when given the initial

Ω^{(t)}

in t-th iteration,

h (Ω | Ω^{(t)})

has the same gradient and solution path as

\tilde{L} (Ω)

without the block constraints using the gradient descent method. Note that for

Ω^{(t)}

, we have

∥ \frac{1}{M} (S_{m} Ω^{(t)} - I) ∥_{\infty} \leq ρ \leq C_{2} M \sqrt{\frac{log p}{n}} .

Using theorem 4 of [5] and the above statements, we have the same conclusion as (A3), that is

\begin{matrix} \frac{1}{p} {∥ \hat{Ω} - Ω^{*} ∥}_{F}^{2} = O_{p} (M^{2} \frac{s_{p} log p}{N}) . \end{matrix}

□

References

Hao, B.; Sun, W.W.; Liu, Y.; Cheng, G. Simultaneous clustering and estimation of heterogeneous graphical models. J. Mach. Learn. Res. 2017, 18, 7981–8038. [Google Scholar]
Ren, M.; Zhang, S.; Zhang, Q.; Ma, S. Gaussian graphical model based heterogeneity analysis via penalized fusion. Biometrics 2022, 78, 524–535. [Google Scholar] [CrossRef]
Jiang, B.; Wang, X.; Leng, C. Quda: A direct approach for sparse quadratic discriminant analysis. J. Mach. Learn. Res. 2018, 19, 1098–1134. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. [Google Scholar] [CrossRef]
Cai, T.T.; Liu, W.D.; Luo, X. A constrained l₁ minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 2011, 104, 594–607. [Google Scholar] [CrossRef]
Liu, W.; Luo, X. Fast and adaptive sparse precision matrix estimation in high dimensions. J. Multivar. Anal. 2015, 135, 153–162. [Google Scholar] [CrossRef]
Zhang, T.; Zou, H. Sparse precision matrix estimation via lasso penalized D-trace loss. Biometrika 2014, 101, 103–120. [Google Scholar] [CrossRef]
Cai, T.T.; Liu, W.D.; Zhou, H.H. Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. Ann. Stat. 2016, 44, 455–488. [Google Scholar] [CrossRef]
Fan, J.Q.; Yuan, L.; Han, L. An overview of the estimation of large covariance and precision matrices. Econom. J. 2016, 19, C1–C32. [Google Scholar] [CrossRef]
Lee, J.D.; Liu, Q.; Sun, Y.; Taylor, J.E. Communication-efficient sparse regression. J. Mach. Learn. Res. 2017, 18, 115–144. [Google Scholar]
Jordan, M.; Lee, J.; Yang, Y. Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 2018. [Google Scholar] [CrossRef]
Fan, J.Q.; Guo, Y.Y.; Wang, K.Z. Communication-efficient accurate statistical estimation. J. Am. Stat. Assoc. 2023, 118, 1000–1010. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Liu, W.D.; Wang, H.S.; Wang, X.Z.; Yan, Y.B.; Zhang, R.Q. A review of distributed statistical inference. Stat. Theory Relat. Fields 2022, 6, 89–99. [Google Scholar] [CrossRef]
Li, X.X.; Xu, C. Feature screening with conditional rank utility for big-data classification. J. Am. Stat. Assoc. 2023, 1–22. [Google Scholar] [CrossRef]
Arroyo, J.; Hou, E. Efficient distributed estimation of inverse covariance matrices . In Proceedings of the 2016 IEEE Statistical Signal Processing Workshop (SSP), Mallorca, Spain, 26–29 June 2016; pp. 1–5. [Google Scholar] [CrossRef]
Wang, G.P.; Cui, H.J. Efficient distributed estimation of high-dimensional sparse precision matrix for transelliptical graphical models. Acta Math. Sin. Engl. Ser. 2021, 37, 689–706. [Google Scholar] [CrossRef]
Dong, W.; Li, X.X.; Xu, C.; Tang, N.S. Hybrid hard-soft screening for high-dimensional latent class analysis. Stat. Sin. 2023, 33, 1319–1341. [Google Scholar] [CrossRef]
Zhang, C. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Ma, S.; Huang, J. A concave pairwise fusion approach to subgroup analysis. J. Am. Stat. Assoc. 2017, 112, 410–423. [Google Scholar] [CrossRef]
Fan, J.Q.; Li, R.Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Cai, T.T.; Li, H.Z.; Liu, W.D.; Xie, J. Joint estimation of multiple high-dimensional precision matrices. Stat. Sin. 2016, 26, 445–464. [Google Scholar] [CrossRef]
Ravikumar, P.; Wainwright, M.J.; Raskutti, G.; Yu, B. High-dimensional covariance estimation by minimizing l₁-penalized log-determinant divergence. Electron. J. Stat. 2011, 5, 935–980. [Google Scholar] [CrossRef]
Wang, C.; Jiang, B. An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss. Comput. Data Anal. 2020, 142, 106812. [Google Scholar] [CrossRef]
Xie, J.H.; Lin, Y.Y.; Yan, X.D.; Tang, N.S. Category-adaptive variable screening for ultra-high dimensional heterogeneous categorical data. J. Am. Stat. Assoc. 2019, 747–760. [Google Scholar] [CrossRef]
Uçar, M.K.; Nour, M.; Sindi, H.; Polat, K. The effect of training and testing process on machine learning in biomedical datasets. Math. Probl. Eng. 2020, 2020, 2836236. [Google Scholar] [CrossRef]
Xu, P.; Tian, L.; Gu, Q.Q. Communication-efficient distributed estimation and inference for transelliptical graphical models. arXiv 2016, arXiv:1612.09297. [Google Scholar]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, W.; Liu, H. Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent. Mathematics 2024, 12, 646. https://doi.org/10.3390/math12050646

AMA Style

Dong W, Liu H. Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent. Mathematics. 2024; 12(5):646. https://doi.org/10.3390/math12050646

Chicago/Turabian Style

Dong, Wei, and Hongzhen Liu. 2024. "Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent" Mathematics 12, no. 5: 646. https://doi.org/10.3390/math12050646

APA Style

Dong, W., & Liu, H. (2024). Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent. Mathematics, 12(5), 646. https://doi.org/10.3390/math12050646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distributed Sparse Precision Matrix Estimation via Alternating Block-Based Gradient Descent

Abstract

1. Introduction

2. Distributed Sparse Precision Matrix Estimation

2.1. Model Setups

2.2. Block-Gradient Descent Algorithm

3. Theoretical Properties

4. Numerical Studies

Real Data Analysis

5. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Tables and Figures

Appendix A.2. Proof of the Main Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI