Next Article in Journal
Simple Equations Method and Non-Linear Differential Equations with Non-Polynomial Non-Linearity
Previous Article in Journal
Causal Discovery in High-Dimensional Point Process Networks with Hidden Nodes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Proximal Gradient Algorithms for Joint Graphical Lasso

Graduate School of Engineering Science, Osaka University, Osaka 560-0043, Japan
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(12), 1623; https://doi.org/10.3390/e23121623
Submission received: 4 November 2021 / Revised: 27 November 2021 / Accepted: 27 November 2021 / Published: 2 December 2021

Abstract

:
We consider learning as an undirected graphical model from sparse data. While several efficient algorithms have been proposed for graphical lasso (GL), the alternating direction method of multipliers (ADMM) is the main approach taken concerning joint graphical lasso (JGL). We propose proximal gradient procedures with and without a backtracking option for the JGL. These procedures are first-order methods and relatively simple, and the subproblems are solved efficiently in closed form. We further show the boundedness for the solution of the JGL problem and the iterates in the algorithms. The numerical results indicate that the proposed algorithms can achieve high accuracy and precision, and their efficiency is competitive with state-of-the-art algorithms.

1. Introduction

Graphical models are widely used to describe the relationships among interacting objects [1]. Such models have been extensively used in various domains, such as bioinformatics, text mining, and social networks. The graph provides a visual way to understand the joint distribution of an entire set of variables.
In this paper, we consider learning Gaussian graphical models that are expressed by undirected graphs, which represent the relationship among continuous variables that follow a joint Gaussian distribution. In an undirected graph, G = ( V , E ) , and edge set E represents the conditional dependencies among the variables in vertex set V.
Let X 1 , , X p ( p 1 ) be Gaussian variables with covariance matrix Σ R p × p , and  Θ : = Σ 1 be the precision matrix, if it exists. We remove the edges so that the variables X i , X j are conditionally independent given the other variables if and only if the ( i , j ) -th element θ i , j in Θ is 0:
{ i , j } E θ i , j = 0 X i X j | X V \ { i , j } ,
where each edge is expressed as a set of two elements in { 1 , , p } . In this sense, constructing a Gaussian graphical model is equivalent to estimating a precision matrix.
Suppose that we estimate the undirected graph from data consisting of n tuples of p variables and that dimension p is much higher than sample size n. For example, if we have expression data of p = 20,000 genes for n = 100 case/control patients, how can we construct a gene regulatory network structure from the data? It is almost impossible to estimate the locations of the nonzero elements in Θ by obtaining the inverse of sample covariance matrix S R p × p , which is the unbiased estimator of Σ . In fact, if  p > n , then no inverse S 1 exists because the rank of S R p × p is, at most, n.
In order to address this situation, two directions are suggested:
  • Sequentially find the variables on which each variable depends via regression so that the quasilikelihood is maximized [2].
  • Find the locations in Θ , the values of which are zeros, so that the 1 regularized log-likelihood is maximized [3,4,5,6].
We follow the second approach because we assume Gaussian variables, also known as graphical lasso (GL). The  1 regularized log-likelihood is defined by:
maximize Θ log det Θ trace ( S Θ ) λ | | Θ | | 1 ,
where tuning parameter λ controls the amount of sparsity, and  | | Θ | | 1 denotes the sum of the absolute value of the off-diagonal elements in Θ . Several optimization techniques [4,7,8,9,10,11,12] have been studied for the optimization problem of (1).
In particular, we consider a generalized version of the abovementioned GL. For example, suppose that the gene regulatory networks of thirty case and seventy control patients are different. One might construct a gene regulatory network separately for each of the two categories. However, estimating each on its own does not provide an advantage if a common structure is shared. Instead, we use 100 samples to construct two networks simultaneously. Intuitively speaking, using both types of data improves the reliability of the estimation by increasing the sample size for the genes that show similar values between case and control patients, while using only one type of data leads to a more accurate estimate for genes that show significantly different values. Ref. [13] proposed a joint graphical lasso (JGL) model by including an additional convex penalty (fused or group lasso penalty) to the graphical lasso objective function for K classes. For example, K is equal to two for the case/control patients in the example. JGL includes fused graphical lasso with fused lasso penalty, which encourages sparsity and the similarity of the value of edges across K classes, and group graphical lasso with group lasso penalty, which promotes similar sparsity structure across K graphs. Although there are several approaches to handling the multiple graphical models, such as those of [14,15,16,17], the JGL is considered the most promising.
The main topic of this paper is efficiency improvement in terms of solving the JGL problem. For the GL, relatively efficient solving procedures exist. If we differentiate the 1 regularized log-likelihood (1) by Θ , then we have an equation to solve [4]. Moreover, several improvements have been considered for the GL, such as proximal Newton [12] and proximal gradient [10] procedures. However, for the JGL, even if we derive such an equation, we have no efficient way of handling it.
Instead, the alternating direction method of multipliers (ADMM) [18], which is a procedure for solving convex optimization problems for general purposes, has been the main approach taken [13,19,20,21]. However, ADMM does not scale well concerning the feature dimension p and number of classes K. It usually takes time to converge to a high-accuracy solution [22].
For efficient procedures to solve the JGL problem, ref. [23] proposed a method based on the proximal Newton method only when the penalty term is expressed by fused lasso. The existing method requires expensive computations for the Hessian matrix and Newton directions, which means that it would be expensive to use for high-dimensional problems.
In this paper, we propose efficient proximal-gradient-based algorithms to solve the JGL problem by extending the procedure in [10] and employing the step-size selection strategy proposed in [24]. Moreover, we provide the theoretical analysis of both methods for the JGL problem.
In our proximal gradient methods for the JGL problem, the proximal operator in each iteration is quite simple, which eases the implementation process and requires very little computation and memory at each step. Simulation experiments are used to justify our proposed methods over the existing ones.
Our main contributions are as follows:
  • We propose efficient algorithms based on the proximal gradient method to solve the JGL problem. The algorithms are first-order methods and quite simple, and the subproblems can be solved efficiently with a closed-form solution. The numerical results indicate that the methods can achieve high accuracy and precision, and the computational time is competitive with state-of-art algorithms.
  • We provide the boundedness for the solution to the JGL problem and the iterates in algorithms, which is related to the convergence rate of the algorithms. With the boundedness, we can guarantee that our proposed method converges linearly.
Table 1 summarizes the relationship between the proposed and existing methods.
The remaining parts of this paper are as follows. In Section 2, we first provide the background of our proposed methods and introduce the joint graphical lasso problem. In Section 3, we illustrate the detailed content of the proposed algorithms and provide a theoretical analysis. In Section 4, we report some numerical results of the proposed approaches, including comparisons with efficient methods and performance evaluations. Finally, we draw some conclusions in Section 5.
Notation: In this paper, | | x | | p denotes the p norm of a vector x R d , | | x | | p : = ( i = 1 d | x i | p ) 1 p for p [ 1 , ) , and  | | x | | : = max i | x i | . For a matrix X R p × q , | | X | | F denotes the Frobenius norm, | | X | | 2 denotes the spectral norm, | | X | | : = max i , j | x i , j | , and  | | X | | 1 : = i = 1 p j = 1 q | x i , j | if not specified. The inner product is defined by X , X : = trace ( X T X ) .

2. Preliminaries

This section first reviews the graphical lasso (GL) problem and introduces the graphical iterative shrinkage-thresholding algorithm (G-ISTA) [10] to solve it. Then, we introduce the step-size selection strategy that we apply to the joint graphical lasso (JGL) in Section 3.2.

2.1. Graphical Lasso

Let x 1 , , x n R p be n 1 observations of dimension p 1 that follow the Gaussian distribution with mean μ R p and covariance matrix Σ R p × p , where without loss of generality, we assume μ = 0 . Let Θ : = Σ 1 , and the empirical covariance matrix S : = 1 n i = 1 n x i T x i . Given penalty parameter λ > 0 , the graphical lasso (GL) is the procedure to find the positive definite Θ R p × p such that:
minimize Θ log det Θ + trace ( S Θ ) + λ Θ 1 ,
where | | Θ | | 1 = j k | θ j , k | . If we regard V : = { 1 , , p } as a vertex set, then we can construct an undirected graph with edge set { { j , k } | θ j , k 0 } , where set { j , k } denotes an undirected edge that connects the nodes j , k V .
If we take the subgradient of (2), then we find that the optimal solution Θ * satisfies the condition:
Θ * 1 + S + λ Φ 0 ,
where Φ = ( Φ j , k ) is
Φ j , k = 1 , θ j , k * > 0 [ 1 , 1 ] , θ j , k * = 0 1 , θ j , k * < 0 .

2.2. ISTA for Graphical Lasso

In this subsection, we introduce the method for solving the GL problem (2) by the iterative shrinkage-thresholding algorithm (ISTA) proposed by [10], which is a proximal gradient method usually employed in dealing with nondifferentiable composite optimization problems.
Specifically, the general ISTA solves the following composite optimization problem:
minimize x F ( x ) : = f ( x ) + g ( x ) ,
where f and g are convex, with f differentiable and g possibly being nondifferentiable.
For the GL problem (2), we denote f , g : R p × p R as
f ( Θ ) : = log det Θ + trace ( S Θ ) ,
and
g ( Θ ) : = λ Θ 1 .
If we define the quadratic approximation Q η : R p × p × R p × p R w.r.t. f ( Θ ) and η > 0 by
Q η ( Θ , Θ ) : = f ( Θ ) + Θ Θ , f ( Θ ) + 1 2 η | | Θ Θ | | F 2 ,
then we can describe the ISTA as a procedure that iterates
Θ t + 1 = arg min Θ { Q η t ( Θ , Θ t ) + g ( Θ ) }
= prox η t g ( Θ t η t f ( Θ t ) ) ,
given initial value Θ 0 , where the value of step size η t > 0 may change at each iteration t = 1 , 2 , , for efficient convergence purpose, and we use the proximal operator:
prox g ( z ) : = arg min θ { 1 2 z θ 2 2 + g ( θ ) } .
Note that the proximal operator of function g = λ | | Θ | | 1 is the soft-thresholding operator: the absolute value | θ i , j | of each off-diagonal element θ i , j with i j becoming either θ i , j sgn ( θ i , j ) λ or zero (if | θ i , j | < λ ). We use the following function for the operator in Section 3:
[ S λ ( Θ ) ] i , j = sgn ( θ i , j ) ( | θ i , j | λ ) +
where ( x ) + : = max ( x , 0 ) .
Definition 1.
A differentiable function f : R n × p R is said to have a Lipschitz-continuous gradient if there exists L > 0 (Lipschitz constant) such that
| | f ( X ) f ( Y ) | | F L | | X Y | | F , X , Y R n × p .
It is known that if we choose η t = 1 L for each step in the ISTA that minimizes F ( · ) , then the convergence rate is, at most, as follows [25]:
F ( Θ t ) F ( Θ * ) = O ( 1 t )
However, for the GL problem (2), we know neither the exact value of the Lipschitz constant L nor any nontrivial upper bound. [10] implement a backtracking line search option in Step 1 of Algorithm 1 below to handle this issue.
The backtracking line search enables us to compute the η t value for each time t = 1 , 2 , by repeatedly multiplying η t by a constant c ( 0 , 1 ) until Θ t + 1 0 ( Θ is positive definite) and
f ( Θ t + 1 ) Q η t ( Θ t + 1 , Θ t ) ,
for the Θ t + 1 in (7). Additionally, (12) is a sufficient condition for (11), which was derived in [25] (see the relationship between Lemma 2.3 and Theorem 3.1 in [25]).
The whole procedure is given in Algorithm 1.
Algorithm 1 G-ISTA for problem (2).
Input: S , tolerance ϵ > 0 , backtracking constant 0 < c < 1 , initial value η 0 , Θ 0 , t = 0 .
While t < t max (until convergence) do
    1:
Backtracking line search: Continue to multiply η t by c until
Θ t + 1 0 and f ( Θ t + 1 ) Q η t ( Θ t + 1 , Θ t )
for Θ t + 1 : = p r o x η t g ( Θ t η t f ( Θ t ) ) .
    2:
Update iterate: Θ t + 1 p r o x η t g ( Θ t η t f ( Θ t ) ) .
    3:
Set next initial step size η t + 1 by the Barzilai—Borwein method.
    4:
t t + 1
end
Output: ϵ -optimal solution to problem (2), Θ * = Θ t + 1 .

2.3. Composite Self-Concordant Minimization

The notion of the self-concordant function was proposed in [26,27,28]. In the following, we say a convex function f is self-concordant with parameter M 0 if
| f ( x ) | M f ( x ) 3 / 2 , for all x dom f .
where dom f is the domain of f.
Reference [24] considered a composite version of self-concordant function minimization and provided a way to efficiently calculate the step size for the proximal gradient method for the GL problem without relying on the Lipschitz gradient assumption in (10). They proved that
f ( Θ ) : = log det Θ + trace ( S Θ )
in (2) is self-concordant and considers the following minimization:
F * : = minimize x { F ( x ) : = f ( x ) + g ( x ) } ,
where f is convex, differentiable, and self-concordant, and g is convex and possibly nondifferentiable. As for Algorithm 1, without using the backtracking line search, we can compute direction d t with initial step size η t as follows:
d t : = p r o x η t g ( Θ t η t f ( Θ t ) ) Θ t ,
where the operator p r o x is defined by (8). Then, we use the modified step size α t to update Θ t + 1 : = Θ t + α t d t , which can be determined by the direction d t . After defining two parameters related to the direction: β t : = η t 1 | | d t | | F 2 and λ t : = 2 f ( Θ t ) d t , d t 1 / 2 , the modified step size can be obtained by
α t : = β t λ t ( λ t + β t ) .
By Lemma 12 in [24], if the modified step size α t ( 0 , 1 ] , then it can ensure a decrease in the objective function and guarantee convergence in the proximal gradient scheme. From (14), if  λ t 1 , then the condition α t ( 0 , 1 ] is satisfied. Therefore, we only need to check the case when λ t < 1 . If the condition α t ( 0 , 1 ] does not hold, we can change the value of the initial η t (such as the bisection method) to influence the value of d t in (13) until the condition is satisfied.

2.4. Joint Graphical Lasso

Let N 1 , p 1 , K 2 , and  ( x 1 , y 1 ) , , ( x N , y N ) R p × { 1 , , K } , where each x i is a row vector. Let n k be the number of occurrences in y 1 , , y N such that y i = k , so that k = 1 K n k = N .
For each k = 1 , , K , we define the empirical covariance matrix S ( k ) R p × p of the data x i as follows:
S ( k ) : = 1 n k i : y i = k x i T x i .
Given the penalty parameters λ 1 > 0 and λ 2 > 0 , the joint graphical lasso (JGL) is the procedure to find the positive definite matrix Θ ( k ) R p × p for k = 1 , , K , such that:
minimize Θ k = 1 K n k { log det Θ ( k ) trace ( S ( k ) Θ ( k ) ) } + λ 1 k = 1 K i j | θ k , i , j | + P ( Θ ) ,
where P ( Θ ) penalizes Θ : = [ Θ ( 1 ) , , Θ ( K ) ] T . For example, ref. [13] suggested the following fused and group lasso penalties:
P F ( Θ ) : = λ 2 k < l i , j | θ k , i , j θ l , i , j |
and
P G ( Θ ) : = λ 2 i j k = 1 K θ k , i , j 2 1 / 2 ,
where θ k , i , j is the ( i , j ) -th element of Θ ( k ) R p × p for k = 1 , , K .
Unfortunately, there is no equation like (3) for the JGL to find the optimum Θ * . [13] considered the ADMM to solve the JGL problem. However, ADMM is quite time consuming for large-scale problems.

3. The Proposed Methods

In this section, we propose two efficient algorithms for solving the JGL problem. One is an extended ISTA based on the G-ISTA in Section 2.2, and the other is based on the step-size selection strategy introduced in Section 2.3.

3.1. ISTA for the JGL Problem

To describe the JGL problem, we define f , g : R K × p × p R by
f ( Θ ) : = k = 1 K n k log det Θ ( k ) trace ( S ( k ) Θ ( k ) ) ,
g ( Θ ) : = λ 1 k = 1 K i j | θ k , i , j | + P ( Θ ) .
Then, the problem (15) reduces to:
minimize Θ F ( Θ ) : = f ( Θ ) + g ( Θ ) ,
where the function f is convex and differentiable, and g is convex and nondifferentiable. Therefore, the ISTA is available for solving the JGL problem (15).
The main difference between the G-ISTA and the proposed method is that the latter needs to simultaneously consider K categories of graphical models in the JGL problem (15). What is more, there are two combined penalties in g ( Θ ) , which complicate the proximal operator in the ISTA procedure. Consequently, the operator for the proposed method is not a simple soft thresholding operator, as it is for the G-ISTA method.
If we define the quadratic approximation Q η t : R K × p × p R of f ( Θ t ) by:
Q η t ( Θ , Θ t ) : = f ( Θ t ) + k = 1 K Θ ( k ) Θ t ( k ) , f ( Θ t ( k ) ) + 1 2 η t k = 1 K | | Θ ( k ) Θ t ( k ) | | F 2 ,
then the update iteration is simplified as:
Θ t + 1 = argmin Θ Q η t ( Θ , Θ t ) + g ( Θ ) = prox η t g ( Θ t η t f ( Θ t ) ) .
Nevertheless, the Lipschitz gradient constant of f ( Θ ) is unknown over the whole domain in the JGL problem. Therefore, our approach needs a backtracking line search to calculate step size η t . We show the details in Algorithm 2.
Algorithm 2 ISTA for problem (15).
Input: S , tolerance ϵ > 0 , backtracking constant 0 < c < 1 , initial step size η 0 , initial iterate  Θ 0 .
For t = 0 , 1 , , (until convergence) do
    1:
Backtracking line search: Continue to multiply η t by c until
f ( Θ t + 1 ) Q η t ( Θ t + 1 , Θ t ) and Θ t + 1 ( k ) 0 for k = 1 , , K .
for Θ t + 1 : = prox η t g ( Θ t η t f ( Θ t ) ) .
    2:
Update iterate: Θ t + 1 prox η t g ( Θ t η t f ( Θ t ) ) .
    3:
Set next initial step size η t + 1 . See details in Section 3.3.
end
Output: optimal solution to problem (15), Θ * = Θ t + 1 .
In the update of Θ t + 1 , we need to compute the proximal operators for the fused and group lasso penalties. In the following, for each of them, the problem can be divided into the fused lasso problems [29] and group lasso problems [30,31] for θ i , j R K , i , j = 1 , , p . We apply the solutions given by (20) and (21) below.

3.1.1. Fused Lasso Penalty P F

By the definition of the proximal operator in the update step, we have:
Θ t + 1 = arg min Θ 1 2 k = 1 K | | Θ ( k ) Θ t ( k ) + η t f ( Θ t ( k ) ) | | F 2 + η t λ 1 k = 1 K i j | θ k , i , j | + η t λ 2 k < l i , j | θ k , i , j θ l , i , j | .
Problem (18) is separable with respect to the elements θ k , i , j in Θ ( k ) R p × p ; hence, the proximal operator can be computed in a componentwise manner: Let A = Θ t η t f ( Θ t ) ; then, problem (18) reduces to the following for i = 1 , , p , j = 1 , , p :
argmin θ 1 , i , j , , θ K , i , j 1 2 k = 1 K ( θ k , i , j a k , i , j ) 2 + η t λ 1 1 i j k = 1 K | θ k , i , j | + η t λ 2 k < l | θ k , i , j θ l , i , j | } ,
where 1 i j is an indicator function, the value of which is 1 only when i j .
The problem (19) is known as the fused lasso problem [29,32] given a k , i , j for k = 1 , , K . In particular, let α : = η t λ 1 1 i j and β : = η t λ 2 . When i j , α 0 and β > 0 , the solution to (19) can be obtained through the soft thresholding operator based on the solution when α = 0 by the following Lemma.
Lemma 1.
([33]) Denote the solution to parameters α and β as θ i ( α , β ) , and then the solution θ i ( α , β ) of the fused lasso problem:
1 2 i = 1 n ( y i θ i ) 2 + α i = 1 n | θ i | + β i = 1 n 1 | θ i θ i + 1 |
is given by [ S α ( θ ( 0 , β ) ) ] i when y 1 , , y n R are given for n 1 .
Additionally, rather efficient algorithms are available for solving the fused lasso problem (20) when α = 0 (i.e., θ ( 0 , β ) ) such as [32,34,35].

3.1.2. Group Lasso Penalty P G

By definition, the update of Θ t + 1 for the group lasso penalty P G ( Θ ) is as follows:
Θ t + 1 = arg min Θ 1 2 k = 1 K | | Θ ( k ) Θ t ( k ) + η t f ( Θ t ( k ) ) | | F 2 + η t λ 1 k = 1 K i j | θ k , i , j | + η t λ 2 i j ( k = 1 K θ k , i , j 2 ) 1 / 2 .
Similarly, let A = Θ t η t f ( Θ t ) ; then, the problem becomes the following for i = 1 , , p , j = 1 , , p :
argmin θ 1 , i , j , , θ K , i , j 1 2 k = 1 K ( θ k , i , j a k , i , j ) 2 + η t λ 1 1 i j k = 1 K | θ k , i , j | + η t λ 2 1 i j ( k = 1 K θ k , i , j 2 ) 1 / 2 .
We have θ k , i , j = a k , i , j for i = j . In addition, for  i j , the solution [31,36,37] is given by
θ k , i , j = S η t λ 1 ( a k , i , j ) 1 η t λ 2 k = 1 K S η t λ 1 ( a k , i , j ) 2 + .

3.2. Modified ISTA for JGL

Thus far, we have seen that f ( Θ ) in the JGL problem (15) is not globally Lipschitz gradient continuous. The ISTA may not be efficient enough for the JGL case because it includes the backtracking line search procedure for this case, which needs to evaluate the objective function and the positive definiteness of Θ t + 1 in Step 1 of Algorithm 2 and is inefficient when the evaluation is expensive.
In this section, we modify Algorithm 2 to Algorithm 3 based on the step-size selection strategy in Section 2.3, which takes advantage of the properties of the self-concordant function. The self-concordant function does not rely on the Lipschitz gradient assumption on the function f ( Θ ) [24], and we can eliminate the need for the backtracking line search.
Lemma 2.
([38]) Self-concordance is preserved by scaling and addition: if f is a self-concordant function and a constant a 1 , then a f is self-concordant. If  f 1 , f 2 are self-concordant, then f 1 + f 2 is self-concordant.
By Lemma 2, the function f ( Θ ) in (16) is self-concordant. In Algorithm 3, for the initial step size of η t in each iteration, we use the Barzilai–Borwein method [39]. We apply the step-size mechanism in Section 2.3, which is employed in Steps 3–5 of Algorithm 3.
Algorithm 3 Modified ISTA (M-ISTA).
Input: S , tolerance ϵ > 0 , initial step size η 0 , initial iterate Θ 0 .
For t = 0 , 1 , , (until convergence) do
    1:
Initialize η t .
    2:
Compute
d t : = prox η t g ( Θ t η t f ( Θ t ) ) Θ t .
    3:
Compute
β t : = η t 1 | | d t | | F 2
and
λ t : = k = 1 K n k | | ( Θ t ( k ) ) 1 d t ( k ) | | F .
    4:
Determine the step size α t : = β t λ t ( λ t + β t ) .
    5:
If α t > 1 , then set η t : = η t / 2 and go back to Step 2.
    6:
Update Θ t + 1 : = Θ t + α t d t .
end
Output: optimal solution to problem (15), Θ * = Θ t + 1 .
There is no backtracking procedure in this algorithm that guarantees the positive definiteness of Θ t + 1 , as in Step 1 of Algorithm 2. We next show how to ensure the positive definiteness of Θ t + 1 in the iterations of Algorithm 3.
Lemma 3.
([40], Theorem 2.1.1) Let f be a self-concordant function, and let x dom f . Additionally, if
W ( x ) = { y | 2 f ( x ) ( y x ) , y x 1 / 2 1 } ,
then W ( x ) d o m f .
In Algorithm 3, because we know α t : = β t λ t ( λ t + β t ) < 1 with β t > 0 and λ t > 0 by Steps 3–5. Thus, we have α t λ t < 1 :
α t λ t : = α t 2 f ( Θ t ) d t , d t 1 / 2 < 1 ,
which implies,
2 f ( Θ t ) ( Θ t + 1 Θ t ) , Θ t + 1 Θ t 1 / 2 < 1 .
Hence, from Lemma 3, we see that Θ t + 1 stays in the domain and maintains positive definiteness.

3.3. Theoretical Analysis

For multiple Gaussian graphical models, Honorio and Samaras [14] and Hara and Washio [17] provided lower and upper bounds for the optimal solution Θ * . However, the models they considered are different than the JGL. To the best of our knowledge, no related research has provided the bounds of the optimal solution Θ * for the JGL problem (15).
In the following section, we show the bounds of the optimal solution Θ * for the JGL and the iterates Θ t generated by Algorithms 2 and 3, which are applied to both fused and group lasso-type penalties.
Proposition 1.
The optimal solution Θ * of the problem (15) satisfies
max 1 k K n k p λ c + n k | | S ( k ) | | 2 | | Θ * ( k ) | | 2 N p λ 1 + k = 1 K i = 1 p ( s k , i , i ) 1 ,
where λ c : = K λ 1 2 + 2 K λ 1 λ 2 + λ 2 2 , and s k , i , i is the i-th diagonal element of S ( k ) .
For the proof, see Appendix A.1.
Note that the objective function value F ( Θ ) always decreases with the increase in iteration in both algorithms due to [25] (Remark 3.1) and Lemma 12 in [24]. Therefore, the following inequality holds for Algorithms 2 and 3:
F ( Θ t + 1 ) F ( Θ t ) for t = 0 , 1 , .
Then, based on the condition (22), we provide the explicit bounds of iterates { Θ t } t = 0 , 1 in Algorithms 2 and 3 for the JGL problem (15).
Proposition 2.
Sequence { Θ t } t = 0 , 1 , , generated by Algorithms 2 and 3 can be bounded:
m | | Θ t | | 2 M ,
where M : = | | Θ 0 | | F + 2 N p λ 1 + 2 k = 1 K i = 1 p s k , i , i 1 , m : = e C 1 n m M ( 1 K p ) , n m = max k n k , and constant C 1 : = F ( Θ 0 ) .
For the proof, see Appendix A.2.
With the help of Propositions 1 and 2, and the following Lemma, we can obtain the range of the step size that ensures the linear convergence rate of Algorithm 2.
Lemma 4.
Let Θ t be t-th iterate in Algorithm 2. Denote λ m i n and λ m a x as the minimum and maximum eigenvalues of the corresponding matrix, respectively. Define
a k : = min { λ min ( Θ t ( k ) ) , λ min ( Θ * ( k ) ) } , b k : = max { λ max ( Θ t ( k ) ) , λ max ( Θ * ( k ) ) }
and n l = min k = 1 , K n k , n m = max k = 1 , K n k , a l = min k = 1 , K a ( k ) , and b m = max k = 1 , K b ( k ) . The sequence { Θ t } t = 0 , 1 , generated by Algorithm 2 satisfies
| | Θ t + 1 Θ * | | F γ t | | Θ t Θ * | | F
with the convergence rate γ t : = max { η t n m a l 2 1 , 1 η t n l b m 2 } .
Proof. 
It can be easily extended by Lemma 3 in [10]. □
Lemma 4 implies that to obtain the convergence rate γ t < 1 , we require:
0 < η t < 2 a l 2 n m .
After using Propositions 1 and 2, we can obtain the bounds of a l . Further, we can obtain the step size η t that satisfies (23) and guarantee s the linear convergence rate ( γ t < 1 ) . However, the step size is quite conservative in practice. Hence, we consider the Barzilai–Borwein method for implementation and regard the step size η t that satisfies (23) as a safe choice. When the number of backtracking iterations in Step 1 of Algorithm 2 exceeds the given maximum number to fulfill the backtracking line search condition, we can use the safe step size η t for the subsequent calculations. In Section 4.2.3, we confirm the linear convergence rate of the proposed ISTA by experiment.

4. Experiments

In this section, we evaluate the performance of the proposed methods on both synthetic and real datasets, and we compare the following algorithms:
  • ADMM: the general ADMM method proposed by [13].
  • FMGL: the proximal Newton-type method proposed by [23].
  • ISTA: the proposed method in Algorithm 2.
  • M-ISTA: the proposed method in Algorithm 3.
We perform all the tests in R Studio on a Macbook Air with 1.6 GHz Intel Core i5 and 8 GB memory. The wall times are recorded as the run times for the four algorithms.

4.1. Stopping Criteria and Model Selection

In the experiments, we consider two stopping criteria for the algorithms.
1. Relative error stopping criterion:
k = 1 K | | Θ t + 1 ( k ) Θ t ( k ) | | F max { k = 1 K | | Θ t ( k ) | | F , 1 } ϵ .
2. Objective error stopping criterion:
F ( Θ t ) F ( Θ * ) ϵ .
ϵ is a given accuracy tolerance; we terminate the algorithm if the above error is smaller than ϵ or the maximum number of iterations exceeds 1000. We use the objective error for convergence rate analysis and the relative error for the time comparison.
The JGL model is affected by regularized parameters λ 1 and λ 2 . For selecting the parameters, we use the V-fold crossvalidation method. First, the dataset is randomly split into V segments of equal size, a single subset (test data), estimated by the other V 1 subsets (training data), is evaluated, and the subset is changed for the test to repeat V times so that each subset is used.
Let S v ( k ) be the sample covariance matrix of the v-th ( v = 1 , , V ) segment for class k = 1 , , K . We estimate the inverse covariance matrix by the remaining V 1 subsets Θ ^ λ , v ( k ) and choose λ 1 and λ 2 , which minimize the average predictive negative log-likelihood as follows:
C V ( λ 1 , λ 2 ) = v = 1 V k = 1 K n k trace ( S v ( k ) Θ ^ λ , v ( k ) ) logdet Θ ^ λ , v ( k )

4.2. Synthetic Data

The performance of the proposed methods was assessed on synthetic data in terms of the number of iterations, the execution time, the squared error, and the receiver operating characteristic (ROC) curve. We follow the data generation mechanism described in [41] with some modifications for the JGL model. We put the details in Appendix B.

4.2.1. Time Comparison Experiments

We vary p , N , K and λ 1 to compare the execution time of our proposed methods with that of the existing methods. We consider only the fused penalty in our proposed method for a fair comparison in the experiments because the FMGL algorithm applies only to the fused penalty. First, we compare the performance among different algorithms under various dimensions p, which are shown in Figure 1.
Figure 1 shows that the execution time of the FMGL and ADMM increases rapidly as p increases. In particular, we observe that the M-ISTA significantly outperforms when p exceeds 200. The ISTA shows better performance than the three methods when p is less than 200, but it requires more time as p grows, compared to the M-ISTA. It is reasonable to consider that evaluating the objective function in the backtracking line search at every iteration increases the computational burden, especially when p increases, which means that the M-ISTA is a good choice for these cases. Furthermore, the ISTA can be a good candidate when the evaluation is inexpensive.
Table 2 summarizes the performance of the four algorithms under different parameter settings to achieve a given precision, ϵ , of the relative error. The results presented in Table 2 reveal that when we increase the number of classes K, all the algorithms require more time than usual. Moreover, the execution time of ADMM becomes huge among them. When we vary λ 1 , the algorithms become more efficient as the value increases. For most instances, the M-ISTA and ISTA outperform the existing methods, such as ADMM and FMGL. For the exceptional cases ( p = 20 , k = 2 , N = 60 , λ 1 = 0.1 and λ 2 = 0.05 ), the M-ISTA and ISTA are still comparable with the FMGL and faster than ADMM.

4.2.2. Algorithm Assessment

We generate the simulation data as described in Appendix B and regard the synthetic inverse covariance matrices Θ ( k ) as the true values for our assessment experiments.
First, we assessed our proposed method by drawing an ROC curve, which displays the number of true positive edges (i.e., TP edges) selected compared to the number of false positive edges (i.e., FP edges) selected. We say that an edge ( i , j ) in the k-th class is selected in estimate Θ ^ ( k ) if element θ ^ k , i , j 0 , and the edges are true positive edges selected if the precision matrix element θ k , i , j 0 and false positive edges selected if the precision matrix element θ k , i , j = 0 , where the two quantities are defined by
T P = k = 1 K i , j 1 ( θ k , i , j 0 ) · 1 ( θ ^ k , i , j 0 )
and
F P = k = 1 K i , j 1 ( θ k , i , j = 0 ) · 1 ( θ ^ k , i , j 0 ) ,
where 1 ( · ) is the indicator function.
To confirm the validity of the proposed methods, we compare the ROC figures of the fused penalty and group penalty. We fix the parameters λ 2 for each curve and change the λ 1 value to obtain various numbers of selected edges because the sparsity penalty parameter λ 1 can control the number of selected total edges.
We show the ROC curves for fused and group lasso penalties in Figure 2a,b respectively. From the figures, we observe that both penalties show highly accurate predictions for the edge selections. The result of λ 2 = 0.0166 in the fused penalty case is better than that in λ 2 = 0.05 . Additionally, the result of λ 2 = 0.0966 in the group penalty case is better than that in λ 2 = 0.09 , which means that if we select the tuning parameters properly, then we can obtain precise results while simultaneously meeting our different model demands.
Then, Figure 3a,b display the mean squared error (MSE) between the estimated values and true values.
M S E = 2 K p ( p 1 ) k = 1 K i < j ( θ ^ k , i , j θ k , i , j ) 2 ,
where θ ^ k , i , j is the value estimated by the proposed method, and θ k , i , j is the true precision matrix value we used in the data generation.
The figures illustrate that when the total number of edges selected increases, the errors decrease and finally achieve relatively low values.
Overall, the proposed method shows competitive efficiency not only in computational time but also in accuracy.

4.2.3. Convergence Rate

This section shows the convergence rate of the ISTA for solving the JGL problem (15) in practice, with λ 1 = 0.1 , 0.09 and 0.08 . We recorded the number of iterations to achieve the different tolerance of F ( Θ t ) F ( Θ * ) in Figure 4 and ran it on a synthetic dataset, with p = 200 , K = 2 , λ 2 = 0.05 , and N = 400 . The figure reveals that as λ 1 decreases, more iterations are needed to converge to the specified tolerance. Moreover, the figure shows the linear convergence rate of the proposed ISTA method, which corroborate the theoretical analysis in Section 3.3.

4.3. Real Data

In this section, we use two different real datasets to demonstrate the performance of our proposed method and visualize the result.
Firstly, we used the presidential speeches dataset in [42] for the experiment to jointly estimate common links across graphs and show the common structure. The dataset contains 75 most-used words (features) from several big speeches of the 44 US presidents (samples). In addition, we used the clustering result in [42], where the authors split the 44 samples into two groups with similar features, and then we obtained two classes of samples ( K = 2 ) .
We used Cytoscape [43] to visualize the results when λ 1 = 1.9 and λ 2 = 0.16 . We chose these relatively large tuning parameters for better interpretation of the network figure. Figure 5 shows the relationship network graph of the high-frequency words identified by the JGL model with the proposed method. As shown in the figure, each node represents a word, and the edges demonstrate the relationships between words.
We use different colors to show various structures. The black edges are a common structure between the two classes, the red edges are the specific structures for the first class ( k = 1 ) , and the green edges are for the second class ( k = 2 ) . Figure 5 shows a subnetwork on the top with red edges, meaning there are relationships among those words, and the connections only exist in the first group.
We compared the time cost among four algorithms and show the results in Table 3. We used the crossvalidation method ( V = 6 ) described in Section 4.1 to select the optimal tuning parameters ( λ 1 = 0.1 , λ 2 = 0.05 ). In addition, we manually chose the other two pairs of parameters for more comparisons.
Table 3 shows that ISTA outperforms the other three algorithms, and our proposed methods offer stable performance when varying the parameters, while ADMM is the slowest in most cases.
Secondly, we use a breast cancer dataset [44] for time comparison. There are 250 samples and 1000 genes in the dataset, with 192 control samples and 58 case samples ( K = 2 ) . Furthermore, we extract 200 genes with the highest variances among the original genes. The tuning parameter pair ( λ 1 = 0.01 , λ 2 = 0.0166 ) was chosen by the crossvalidation method. Table 3 exhibits that our proposed methods (ISTA and M-ISTA) outperform ADMM and FMGL, and M-ISTA shows the best performance in the breast cancer dataset.

5. Discussion

We propose two efficient proximal gradient descent procedures with and without the backtracking line search option for the joint graphical lasso. The first (Algorithm 2) does not require extra variables, unlike ADMM, which needs manual tuning the Lagrangian penalty parameters ρ in [13] and storing and calculating dual variables. Moreover, we reduce the update iterate step to subproblems that can be solved efficiently and precisely by lasso-type problems. Based on Algorithm 2, we modified the step-size selection by extending the strategy in [24] to the second one (Algorithm 3), which does not rely on the Lipschitz assumption. Additionally, the second does not require a backtracking line search, significantly reducing the computation time needed to evaluate objective functions.
From the theoretical perspective, we reach the linear convergence rate for the ISTA. Furthermore, we derive the lower and upper bounds of the solution to the JGL problem and the iterates in the algorithms, guaranteeing that the ISTA converges linearly. Numerically, the methods are demonstrated on simulated and real datasets to illustrate their robust and efficient performance over state-of-the-art algorithms.
For further computational improvement, the most expensive step in the algorithms is to calculate the inversion of matrices required by the gradient of f ( Θ ) . Both algorithms have a complexity of O ( K p 3 ) per iteration. Moreover, we can solve the matrix inversion problem with more efficient algorithms with lower complexity. In addition, we can also use the faster computation procedure in [13] to decompose the optimization problem for the proposed methods and regard it as preprocessing. Overall, the proposed methods are highly efficient for the joint graphical lasso problem.

Author Contributions

Conceptualization, J.C., R.S. and J.S.; methodology, J.C., R.S. and J.S.; software, J.C. and R.S.; validation, J.C., R.S. and J.S.; formal analysis, J.C., R.S. and J.S.; writing—original draft preparation, J.C. and J.S.; writing—review and editing, J.C., R.S. and J.S.; visualization, J.C.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Grant-in-Aid for Scientific Research (KAKENHI) C, Grant number: 18K11192.

Data Availability Statement

Publicly available datasets were analyzed in this paper. Presidential speeches dataset: https://www.presidency.ucsb.edu, accessed on 5 November 2021; Breast cancer dataset: https://www.rdocumentation.org/packages/doBy/versions/4.5-15/topics/breastcancer, accessed on 5 November 2021.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADMMalternating direction method of multipliers
FMGLfused multiple graphical lasso algorithm
FPfalse positive
G-ISTAgraphical iterative shrinkage-thresholding algorithm
GLgraphical lasso
ISTAiterative shrinkage-thresholding algorithm
JGLjoint graphical lasso
M-ISTAmodified iterative shrinkage-thresholding algorithm
MSEmean squared error
ROCreceiver operating characteristic
TPtrue positive

Appendix A. Proofs of Propositions

Appendix A.1. Proof of Proposition 1

We first introduce the Lagrange dual problem of (15). By introducing the auxiliary variables Z = { Z ( 1 ) , , Z ( K ) } , we can rewrite the problem as follows:
min Θ f ( Θ ) + g ( Θ ) subject to Z = Θ
Then, the Lagrange function of the above is given by:
L ( Θ , Z , Λ ) = f ( Θ ) + g ( Z ) + k = 1 K Λ ( k ) , Θ ( k ) Z ( k ) ,
where Λ = { Λ ( 1 ) , , Λ ( K ) } , Λ ( k ) R p × p are dual variables. To obtain the dual problem, we minimize the primal variables as follows:
min Λ , Z L ( Θ , Z , Λ ) = min Θ { f ( Θ ) + k = 1 K Λ ( k ) , Θ ( k ) } max Z { g ( Z ) k = 1 K Λ ( k ) , Z ( k ) } = min Θ { f ( Θ ) + k = 1 K Λ ( k ) , Θ ( k ) } g * ( Λ ) = min Θ k = 1 K Λ ( k ) + n k S ( k ) , Θ ( k ) k = 1 K n k logdet Θ ( k ) g * ( Λ ) .
Taking derivative of the function:
L 1 : = k = 1 K Λ ( k ) + n k S ( k ) , Θ ( k ) k = 1 K n k logdet Θ ( k ) ,
Θ ( k ) L 1 = 0 ,
We obtain
n k S ( k ) + Λ ( k ) = n k ( Θ ( k ) ) 1
for k = 1 , , K . Substitute the Equation (A1) into the dual problem min Θ , Z L ( Θ , Z , Λ ) , then it becomes:
min Θ , Z L ( Θ , Z , Λ ) = k = 1 K n k p + k = 1 K n k logdet ( S ( k ) + 1 n k Λ ( k ) ) g * ( Λ ) .
Hence, we can obtain the duality gap [38] (the primal problem minus the dual problem) as follows:
f ( Θ ) + g ( Z ) k = 1 K n k p k = 1 K n k { logdet Θ ( k ) } + g * ( Λ ) = k = 1 K n k trace ( S ( k ) Θ ( k ) ) + g ( Z ) k = 1 K n k p + g * ( Λ ) ,
when the gap value is 0, the optimal solution is found. Because the conjugate function g * ( Λ ) is the indicator function, the value is hence 0 for the optimal solution.
Firstly, for the group penalty P G ( Θ ) , the duality gap is
k = 1 K [ n k trace ( S ( k ) Θ * ( k ) ) n k p ] + λ 1 k = 1 K i j | θ * k , i , j | + λ 2 i j k = 1 K θ * k , i , j 2 = 0 .
From Equation (A2), we obtain
λ 1 | | Θ * | | 1 = k = 1 K n k trace ( S ( k ) Θ * ( k ) ) λ 2 i j k = 1 K θ * k , i , j 2 + k = 1 K n k p + k = 1 K i = 1 p λ 1 | θ * k , i , i | k = 1 K n k p + k = 1 K i = 1 p λ 1 | θ * k , i , i | .
From Equation (A1), we have the following relationship of diagonal elements,
θ k , i , i * = d i a g S ( k ) + 1 n k Λ * ( k ) 1 ,
and due to dual variable Λ k , i , i * > 0 , for k = 1 , , K . Hence,
| | Θ * | | 1 1 λ 1 k = 1 K n k p + k = 1 K i = 1 p d i a g S ( k ) + 1 n k Λ * ( k ) 1 1 λ 1 k = 1 K n k p + k = 1 K i = 1 p d i a g S ( k ) 1 .
By | | Θ * | | 2 | | Θ * | | F | | Θ * | | 1 , we obtain the upper bound:
| | Θ * | | 2 | | Θ * | | F 1 λ 1 k = 1 K n k p + k = 1 K i = 1 p s k , i , i 1 .
The proof is similar for the fused penalty P F ( Θ ) ; therefore, we omit it here. Next, we continue to prove the lower bound of Θ * .
Firstly, for the group penalty P G ( Θ ) . Let E ( k ) be non-negative p × p matrix satisfying E k , i , j θ k , i , j E k , i , j . Introducing the Lagrange multipliers Γ ( k ) and Γ 0 ( k ) for k = 1 , , K . This procedure is similar to the way in [17].
Then, the new Lagrange problem becomes,
max Θ , E min Γ , Γ 0 f ( Θ ) k = 1 K i j λ 1 E k , i , j λ 2 i j k = 1 K E k , i , j 2 k = 1 K t r ( Γ ( k ) Θ ( k ) ) t r ( a b s ( Γ ( k ) ) E ( k ) ) t r ( Γ 0 ( k ) E ( k ) ) ,
Taking derivative w.r.t Θ ( k ) and E k , i , j , we obtain the following equations:
n k Θ ( k ) 1 n k S ( k ) Γ ( k ) = 0 ,
λ 1 λ 2 E k , i , j k = 1 K E k , i , j 2 + | Γ k , i , j | + Γ 0 ( k ) = 0 , for i j ,
| Γ k , i , j | + Γ 0 ( k ) = 0 , for i = j .
When i j , from Equation (A5),
| Γ k , i , j | λ 1 + λ 2 E k , i , j k = 1 K E k , i , j 2 | Γ k , i , j | 2 λ 1 + λ 2 E k , i , j k = 1 K E k , i , j 2 2 = λ 1 2 + 2 λ 1 λ 2 E k , i , j k = 1 K E k , i , j 2 + λ 2 2 E k , i , j 2 k = 1 K E k , i , j 2 λ 1 2 + 2 λ 1 λ 2 + λ 2 2 E k , i , j 2 k = 1 K E k , i , j 2 .
Taking summation of each k,
k = 1 K | Γ k , i , j | 2 K λ 1 2 + 2 K λ 1 λ 2 + λ 2 2 .
Then,
k = 1 K | Γ k , i , j | 2 K λ 1 2 + 2 K λ 1 λ 2 + λ 2 2 .
From (A4) and (A7), we have the following relationship
| | Θ ( k ) 1 | | | | 1 n k Γ ( k ) + S ( k ) | | 2 1 n k | | Γ ( k ) | | 2 + | | S ( k ) | | 2 p n k max i , j | Γ k , i , j | + | | S ( k ) | | 2 p n k max k max i , j | Γ k , i , j | + | | S ( k ) | | 2 p ( K λ 1 2 + 2 K λ 1 λ 2 + λ 2 2 ) n k + | | S ( k ) | | 2 .
The last equation holds because
max k max i , j | Γ k , i , j | k = 1 K | Γ k , i , j | 2 .
We only consider the case when i j for m a x i , j | Γ i j ( k ) | , because from Equations (A5) and (A6), we know | Γ i j ( k ) | > | Γ i i ( k ) | . Overall, the lower bound is
n k p K λ 1 2 + 2 K λ 1 λ 2 + λ 2 2 + n k | | S ( k ) | | 2 .
The lower bound of fused penalty can be derived in similar way.

Appendix A.2. Proof of Proposition 2

By Equation (22) and convexity of F ( Θ ) , it is easy to obtain
| | Θ t Θ * | | F | | Θ 0 Θ * | | F .
Since | | · | | 2 | | · | | F , then
| | Θ t | | 2 | | Θ * | | 2 | | Θ t Θ * | | 2 | | Θ t Θ * | | F | | Θ 0 Θ * | | F .
Hence,
| | Θ t | | 2 | | Θ 0 Θ * | | F + | | Θ * | | 2 | | Θ 0 | | F + 2 | | Θ * | | F .
Then, by Equation (A3), we can complete the proof of the upper bound.
To prove the lower bound, denote
a t ( k ) = λ min ( Θ t ( k ) ) ( a t ) l = min k = 1 , , K a t ( k ) .
By the definition of the matrix norm, we have
| | Θ t ( k ) | | 2 a t ( k ) ( a t ) l .
Denote the upper bound of | | Θ t | | 2 as M, and that of | | Θ t | | 2 as M ( k ) , for k = 1 , , K . By definition of tensor norm, we have M | | Θ t | | 2 | | Θ t ( k ) | | 2 ( a t ) l .
Let constant C 1 : = f ( Θ 0 ) + g ( Θ 0 ) . By the Equation (22), we have
C 1 f ( Θ t ) + g ( Θ t ) .
Note that S 0 , Θ t 0 implies t r ( S Θ t ) 0 and because g ( Θ t ) 0
C 1 k = 1 K n k logdet Θ t ( k ) = k = 1 K n k log ( Π i = 1 p λ i ) .
Let the eigenvalues of Θ t ( k ) as λ 1 λ 2 λ p . Then a t ( k ) = λ 1 λ p M ( k ) , hence,
Π i = 1 p ( λ i ) = a t ( k ) · λ 2 · λ p a t ( k ) · M ( k ) ( p 1 ) .
Then,
k = 1 K n k log ( Π i = 1 p λ i ) k = 1 K n k l o g a t ( k ) + ( p 1 ) l o g M ( k ) .
Let the coefficient n k of the term which contains ( a t ) l in k = 1 K n k l o g a t ( k ) as n x , then
k = 1 K n k l o g a t ( k ) = n x l o g ( a t ) l + k x n k l o g a t ( k ) .
Because
M ( k ) M ,
denote n m = max i = 1 , K n k , then,
k = 1 K n k l o g a t ( k ) n m l o g ( a t ) l + n m ( K 1 ) l o g M .
Hence,
C 1 k = 1 K n k log ( Π i = 1 p ( λ i ) ) k = 1 K n k l o g a t ( k ) + ( p 1 ) l o g M ( k ) n m l o g ( a t ) l n m ( K 1 ) l o g M K n m ( p 1 ) l o g M .
Then, we can obtain
l o g ( a t ) l K ( p 1 ) l o g M ( K 1 ) l o g M C 1 n m ( a t ) l e ( 1 K p ) l o g M C 1 n m .
Hence, the lower bound is proved:
| | Θ t | | 2 | | Θ t ( k ) | | 2 ( a t ) l e C 1 n m M ( 1 K p ) .

Appendix B. Data Generation

We generate n k samples independently and identically distributed observations from a multivariate normal distribution N { 0 , ( Θ ^ ( k ) ) 1 } , where Θ ( k ) is the inverse covariance matrix of the k-th category. Specifically, we generate p points randomly on a unit space and calculate their pairwise distances. Then, we find the m-nearest neighbors point by this distance. We connect any two points that are m-nearest neighbors of each other. The integer m determines for the degree of sparsity of the data, and m values range from 4 to 9 in our experiments.
Additionally, we add heterogeneity to the common structure by building extra individual connections in the following way: we randomly choose a pair of symmetric zero elements, θ k , i , j = θ k , j , i = 0 , and replace them with a value uniformly generated from the [ 1 , 0 , 5 ] [ 0.5 , 1 ] interval. This operation is repeated M / 2 times, where M is the number of off-diagonal nonzero elements in Θ ( k ) .

References

  1. Lauritzen, S.L. Graphical Models; Clarendon Press: Oxford, UK, 1996; Volume 17. [Google Scholar]
  2. Meinshausen, N.; Bühlmann, P. High-dimensional graphs and variable selection with the lasso. Ann. Stat. 2006, 34, 1436–1462. [Google Scholar] [CrossRef] [Green Version]
  3. Yuan, M.; Lin, Y. Model selection and estimation in the Gaussian graphical model. Biometrika 2007, 94, 19–35. [Google Scholar] [CrossRef] [Green Version]
  4. Friedman, J.; Hastie, T.; Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9, 432–441. [Google Scholar] [CrossRef] [Green Version]
  5. Banerjee, O.; El Ghaoui, L.; d’Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 2008, 9, 485–516. [Google Scholar]
  6. Rothman, A.J.; Bickel, P.J.; Levina, E.; Zhu, J. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2008, 2, 494–515. [Google Scholar] [CrossRef]
  7. Banerjee, O.; Ghaoui, L.E.; d’Aspremont, A.; Natsoulis, G. Convex optimization techniques for fitting sparse Gaussian graphical models. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 89–96. [Google Scholar]
  8. Xue, L.; Ma, S.; Zou, H. Positive-definite l1-penalized estimation of large covariance matrices. J. Am. Stat. Assoc. 2012, 107, 1480–1491. [Google Scholar] [CrossRef] [Green Version]
  9. Mazumder, R.; Hastie, T. The graphical lasso: New insights and alternatives. Electron. J. Stat. 2012, 6, 2125. [Google Scholar] [CrossRef]
  10. Guillot, D.; Rajaratnam, B.; Rolfs, B.T.; Maleki, A.; Wong, I. Iterative thresholding algorithm for sparse inverse covariance estimation. arXiv 2012, arXiv:1211.2532. [Google Scholar]
  11. d’Aspremont, A.; Banerjee, O.; El Ghaoui, L. First-order methods for sparse covariance selection. SIAM J. Matrix Anal. Appl. 2008, 30, 56–66. [Google Scholar] [CrossRef] [Green Version]
  12. Hsieh, C.J.; Sustik, M.A.; Dhillon, I.S.; Ravikumar, P. QUIC: Quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res. 2014, 15, 2911–2947. [Google Scholar]
  13. Danaher, P.; Wang, P.; Witten, D.M. The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014, 76, 373. [Google Scholar] [CrossRef]
  14. Honorio, J.; Samaras, D. Multi-Task Learning of Gaussian Graphical Models; ICML: Baltimore, MA, USA, 2010. [Google Scholar]
  15. Guo, J.; Levina, E.; Michailidis, G.; Zhu, J. Joint estimation of multiple graphical models. Biometrika 2011, 98, 1–15. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Zhang, B.; Wang, Y. Learning structural changes of Gaussian graphical models in controlled experiments. arXiv 2012, arXiv:1203.3532. [Google Scholar]
  17. Hara, S.; Washio, T. Learning a common substructure of multiple graphical Gaussian models. Neural Netw. 2013, 38, 23–38. [Google Scholar] [CrossRef] [Green Version]
  18. Glowinski, R.; Marroco, A. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. ESAIM Math. Model. Numer. Anal.-Modél. Math. Et Anal. Numér. 1975, 9, 41–76. [Google Scholar] [CrossRef]
  19. Tang, Q.; Yang, C.; Peng, J.; Xu, J. Exact hybrid covariance thresholding for joint graphical lasso. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2015; pp. 593–607. [Google Scholar]
  20. Hallac, D.; Park, Y.; Boyd, S.; Leskovec, J. Network inference via the time-varying graphical lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 205–213. [Google Scholar]
  21. Gibberd, A.J.; Nelson, J.D. Regularized estimation of piecewise constant gaussian graphical models: The group-fused graphical lasso. J. Comput. Graph. Stat. 2017, 26, 623–634. [Google Scholar] [CrossRef] [Green Version]
  22. Boyd, S.; Parikh, N.; Chu, E. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers; Now Publishers Inc.: Norwell, MA, USA, 2011. [Google Scholar]
  23. Yang, S.; Lu, Z.; Shen, X.; Wonka, P.; Ye, J. Fused multiple graphical lasso. SIAM J. Optim. 2015, 25, 916–943. [Google Scholar] [CrossRef] [Green Version]
  24. Tran-Dinh, Q.; Kyrillidis, A.; Cevher, V. Composite self-concordant minimization. J. Mach. Learn. Res. 2015, 16, 371–416. [Google Scholar]
  25. Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef] [Green Version]
  26. Nesterov, Y.; Nemirovskii, A. Interior-Point Polynomial Algorithms in Convex Programming; SIAM: Philadelphia, PA, USA, 1994. [Google Scholar]
  27. Renegar, J. A Mathematical View of Interior-Point Methods in Convex Optimization; SIAM: Philadelphia, PA, USA, 2001. [Google Scholar]
  28. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer Science & Business Media: New York, NY, USA, 2003; Volume 87. [Google Scholar]
  29. Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B 2005, 67, 91–108. [Google Scholar] [CrossRef] [Green Version]
  30. Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. A sparse-group lasso. J. Comput. Graph. Stat. 2013, 22, 231–245. [Google Scholar] [CrossRef]
  31. Friedman, J.; Hastie, T.; Tibshirani, R. A note on the group lasso and a sparse group lasso. arXiv 2010, arXiv:1001.0736. [Google Scholar]
  32. Hoefling, H. A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Stat. 2010, 19, 984–1006. [Google Scholar] [CrossRef] [Green Version]
  33. Friedman, J.; Hastie, T.; Höfling, H.; Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat. 2007, 1, 302–332. [Google Scholar] [CrossRef] [Green Version]
  34. Tibshirani, R.J.; Taylor, J. The solution path of the generalized lasso. Ann. Stat. 2011, 39, 1335–1371. [Google Scholar] [CrossRef] [Green Version]
  35. Johnson, N.A. A dynamic programming algorithm for the fused lasso and l 0-segmentation. J. Comput. Graph. Stat. 2013, 22, 246–260. [Google Scholar] [CrossRef]
  36. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 2006, 68, 49–67. [Google Scholar] [CrossRef]
  37. Suzuki, J. Sparse Estimation with Math and R: 100 Exercises for Building Logic; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
  38. Boyd, S.; Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  39. Barzilai, J.; Borwein, J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988, 8, 141–148. [Google Scholar] [CrossRef]
  40. Nemirovski, A. Interior point polynomial time methods in convex programming. Lect. Notes 2004, 42, 3215–3224. [Google Scholar]
  41. Li, H.; Gui, J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 2006, 7, 302–317. [Google Scholar] [CrossRef]
  42. Weylandt, M.; Nagorski, J.; Allen, G.I. Dynamic visualization and fast computation for convex clustering via algorithmic regularization. J. Comput. Graph. Stat. 2020, 29, 87–96. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  43. Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13, 2498–2504. [Google Scholar] [CrossRef] [PubMed]
  44. Miller, L.D.; Smeds, J.; George, J.; Vega, V.B.; Vergara, L.; Ploner, A.; Pawitan, Y.; Hall, P.; Klaar, S.; Liu, E.T.; et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc. Natl. Acad. Sci. USA 2005, 102, 13550–13555. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Plot of time comparison under different p. Setting λ 1 = 0.1 , λ 2 = 0.05 , K = 2 , and N = 200 .
Figure 1. Plot of time comparison under different p. Setting λ 1 = 0.1 , λ 2 = 0.05 , K = 2 , and N = 200 .
Entropy 23 01623 g001
Figure 2. Plot of true positive edges vs. false positive edges selected. Setting p = 50 , K = 2 . (a) The fused penalty; (b) The group penalty.
Figure 2. Plot of true positive edges vs. false positive edges selected. Setting p = 50 , K = 2 . (a) The fused penalty; (b) The group penalty.
Entropy 23 01623 g002
Figure 3. Plot of the mean squared errors vs. total edges selected. Setting p = 50 , K = 2 . (a) The fused penalty; (b) The group penalty.
Figure 3. Plot of the mean squared errors vs. total edges selected. Setting p = 50 , K = 2 . (a) The fused penalty; (b) The group penalty.
Entropy 23 01623 g003
Figure 4. Plot of log ( F ( Θ t ) F ( Θ * ) ) vs. the number of iterations with different λ 1 values. Setting p = 200 , N = 400 , K = 2 and λ 2 = 0.05 .
Figure 4. Plot of log ( F ( Θ t ) F ( Θ * ) ) vs. the number of iterations with different λ 1 values. Setting p = 200 , N = 400 , K = 2 and λ 2 = 0.05 .
Entropy 23 01623 g004
Figure 5. Network figure of the words in president speeches dataset.
Figure 5. Network figure of the words in president speeches dataset.
Entropy 23 01623 g005
Table 1. Efficient JGL procedures.
Table 1. Efficient JGL procedures.
ModelADMMProximal NewtonProximal Gradient
GL [4][8][12][10]
JGL [13][13][23]Current Paper
(for fused penalty)(for fused and group penalties)
Table 2. Computational time under different settings.
Table 2. Computational time under different settings.
Parameters SettingComputational Time
pKN λ 1 λ 2 precision ϵ ADMMFMGLISTAM-ISTA
202600.10.050.0000110.506 s1.158 s2.174 s1.742 s
31.879 min4.267 s3.357 s3.668 s
510.51.123 min10.556 s4.216 s2.874 s
3021200.10.050.000110.095 s5.259 s2.690 s4.857 s
32.014 min38.562 s14.722 s31.870 s
510.52.447 min15.819 s22.431 s12.113 s
5026000.020.0050.00016.427 s10.228 s7.213 s4.625 s
0.036.240 s8.925 s6.645 s4.023 s
0.047.025 s9.381 s6.144 s3.993 s
20024000.090.050.00014.050 min1.874 min2.289 min35.038 s
0.14.569 min1.137 min1.340 min24.852 s
0.123.848 min1.881 min1.443 min18.367 s
Table 3. Time comparison result of two real datasets.
Table 3. Time comparison result of two real datasets.
DatasetParameters SettingComputational Time
λ 1 λ 2 Precision ϵ ADMMFMGLISTAM-ISTA
Speeches0.10.050.000119.969 s4.977 min11.829 s12.867 s
0.20.14.661 min3.209 min11.560 s12.682 s
0.50.255.669 min1.490 min11.043 s12.788 s
Breast cancer0.10.01660.00013.809 min7.937 min1.305 min1.158 min
0.20.026.031 min5.198 min1.503 min1.230 min
0.30.035.499 min2.265 min1.188 min1.061 min
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Chen, J.; Shimmura, R.; Suzuki, J. Efficient Proximal Gradient Algorithms for Joint Graphical Lasso. Entropy 2021, 23, 1623. https://doi.org/10.3390/e23121623

AMA Style

Chen J, Shimmura R, Suzuki J. Efficient Proximal Gradient Algorithms for Joint Graphical Lasso. Entropy. 2021; 23(12):1623. https://doi.org/10.3390/e23121623

Chicago/Turabian Style

Chen, Jie, Ryosuke Shimmura, and Joe Suzuki. 2021. "Efficient Proximal Gradient Algorithms for Joint Graphical Lasso" Entropy 23, no. 12: 1623. https://doi.org/10.3390/e23121623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop