Efficient Proximal Gradient Algorithms for Joint Graphical Lasso

We consider learning as an undirected graphical model from sparse data. While several efficient algorithms have been proposed for graphical lasso (GL), the alternating direction method of multipliers (ADMM) is the main approach taken concerning joint graphical lasso (JGL). We propose proximal gradient procedures with and without a backtracking option for the JGL. These procedures are first-order methods and relatively simple, and the subproblems are solved efficiently in closed form. We further show the boundedness for the solution of the JGL problem and the iterates in the algorithms. The numerical results indicate that the proposed algorithms can achieve high accuracy and precision, and their efficiency is competitive with state-of-the-art algorithms.


Introduction
Graphical models are widely used to describe the relationships among interacting objects [1]. Such models have been extensively used in various domains, such as bioinformatics, text mining, and social networks. The graph provides a visual way to understand the joint distribution of an entire set of variables.
In this paper, we consider learning Gaussian graphical models that are expressed by undirected graphs, which represent the relationship among continuous variables that follow a joint Gaussian distribution. In an undirected graph, G = (V, E), and edge set E represents the conditional dependencies among the variables in vertex set V.
Let X 1 , . . . , X p (p ≥ 1) be Gaussian variables with covariance matrix Σ ∈ R p×p , and Θ := Σ −1 be the precision matrix, if it exists. We remove the edges so that the variables X i , X j are conditionally independent given the other variables if and only if the (i, j)-th element θ i,j in Θ is 0: where each edge is expressed as a set of two elements in {1, . . . , p}. In this sense, constructing a Gaussian graphical model is equivalent to estimating a precision matrix.
Suppose that we estimate the undirected graph from data consisting of n tuples of p variables and that dimension p is much higher than sample size n. For example, if we have expression data of p = 20, 000 genes for n = 100 case/control patients, how can we construct a gene regulatory network structure from the data? It is almost impossible to estimate the locations of the nonzero elements in Θ by obtaining the inverse of sample covariance matrix S ∈ R p×p , which is the unbiased estimator of Σ. In fact, if p > n, then no inverse S −1 exists because the rank of S ∈ R p×p is, at most, n.
In order to address this situation, two directions are suggested: 1. Sequentially find the variables on which each variable depends via regression so that the quasilikelihood is maximized [2].
We follow the second approach because we assume Gaussian variables, also known as graphical lasso (GL). The 1 regularized log-likelihood is defined by: where tuning parameter λ controls the amount of sparsity, and ||Θ|| 1 denotes the sum of the absolute value of the off-diagonal elements in Θ. Several optimization techniques [4,[7][8][9][10][11][12] have been studied for the optimization problem of (1). In particular, we consider a generalized version of the abovementioned GL. For example, suppose that the gene regulatory networks of thirty case and seventy control patients are different. One might construct a gene regulatory network separately for each of the two categories. However, estimating each on its own does not provide an advantage if a common structure is shared. Instead, we use 100 samples to construct two networks simultaneously. Intuitively speaking, using both types of data improves the reliability of the estimation by increasing the sample size for the genes that show similar values between case and control patients, while using only one type of data leads to a more accurate estimate for genes that show significantly different values. Ref. [13] proposed a joint graphical lasso (JGL) model by including an additional convex penalty (fused or group lasso penalty) to the graphical lasso objective function for K classes. For example, K is equal to two for the case/control patients in the example. JGL includes fused graphical lasso with fused lasso penalty, which encourages sparsity and the similarity of the value of edges across K classes, and group graphical lasso with group lasso penalty, which promotes similar sparsity structure across K graphs. Although there are several approaches to handling the multiple graphical models, such as those of [14][15][16][17], the JGL is considered the most promising.
The main topic of this paper is efficiency improvement in terms of solving the JGL problem. For the GL, relatively efficient solving procedures exist. If we differentiate the 1 regularized log-likelihood (1) by Θ, then we have an equation to solve [4]. Moreover, several improvements have been considered for the GL, such as proximal Newton [12] and proximal gradient [10] procedures. However, for the JGL, even if we derive such an equation, we have no efficient way of handling it.
Instead, the alternating direction method of multipliers (ADMM) [18], which is a procedure for solving convex optimization problems for general purposes, has been the main approach taken [13,[19][20][21]. However, ADMM does not scale well concerning the feature dimension p and number of classes K. It usually takes time to converge to a high-accuracy solution [22].
For efficient procedures to solve the JGL problem, ref. [23] proposed a method based on the proximal Newton method only when the penalty term is expressed by fused lasso. The existing method requires expensive computations for the Hessian matrix and Newton directions, which means that it would be expensive to use for high-dimensional problems.
In this paper, we propose efficient proximal-gradient-based algorithms to solve the JGL problem by extending the procedure in [10] and employing the step-size selection strategy proposed in [24]. Moreover, we provide the theoretical analysis of both methods for the JGL problem.
In our proximal gradient methods for the JGL problem, the proximal operator in each iteration is quite simple, which eases the implementation process and requires very little computation and memory at each step. Simulation experiments are used to justify our proposed methods over the existing ones.
Our main contributions are as follows: • We propose efficient algorithms based on the proximal gradient method to solve the JGL problem. The algorithms are first-order methods and quite simple, and the subproblems can be solved efficiently with a closed-form solution. The numerical results indicate that the methods can achieve high accuracy and precision, and the computational time is competitive with state-of-art algorithms. • We provide the boundedness for the solution to the JGL problem and the iterates in algorithms, which is related to the convergence rate of the algorithms. With the boundedness, we can guarantee that our proposed method converges linearly. Table 1 summarizes the relationship between the proposed and existing methods. The remaining parts of this paper are as follows. In Section 2, we first provide the background of our proposed methods and introduce the joint graphical lasso problem. In Section 3, we illustrate the detailed content of the proposed algorithms and provide a theoretical analysis. In Section 4, we report some numerical results of the proposed approaches, including comparisons with efficient methods and performance evaluations. Finally, we draw some conclusions in Section 5.

Preliminaries
This section first reviews the graphical lasso (GL) problem and introduces the graphical iterative shrinkage-thresholding algorithm (G-ISTA) [10] to solve it. Then, we introduce the step-size selection strategy that we apply to the joint graphical lasso (JGL) in Section 3.2.

Graphical Lasso
Let x 1 , . . . , x n ∈ R p be n ≥ 1 observations of dimension p ≥ 1 that follow the Gaussian distribution with mean µ ∈ R p and covariance matrix Σ ∈ R p×p , where without loss of generality, we assume µ = 0. Let Θ := Σ −1 , and the empirical covariance matrix S := 1 n ∑ n i=1 x T i x i . Given penalty parameter λ > 0, the graphical lasso (GL) is the procedure to find the positive definite Θ ∈ R p×p such that: where ||Θ|| 1 = ∑ j =k |θ j,k |. If we regard V := {1, . . . , p} as a vertex set, then we can construct an undirected graph with edge set {{j, k}|θ j,k = 0}, where set {j, k} denotes an undirected edge that connects the nodes j, k ∈ V.
If we take the subgradient of (2), then we find that the optimal solution Θ * satisfies the condition: where Φ = (Φ j,k ) is

ISTA for Graphical Lasso
In this subsection, we introduce the method for solving the GL problem (2) by the iterative shrinkage-thresholding algorithm (ISTA) proposed by [10], which is a proximal gradient method usually employed in dealing with nondifferentiable composite optimization problems.
Specifically, the general ISTA solves the following composite optimization problem: where f and g are convex, with f differentiable and g possibly being nondifferentiable. For the GL problem (2), we denote f , g : R p×p → R as and g(Θ) := λ Θ 1 .
If we define the quadratic approximation Q η : then we can describe the ISTA as a procedure that iterates given initial value Θ 0 , where the value of step size η t > 0 may change at each iteration t = 1, 2, . . . , for efficient convergence purpose, and we use the proximal operator: Note that the proximal operator of function g = λ||Θ|| 1 is the soft-thresholding operator: the absolute value |θ i,j | of each off-diagonal element θ i,j with i = j becoming either θ i,j − sgn(θ i,j )λ or zero (if |θ i,j | < λ). We use the following function for the operator in Section 3: where (x) + := max(x, 0).
It is known that if we choose η t = 1 L for each step in the ISTA that minimizes F(·) , then the convergence rate is, at most, as follows [25]: However, for the GL problem (2), we know neither the exact value of the Lipschitz constant L nor any nontrivial upper bound. [10] implement a backtracking line search option in Step 1 of Algorithm 1 below to handle this issue.
The backtracking line search enables us to compute the η t value for each time t = 1, 2, . . . by repeatedly multiplying η t by a constant c ∈ (0, 1) until Θ t+1 0 (Θ is positive definite) and for the Θ t+1 in (7). Additionally, (12) is a sufficient condition for (11), which was derived in [25] (see the relationship between Lemma 2.3 and Theorem 3.1 in [25]). The whole procedure is given in Algorithm 1.

Composite Self-Concordant Minimization
The notion of the self-concordant function was proposed in [26][27][28]. In the following, we say a convex function Reference [24] considered a composite version of self-concordant function minimization and provided a way to efficiently calculate the step size for the proximal gradient method for the GL problem without relying on the Lipschitz gradient assumption in (10). They proved that (2) is self-concordant and considers the following minimization: where f is convex, differentiable, and self-concordant, and g is convex and possibly nondifferentiable. As for Algorithm 1, without using the backtracking line search, we can compute direction d t with initial step size η t as follows: where the operator prox is defined by (8). Then, we use the modified step size α t to update Θ t+1 := Θ t + α t d t , which can be determined by the direction d t . After defining two parameters related to the direction: the modified step size can be obtained by By Lemma 12 in [24], if the modified step size α t ∈ (0, 1], then it can ensure a decrease in the objective function and guarantee convergence in the proximal gradient scheme. From (14), if λ t ≥ 1, then the condition α t ∈ (0, 1] is satisfied. Therefore, we only need to check the case when λ t < 1. If the condition α t ∈ (0, 1] does not hold, we can change the value of the initial η t (such as the bisection method) to influence the value of d t in (13) until the condition is satisfied.

Joint Graphical Lasso
where each x i is a row vector. Let n k be the number of occurrences in y 1 , . . . , y N such that y i = k, so that ∑ K k=1 n k = N.

The Proposed Methods
In this section, we propose two efficient algorithms for solving the JGL problem. One is an extended ISTA based on the G-ISTA in Section 2.2, and the other is based on the step-size selection strategy introduced in Section 2.3.

ISTA for the JGL Problem
To describe the JGL problem, we define f , g : Then, the problem (15) reduces to: where the function f is convex and differentiable, and g is convex and nondifferentiable. Therefore, the ISTA is available for solving the JGL problem (15). The main difference between the G-ISTA and the proposed method is that the latter needs to simultaneously consider K categories of graphical models in the JGL problem (15). What is more, there are two combined penalties in g(Θ), which complicate the proximal operator in the ISTA procedure. Consequently, the operator for the proposed method is not a simple soft thresholding operator, as it is for the G-ISTA method.
If we define the quadratic approximation then the update iteration is simplified as: Nevertheless, the Lipschitz gradient constant of f (Θ) is unknown over the whole domain in the JGL problem. Therefore, our approach needs a backtracking line search to calculate step size η t . We show the details in Algorithm 2.
In the update of Θ t+1 , we need to compute the proximal operators for the fused and group lasso penalties. In the following, for each of them, the problem can be divided into the fused lasso problems [29] and group lasso problems [30,31] We apply the solutions given by (20) and (21) below.

Fused Lasso Penalty P F
By the definition of the proximal operator in the update step, we have: Problem (18) is separable with respect to the elements θ k,i,j in Θ (k) ∈ R p×p ; hence, the proximal operator can be computed in a componentwise manner: Let A = Θ t − η t ∇ f (Θ t ); then, problem (18) reduces to the following for i = 1, · · · , p, j = 1, · · · , p : where 1 i =j is an indicator function, the value of which is 1 only when i = j.

Group Lasso Penalty P G
By definition, the update of Θ t+1 for the group lasso penalty P G (Θ) is as follows: Similarly, let A = Θ t − η t ∇ f (Θ t ); then, the problem becomes the following for i = 1, · · · , p, j = 1, · · · , p: We have θ k,i,j = a k,i,j for i = j. In addition, for i = j, the solution [31,36,37] is given by

Modified ISTA for JGL
Thus far, we have seen that f (Θ) in the JGL problem (15) is not globally Lipschitz gradient continuous. The ISTA may not be efficient enough for the JGL case because it includes the backtracking line search procedure for this case, which needs to evaluate the objective function and the positive definiteness of Θ t+1 in Step 1 of Algorithm 2 and is inefficient when the evaluation is expensive.
In this section, we modify Algorithm 2 to Algorithm 3 based on the step-size selection strategy in Section 2.3, which takes advantage of the properties of the self-concordant function. The self-concordant function does not rely on the Lipschitz gradient assumption on the function f (Θ) [24], and we can eliminate the need for the backtracking line search.

Lemma 2. ([38])
Self-concordance is preserved by scaling and addition: if f is a self-concordant function and a constant a ≤ 1, then a f is self-concordant. If f 1 , f 2 are self-concordant, then f 1 + f 2 is self-concordant.
By Lemma 2, the function f (Θ) in (16) is self-concordant. In Algorithm 3, for the initial step size of η t in each iteration, we use the Barzilai-Borwein method [39]. We apply the step-size mechanism in Section 2.3, which is employed in Steps 3-5 of Algorithm 3.

Theoretical Analysis
For multiple Gaussian graphical models, Honorio and Samaras [14] and Hara and Washio [17] provided lower and upper bounds for the optimal solution Θ * . However, the models they considered are different than the JGL. To the best of our knowledge, no related research has provided the bounds of the optimal solution Θ * for the JGL problem (15).
In the following section, we show the bounds of the optimal solution Θ * for the JGL and the iterates Θ t generated by Algorithms 2 and 3, which are applied to both fused and group lasso-type penalties. Proposition 1. The optimal solution Θ * of the problem (15) satisfies where λ c := Kλ 2 1 + 2Kλ 1 λ 2 + λ 2 2 , and s k,i,i is the i-th diagonal element of S (k) .
For the proof, see Appendix A.1. Note that the objective function value F(Θ) always decreases with the increase in iteration in both algorithms due to [25] (Remark 3.1) and Lemma 12 in [24]. Therefore, the following inequality holds for Algorithms 2 and 3: Then, based on the condition (22), we provide the explicit bounds of iterates {Θ t } t=0,1... in Algorithms 2 and 3 for the JGL problem (15).

}.
Proof. It can be easily extended by Lemma 3 in [10].
Lemma 4 implies that to obtain the convergence rate γ t < 1, we require: After using Propositions 1 and 2, we can obtain the bounds of a l . Further, we can obtain the step size η t that satisfies (23) and guarantee s the linear convergence rate (γ t < 1). However, the step size is quite conservative in practice. Hence, we consider the Barzilai-Borwein method for implementation and regard the step size η t that satisfies (23) as a safe choice. When the number of backtracking iterations in Step 1 of Algorithm 2 exceeds the given maximum number to fulfill the backtracking line search condition, we can use the safe step size η t for the subsequent calculations. In Section 4.2.3, we confirm the linear convergence rate of the proposed ISTA by experiment.

Experiments
In this section, we evaluate the performance of the proposed methods on both synthetic and real datasets, and we compare the following algorithms: • ADMM: the general ADMM method proposed by [13]. • FMGL: the proximal Newton-type method proposed by [23]. We perform all the tests in R Studio on a Macbook Air with 1.6 GHz Intel Core i5 and 8 GB memory. The wall times are recorded as the run times for the four algorithms.

Stopping Criteria and Model Selection
In the experiments, we consider two stopping criteria for the algorithms. 1. Relative error stopping criterion: ≤ .
2. Objective error stopping criterion: is a given accuracy tolerance; we terminate the algorithm if the above error is smaller than or the maximum number of iterations exceeds 1000. We use the objective error for convergence rate analysis and the relative error for the time comparison.
The JGL model is affected by regularized parameters λ 1 and λ 2 . For selecting the parameters, we use the V-fold crossvalidation method. First, the dataset is randomly split into V segments of equal size, a single subset (test data), estimated by the other V − 1 subsets (training data), is evaluated, and the subset is changed for the test to repeat V times so that each subset is used.
Let S (k) v be the sample covariance matrix of the v-th ( v = 1, . . . , V) segment for class k = 1, . . . , K. We estimate the inverse covariance matrix by the remaining V − 1 subsetŝ Θ (k) λ,−v and choose λ 1 and λ 2 , which minimize the average predictive negative log-likelihood as follows:

Synthetic Data
The performance of the proposed methods was assessed on synthetic data in terms of the number of iterations, the execution time, the squared error, and the receiver operating characteristic (ROC) curve. We follow the data generation mechanism described in [41] with some modifications for the JGL model. We put the details in Appendix B.

Time Comparison Experiments
We vary p, N, K and λ 1 to compare the execution time of our proposed methods with that of the existing methods. We consider only the fused penalty in our proposed method for a fair comparison in the experiments because the FMGL algorithm applies only to the fused penalty. First, we compare the performance among different algorithms under various dimensions p, which are shown in Figure 1.  Figure 1 shows that the execution time of the FMGL and ADMM increases rapidly as p increases. In particular, we observe that the M-ISTA significantly outperforms when p exceeds 200. The ISTA shows better performance than the three methods when p is less than 200, but it requires more time as p grows, compared to the M-ISTA. It is reasonable to consider that evaluating the objective function in the backtracking line search at every iteration increases the computational burden, especially when p increases, which means that the M-ISTA is a good choice for these cases. Furthermore, the ISTA can be a good candidate when the evaluation is inexpensive. Table 2 summarizes the performance of the four algorithms under different parameter settings to achieve a given precision, , of the relative error. The results presented in Table 2 reveal that when we increase the number of classes K, all the algorithms require more time than usual. Moreover, the execution time of ADMM becomes huge among them. When we vary λ 1 , the algorithms become more efficient as the value increases. For most instances, the M-ISTA and ISTA outperform the existing methods, such as ADMM and FMGL. For the exceptional cases (p = 20, k = 2, N = 60, λ 1 = 0.1 and λ 2 = 0.05), the M-ISTA and ISTA are still comparable with the FMGL and faster than ADMM.

Algorithm Assessment
We generate the simulation data as described in Appendix B and regard the synthetic inverse covariance matrices Θ (k) as the true values for our assessment experiments.
First, we assessed our proposed method by drawing an ROC curve, which displays the number of true positive edges (i.e., TP edges) selected compared to the number of false positive edges (i.e., FP edges) selected. We say that an edge (i, j) in the k-th class is selected in estimateΘ (k) if elementθ k,i,j = 0, and the edges are true positive edges selected if the precision matrix element θ k,i,j = 0 and false positive edges selected if the precision matrix element θ k,i,j = 0, where the two quantities are defined by where 1(·) is the indicator function.
To confirm the validity of the proposed methods, we compare the ROC figures of the fused penalty and group penalty. We fix the parameters λ 2 for each curve and change the λ 1 value to obtain various numbers of selected edges because the sparsity penalty parameter λ 1 can control the number of selected total edges.
We show the ROC curves for fused and group lasso penalties in Figure 2a,b respectively. From the figures, we observe that both penalties show highly accurate predictions for the edge selections. The result of λ 2 = 0.0166 in the fused penalty case is better than that in λ 2 = 0.05. Additionally, the result of λ 2 = 0.0966 in the group penalty case is better than that in λ 2 = 0.09, which means that if we select the tuning parameters properly, then we can obtain precise results while simultaneously meeting our different model demands. Then, Figure 3a,b display the mean squared error (MSE) between the estimated values and true values.
whereθ k,i,j is the value estimated by the proposed method, and θ k,i,j is the true precision matrix value we used in the data generation. The figures illustrate that when the total number of edges selected increases, the errors decrease and finally achieve relatively low values.
Overall, the proposed method shows competitive efficiency not only in computational time but also in accuracy.

Convergence Rate
This section shows the convergence rate of the ISTA for solving the JGL problem (15) in practice, with λ 1 = 0.1, 0.09 and 0.08. We recorded the number of iterations to achieve the different tolerance of F(Θ t ) − F(Θ * ) in Figure 4 and ran it on a synthetic dataset, with p = 200, K = 2, λ 2 = 0.05, and N = 400. The figure reveals that as λ 1 decreases, more iterations are needed to converge to the specified tolerance. Moreover, the figure shows the linear convergence rate of the proposed ISTA method, which corroborate the theoretical analysis in Section 3.3.

Real Data
In this section, we use two different real datasets to demonstrate the performance of our proposed method and visualize the result.
Firstly, we used the presidential speeches dataset in [42] for the experiment to jointly estimate common links across graphs and show the common structure. The dataset contains 75 most-used words (features) from several big speeches of the 44 US presidents (samples). In addition, we used the clustering result in [42], where the authors split the 44 samples into two groups with similar features, and then we obtained two classes of samples (K = 2).
We used Cytoscape [43] to visualize the results when λ 1 = 1.9 and λ 2 = 0.16. We chose these relatively large tuning parameters for better interpretation of the network figure. Figure 5 shows the relationship network graph of the high-frequency words identified by the JGL model with the proposed method. As shown in the figure, each node represents a word, and the edges demonstrate the relationships between words. We use different colors to show various structures. The black edges are a common structure between the two classes, the red edges are the specific structures for the first class (k = 1), and the green edges are for the second class (k = 2). Figure 5 shows a subnetwork on the top with red edges, meaning there are relationships among those words, and the connections only exist in the first group.
We compared the time cost among four algorithms and show the results in Table 3. We used the crossvalidation method (V = 6) described in Section 4.1 to select the optimal tuning parameters (λ 1 = 0.1, λ 2 = 0.05). In addition, we manually chose the other two pairs of parameters for more comparisons.  Table 3 shows that ISTA outperforms the other three algorithms, and our proposed methods offer stable performance when varying the parameters, while ADMM is the slowest in most cases.
Secondly, we use a breast cancer dataset [44] for time comparison. There are 250 samples and 1000 genes in the dataset, with 192 control samples and 58 case samples (K = 2). Furthermore, we extract 200 genes with the highest variances among the original genes. The tuning parameter pair (λ 1 = 0.01, λ 2 = 0.0166) was chosen by the crossvalidation method. Table 3 exhibits that our proposed methods (ISTA and M-ISTA) outperform ADMM and FMGL, and M-ISTA shows the best performance in the breast cancer dataset.

Discussion
We propose two efficient proximal gradient descent procedures with and without the backtracking line search option for the joint graphical lasso. The first (Algorithm 2) does not require extra variables, unlike ADMM, which needs manual tuning the Lagrangian penalty parameters ρ in [13] and storing and calculating dual variables. Moreover, we reduce the update iterate step to subproblems that can be solved efficiently and precisely by lasso-type problems. Based on Algorithm 2, we modified the step-size selection by extending the strategy in [24] to the second one (Algorithm 3), which does not rely on the Lipschitz assumption. Additionally, the second does not require a backtracking line search, significantly reducing the computation time needed to evaluate objective functions.
From the theoretical perspective, we reach the linear convergence rate for the ISTA. Furthermore, we derive the lower and upper bounds of the solution to the JGL problem and the iterates in the algorithms, guaranteeing that the ISTA converges linearly. Numerically, the methods are demonstrated on simulated and real datasets to illustrate their robust and efficient performance over state-of-the-art algorithms.
For further computational improvement, the most expensive step in the algorithms is to calculate the inversion of matrices required by the gradient of f (Θ). Both algorithms have a complexity of O(Kp 3 ) per iteration. Moreover, we can solve the matrix inversion problem with more efficient algorithms with lower complexity. In addition, we can also use the faster computation procedure in [13] to decompose the optimization problem for the proposed methods and regard it as preprocessing. Overall, the proposed methods are highly efficient for the joint graphical lasso problem.  Data Availability Statement: Publicly available datasets were analyzed in this paper. Presidential speeches dataset: https://www.presidency.ucsb.edu, accessed on 5 November 2021; Breast cancer dataset: https://www.rdocumentation.org/packages/doBy/versions/4.5-15/topics/breastcancer, accessed on 5 November 2021.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: We first introduce the Lagrange dual problem of (15). By introducing the auxiliary variables Z = {Z (1) , . . . , Z (K) }, we can rewrite the problem as follows: subject to Z = Θ Then, the Lagrange function of the above is given by: where Λ = {Λ (1) , . . . , Λ (K) }, Λ (k) ∈ R p×p are dual variables. To obtain the dual problem, we minimize the primal variables as follows: Taking derivative of the function: We obtain n k S (k) + Λ (k) = n k (Θ (k) ) −1 (A1) for k = 1, · · · , K. Substitute the Equation (A1) into the dual problem min Θ,Z L(Θ, Z, Λ), then it becomes: Hence, we can obtain the duality gap [38] (the primal problem minus the dual problem) as follows: n k trace(S (k) Θ (k) ) + g(Z) − K ∑ k=1 n k p + g * (Λ), when the gap value is 0, the optimal solution is found. Because the conjugate function g * (Λ) is the indicator function, the value is hence 0 for the optimal solution.
Firstly, for the group penalty P G (Θ). Let E (k) be non-negative p × p matrix satisfying −E k,i,j ≤ θ k,i,j ≤ E k,i,j . Introducing the Lagrange multipliers Γ (k) and Γ (k) 0 for k = 1, . . . , K. This procedure is similar to the way in [17].
The lower bound of fused penalty can be derived in similar way.