Jewel : A Novel Method for Joint Estimation of Gaussian Graphical Models

: In this paper, we consider the problem of estimating multiple Gaussian Graphical Models from high-dimensional datasets. We assume that these datasets are sampled from different distributions with the same conditional independence structure, but not the same precision matrix. We propose jewel , a joint data estimation method that uses a node-wise penalized regression approach. In particular, jewel uses a group Lasso penalty to simultaneously guarantee the resulting adjacency matrix’s symmetry and the graphs’ joint learning. We solve the minimization problem using the group descend algorithm and propose two procedures for estimating the regularization parameter. Furthermore, we establish the estimator’s consistency property. Finally, we illustrate our estimator’s performance through simulated and real data examples on gene regulatory networks.


Introduction
Network analysis is becoming a powerful tool for describing the complex systems that arise in physical, biomedical, epidemiological, and social sciences, see in [1,2]. In particular, estimating network structure and its complexity from high-dimensional data has been one of the most relevant statistical challenges of the last decade [3]. The mathematical framework to use depends on the type of relationship among the variables that the network should incorporate. For example, in the context of gene regulatory networks (GRN), traditional co-expression methods are useful to capture marginal correlation among genes without distinguishing between direct or mediated gene interactions. Instead, graphical models (GM) constitute a well-known framework for describing conditional dependency relationships between random variables in a complex system. Therefore, they are more suited to describe direct relations among genes, not mediated by the remaining genes.
In the GMs' framework, an undirected graph G = (V, E) describes the joint distribution of the random vector (X 1 , . . . , X p ), the individual variables being the graph's nodes and the edges reflecting the presence/absence of conditional dependency relation among them. When the system of random variables has a multivariate Gaussian distribution (X 1 , . . . , X p ) ∼ N(0, Σ), we refer to Gaussian Graphical Models (GGMs). In such a case, the graph estimation is equivalent to estimating the precision matrix's support, namely, the inverse of the covariance matrix associated with the multivariate Gaussian distribution.
There is extensive literature on learning a GGM, and we refer to the works in [4][5][6][7] for an overview. In brief, under high-dimensional setting and sparsity assumptions on the precision matrix, we can broadly classify the available methods into two categories: methods that estimate the entire precision matrix Ω and those that estimate only its support (i.e., the edge set E). Methods in the first category usually impose a Lasso type penalty on the inverse covariance matrix entries when maximizing the log-likelihood, as the GLasso approach proposed in [8,9]. Methods in the second category date back to the seminal paper of Meinshausen and Bühlmann [10] and use the variable selection properties of Lasso regression.
In recent years, the focus shifted from the inference of a single graph given a dataset to the inference of multiple graphs given different datasets, assuming that the graphs share some common structure. This setting encompasses cases where datasets measure similar entities (variables) under different conditions or setups. It is useful for describing various situations encountered in real applications, such as meta-analyses and heterogeneous data integration. For example, in GNR, a single dataset might represent the gene expression levels of n patients affected by a particular type of cancer, and the inference aims to understand the specific regulatory mechanisms. International projects, such as The Cancer Genome Atlas (TGCA), or large repositories, such as Gene Expression Omnibus (GEO), make available tens of thousands of gene expression samples. Therefore, it is easy to find several datasets (i.e., different studies collected using different high-throughput assays or performed in different laboratories) on the experimental condition of interest. Assuming that, despite the difference in the technologies and preprocessing, the hidden regulatory mechanisms investigated in the different studies are similar if not the same, it is worth integrating them to improve the inference. A similar type of challenge is also present in the integrative multi-omics analysis. For instance, one can measure gene expression, single nucleotide polymorphisms (SNPs), methylation levels, etc. on the same set of individuals and want to combine the information across the different omics. In both cases, it is possible to develop multitasking or joint learning approaches to gain power in the inference (see [11,12]

to cite few examples).
More specifically, in multiple GGMs inference, we have K ≥ 2 different datasets X (1) , . . . , X (K) , each drawn from a Gaussian distribution N(0, Σ (k) ). Each dataset measures (almost) the same sets of variables in a specific class (or condition) k. The aim is to estimate the underlying GGMs under the assumption that they share some common structure across the classes.
Several methods are already available in the literature to deal with such a problem. They differ on the assumptions on how the conditional dependency structure is shared across the various distributions. The majority of these methods extend maximum likelihood (MLE) approaches to the multiple data framework. Among the most important contributions, we cite the work in [13] that penalizes the MLE by a hierarchical penalty enforcing similar sparsity patterns across classes, allowing some class-specific difference. However, as the penalty is not convex, convergence is not guaranteed. JGL [14] is another important method that proposes two alternatives. The first penalizes the MLE by a fused Lasso penalty, and the second penalizes the MLE by combining two group Lasso penalties. In this last alternative, the two penalties enforce a similar sparsity structure and some differences, respectively. Two other approaches are [15], which penalizes the MLE by an adaptive Lasso penalty whose weights are updated iteratively, and the work in [16], which penalizes the MLE by a group Lasso penalty. On the other hand, the work in [17,18] extends the regression-based approach in [10] to the multiple data setting. In particular, in [17], the authors constructed a two-step procedure. In the first step, they infer the graphical structure using a regression-based approach with a group Lasso penalty for exploiting the common structure across the classes. In the second step, they penalize the MLE subject to the first step edges' set. The authors of [18] proposed regression-based minimization problem with cooperative group Lasso penalty. The first term is the standard group Lasso penalty which enforces the same structure across classes by penalizing elements of precision matrices. The second one penalizes negative elements of those matrices to promote differences between classes. By authors' hypothesis, swap in the sign can occur only for class-specific connections, justifying such penalty.
In this paper, we propose jewel, a new technique for jointly estimate a GGM in the multiple dataset framework. We assume that the underlying graph structure is the same across the K classes. However, each class preserves its specific precision matrix. We extend the regression-based method proposed in [10] to the case of multiple datasets, in the same spirit as in [18] and the first step of the procedure in [17], which mainly differ in the definition of variables' groups. More specifically, in [17,18] the groups are determined by the ij positions across the K precision matrices and in our approach, the groups are determined by both ij and ji positions across the K precision matrices. Consequently, both [17,18] do not provide symmetric estimates of precision matrices' supports and require postprocessing. Instead, jewel grouping strategy exploits both the common structure across the datasets and the symmetry of the precision matrices support. We consider this a valuable improvement compared to the competitors as such an approach allows to avoid the need for postprocessing.
The rest of the paper is organized as follows. In Section 2, we derive the jewel method and present the numerical algorithm for its solution, as well as establish the theoretical properties of the estimator in Section 2.4 and discuss the choice of the tuning parameter in Section 2.5. Code is provided in Section 2.6. Finally, in Section 3, we illustrate our method's performance in a simulation study comparing it with some other available alternatives and describe a real data application from gene expression datasets.

Jewel: Joint Node-Wise Estimation of Multiple Gaussian Graphical Models
This section describes the mathematical setup, the proposed method for the joint estimation of GGMs (jewel), the numerical algorithm adopted, the theoretical property of the proposed estimator and approaches of estimating the regularization parameter.

Problem Setup
Let X (1) , X (2) , . . . , X (K) be K ≥ 2 datasets of dimension n k × p k , respectively, that represent similar entities measured under K different conditions or collected in distinct classes. Each dataset X (k) represents n k observations x ) is a p k -dimensional row vector. Without loss of generality, assume that each data matrix is standardized to have columns with zero mean and variance equal to one.
The proposed model assumes that x (k) 1 , . . . , x (k) n k represent independent samples from a N (0, Σ (k) ), with Σ (k) 0. Moreover, we also assume that most of the variables are in common in all the K datasets. For example, this assumption includes datasets that measure the same set of p variables under K different conditions; however, few variables might not be observed in some datasets due to some technical failure. Under this setup, the model assumes that the variables share the same conditional independence structure across the K datasets.
Precisely, let G k = (V k , E k ) be the undirected graph that describes the conditional independence structure of the k-th distribution N (0, Σ (k) ), i.e., V k = {X 1 , . . . , X p k } and Note that we use notation (i, j) to denote the undirected edge incident to vertices i and j, or equivalently j and i (if such edge exists). We use notation {i, j} to denote the pair of vertices or variables {X i , X j }. We assume that there exists a common undirected graph In other words, two variables of V are conditionally independent in all the distributions of which they are part or they are never conditionally independent in any. However, when they are not conditionally independent, the conditional correlation might be different in the different datasets to model relations that can be differently tuned depending on specific experimental condition or set-up.
Let us denote Ω (k) = (Σ (k) ) −1 the true precision matrix for k-th distribution. As inferring a GGM is equivalent to estimating the support of the precision matrix, estimating a GGM from multiple datasets translates into simultaneously inferring the support of all precision matrices Ω (k) , with the constraint Ω We emphasize here that the aim is to estimate only the structure of the common network, not the entries of precision matrices.

Jewel
jewel is inspired by the node-wise regression-based procedure proposed in [10] for inferring a GGM from a single dataset. Let us first revise this method. Let fix a single dataset, say k, and define Θ (k) , the p k × p k zero diagonal matrix with extra-diagonal entries Θ jj > 0 for the positiveness of Σ (k) , the support of Ω (k) coincides with the support of Θ (k) . In [10], the authors proposed to learn matrix Θ (k) instead of Ω (k) . To this aim, they solved the following multivariate regression problem with the Lasso penaltyΘ (k) = arg min where λ is a tuning regularization parameter and || · || F is the Frobenius norm. Note that problem in Equation (1) is separable into p k independent univariate regression problems where each column of matrix Θ (k) is obtained independently from the others. Although computationally efficient, such an approach does not exploit either guarantee the symmetry of the solution. Therefore, the authors proposed to post-process the solution to restore the symmetry, for example by the "AND" or "OR" rules. More recently, the authors of [19] proposed a modified approach where Θ (k) is obtained by solving the following minimization problem: that corresponds to a multivariate regression problem with a group Lasso penalty. In Equation (2), the number of unknown parameters, i.e., the extra-diagonal terms of Θ (k) , are p k (p k − 1) and are arranged into ( p k 2 ) groups of dimension 2, each group including Θ (k) ij and Θ (k) ji . As a consequence, the minimization problem in Equation (2) results in a group sparse estimate ofΘ (k) which exploits and hence guarantees the symmetry of the estimated support.
In this work, we further extend the approach in [19] to simultaneously learn the K zero-diagonal matrices Θ (k) , each with p k (p k − 1) unknowns parameters. Let p = |V| be the cardinality of V and divide its elements into ( p 2 ) groups. Each group consists of pairs of variables Θ ji across all the datasets that contain them. More precisely, for all ji : {X i , X j } ⊆ V k } and denote g ij the group's cardinality. Based on our hypothesis, we have that the vector of each group of variables coincides with the zero vector when the variables X i and X j are conditionally independent in all the datasets that contain them (i.e., when (i, j) / ∈ E), or it is different from zero when the variables X i and X j are conditionally dependent in all the data sets that contain them (i.e., when (i, j) ∈ E). In the latter case, each data set can modulate the strength of the dependency in a specific way because, when not equal to zero, the group elements are not forced to be equal. This is another advantage of our proposal to treat groups as symmetric pairs across all the datasets. Therefore, in this paper we propose jewel, which estimates simultaneouslyΘ (1) , . . . ,Θ (K) by solving the following minimization problem: (Θ (1) , . . . ,Θ (K) ) = arg min The estimated edge setÊ is obtained as supp(Θ (1) ) = · · · = supp(Θ (K) ), that are equal by construction. Indeed, the problem in Equation (3) corresponds to a multivariate regression problem with a group Lasso penalty that enforces the same support for the estimatesΘ (k) naturally exploiting and guaranteeing the symmetry of the solution.
Note that, although the minimization problem in Equation (3) can be written and solved for a number of variables p k that is different in each dataset, the notations and the formulas simplify a lot when we assume that all variables coincide in the K datasets.
Under this assumption, for all k we have p k = p and V k = V, then each group {Θ is the GGM associated to each and all the datasets. Consequently, while the general formulation of Equation (3) can be useful for some applications where the variables of the different datasets may partly not coincide due to missing values or other reasons, for the sake of clarity from now on, we will use the simplified formulation, referring to remarks notes about the general formulation.
We use the following notations through the rest of the paper: -With bold capital letters we represent matrices, with bold lower case letters-vectors, which are intended as columns if not stated otherwise; is the vector of the unknown coefficients; θ [ij] denotes the restriction of vector θ to the variables belonging to the group i < j; .i corresponds to the i-th column of matrix X (k) and X (k) .−i corresponds to the submatrix of X (k) without the i-th column; , dim(y) = N p × 1, N = ∑ K k=1 n k , denotes the vector concatenating the columns of all data matrices; - denotes the block-diagonal matrix made up by the block-diagonal matrices X (k) ∀i, j and λ reparametrized as λ √ 2K, the penalty in Equation (3) can be easily rewritten using vector θ [ij] Moreover, the goodness-of-fit term becomes Combining Equations (4) and (5), we rewrite the minimization problem in Equation (3) as follows:θ = arg min This alternative formulation will be useful to present the algorithm and to study theoretical properties of jewel.

Numerical Algorithm
Function F(θ) in Equation (6) is convex and separable in terms of the groups. Moreover, we note that matrix X, used in the formulation of Equation (6), satisfies the orthogonal group hypothesis that requires the restriction of X to the columns of each group to be orthogonal-indeed, X ·[ij] is orthogonal by construction. Therefore, given λ, we can solve the minimization problem by applying the iterative group descent algorithm proposed in [20] which consists of updating one group of variables i < j at a time freezing the other groups at their current value, cycling until convergence.
More precisely, given a starting value for vector θ, the jewel algorithm updates the group of variables i < j minimizing function F(θ) for that group, considering the rest of the variables fixed to their current value. Consider the group i < j and compute the subgradient of F(θ) with respect to the variables θ [ij] . We have that the subgradient is a vector of g ij = 2K components defined as follows: where u is the relative entry of a vector u ∈ R 2K with ||u|| ≤ 1.
Vector z depends on the observed data X (k) , k = 1 . . . K, and the current values of Θ (k) ml not involving the pairs of variables (i, j) we are seeking.
From the work in [20], we have that the minimizer of function F(θ) with respect to the variables θ [ij] is the following multivariate soft-thresholding operator The soft-thresholding operator (·) + acts on vector z by shortening it towards 0 by an amount λ if its norm is greater or equal to λ and by putting it equal to zero if its norm is less than λ.
For a fixed value of λ, Equations (8) and (9) represent the updating step for each group i < j. The update is cyclically repeated for all groups. The entire cycle represents the generic update step of the iterative procedure. Thus, it is repeated until convergence. Precisely, we stop the procedure when the relative difference between two successive approximations of vector θ is less than a given tolerance tol, or when the algorithm reaches a maximum number of iterations.

Remark 1.
The numerical algorithm can be easily extended to minimize the general model of Equation (3). When the data matrices include different number of variables p k , then vector z has dimension g ij that can be different for each pair {i, j}. Indeed, z andθ [ij] incorporate only those k datasets for which the pairs of variables {i, j} were observed. Equation (9) is still valid with λ · √ g ij in place of λ as each update step is done independently for each pair and √ g ij is a constant that does not influence the convergence of the algorithm.
Although jewel's formal description involves large matrices and several matrix-vector products at each step, its implementation remains computationally feasible even for large datasets. Indeed, vectors y, θ and matrix X used in Equation (6) do not need to be explicitly built and the scalar products in Equation (8) can be obtained by modifying previously computed values.
Moreover, in our implementation of the group descend algorithm, we adopted the active shooting approach, as proposed in [9,21]. In this strategy, one exploits the problem's sparse nature efficiently, providing an increase in computational speed. The idea is to divide all pairs of variables into "active"-those which are not equal to zero at the current stepand "non-active"-those which are equal to zero at the current step-and update only the first ones. From a computational point of view, we define the upper triangular matrix Active of dimension p × p, which takes trace of the current support of Θ (k) , k = 1 . . . K. Active can be initialized by setting all up extra-diagonal elements equal to 1, meaning that at the beginning of the procedure, all the groups i < j are "active". Then, if the soft-thresholding operator zeroes group {2, 3} during the iterations, the corresponding matrix element is set to zero, Active 23 = 0, indicating that the group {2, 3} is no longer active and its status will be no more updated. At the end of the algorithm, matrix Active contains the estimate of the edge set E, as (i, j) ∈Ê ⇐⇒ Active ij = 1.
We provide the pseudocode for the algorithm we implemented for a fixed parameter λ. We also show how vector z can be efficiently updated using the residuals R (k) of the linear regressions.
Note that at the end of each iteration, we evaluate the difference between two suc- In the simulation study we discover that tol has small influence on the final estimate of Active, thus we use tol = 0.01 to speed-up the calculations. However, as it might influence the evaluation of the residual error used to apply the BIC criterion, in Section 2.5 we set tol = 10 −4 .

Remark 2.
Due to separability of the function F(θ) in Equation (6), if the graph structure is blockdiagonal (i.e., the adjacency matrix encoding the graph is block-diagonal), then the minimization problem in Equation (6) can be solved independently for each block. In the case of ultrahighdimensional data, this strategy turns to be very computationally efficient since each block could be, in principle, solved in parallel. The work in [22] provides the conditions to split the original problem into independent subproblems.

Theoretical Property
In this subsection, we establish the consistency property for the jewel estimator. Our findings are largely based on [23], where a GGM is inferred for temporal panel data. We start with formulation of the minimization problem given in Equation (6) Before presenting the main result in Theorem 1, let us introduce some auxiliary notations and Lemma 1, which will be useful in the proof of the theorem.
Let us denote θ 0 the true parameter vector of dimension Kp(p − 1) × 1. θ 0 is our unknown, and its non-zero components describe the true edge set E of the graph. θ 0 is naturally divided into ( p 2 ) groups, each consisting of the true parameters θ 0 [ij] whose row/column index refer to the same pair {i, j}. Let s denote the true number of edges in E and define the sets of "active" and "non-active" groups as Therefore, S contains all pairs of nodes for which there is an edge in E and S c contains all pairs of nodes for which there is an absence of edge. Now, referring to the linear regression problem formulation of jewel given in Equation (6), we define the additive Gaussian noise vector ε by the following: with matrices , . . . , 1 Given these definitions, the data model can be rewritten as y = Xθ 0 + ε and the following lemma holds: Lemma 1 (Group Lasso estimate characterization, cfr. Lemma A.1 in [23]). A vectorθ is a solution to convex optimization problem in Equation (6) if and only if there exists τ ∈ R p(p−1)K such that [X T D 2 (y − Xθ)] = λτ and where dir(u) = u/||u|| is the directional vector of any non-zero vector u.
We can now state the main result, which has been inspired by Theorem 4.1 of [23]. In the following, the analog of empirical covariance matrix C and some auxiliary stochastic matrices and vectors which will be part of the main theorem: where ζ A and C AA denote the restriction of vector ζ and matrix C to the rows and columns in the set A.
Proof. To prove set equalityÊ = E, we verify separately the two inclusions,Ê ⊆ E and

Defineθ
S be the solution of the following restricted problem: Defineθ ∈ R p(p−1)K such that its restriction to the set of active groups coincides witĥ θ S , while its restriction to the set of non-active groups is zero, i.e., To get the first inclusion, we need to prove thatθ is a solution of the full problem in Equation (6). By Lemma 1 it is sufficient to prove that ∃ τ ∈ R p(p−1)K : −X T D 2 (Y − Xθ) + λτ = 0 and Whenθ [ij] ≡ 0, the conditions in Equation (10) are satisfied by construction ofθ. Whenθ [ij] ≡ 0, the conditions in Equation (10) need to be verified. To this aim, substitute After properly permuting the indexes of C, ζ and τ, i.e., placing all the variables belonging to the active groups at the beginning and the non-active ones at the end, Equation (11) becomes This is equivalent to Solving the first equation forθ − θ 0 , and substituting into the second, we obtain and then which is a stronger requirement called direction consistency in the original paper [23]. Starting from Equation (12), we get Then, by hypothesis 3, we have that Remark 3. We stress that the hypotheses of Theorem 1 are weaker than the hypotheses of Theorem 4.1 in [23]. In fact, in our setting, the stochastic matrix C and vector ζ do not inherit the Gaussian distribution from data. Therefore, our results are based on a probabilistic assumption on these stochastic objects and not on the underlying families of Gaussian distributions, i.e., on their covariance matrices Σ (k) , k = 1 . . . K. However, if, on one hand, this could be a limitation, on the other hand, our result gives explicit conditions on the data that, in principle, could be verified given an estimate of vector θ. The same would not be possible when the assumptions involve the population matrices Σ (k) instead of the population vector θ 0 , because we do not estimate covariance matrices.

Remark 4.
In machine learning language, hypothesis 2 implies that λ must be chosen small enough to control the Type I error (i.e., the first inclusion) to avoid killing real edges. Hypothesis 3 implies that λ must be chosen large enough to control the Type II error (the second inclusion) to avoid including in the model false edges. Unfortunately, as it always happens in literature, from theoretical results we have no explicit expression for λ, thus we will select it through data-driven criteria, as exposed in the next section.

Selection of Regularization Parameter
Like any other penalty-based method, jewel requires selecting the regularization parameter λ, which controls the resulting estimator's sparsity. A high value of λ results in a more sparse and interpretable estimator, but it may have many false-negative edges. By contrast, a small value of λ results in a less sparse estimator with many false-positive edges.
Some authors have proposed using λ = log p/n or suggested empirical applicationdriven choices so that the resulting model is sufficiently complex to provide novel information and, at the same time, sufficiently sparse to be interpretable. However, the best choice remains to select λ by Bayesian Information Criterion (BIC), Cross-Validation (CV), or other data-driven criteria (e.g., quantile universal threshold (QUT) [24]). In this work, we propose the use of BIC and CV approaches.
Bayesian Information Criterion (BIC): Following the idea in [21], we can define the BIC for the K classes as the weighted sum of the BICs of the individual classes. For each class, the BIC comprises two terms: the logarithm of the residual sum of squares (RSS) and the degree of freedom. For any value of λ, Algorithm 1 provides not only the solutionθ, but also the RSS stored in the matrices R (k) and the degree of freedom as the number of non-zero pairs in the Active matrix. Therefore, the expression for BIC is given by

Cross-Validation (CV):
The idea of cross-validation (CV) is to split the data into F folds and consequentially use one fold as a testing set and all the others as training set. In our jewel procedure, we divide each data set X (k) into F folds of dimension n f k × p and the f -th fold is the union of the f -th folds of each class. As in standard CV procedure, for each parameter λ l of the grid λ 1 < · · · < λ L and for each fold (X (k) f ) k = 1,...,K we estimatê Θ (k) − f (λ l ) and then evaluate its error as Errors are then averaged over folds CV(λ l ) = 1 F ∑ F f =1 err( f , l) and the optimal parameter is chosen as λ CV = arg min λ l , l=1...L CV(λ l ).
As part of our numerical procedure, for both criteria, we start from the same grid of values λ 1 < · · · < λ L , and we adopt the warm start initialization procedure combined with the active shooting approach. Warm start initialization procedure means that we first apply Algorithm 1 with the smaller value of λ l , obtaining solutionθ λ l and Active λ l . Then, when moving to the next value λ l+1 , we initialize Algorithm 1 with Active λ l and θ λ l . Starting with a sparse Active matrix instead of a full one, allows the algorithm to cycle over a smaller number of groups, accelerating the iteration step. Note that with warm start initialization not only we reduce the computational cost but also get nested solutions.

Code Availability
jewel is implemented as an R package jewel which is freely available at https://github. com/annaplaksienko/jewel accessed on 11 July 2021.

Simulation Studies
This section presents simulation results to demonstrate the empirical performance of jewel from different perspectives. Specifically, we conducted three types of experiments.
In the first, we show the performance of jewel as a function of the number of classes K, assessing the advantages of using more than one dataset. In the second, given the same K datasets, we show that the joint approach is better than performing K independent analyses with a voting strategy or fitting a single concatenated dataset. Finally, in the third experiment, we compare the performance of jewel with two existing methods, the joint graphical lasso (JGL) [14] and the proposal of Guo et al. [13]. We used JGL package for the first and the code, kindly provided by the authors of [13], for the second.
Before presenting the results, we briefly describe the data generation and the metrics we use to measure performance.
Data generation: Given K, p, and n k , we generated a "true" scale-free network G = (V, E) with p nodes according to the Barabasi-Albert algorithm with the help of igraph package [25]. If not stated otherwise, the number of edges added at each step of the algorithm, m, and the power of the preferential attachment, power, were both set to 1. The resulting graph G was sparse and had mp − (2m − 1) edges. Then, we generated K precision matrices Ω (k) . To this purpose, for each k, we created a p × p matrix with 0s on the elements not corresponding to the network edges and symmetric values sampled from the uniform distribution with support on [−0.8, −0.2] ∪ [0.2, 0.8] for the elements corresponding to the existing edges. To ensure positive definiteness of Ω (k) , we set its diagonal elements equal to |µ min (Ω (k) )| + 0.1, with µ min (A) being the minimum eigenvalue of matrix A. We invert Ω (k) and set Σ (k) with elements Finally, for each k, we sampled n k independent, identically distributed observations from N (0, Σ (k) ).

Performance measures:
We evaluated the estimate of the graph structureÊ using the true positive rate and the false positive rate, defined, respectively, as where A c is the complement of set A. TPR shows the proportion of edges correctly identified, and FPR shows the proportion of edges incorrectly identified. As usually done in the literature, to judge the method's performance without being influenced by λ, we used the ROC-curve (receiver operating characteristic), i.e., TPR against the FPR for different values of λ. Our experiments used a grid of λ s equispaced in log scale, consisting of 50 values ranging from 0.01 to 1. We averaged both performance metrics and running time over 20 independent realizations of the above data generation procedure. Running time was measured on the 4-core 3.6GHz processor and 16GB RAM computer.

More Datasets Provide Better Performance
This first experiment aims to quantify the gain in estimating E when the number of datasets K increases. The simulation settings for this first experiment are as follows. We simulated 10 datasets as described above for two different dimensional cases: p = 100 with n k = 50 ∀ k (n k /p = 1/2) and p = 500 with n k = 100 ∀ k (n k /p = 1/5). We repeated the datasets generation 20 times. For each case, we first applied jewel to the K = 10 datasets and each of the 20 runs. We computed the average TPR and FPR. Then, we sampled K = 5 matrices (in each run) and repeated the procedure. We subsampled K = 3 matrices out of the previous 5, then K = 2 out of 3 and K = 1 out of 2. In other words, for each value of K = 10, 5, 3, 2, 1 we applied jewel to 20 realizations and evaluated the average TPR and FPR.
The average ROC curve in Figure 1 illustrates the performances as a function of K. Results in Figure 1 show the trend of improvement as K grows (which we expect, given the increasing amount of available data) and demonstrate that a limited increase in the number of datasets can provide a significant gain in performance. Indeed, we observed a remarkable improvement going from K = 1 to K = 2 or K = 3. Of course, this improvement comes at a price on an increasing computational time. However, this price is not excessive because it increases from ≈40 min for K = 1 to ≈1.5 h for K = 3 considering the whole grid of 50 λ parameters. The grid of λ is uniform in log-scale and starts from 0.01. Therefore, half of the values are between 0.01 and 0.1. Starting from a bigger λ 1 or using fewer values would decrease running time to minutes and make the price in terms of computational cost not excessive. Note also that these running times refers to the case where we use jewel over the entire grid of λ without the warm start procedure.

The Joint Approach Is Better That Voting and Concatenation.
In the same spirit of [26], this second experiment shows the importance of considering a joint approach when analyzing multiple datasets instead of other naive alternatives.
More precisely, given K datasets sharing the same network structure E, we want to show that the joint analysis performed by jewel has important advantages with respect to the two following alternatives. The first is the concatenation approach, where all data sets are combined into one extended matrix of size ∑ k n × p and jewel is applied with K = 1. The second is the voting approach, where each dataset is processed independently by jewel with K = 1 obtaining K estimates of the adjacency matrices. Then, we build a consensus matrix by setting an edge if it is present in at least K/2 of the estimated matrices. Figure 2 illustrates a schematic representation of these approaches.
The simulation setting for this second experiment is the following. We generated 20 independent runs, each with K = 3 data sets. We considered two dimensional scenario, p = 100 with n k = 50 and p = 500 with n k = 100. For the concatenation approach, as a first step, we constructed the long Kn k × p matrix and then applied jewel. For the voting approach, we applied the method separately to each data matrix X (k) , k = 1 . . . 3, and then put an edge in the resulting adjacency matrix only if it was present in 2 out of 3 estimated adjacency matrices. The ROC curves represent each approach's performance in the left and right panel of Figure 3 for the two-dimensional scenario, respectively.
First, we note that performance in the first scenario (left panel) is superior to the second scenario (right panel) due to the high-dimensional regime that is more severe in the second case. Second, we observe a significant advantage in processing the datasets jointly with respect to the other two approaches. Indeed, with the same amount of data, jewel correctly exploits the commonalities and the differences in K distributions and provides a more accurate estimate of E. Instead, the concatenation approach ignores the distributional differences creating a single data matrix (from not identically distributed datasets), and the voting approach exploits the common structure of the network only during the postprocessing (voting) of the estimator. As a consequence, both concatenation and voting approaches result in a loss of performance.

Comparison of Jewel with Other Joint Estimation Methods
In this third experiment, we compare the performance and runtime of jewel with two other methods for joint estimation: joint graphical lasso (JGL) with group penalty [14] and the proposal of Guo et al. [13]. JGL requires two tuning parameters λ 1 and λ 2 where the first one is responsible for differences in the supports of Θ (k) , k = 1 . . . K. Thus, according to our hypothesis, as the patterns of non-zero elements are identical across classes, we set λ 1 = 0 and vary only λ 2 . For the proposal of Guo et al., we consider the union of the supports of Θ (k) as the final adjacency matrix estimation (OR rule).
For sake of brevity, in this section we discuss only the dimensional setting K = 3, p = 500, n k = 100 ∀k (n/p = 1/5). The other settings show analogous results and do not add value to our exposition. Instead, as an added value to this study, we explored the influence of the type of "true" graph on methods' performance. Specifically, we compared results obtained for different scale-free graphs obtained changing parameters m and power. The first one, m, controls the graph's sparsity as the number of edges in the resulting graph is equal to mp − (2m − 1). The power parameter controls graph's hub structure-bigger power, bigger hubs. We considered six different m − power scenarios, with parameter m = 1, 2 (resulting in 499 edges with 0.4% sparsity and 997 edges with 0.8% sparsity, respectively) and parameter power = 0.5, 1, 1.5. In each of these scenarios, we generated the "true" underlying graph for 20 independent realizations, see Figure 4 for a random realization in each scenario. We then proceeded with the same scheme described before, generating the data, to which we applied jewel, JGL, and Guo et al. methods, finally evaluating the average performance and running time. In Figure 5, we show results of this third experiment and observe that on average jewel and JGL are comparable in performance, both being superior to the proposal of Guo et al. This observation remains true even in the worst-case scenario, i.e., power = 1.5. Overall, Figure 5 illustrates the good performance of this class of methods in the sparse regime, although increasing m, the performance decreases (in the worst case, it becomes similar to a random guess for all methods). More specifically, we note that increasing power, i.e., hubs size, leads to a significant loss in performance for all the methods. This observation agrees with the recent paper [27] for the case of one dataset that explores classical methods, like the one treated in this paper, for inferring a network with big hubs and comes to the same discovery. This observation is quite important since, in many real-world networks, the power is often estimated between 2 and 3. We could introduce degree-induced weights into the penalty to overcome this limitation, but this possibility is not explored in this paper. proposal [13] for K = 3, p = 500, n k = 100 ∀k. Each panel demonstrates performance in different m − power setting.
In Table 1, we report the running time results for some specific values of λ and for the entire grid of λ in the case m = 1, power = 1 (we omit other cases as these parameters do not influence the running time). As the tolerance influences the running time, we report that for each method we used its default stopping criteria threshold value which is tol = 0.01 for both jewel and Guo et al. method and tol = 10 −4 for JGL. As we can see, without using the warm start, jewel is approximately two times faster than JGL and several times faster than Guo et al. for small values of λ. This does not hold for the higher values of λ where jewel has to pay the price for the full initialization of the Active matrix. However, in real data applications, it is unlikely to use such large values of λ since it implies setting most of the connections to zero and, hence, many false negatives. In practical applications, interesting values for λ lay in the range for which jewel is faster than both its competitors.  [13] for K = 3, p = 500, n k = 100, m = 1 and power = 1 over the uniform in log-scale grid of 50 parameters λ from 0.01 to 1.

Tuning Parameter Estimation
Here, we show results obtained using BIC and CV criteria described in Section 2.5. jewel package has both criteria built-in. By default, we fixed 5-folds for the CV and implemented parallelization on a 4-core machine. Warm start procedure was implemented for both criteria.
The simulation setting is the following: for p = 500, n k = 100, K = 3 we generated 20 independent runs as described before with default values m = 1 and power = 1. We used a grid of 50 λs uniformly spaced in log-scale from 0.1 to 1. We set the stopping criterion threshold tol = 10 −4 instead of default value tol = 10 −2 to achieve higher accuracy for the estimation of regression coefficientsΘ (k) and residuals R (k) , k = 1 . . . K, which are required by both criteria BIC and CV.
In Figure 6, we plot values of BIC and CV error for each λ l value for one realization of data randomly chosen out of twenty independent runs. In Table 2, instead, we report results averaged over all 20 runs. For each run, we first estimated λ BIC and λ CV by the two criteria, then ran jewel with these values and evaluated performance in terms of accuracy, precision, recall (accuracy = (TP + TN)/(TP + TN + FP + FN), precision = TP/(TP + FP), recall = TP/(TP + FN)) and running time.  As we can see from Figure 6, both curves show a clear and well-defined minimum and provide λ OPT estimation. From Table 2 we also observe that runtime is around ≈3.5 min, while the estimation without warm start would take around 3.5 h. Thus, we suggest using warm start initialization for both criteria.
On average both estimates λ BIC and λ CV are quite close to λ = log p/n = log 500/100 = 0.249,which some authors adopt as regularization parameter (without applying data-driven selection criteria). However, this choice seems too large because it eliminates too many correct edges reaching high accuracy but relatively low precision and recall. This fact indicates that the choice of regularization parameter is still an open problem and a crucial one, and thus we conclude that, although BIC and CV are fundamental for choosing λ in the absence of any other information, their performance is not yet optimal.

Real Data Analysis
This section shows the application of jewel to gene expression datasets of patients with glioblastoma, which is the most common type of malignant brain tumor. We used three microarray datasets from Gene Expression Omnibus (GEO) database [28]: GSE22866 [29] (Agilent Microarray), GSE4290 [30], and GSE7696 [31,32] (both Affymetrix Array). We annotated the probes using the biomaRt R package. In case of multiple matches between probes and Ensembl gene ids, we gave preference to genes in common among all datasets or, in case of further uncertainty, to those present in selected pathways (see below). Then, we converted Ensembl gene ids to gene symbols, and we averaged gene expression over the probes, obtaining K = 3 matrices with dimensions 40 × 20,861, 77 × 16,801, 80 × 16,804, respectively. For the sake of simplicity, we considered only the p = 13,323 genes in common to all three datasets.
For this illustrative analysis, we limited the attention to the genes belonging to seven pathways from the Kyoto Encyclopedia of Genes and Genomes database [33] which were associated with cancer: p53 signaling pathway (hsa04115), glutamatergic synapse (hsa04724), chemokine signaling pathway (hsa04062), PI3K-Akt signaling pathway (hsa04151), glioma pathway (hsa05214), mTOR signaling pathway (hsa04150), and cytokine-cytokine receptor interaction (hsa04060). These pathways involve 920 genes in total; out of them, p = 483 were present in our datasets. Therefore, we applied jewel on this subset of genes. As described in the previous section, we selected the regularization parameter λ with both BIC and CV procedures. Finally, we compared the two estimated networks with a network obtained from the STRING database. In the following are the details.
First, when we used BIC (with the warm start) to estimate the optimal value of λ, we obtained λ BIC = 0.2223 (see Figure 7). Therefore, the estimated graph G BIC is the solution of jewel corresponding to this parameter. It has 3113 edges (about 2.7% of all possible edges), and all 483 vertices have a degree of at least 1.
When we used CV (with the warm start) to estimate the optimal value of λ, we obtained λ CV = 0.1151 (see Figure 7). We ran jewel with this value of the regularization parameter. Resulting graph G CV has 7272 edges (about 6.2% of all possible edges) and all 483 vertices have a degree of at least 1. As, in this example λ CV < λ BIC , G CV has more connections than G BIC .
Then, to better understand the identified connections, we analyzed the p = 483 genes in the STRING database [34]. STRING is a database of known and predicted proteinprotein interactions that can be physical and functional and derived from lab experiments, known co-expression, and genomic context predictions and knowledge in the databases text mining. We limited the query to connections from "experiments" and "databases" as active interaction sources setting the minimum required interaction score to the highest value of 0.9. The resulting STRING network had 415 out of 483 vertices connected to any other node and 4134 edges. We measured the number of connections common to our estimated network and the network from the STRING database. For each case, Figure 8 shows the connections identified by jewel that were present also in the STRING database. For G BIC , we observed 170 edges in common, while for G CV , we had 297 common edges (see all the results summarized in Table 3).  Although the number of edges in the intersection can be considered low at first sight, it is significant according to the hypergeometric test. Nevertheless, we should note two things: First, jewel seeks to identify conditional correlation among variables or equivalently linear relationships between two genes not mediated by other factors (i.e., other genes). Meanwhile, connections from the STRING database are not necessarily of such nature. Second, STRING contains general protein-protein interactions, i.e., interactions that are not necessarily present in the tissue/condition studied in used datasets. Therefore, we do not expect to identify mechanisms that might occur in other biological conditions (our gene expressions are glioblastoma).
However, we notice many groups of genes identified consistently, such as collagen alpha chains, ionotropic glutamate receptors, frizzled class receptors, interleukin 1 receptors, and fibroblast growth factors, collagen, and others. The biggest hubs in G BIC include PPP3CC (frequently underexpressed in gliomas), RCHY1 (vice versa, typically highly expressed in this condition), and IL4R (is associated with better survival rates). In G CV , the biggest hubs are TNN (that is considered a therapeutic target since an increase in expression can suppress brain tumor growth), CALML6, and BCL-2 (that can block apoptosis, i.e., cell death, and therefore may influence tumor prognosis).
To conclude, jewel demonstrated its ability to identify connections from the gene expression microarray data. However, it is possible that the choice of the regularization parameter still deserves improvements to achieve better results.

Discussion
The proposed method jewel is a methodological contribution to the GGM inference in the context of multiple datasets. It provides a valid alternative if the user is interested only in the structure of the GGM and not in covariance estimation. The proposed method is easy to use with the R package jewel.
There are still some aspects that can be improved and constitute directions for future work. As the performance of any regularization method depends on the choice of the tuning parameter, jewel could be improved with the better choice of λ. For example, in [24] the authors suggest quantile universal threshold, λ QUT , which is the upper α-quantile of the introduced zero-thresholding function under the null model. When not only the response vector but also the design matrix is random (as in jewel), bootstrapping within the Monte Carlo simulation can be used to evaluate λ QUT .
Moving to more methodological improvements, we can incorporate degree-induced weights into the minimization problem to account for the underlying graph's hub structure. In this way, we could overcome the decrease in performance demonstrated by all analyzed methods. Furthermore, we can consider other grouping approaches as the neighbordependent synergy described in [35]. Another possible improvement, when the underlying graphs are not the same across all the K datasets, is to decompose Θ (k) as a sum of two factors, one describing the common part and another the differences between the K graphs, then add a second group Lasso penalty term to capture differences between the networks. Other intriguing improvements regard the incorporation of specific prior knowledge, which would lead to different initialization of the Active matrix. For example, using variable screening procedures, i.e., a preliminary analysis of the input data that identifies connections that are not "important" with high probability, we can reduce the problem's dimensionality. Other aspects concern implementing the block-diagonalization approach, i.e., identifying blocks in the underlying graph and perform independent execution of jewel to each block. Such a choice does not influence performance but can significantly decrease the running time, especially if we parallelize the execution of jewel to different blocks.
Finally, another point that might greatly impact the applications of jewel to the analysis of gene expression has to do with the Gaussian assumption. Nowadays, RNA-seq data have become popular and are replacing the old microarray technology. However, RNA-seq are counts data. Therefore the Gaussian assumption does not hold. Methods such as voom [36] can be used to transform the RNA-seq data and stabilize the variance. voom estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma [37] empirical Bayes analysis pipeline. With this transformation, the RNA-seq can be analyzed using similar tools as for microarrays, jewel included. As a more appealing alternative, one could develop a joint graphical approach in the context of count data, such as in [38][39][40].