Topology Adaptive Graph Estimation in High Dimensions

We introduce Graphical TREX (GTREX), a novel method for graph estimation in high-dimensional Gaussian graphical models. By conducting neighborhood selection with TREX, GTREX avoids tuning parameters and is adaptive to the graph topology. We compare GTREX with standard methods on a new simulation set-up that is designed to assess accurately the strengths and shortcomings of different methods. These simulations show that a neighborhood selection scheme based on Lasso and an optimal (in practice unknown) tuning parameter outperforms other standard methods over a large spectrum of scenarios. Moreover, we show that GTREX can rival this scheme and, therefore, can provide competitive graph estimation without the need for tuning parameter calibration.


Introduction
Graphical models have become an important tool to find and describe patterns in high-dimensional data ( [1], Chapter 3). In biology, for example, graphical models have been successfully applied to estimate interactions between genes from high-throughput expression profiles [2,3], to predict contacts between protein residues from multiple sequence alignments [4], and to uncover the interactions of microbes from gene sequencing data [5]. Graphical models represent the conditional dependence structure of the underlying random variables as a graph. Learning a graphical model from data requires a simultaneous estimation of the graph and of the probability distribution that factorizes according to this graph. In the Gaussian case, it is well known that the underlying graph is determined by the non-zero entries of the precision matrix (the inverse of the population covariance matrix). Gaussian graphical models have become particularly popular after the advent of computationally efficient approaches, such as neighborhood selection [6] and sparse covariance estimation [7,8], which can learn even high-dimensional graphical models. Neighborhood selection, on the one hand, reconstructs the graph by estimating the local neighborhood of each node via Lasso [9]. This approach is usually seen as a proxy to the covariance selection problem [10]. On the other hand, References [7,8] showed that the graph and the precision matrix can be simultaneously estimated by solving a global optimization problem. State-of-the-art solvers are graphical Lasso [10] and the Quadratic Approximation of Inverse Covariance (QUIC) method [11]. Both approaches can be extended beyond the framework of Gaussian graphical models. To mention two of the many examples, Reference [12] studied neighborhood selection for Ising models, and [13] introduced a semi-parametric penalized likelihood estimator that allows for non-Gaussian distributions of the data.
Although the field has advanced tremendously in the past decade, there are still a number of challenges, both from a practical and a theoretical point of view. First, the conditions that are currently imposed [6,12,14,15] to show consistency in graph and/or graphical model estimation are difficult to meet or verify in practice. Moreover, the performance of any of the standard methods heavily depends on the simulation setup or the data at hand [5,16,17]. Furthermore, standard neighborhood selection and covariance estimation methods require a careful calibration of a tuning parameter, especially because the model complexity is known a priori only in very few examples [4]. In practice, the tuning parameters are calibrated via cross-validation, classical information criteria such as the AIC and BIC [8], or stability criteria [18]. However, different calibration schemes can result in largely disparate estimates [18].
To approach some of the practical challenges, we introduce Graphical TREX (GTREX), a novel method for graph estimation based on neighborhood selection with TREX [19]. The main feature of GTREX is that it can make tuning parameters superfluous, which renders this method particularly useful in practice. We also introduce a novel simulation setup that may serve as a benchmark to assess the strengths and shortcomings of different methods.
Our simulations showed that, if the tuning parameter is optimally chosen, standard neighborhood selection with the "or-rule" outmatches other standard methods across a wide range of scenarios. Our simulations also showed that GTREX can rival this method in many scenarios. Since optimal tuning parameters depend on unknown quantities and, therefore, are not accessible in practice, this demonstrates that GTREX is a promising alternative for graph estimation in high-dimensional graphical models.
The remainder of the paper is structured as follows. After specifying the framework and notation, we introduce GTREX in Section 2. We then describe the experimental scenarios in Section 3 and present the numerical results in Section 4. We finally conclude in Section 5.

Framework and Notation
We considered n samples from a p-dimensional Gaussian distribution N p (0, Σ) with positive-definite, symmetric covariance matrix Σ ∈ R p×p . The samples are summarized in the matrix X ∈ R n×p such that X ij corresponds to the jth component of the ith sample. We call Σ −1 the precision matrix and note that the precision matrix is symmetric.
The Gaussian distribution N p (0, Σ) can be associated with an undirected graph G = (V, E ), where V = {1, . . . , p} is the set of nodes and E = V × V the set of (undirected) edges that consists of all pairs (i, j), (j, i) ∈ V × V that fulfill i = j and (Σ −1 ) ij = 0. We denote by e ij (and equivalently by e ji ) the edge that corresponds to the pair (i, j), (j, i) and by s := |E | the total number of edges in the graph G.
We denote by supp(β) the support of a vector β, by a ∨ b and a ∧ b the maximum and minimum, respectively, of two constants a, b ∈ R, and by | · | the cardinality of a set.
In this paper, we focused on estimating which entries of the precision matrix Σ −1 are non-zero from the data X. This is equivalent to estimating the set of edges E from X. We assessed the quality of an estimateẼ via the Hamming distance to the true set of edges E given by d H (Ẽ , E ) := |{e ij : e ij ∈Ẽ , e ij ∈ E } ∪ {e ij : e ij ∈Ẽ , e ij ∈ E }|.

Methodology
Before introducing our new estimator GTREX, we first recall the definitions of graphical Lasso and of neighborhood selection with Lasso. For a fixed tuning parameter λ > 0, Graphical Lasso (GLasso) estimates the precision matrix Σ −1 from X according to: [10] Θ λ GLasso ∈ argmin − log det(Θ) + trace(ΣΘ) where the minimum is taken over all positive-definite matrices Θ ∈ R p×p ,Σ := X X/n is the sample covariance matrix, and Θ 1 := ∑ p i,j=1 |Θ ij | is the sum of the entries of Θ. The corresponding estimator for the set of edges E is then: This defines a family of graph estimators indexed by the tuning parameter λ. To assess the potential of GLasso, we defineÊ * GLasso :=Ê λ * GLasso , where λ * is the tuning parameter that minimizes the Hamming distance to the true edge set E . We stress, however, that the optimal tuning parameter λ * is not accessible in practice and that there are no guarantees that standard calibration schemes provide a tuning parameter close to λ * . Therefore, the performance ofÊ * GLasso is to be understood as an upper bound for the performance of GLasso. Besides GLasso, we also considered neighborhood selection with Lasso. To this end, we define for any matrixX ∈ R n ×p , n ≤ n, and for any node k ∈ V, the vectorX k ∈ R n as the kth column ofX and the matrixX −k ∈ R n ×(p−1) asX without the kth column. For a fixed tuning parameter λ > 0, the estimates of Lasso for node k are defined according to [9]:β The corresponding set of edgesÊ λ and (with the "and-rule") andÊ λ or (with the "or-rule") are then defined via Algorithm 1 following Meinshausen and Bühlmann [6]. (An interesting alternative would be the symmetric approach in [20].) Similarly as above, we define the optimal representative (in terms of the Hamming distance) of these families of estimators asÊ * and , called MB(and), andÊ * or , called MB(or). Again, in practice, it is not known which tuning parameters are optimal; however, MB(and) and MB(or) can highlight the potential of neighborhood selection with Lasso.

Algorithm 1: Neighborhood selection with Lasso.
Data: X ∈ R n×p , λ > 0; Result:Ê λ and ,Ê λ or ; Initialize a matrix C := 0 p×p ; for k = 1 to p do Computeβ λ LASSO (k; X) according to (2); Update the kth column C k of the matrix C according to C k :=β λ LASSO (k; X); end Set the estimated sets tô E λ and := {e ij : |C ij | ∨ |C ji | > 0} and E λ or := {e ij : |C ij | ∧ |C ji | > 0}; We finally introduce Graphical TREX (GTREX). To this end, we considered TREX for node k on a subsampleX according to [19]: For a fixed number of bootstraps b ∈ {1, 2, . . . } and threshold t ≥ 0, we then define GTREX as the set of edgesÊ provided by Algorithm 2. The bootstrap part of the algorithm might remind us of the stability selection method [21], but has a different focus: it concerns the edge selection directly rather than the calibration of a tuning parameter.
For the actual implementation, we followed [19] and invoked that a ∞ ≈ a q for q sufficiently large. We then used a projected sub-gradient method to minimize the objective: which corresponds to (3).
Initialize all frequencies F := 0 p×p ; for k = 1 to p do for l = 1 to b do Generate sequential bootstrap sampleX of X; Computeβ(k;X) according to (3); Update the frequencies for the edges adjacent to node k for m = 1 to p do if m ∈ supp(β(k;X)) then

Experimental Scenarios
Besides the number of parameters p, the sample size n, and the level of sparsity of the graph, the graph topology can have a considerable impact on the performance of the different methods [15]. For example, standard estimators require many samples for graphs with many hub nodes (nodes that are connected to many other nodes). Reference [15] presented a number of toy examples that confirmed these theoretical predictions. The following experimental setup was motivated by these insights. We considered six different graph topologies with varying hub structures, ranging from a single-hub case to Erdős-Rényi graphs: 1.
Single-hub graph: Erdős-Rényi graph: Until the number of edges s is exhausted, edges are uniformly at random added to E ; 6.
Scale-free graph: First, a set of edges is constructed with the preferential attachment algorithm [22]: The set of edges is first set to E = {e 12 }. For each node i ∈ V \ {1, 2}, an edge e ij is then iteratively added to E until the number of edges s is exhausted. The probability of selecting the edge e ij is set proportional to the degree of node j ∈ V (that is, the number of edges at node j) in the current set of edges. One could also change this probability; for example, one could weight the degree more heavily, that is put more emphasis on nodes that already have edges connected to them. In general, the more weight on the degree, the more the graphs look the same as hub graphs and, therefore, the better the performance of our method relative to GLasso and MB.
Given a graph G that consists of a set of nodes V and a set of edges E as described above, a precision matrix Σ −1 is generated as follows: The set of edges E determines which off-diagonal entries of the precision matrix Σ −1 are non-zero. The values of these entries are independently sampled uniformly at random in [−a max , −a min ] ∪ [a min , a max ] for some a max > a min > 0. The diagonal entries of Σ −1 are then set to a common value, which is chosen to ensure a given condition number cond := cond(Σ −1 ) (the ratio of the largest eigenvalue to the smallest eigenvalue of Σ −1 ).

Numerical Results
We performed all numerical computations in MATLAB 2014a on a standard MacBook Pro with a 2.8 GHz Dual-core Intel i7 and 16 GB 1600 MHz DDR3 memory. To compute the GLasso paths, we used the C implementation of the QUIC algorithm and the corresponding MATLAB wrapper [11]. We set the maximum number of iterations to 200, which ensured the global convergence of the algorithm in our settings. To compute the Lasso paths for the neighborhood selection schemes, we used the MATLAB internal procedure lasso.m, which follows the popular glmnet R code. We implemented a neighborhood selection wrapper mblasso.m that returns the graph traces over the entire path for the "and-rule" and the "or-rule." Both for GLasso and neighborhood selection, we used a fine grid of step size 0.01 on the unit interval for the tuning parameter λ, resulting in a path over 100 values of λ. To compute TREX, we optimized the approximate TREX objective function with q = 40 using Schmidt's PSS algorithm implemented in L1General2_PSSgb.m. We used the PSS algorithm with the standard parameter settings and set the initial solution to the parsimonious all-zeros vector β init = (0, . . . , 0) ∈ R p . We used the following PSS stopping criteria: minimum relative progress tolerance optTol = 1 × 10 −7 , minimum gradient tolerance progTol = 1 × 10 −9 , and maximum number of iterations maxIter = 0.2p. We implemented a wrapper gtrex.m that integrates the nodewise TREX solutions and returns the frequency table for each edge and the resulting graph estimate. We used b = 31 bootstrap samples in B-TREX; increasing the number of bootstraps did not result in significant changes of the GTREX solutions.
We generated the graphical models as outlined in Section 3. We set the number of nodes to p ∈ {100, 200}, the number of edges to s = p − 1, the bounds for the absolute values of the off-diagonal entries of the precision matrix to a min = 0.2 and a max = 1, and the condition number to cond = 100. We then drew n ∈ {p, 2p, 4p, 10p} samples from the resulting normal distribution and normalized each sample to have the Euclidean norm equal to √ n. We measured the performance of the estimators in terms of the Hamming distance to the true graph and in terms of the Precision/Recall. We stress that for GLasso, MB(or), and MB(and), we selected the (in practice unknown) tuning parameter λ that minimizes the Hamming distance to the true graph. For GTREX, we set the frequency threshold to t = 0.75; however, it turned out that GTREX is robust with respect to the choice of the threshold. For each graph, we report the averaged results over 20 repetitions.
The results are summarized in Figure 1 and in Tables 1-3. The results with respect to the Hamming distance in Figure 1 provide three interesting insights: First, GLasso performed poorly in the Hamming distance for all considered scenarios. We suspect that this is connected with the chosen value for the condition of the precision matrix. Second, we observed marked differences between MB(and) and MB(or). In particular, the two methods had a similar performance in the scenarios with the four-niche and the Erdös-Rényi graphs, but a completely different performance in the scenarios with the hub graphs. Third, GTREX performed excellently for the hub graphs if the sample size was sufficiently large (n > 2p) and reasonably in all other scenarios. The results with respect to Precision/Recall in Tables 1-3 show that all methods had an excellent Precision (all values were close to 1), but differed in Recall. MB(and) provided the best overall performance, but note once more that it was calibrated with the optimal tuning parameter, which is unknown in practice. GTREX was competitive in most scenarios once the sample size n was sufficiently large, which indicates that the GTREX requires a minimal number of samples to "autotune" itself to the data.

Single-Hub
Double-Hub Four-Hub Scale-free Erdős-Rényi Four-niches We also note that neither GTREX nor its competitors worked reasonably well when p n in our simulation framework. In general, we recommend using graphical modeling with care when p is larger than n.
TREX does not contain a tuning parameter, but one can argue that the frequency threshold t could be adapted to the model or the data and, therefore, plays the role of a tuning parameter in GTREX. However, the above results demonstrate that the universal value t = 0.75 works for a large variety of scenarios. Moreover, GTREX is robust with respect to the choice of t. This is illustrated in Figure 2, where for two scenarios, we report the Hamming distances of GTREX to the true graphs as a function of t. We observed that the Hamming distances were similar over wide ranges of t. In the same figure, we also report the Hamming distances of the standard methods to the true graphs as a function of the tuning parameter λ. We see that these paths have narrow peaks, which suggests that the tuning parameters of GLasso and of neighborhood selection with Lasso need to be carefully calibrated.
Note that the results especially of Figure 1, including the tuning of the standard methods, were based on the Hamming distance as a measure of accuracy. We chose the Hamming distance, because it is a very general and widely accepted measure and, most importantly, readily comparable across different setups. However, different measures of accuracy could give different results, for example in terms of how the false estimates split into false positives and false negatives.

Single-Hub
Double-Hub Four-Hub Scale-free Erdős-Rényi Four-niches Hamming distance Hamming distance p=100, n=500 p=200, n=1000   Note finally that there has been recent progress in the computational aspects of TREX. Reference [23] introduced a new algorithm that is guaranteed to converge to a global optimum (even though it is a non-convex problem) and requires solving at most p reasonably simple subproblems. Reference [24] developed this idea further in the context of prospective functions. In any case, the computational costs of GTREX are |V | × b × (costs of solving Equation (3)). In our current non-convex implementation, solving Equation (3) Figure 1). Hence, estimating a graph in our current simulations took about one hour on a single CPU (of course, the algorithm can be parallelized very easily).

Conclusions
We introduced a new method for graph estimation in high-dimensional graphical models. Unlike any other method, our estimator avoided the tuning parameter, which is usually part of the regularizer. Beyond establishing the mere fact that this is possible to begin with, we made two main contributions: First, since the method rivals standard methods even if they are calibrated with an optimal, in practice unknown tuning parameter, our paper can directly lead to more accurate estimation in practice.
Second, deriving statistical theory for tuning parameter calibration has turned out to be very difficult in all parts of high-dimensional statistics. Recent developments focused mainly on linear and logistic regression [25,26]. For graphical modeling, standard estimators are currently not equipped with a rigorously justified tuning scheme. We have not established a statistical theory for GTREX yet, but it has been shown recently that the TREX idea can provide new ways to establish such a theory [27]. Hence, we hope that our approach can indeed be complemented with comprehensive statistical theories in future research, thereby furthering the mathematical understanding of graphical modeling in general.