1. Background
Inferring the structural change of a network under different conditions is essential in many problems arising in biology, medicine, and other scientific fields. For instance, in genomics, it is often of importance to study the structural change of a genetic pathway between diseased and normal groups. In the field of brain mapping, it is critical to identify the difference in brain connectivity between groups (for example, the brain connectivity network of normal subjects and patients often possess different structures). Most of these applications have relied on the prevailing Gaussian graphical models (GGMs) because of its good interpretability and computational convenience, and there is a rich and growing literature on learning differential networks under GGMs. To name a few, Guo et al. (2015) [
1] introduced a joint estimation for multiple GGMs by a group lasso approach, under the assumption that the GGMs being studied are sparse and only differ in a small portion of edges. Danaher et al. (2014) [
2] proposed a fused graphical lasso method which is free from the sparsity assumption on condition-specific networks and only requires the sparsity of the differential network. Zhao et al. (2014) [
3] constructed a new estimator which directly estimates the differential network defined as
, where
and
represent the two condition-specific precision matrices and
,
,
have the same dimension. Liu (2017) [
4] presented a new test to simultaneously study structural similarities and differences between multiple high-dimensional GGMs, which adopts the partial correlation coefficients to characterize the potential changes of dependency strength between two variables.
Most of the aforementioned algorithms were based upon penalized likelihood maximization. Although some algorithms were consistent under certain regularity conditions, they failed to control the false discovery rate (FDR) of the substructure detection as it is difficult to choose a tuning parameter to control the FDR at the desired level [
1,
2,
3]. One exception is Liu (2017), who introduced a hierarchical testing framework to adjust for the multiplicity. Liu’s test was constructed to asymptotically control the FDR while keeping satisfactory statistical power. Simulation studies in [
4] have shown that this new test exhibits substantial power gains over existing methods such as graphical lasso. One major drawback that limits the application of Liu’s test is the Gaussian assumption, which is often violated in practice especially in genomics. For instance, some digital measurements of gene expression level such as RNA-Seq data often greatly deviate from normality even after log-transformation or other variance-stabilizing transformations. In this paper, we aim to extend Liu’s work to a more flexible semiparametric framework, namely the nonparanormal graphical models (NPNGMs), where the random variables are assumed to follow a multivariate normal distribution after a set of monotonically increasing transformations. We use a novel rank-based multiple testing method to detect the structural difference between multiple networks from non-Gaussian data. The method is computationally efficient and asymptotically controls the FDR at a desired level. To begin with, we give the formal definition of nonparanormal distribution:
Definition 1. A random vector follows a nonparanormal distribution if there exists a set of univariate and monotonically increasing transformations, , such that:where μ and Σ
denote the mean and covariance matrix in the multivariate normal distribution, respectively. The distribution of depends on three parameters and it can be generally written as By Definition 1 and Sklar’s theorem, it is easy to verify that when the transformation functions
are all differentiable, the nonparanormal distribution
is equivalent to a Gaussian copula [
5]. As graphical models, the NPNGMs are much more flexible than GGMs in modeling non- Gaussian data while retaining the interpretability of the latter. Some recent studies have established the estimation and properties of high dimensional nonparanormal graphical models. For example, Liu et al. (2009) [
5], who first studied high-dimensional NPNGMs, bridged the estimations of GGMs and NPNGMs by a nonparametric and truncated (Winsorized) estimator of the unknown transformation functions. Xue and Zou (2012) [
6] proposed to use an adjusted Spearman’s correlation to estimate the structure of high-dimensional NPNGMs, and they showed that the rank-based estimator achieves the same rate of convergence as its oracle counterpart (i.e., assuming known transformation functions). Despite the advances in single NPNGM estimation, to the best of our knowledge, the inference of differential substructure between multiple NPNGMs has not been studied. In this paper, we tackled this problem by embedding the Winsorized estimator into the testing framework of Liu (2017). Under some regularity conditions, we showed that the new test statistic converges to the same distribution as its oracle counterpart does [
4].
We begin with the notations and problem formulation. For a vector , we define its norm as , its norm as , its norm as , and its norm as . For a matrix , we define its norm as , its norm as , its Frobenius norm as and its norm as . Let denote the ith row of with its jth entry being removed and denote the jth column with its ith entry being removed. We use to denote a matrix by removing the ith row and the jth column. For square matrix , we let and denote the largest and smallest eigenvalues of respectively. In addition, for a given sequence of random variable and a constant sequence , denotes that converges to zero in probability as n approaches to infinity and denotes that is stochastically bounded. If there are positive constants c and C such that for all , we write .
To formulate the problem, we let
be the index of class,
p be the dimension, and
be a sample of size
for class
k where
,
. Under
, we test the following hypothesis:
where
,
, and
represents the partial correlation coefficient between
and
given
,
. The edge
is a differential edge if
for some
, and the differential network is defined as the set of all differential edges. As a well-known result in statistics,
. Here, we consider an equivalent alternative of the hypothesis testing above. Similar as in [
4], let
then the hypothesis testing can be simplified as
As , we define and , and the total numbers of tests are , i.e., .
The rest of this paper is structured as follows: In
Section 2, we introduce the new test statistic and multiple testing procedure. In
Section 3 we perform a simulation study to evaluate the finite sample performance of the proposed test in terms of FDR control and statistical power. We then apply the new method to a rich genomic data to study the genetic difference between four breast cancer subtypes. We discuss the strength and shortcomings of the test in
Section 5. Technical proof of the asymptotic results is provided in
Appendix A.
3. Numerical Study
We performed a simulation study to evaluate the finite sample performance of the proposed procedure. In particular, we evaluated the empirical false discovery rate (eFDR) as well as the statistical power under two classes, i.e., . The dimension and sample size were set to be and . We consider two commonly used graph-generating models including the band graph and Erdos–Rényi (ER) graph, and two estimators for regression coefficients including lasso estimator and Dantzig selector. Detailed set-up for precision matrices and are given below:
Band graph: was obtained by the following assignments
We then randomly picked 50 edges in as the differential edges and changed their signs in . To ensure positive definiteness, we added , to the diagonal of and .
Erdos–Rényi (ER) graph: Each node pair () were randomly connected with probability 5%. A correlation coefficient is generated for each edge in the network from a two-part uniform distribution . To ensure positive-definiteness, we shrunk the correlations by a factor of 5 and the diagonals were set to be one for . We then randomly selected 5% of the edges as the differential edges, and changed their signs in .
For each graph, we generated the latent Gaussian data (oracle data) from
,
, and a Winsorized estimator with truncation parameter
was used to implement our test. The performance of the proposed method was then evaluated in two aspects: false discovery rate control and statistical power. In particular, we compared the results based on oracle data and imputed data by the Winsorized estimator. Two estimators including the lasso estimator and Dantzig selector were used to estimate coefficients
. For oracle data, we applied the R package
flare to calculate the solution path over a sequence of 20 candidate
’s and tune by Akaike information criterion (AIC). For imputed data, we adopted the rank-based methods introduced by [
6], i.e., the rank-based lasso and rank-based Dantzig selector. The simulation was repeated for 100 times for each FDR level (
) and the average empirical FDR and statistical power were summarized.
Figure 1 and
Figure 2 compared the empirical false discovery rate (eFDR) with the desired level
under the band graph and ER graph. It can be seen that the empirical FDR based on imputed data is close to the one by oracle data, both close to the desired level of
, suggesting that the FDRs were controlled quite well for both cases. The lasso estimator works almost equally well as Dantzig selector in both settings.
Figure 3 and
Figure 4 summarized the statistical power of the test for the band graph and ER graph. As can be seen, the power for ER graph is substantially lower than the band graph, indicating that the complexity and denseness of the underlying differential network may significantly decrease the power of our test. The test based on oracle data performs slightly better than the imputed data, which is due to the loss of information during Winsorized imputation. Similar as we observed from
Figure 1 and
Figure 2, the lasso estimator works almost equally well as Dantzig selector.
In addition, we compared the proposed test with a direct estimator, recently developed by Zhang (2019) [
8]. The direct estimator is a rank-based estimator and can be solved by a parametric simplex algorithm. We simulated the data from the Erdos–Rényi (ER) graph with different sample sizes (
) and numbers of dimensions (
). As the direct estimator does not control the false discovery rate, we set the FDR level at 0.05 for our proposed test.
Figure 5 summarized the empirical FDR and statistical power under different sample sizes (with dimension fixed at 100) and different dimensions (with sample size fixed at 100). It can be seen that the two methods have comparable performance and our proposed test achieves lower FDR but slightly lower statistical power. However, it is noteworthy that the direct estimator is computationally expensive and becomes impractical when the dimensions exceed 150.
Table 1 summarized the running time of the two methods, where it can be seen that our test is much faster than the direct estimator, especially for relatively high dimensions. For instance, when
, the direct estimator takes hours while our test takes less than 10 seconds. As the core part of the proposed algorithm is the estimation of regression coefficients, the time complexity is the same as the linear regression. For instance, with LASSO and
, the time complexity is
, while the direct estimator by Zhang (2019) has a time complexity
.
4. A Genomic Application
In this part, we applied the proposed test to the Cancer Genome Atlas data (TCGA, [
9]) to study the different roles of the cell cycle pathway in the two subtypes of breast cancer including luminal A subtype and basal-like subtype. The cell cycle pathway is known to play a critical role in the initiation and progression of many human cancers including breast cancer and ovarian cancer [
10,
11]. For instance, the cell cycle pathway provided by KEGG (Kyoto Encyclopedia of Genes and Genomes, [
12]) contains 128 important genes that co-regulate cell proliferation, including
ATM,
RB1,
CCNE1, and
MYC. Abnormal regulation among these genes may cause the over-proliferation of cells and an accumulation of tumor cell numbers [
11].
The transcriptome profiling data for breast cancer were downloaded through the Genomic Data Commons portal [
13] in January 2017. The expression level of each gene was quantified by the count of reads mapped to the gene. The quantifications were done by software
HTSeq of version 0.9.1 [
14]. In our analysis, we excluded 43 subjects including 12 male subjects and 31 subjects with >1% missing values. In addition, we removed the effects due to different age groups and batches using a median- matching and variance-matching strategy [
10,
15,
16]. For example, the batch effect can be removed in the following way:
where
refers to the expression value for gene
i from sample
k in batch
j (
),
represents the median of
,
refers to the median of
,
and
stand for the standard deviations of
and
, respectively.
The remaining 959 breast cancer samples were further classified into five subtypes according to two molecular signatures, namely
PAM50 [
17] and
SCMOD2 [
18]. The two classifications were implemented separately using R package
genefu [
19] and we obtained 530 subjects with concordant classification by two classifiers. The resulting set contains 221 subjects in the luminal A group, 119 in the luminal B group, 74 in the her2-enriched group, 105 in the basal-like group, and 11 in the normal-like group. For illustration purposes, we conducted two pairwise comparisons (1) Luminal A vs basal-like and (2) Luminal B vs basal-like.
To balance the bias and variance, we choose the same truncation parameter in Winsorized imputation as in our simulation study
where
,
,
. The proposed test based on the Winsorized estimator was then conducted for each gene pair with different FDR cutoffs.
Figure 6 and
Figure 7 summarized all the identified differential edges under FDR levels
, with all isolated genes being removed. Our results suggested a list of important genes that play different roles in different breast cancer subtypes. For instance, in
Figure 6, genes
CCNB1 and
PRKDC contribute to several differential edges. According to recent studies, gene
CCNB1 is a prognostic biomarker for certain subtypes of breast cancer and it is closely associated with hormone therapy resistance [
20]. It has also been reported in the literature that the
PRKDC regulates chemosensitivity and is a potential prognostic and predictive marker of response to adjuvant chemotherapy in breast cancer patients [
21]. Our findings about several other genes including
CHEK2 and
CDC7 also confirmed some existing reports [
22,
23]. As we observed from the two examples, as the desired FDR level increases, the resulting differential network tends to be denser and denser (
Figure 8 showed the correlation between FDR and the number of differential edges). In practice, users should consider the trade-off between the accuracy (FDR) and number of new hypotheses (number of differential edges) and choose an appropriate FDR [
24].
5. Discussion
Detecting the differential substructure on multiple graphical models is a fundamental and challenging problem in statistics. Liu (2017) studied the problem under the Gaussian framework and introduced an elegant hierarchical test based on the estimation of single GGM. Unlike most existing methods, Liu’s approach asymptotically controlled the false discovery rate at a nominal level, which guarantees the quality of the estimated differential network. In this work, we further extended Liu’s test to a more flexible semiparametric framework, namely the nonparanormal graphical models. Our test is built upon a Winsorized estimator of the unknown transformation functions and it enjoys similar asymptotic properties as its oracle counterpart does.
Although the new test holds great promise in many applications such as genetic network modeling, it has some practical limitations. First, as we see from the theoretical derivation, the good performance of the test relied on the sparsity assumption on the differential network. Although the sparsity assumption is reasonable in many cases, it still could be violated in some applications. For instance, some genetic pathways may exhibit a global change of gene–gene regulations between different phenotypes. When the differential network is dense or locally dense, the method may fail to control the FDR. To solve the problem, a new test needs to be defined to evaluate the level of the sparseness of the change between two conditions. However, there is still a gap on the literature of this topic.
Second, one key assumption in NPNGMs is that the transformed variables follow a joint Gaussian distribution. This assumption also needs to be checked in real-world applications. Under low dimensions, one can employ some popular normality tests, including the Anderson–Darling test and Shapiro–Wilk test, on the imputed data or other normal scores. However, most of these tests fail to detect non-normality for high-dimension data. The normality test under high dimension is still an open and challenging problem and we left it for future research.
It is also noteworthy to mention that the new test relied on an accurate estimator for the coefficients
. Motivated by [
6], we chose two popular estimators including lasso estimator and Dantzig selector based on the adjusted Spearman’s rank, which satisfies Condition (7). In fact, some other estimators also satisfy the conditions, for instance, the rank-based adaptive lasso [
6,
25] and square-root lasso estimator [
6,
26]. These estimators can also be incorporated into our testing framework.