Next Article in Journal
Homology-Free Detection of Transposable Elements Unveils Their Dynamics in Three Ecologically Distinct Rhodnius Species
Next Article in Special Issue
Model-Based Clustering with Measurement or Estimation Errors
Previous Article in Journal
The LCORL Locus Is under Selection in Large-Sized Pakistani Goat Breeds
Previous Article in Special Issue
Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Testing Differential Gene Networks under Nonparanormal Graphical Models with False Discovery Rate Control

Department of Mathematical Sciences, University of Arkansas, Arkansas, AR 72701, USA
Genes 2020, 11(2), 167; https://doi.org/10.3390/genes11020167
Submission received: 20 January 2020 / Revised: 27 January 2020 / Accepted: 30 January 2020 / Published: 5 February 2020
(This article belongs to the Special Issue Statistical Methods for the Analysis of Genomic Data)

Abstract

:
The nonparanormal graphical model has emerged as an important tool for modeling dependency structure between variables because it is flexible to non-Gaussian data while maintaining the good interpretability and computational convenience of Gaussian graphical models. In this paper, we consider the problem of detecting differential substructure between two nonparanormal graphical models with false discovery rate control. We construct a new statistic based on a truncated estimator of the unknown transformation functions, together with a bias-corrected sample covariance. Furthermore, we show that the new test statistic converges to the same distribution as its oracle counterpart does. Both synthetic data and real cancer genomic data are used to illustrate the promise of the new method. Our proposed testing framework is simple and scalable, facilitating its applications to large-scale data. The computational pipeline has been implemented in the R package DNetFinder, which is freely available through the Comprehensive R Archive Network.

1. Background

Inferring the structural change of a network under different conditions is essential in many problems arising in biology, medicine, and other scientific fields. For instance, in genomics, it is often of importance to study the structural change of a genetic pathway between diseased and normal groups. In the field of brain mapping, it is critical to identify the difference in brain connectivity between groups (for example, the brain connectivity network of normal subjects and patients often possess different structures). Most of these applications have relied on the prevailing Gaussian graphical models (GGMs) because of its good interpretability and computational convenience, and there is a rich and growing literature on learning differential networks under GGMs. To name a few, Guo et al. (2015) [1] introduced a joint estimation for multiple GGMs by a group lasso approach, under the assumption that the GGMs being studied are sparse and only differ in a small portion of edges. Danaher et al. (2014) [2] proposed a fused graphical lasso method which is free from the sparsity assumption on condition-specific networks and only requires the sparsity of the differential network. Zhao et al. (2014) [3] constructed a new estimator which directly estimates the differential network defined as Δ = Σ X 1 Σ Y 1 , where Σ X 1 and Σ Y 1 represent the two condition-specific precision matrices and Δ , Σ X 1 , Σ Y 1 have the same dimension. Liu (2017) [4] presented a new test to simultaneously study structural similarities and differences between multiple high-dimensional GGMs, which adopts the partial correlation coefficients to characterize the potential changes of dependency strength between two variables.
Most of the aforementioned algorithms were based upon penalized likelihood maximization. Although some algorithms were consistent under certain regularity conditions, they failed to control the false discovery rate (FDR) of the substructure detection as it is difficult to choose a tuning parameter to control the FDR at the desired level [1,2,3]. One exception is Liu (2017), who introduced a hierarchical testing framework to adjust for the multiplicity. Liu’s test was constructed to asymptotically control the FDR while keeping satisfactory statistical power. Simulation studies in [4] have shown that this new test exhibits substantial power gains over existing methods such as graphical lasso. One major drawback that limits the application of Liu’s test is the Gaussian assumption, which is often violated in practice especially in genomics. For instance, some digital measurements of gene expression level such as RNA-Seq data often greatly deviate from normality even after log-transformation or other variance-stabilizing transformations. In this paper, we aim to extend Liu’s work to a more flexible semiparametric framework, namely the nonparanormal graphical models (NPNGMs), where the random variables are assumed to follow a multivariate normal distribution after a set of monotonically increasing transformations. We use a novel rank-based multiple testing method to detect the structural difference between multiple networks from non-Gaussian data. The method is computationally efficient and asymptotically controls the FDR at a desired level. To begin with, we give the formal definition of nonparanormal distribution:
Definition 1.
A random vector Y = ( Y 1 , Y 2 , . . . , Y p ) follows a nonparanormal distribution if there exists a set of univariate and monotonically increasing transformations, f = ( f 1 , . . . , f p ) , such that:
( X 1 , . . . , X p ) ( f 1 ( Y 1 ) , . . . , f p ( Y p ) ) N ( μ , Σ ) ,
where μ and Σ denote the mean and covariance matrix in the multivariate normal distribution, respectively. The distribution of Y depends on three parameters and it can be generally written as Y N P N ( μ , Σ , f ) .
By Definition 1 and Sklar’s theorem, it is easy to verify that when the transformation functions f j s are all differentiable, the nonparanormal distribution N P N ( μ , Σ , f ) is equivalent to a Gaussian copula [5]. As graphical models, the NPNGMs are much more flexible than GGMs in modeling non- Gaussian data while retaining the interpretability of the latter. Some recent studies have established the estimation and properties of high dimensional nonparanormal graphical models. For example, Liu et al. (2009) [5], who first studied high-dimensional NPNGMs, bridged the estimations of GGMs and NPNGMs by a nonparametric and truncated (Winsorized) estimator of the unknown transformation functions. Xue and Zou (2012) [6] proposed to use an adjusted Spearman’s correlation to estimate the structure of high-dimensional NPNGMs, and they showed that the rank-based estimator achieves the same rate of convergence as its oracle counterpart (i.e., assuming known transformation functions). Despite the advances in single NPNGM estimation, to the best of our knowledge, the inference of differential substructure between multiple NPNGMs has not been studied. In this paper, we tackled this problem by embedding the Winsorized estimator into the testing framework of Liu (2017). Under some regularity conditions, we showed that the new test statistic converges to the same distribution as its oracle counterpart does [4].
We begin with the notations and problem formulation. For a vector a = ( a 1 , . . . , a p ) , we define its 0 norm as a 0 = i = 1 p I { a i 0 } , its 1 norm as a 1 = i = 1 p | a i | , its 2 norm as a 2 = i = 1 p a i 2 , and its norm as a = max i | a i | . For a matrix A = ( a i j ) R p × q , we define its 0 norm as A 0 = i , j I { a i j 0 } , its 1 norm as A 1 = i , j | a i j | , its Frobenius norm as A F = i , j a i j 2 and its norm as A = max i , j | a i j | . Let A i , j denote the ith row of A with its jth entry being removed and A i , j denote the jth column with its ith entry being removed. We use A i , j to denote a ( p 1 ) × ( q 1 ) matrix by removing the ith row and the jth column. For square matrix B , we let λ max ( B ) and λ min ( B ) denote the largest and smallest eigenvalues of B respectively. In addition, for a given sequence of random variable { X n , n = 1 , 2 , . . . } and a constant sequence { a n , n = 1 , 2 , . . . } , X n = o p ( a n ) denotes that X n / a n converges to zero in probability as n approaches to infinity and X n = O p ( a n ) denotes that X n / a n is stochastically bounded. If there are positive constants c and C such that c X n / a n C for all n 1 , we write X n a n .
To formulate the problem, we let k { 1 , 2 , . . . , K } be the index of class, p be the dimension, and ( Y 1 ( k ) , . . . , Y n k ( k ) ) be a sample of size n k for class k where Y m ( k ) = ( Y m 1 ( k ) , . . . , Y m p ( k ) ) T R p , m { 1 , . . . , n k } . Under Y m ( k ) N P N ( μ ( k ) , Σ ( k ) , f ( k ) ) , we test the following hypothesis:
H 0 i j : ρ i j · ( 1 ) = ρ i j · ( 2 ) = . . . = ρ i j · ( K ) , H a i j : ρ i j · ( k ) ρ i j · ( k ) , for some k , k { 1 , . . . , K } ,
where 1 i , j p , { Σ ( k ) } 1 = Ω ( k ) = ( ω i j ( k ) ) , and ρ i j · ( k ) represents the partial correlation coefficient between X i ( k ) and X j ( k ) given X ( k ) ( X i ( k ) , X j ( k ) ) , ( X m 1 ( k ) , . . . , X m p ( k ) ) = ( f 1 ( k ) ( Y m 1 ( k ) ) , . . . , f p ( k ) ( Y m p ( k ) ) ) . The edge ( i , j ) is a differential edge if ρ i j · ( k ) ρ i j · ( k ) for some k , k { 1 , . . . , K } , and the differential network is defined as the set of all differential edges. As a well-known result in statistics, ρ i j · ( k ) = ω i j ( k ) / ω i i ( k ) ω j j ( k ) . Here, we consider an equivalent alternative of the hypothesis testing above. Similar as in [4], let
S i j ( Ω ) = 1 k < k K ( ρ i j · ( k ) ρ i j · ( k ) ) 2 ,
then the hypothesis testing can be simplified as
H 0 i j : S i j ( Ω ) = 0 , H a i j : S i j ( Ω ) > 0 .
As S i j ( Ω ) = S j i ( Ω ) , we define H 0 = { H 0 i j , 1 i < j p } and H a = { H a i j , 1 i < j p } , and the total numbers of tests are p ( p 1 ) / 2 , i.e., c a r d ( H 0 ) = c a r d ( H a ) = p ( p 1 ) / 2 .
The rest of this paper is structured as follows: In Section 2, we introduce the new test statistic and multiple testing procedure. In Section 3 we perform a simulation study to evaluate the finite sample performance of the proposed test in terms of FDR control and statistical power. We then apply the new method to a rich genomic data to study the genetic difference between four breast cancer subtypes. We discuss the strength and shortcomings of the test in Section 5. Technical proof of the asymptotic results is provided in Appendix A.

2. Statistical Methods

2.1. Winsorized Estimator of the Latent Gaussian Variables

In practice, the transformation functions f ( k ) = ( f 1 ( k ) , . . . , f p ( k ) ) in the nonparanormal distribution are unknown. However, one can use a Winsorized estimator to approximate f ( k ) , i.e., to impute the latent Gaussian variables (oracle data) ( X m 1 ( k ) , . . . , X m p ( k ) ) 1 m n k . To illustrate the Winsorized estimator, we define the following quantile function:
h ^ j ( k ) ( t ) = Φ 1 ( F ˜ j ( k ) ( t ) ) , 1 j p ,
where F ˜ j ( k ) is some estimator of the cumulative distribution function of Y j ( k ) , and a natural choice for F ˜ j ( k ) would be the empirical cumulative distribution function (eCDF)
F ^ j ( k ) ( t ) = 1 n k m = 1 n k I { Y m j ( k ) t } .
One major drawback of the eCDF above is that under high dimensionality, the variance of F ^ j ( k ) ( t ) could be too large. To overcome the problem, Liu et al. (2009) considered a truncated (Winsorized) estimator as follows:
F ˜ j ( k ) = δ n F ^ j ( k ) ( t ) < δ n F ^ j ( k ) ( t ) δ n F ^ j ( k ) ( t ) 1 δ n , 1 δ n F ^ j ( k ) ( t ) > 1 δ n
where δ n serves as the truncation parameter that should be carefully chosen. Liu et al. (2009) [5] suggested δ n = 1 / ( 4 n 1 / 4 π log n ) to balance the bias and variance of eCDF, and so we will use this value in our calculations. To estimate the transformation functions and impute the latent Gaussian variable X , we define
X m j ( k ) * = f ˜ j ( k ) ( Y m j ( k ) ) = μ ^ j ( k ) + σ ^ j ( k ) h ˜ j ( k ) ( Y m j ( k ) ) ,
where h ˜ j ( k ) ( t ) , μ ^ j ( k ) and σ ^ j ( k ) are given below:
h ˜ j ( k ) ( t ) = Φ 1 ( F ˜ j ( k ) ( t ) ) ,
μ ^ j ( k ) = 1 n k m = 1 n k Y m j ( k ) ,
σ ^ j ( k ) = 1 n k m = 1 n k ( Y m j ( k ) μ ^ j ( k ) ) 2 .
The Winsorized estimator X m j ( k ) * generally works well in approximating the unknown X m j ( k ) , and it could be used to estimate the oracle sample covariance. Let Σ ^ ( k ) be the sample covariance matrix by the oracle data, and Σ ˜ ( k ) be the sample covariance matrix by ( X 1 ( k ) * , . . . , X p ( k ) * ) , that is
Σ ˜ ( k ) = 1 n k m = 1 n k ( X m ( k ) * μ ˜ ( k ) ) ( X m ( k ) * μ ˜ ( k ) ) T ,
where μ ˜ ( k ) = ( 1 / n k ) m = 1 n k X m ( k ) * . Liu et al. (2009) established the following consistency results under mild regularity conditions:
Σ ˜ ( k ) Σ ^ ( k ) = O p log p log 2 n k n k 1 / 2 .
When estimating the precision matrix Ω ( k ) , one can consider a modified graphical lasso based on imputed data, i.e.,
Ω ˜ g l a s s o ( k ) = arg min Ω t r ( Ω Σ ˜ ( k ) ) log | Ω | + λ Ω 1 .
Liu et al. (2009) showed the following convergence, which elucidated the asymptotic equivalence between the oracle data and imputed data in the structural estimation of NPNGM
Ω ˜ g l a s s o ( k ) Ω ( k ) F = O p ( Ω ( k ) 0 + p ) log p log 2 n k n k 1 / 2 .

2.2. Asymptotic Results for a Single Class

To extend Liu’s test to a nonparanormal case, we first consider the problem of single GGM estimation based on oracle data, i.e., ( X m 1 ( k ) , . . . , X m p ( k ) ) 1 m n k N ( μ k , Σ k ) , in the following regression framework
X m j ( k ) = α j ( k ) + X m , j ( k ) β j ( k ) + ϵ m j ( k ) .
It is not hard to show that the regression coefficients β j ( k ) = ( β j , 1 ( k ) , . . . , β j , j 1 ( k ) , β j , j + 1 ( k ) , β j , p ( k ) ) and the error term ϵ m j ( k ) satisfy
β j ( k ) = ω j j ( k ) 1 Ω j , j ( k ) , cov ( ϵ m i ( k ) , ϵ m j ( k ) ) = ω i j ( k ) ω i i ( k ) ω j j ( k ) .
As the oracle data ( X m 1 ( k ) , . . . , X m p ( k ) ) 1 m n k in Equation (3) are generally unknown, we consider a new regression model based on Winsorized imputations:
X m j ( k ) * = α ^ j ( k ) + X m , j ( k ) * β ^ j ( k ) + ϵ m j ( k ) * .
In solving the problem of single GGM estimation, Liu (2017) proposed an elegant test based on a bias-corrected sample covariance. This has motivated us to construct the following new statistic
S i j ( k ) * = 1 n k r i i ( k ) * r j j ( k ) * m = 1 n k ϵ m i ( k ) * ϵ m j ( k ) * + m = 1 n k { ϵ m i ( k ) * } 2 β ^ i , j ( k ) + m = 1 n k { ϵ m j ( k ) * } 2 β ^ j , i ( k ) ,
where r i j ( k ) * = ( 1 / n k ) m = 1 n k ϵ m i ( k ) * ϵ m j ( k ) * . By letting ϵ ¯ ( k ) = ( 1 / n k ) m = 1 n k ϵ m ( k ) , ( σ ^ i j , ϵ ( k ) ) 1 i , j p = ( 1 / n k ) m = 1 n k ( ϵ m ( k ) ϵ ¯ ( k ) ) ( ϵ m ( k ) ϵ ¯ ( k ) ) , b i j ( k ) = ω i i ( k ) σ ^ i i , ϵ ( k ) + ω j j ( k ) σ ^ j j , ϵ ( k ) 1 , we will prove that, under mild conditions (see a detailed proof in Appendix A)
S i j ( k ) * + b i j ( k ) ω i j ( k ) ω i i ( k ) ω j j ( k ) D N 0 , 1 + { ω i j ( k ) } 2 ω i i ( k ) ω j j ( k ) .
Similar as in [4], the estimated coefficients β ^ j ( k ) must satisfy the following conditions:
β ^ j ( k ) β j ( k ) 1 = O p ( a n ( k ) ) ,
min λ max 1 / 2 ( Σ ( k ) ) β ^ j ( k ) β j ( k ) 2 , max 1 j p ( β ^ j ( k ) β j ( k ) ) T Σ ^ j , j ( k ) ( β ^ j ( k ) β j ( k ) ) = O p ( b n ( k ) ) ,
where
a n ( k ) = o ( log p / n k ) , and b n ( k ) = o ( n k 1 / 4 ) .
Equation (6) is our main result, which is essentially a counterpart of Proposition 3.1 in [4]. The detailed proof is given in Appendix A. The asymptotic result we obtained here suggested that, by an appropriate choice of regression coefficients β ^ j ( k ) , Liu’s test can be readily extended to a nonparanormal framework by Winsorized imputation. Under GGMs, the condition (7) can be satisfied by several popular shrinkage estimators including lasso estimator and Dantzig selector. For the choice of β j ( k ) under NPNGMs, one can use the rank-based method introduced by Xue and Zou (2012) [6]. Xue and Zou (2012) showed that the rank-based estimator (e.g., rank-based lasso and rank-based Dantzig selector) achieved exactly the same convergence rate as its oracle counterpart, therefore, it also satisfies our condition (7).

2.3. Multiple Testing Procedure for FDR Control

Now we introduce the multiple testing procedure for FDR control based on the single-class result from Equation (6). As suggested in [4], the partial correlation coefficient can be well estimated by a thresholding estimator
ρ ^ i j . ( k ) = S i j ( k ) I | S i j ( k ) | 2 log p n k ,
and we define the following two-sample test statistics
S i j ( k , k ) = S i j ( k ) S i j ( k ) 1 n k ( 1 { ρ ^ i j . ( k ) } 2 ) 2 + 1 n k ( 1 { ρ ^ i j . ( k ) } 2 ) 2 .
In the multi-sample case S i j = ( S i j ( k , k ) ) 1 k < k K , we consider a sum squared test statistics
S i j = k < k { S i j ( k , k ) } 2 .
Motivated by [4] (Equations 2.6 and 2.7) and [7], we define the following statistic
T i j = Φ 1 P i = 1 M λ i Z i 2 S i j ,
and constant A = ( P 0 P ^ 0 ) / Q 0 , where Z i , i = 1 , . . . , M represent a sequence of M i.i.d. standard normal random variables, P 0 = 2 Φ ( 1 ) 1 , P ^ 0 = 2 1 i < j p I { | T i j | 1 } / ( p 2 p ) , Q 0 = 2 ϕ ( 1 ) and A ( t ) = ( 1 + | A | | t | ϕ ( t ) 2 ( 1 Φ ( t ) ) ) 1 . For a given 0 < α 0 < 1 , let
t ( α 0 ) = inf t R , 1 ϕ ( t ) α 0 A ( t ) max { 1 , 1 i < j p I { T i j t } } ( p 2 p ) / 2 .
the FDR can be controlled at level α , if we reject H 0 i j : S i j ( Ω ) = 0 when T i j t ( α 0 ) . One may refer to [7] for the detailed proof about this testing procedure.
Our proposed computational pipeline consisted of three steps: (1) Winsorized imputation for the latent Gaussian variables; (2) rank-based estimation of regression coefficients, and (3) multiple testing with FDR control. On the whole, we put forward a simple procedure to estimate the structural difference between multiple nonparanormal graphical models. The computational pipeline for a two-sample comparison has been implemented in the R package DNetFinder, which can be downloaded from the Comprehensive R Archive Network (CRAN).

3. Numerical Study

We performed a simulation study to evaluate the finite sample performance of the proposed procedure. In particular, we evaluated the empirical false discovery rate (eFDR) as well as the statistical power under two classes, i.e., K = 2 . The dimension and sample size were set to be p = 200 and n 1 = n 2 = 100 . We consider two commonly used graph-generating models including the band graph and Erdos–Rényi (ER) graph, and two estimators for regression coefficients including lasso estimator and Dantzig selector. Detailed set-up for precision matrices Ω 1 and Ω 2 are given below:
  • Band graph: Ω 1 = ( ω i j ) 1 i , j p was obtained by the following assignments
    ω i j = 1 | i j | = 0 0 . 6 | i j | = 1 0 | i j | 2 .
    We then randomly picked 50 edges in Ω 1 as the differential edges and changed their signs in Ω 2 . To ensure positive definiteness, we added max ( | λ min ( Ω 1 ) | , | λ min ( Ω 2 ) | ) + 0 . 05 , to the diagonal of Ω 1 and Ω 2 .
  • Erdos–Rényi (ER) graph: Each node pair ( i , j ) were randomly connected with probability 5%. A correlation coefficient is generated for each edge in the network from a two-part uniform distribution [ 1 / 2 , 1 / 4 ] [ 1 / 4 , 1 / 2 ] . To ensure positive-definiteness, we shrunk the correlations by a factor of 5 and the diagonals were set to be one for Ω 1 . We then randomly selected 5% of the edges as the differential edges, and changed their signs in Ω 2 .
For each graph, we generated the latent Gaussian data (oracle data) from N ( 0 , Ω 1 ) , Ω { Ω 1 , Ω 2 } , and a Winsorized estimator with truncation parameter δ n = 1 / ( 4 n 1 / 4 π log n ) was used to implement our test. The performance of the proposed method was then evaluated in two aspects: false discovery rate control and statistical power. In particular, we compared the results based on oracle data and imputed data by the Winsorized estimator. Two estimators including the lasso estimator and Dantzig selector were used to estimate coefficients β ^ . For oracle data, we applied the R package flare to calculate the solution path over a sequence of 20 candidate λ ’s and tune by Akaike information criterion (AIC). For imputed data, we adopted the rank-based methods introduced by [6], i.e., the rank-based lasso and rank-based Dantzig selector. The simulation was repeated for 100 times for each FDR level ( α { 0 . 05 , 0 . 10 , 0 . 15 , . . . , 0 . 50 } ) and the average empirical FDR and statistical power were summarized.
Figure 1 and Figure 2 compared the empirical false discovery rate (eFDR) with the desired level α under the band graph and ER graph. It can be seen that the empirical FDR based on imputed data is close to the one by oracle data, both close to the desired level of α , suggesting that the FDRs were controlled quite well for both cases. The lasso estimator works almost equally well as Dantzig selector in both settings.
Figure 3 and Figure 4 summarized the statistical power of the test for the band graph and ER graph. As can be seen, the power for ER graph is substantially lower than the band graph, indicating that the complexity and denseness of the underlying differential network may significantly decrease the power of our test. The test based on oracle data performs slightly better than the imputed data, which is due to the loss of information during Winsorized imputation. Similar as we observed from Figure 1 and Figure 2, the lasso estimator works almost equally well as Dantzig selector.
In addition, we compared the proposed test with a direct estimator, recently developed by Zhang (2019) [8]. The direct estimator is a rank-based estimator and can be solved by a parametric simplex algorithm. We simulated the data from the Erdos–Rényi (ER) graph with different sample sizes ( n = 25 , 50 , 100 , 150 ) and numbers of dimensions ( p = 40 , 60 , 90 , 120 ). As the direct estimator does not control the false discovery rate, we set the FDR level at 0.05 for our proposed test. Figure 5 summarized the empirical FDR and statistical power under different sample sizes (with dimension fixed at 100) and different dimensions (with sample size fixed at 100). It can be seen that the two methods have comparable performance and our proposed test achieves lower FDR but slightly lower statistical power. However, it is noteworthy that the direct estimator is computationally expensive and becomes impractical when the dimensions exceed 150. Table 1 summarized the running time of the two methods, where it can be seen that our test is much faster than the direct estimator, especially for relatively high dimensions. For instance, when p = 120 , the direct estimator takes hours while our test takes less than 10 seconds. As the core part of the proposed algorithm is the estimation of regression coefficients, the time complexity is the same as the linear regression. For instance, with LASSO and p > n , the time complexity is O ( n p 2 ) , while the direct estimator by Zhang (2019) has a time complexity O ( n p 4 ) .

4. A Genomic Application

In this part, we applied the proposed test to the Cancer Genome Atlas data (TCGA, [9]) to study the different roles of the cell cycle pathway in the two subtypes of breast cancer including luminal A subtype and basal-like subtype. The cell cycle pathway is known to play a critical role in the initiation and progression of many human cancers including breast cancer and ovarian cancer [10,11]. For instance, the cell cycle pathway provided by KEGG (Kyoto Encyclopedia of Genes and Genomes, [12]) contains 128 important genes that co-regulate cell proliferation, including ATM, RB1, CCNE1, and MYC. Abnormal regulation among these genes may cause the over-proliferation of cells and an accumulation of tumor cell numbers [11].
The transcriptome profiling data for breast cancer were downloaded through the Genomic Data Commons portal [13] in January 2017. The expression level of each gene was quantified by the count of reads mapped to the gene. The quantifications were done by software HTSeq of version 0.9.1 [14]. In our analysis, we excluded 43 subjects including 12 male subjects and 31 subjects with >1% missing values. In addition, we removed the effects due to different age groups and batches using a median- matching and variance-matching strategy [10,15,16]. For example, the batch effect can be removed in the following way:
g i j k * = M i + ( g i j k M i j ) σ ^ g i σ ^ g i j ,
where g i j k refers to the expression value for gene i from sample k in batch j ( j = 1 , 2 , . . . , J ; k = 1 , 2 , . . . , n j ), M i j represents the median of g i j = ( g i j 1 , . . . , g i j n j ) , M i refers to the median of g i = ( g i 1 , . . . , g i J ) , σ ^ g i and σ ^ g i j stand for the standard deviations of g i and g i j , respectively.
The remaining 959 breast cancer samples were further classified into five subtypes according to two molecular signatures, namely PAM50 [17] and SCMOD2 [18]. The two classifications were implemented separately using R package genefu [19] and we obtained 530 subjects with concordant classification by two classifiers. The resulting set contains 221 subjects in the luminal A group, 119 in the luminal B group, 74 in the her2-enriched group, 105 in the basal-like group, and 11 in the normal-like group. For illustration purposes, we conducted two pairwise comparisons (1) Luminal A vs basal-like and (2) Luminal B vs basal-like.
To balance the bias and variance, we choose the same truncation parameter in Winsorized imputation as in our simulation study
δ n ( k ) = 1 4 n k 1 / 4 π log n k ,
where k { 1 , 2 } , n 1 = 221 , n 2 = 105 . The proposed test based on the Winsorized estimator was then conducted for each gene pair with different FDR cutoffs. Figure 6 and Figure 7 summarized all the identified differential edges under FDR levels α = 0 . 05 , 0 . 10 , 0 . 15 , 0 . 20 , with all isolated genes being removed. Our results suggested a list of important genes that play different roles in different breast cancer subtypes. For instance, in Figure 6, genes CCNB1 and PRKDC contribute to several differential edges. According to recent studies, gene CCNB1 is a prognostic biomarker for certain subtypes of breast cancer and it is closely associated with hormone therapy resistance [20]. It has also been reported in the literature that the PRKDC regulates chemosensitivity and is a potential prognostic and predictive marker of response to adjuvant chemotherapy in breast cancer patients [21]. Our findings about several other genes including CHEK2 and CDC7 also confirmed some existing reports [22,23]. As we observed from the two examples, as the desired FDR level increases, the resulting differential network tends to be denser and denser (Figure 8 showed the correlation between FDR and the number of differential edges). In practice, users should consider the trade-off between the accuracy (FDR) and number of new hypotheses (number of differential edges) and choose an appropriate FDR [24].

5. Discussion

Detecting the differential substructure on multiple graphical models is a fundamental and challenging problem in statistics. Liu (2017) studied the problem under the Gaussian framework and introduced an elegant hierarchical test based on the estimation of single GGM. Unlike most existing methods, Liu’s approach asymptotically controlled the false discovery rate at a nominal level, which guarantees the quality of the estimated differential network. In this work, we further extended Liu’s test to a more flexible semiparametric framework, namely the nonparanormal graphical models. Our test is built upon a Winsorized estimator of the unknown transformation functions and it enjoys similar asymptotic properties as its oracle counterpart does.
Although the new test holds great promise in many applications such as genetic network modeling, it has some practical limitations. First, as we see from the theoretical derivation, the good performance of the test relied on the sparsity assumption on the differential network. Although the sparsity assumption is reasonable in many cases, it still could be violated in some applications. For instance, some genetic pathways may exhibit a global change of gene–gene regulations between different phenotypes. When the differential network is dense or locally dense, the method may fail to control the FDR. To solve the problem, a new test needs to be defined to evaluate the level of the sparseness of the change between two conditions. However, there is still a gap on the literature of this topic.
Second, one key assumption in NPNGMs is that the transformed variables follow a joint Gaussian distribution. This assumption also needs to be checked in real-world applications. Under low dimensions, one can employ some popular normality tests, including the Anderson–Darling test and Shapiro–Wilk test, on the imputed data or other normal scores. However, most of these tests fail to detect non-normality for high-dimension data. The normality test under high dimension is still an open and challenging problem and we left it for future research.
It is also noteworthy to mention that the new test relied on an accurate estimator for the coefficients β . Motivated by [6], we chose two popular estimators including lasso estimator and Dantzig selector based on the adjusted Spearman’s rank, which satisfies Condition (7). In fact, some other estimators also satisfy the conditions, for instance, the rank-based adaptive lasso [6,25] and square-root lasso estimator [6,26]. These estimators can also be incorporated into our testing framework.

6. Conclusions

We have introduced a novel statistical test to detect the structural difference between the two nonparanormal graphical models. The proposed test dropped the Gaussian assumption and can be potentially applied to many non-Gaussian data for differential network analysis. For instance, some digital gene expression data (e.g., RNA-seq data) do not follow Gaussian distribution even after log transformation or other variance-stabilizing transformations. In such cases, one can model the data with a nonparanormal graphical model and apply our test to find differential edges between two or multiple phenotypic conditions. The proposed test may also be used to detect the difference between normal and disease populations in the brain connectivity network.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
FDRFalse discovery rate
NPNGMNonparanormal graphical model
GGMGaussian graphical model
TCGAThe Cancer Genome Atlas

Appendix A. Proof of Equation (6)

Define the estimated residuals based on Winsorized estimator as:
ϵ m j * = X m j * X ¯ j * ( X m , j * X ¯ j * ) β ^ j ,
where X ¯ j * = 1 / n m = 1 n X m j * and X ¯ j * = 1 / n m = 1 n X m , j * . The choice of β ^ j must satisfy the following two conditions:
β ^ j β j 1 = O p ( a n ) ,
min λ max 1 / 2 ( Σ ) β ^ j β j 2 , max 1 j p ( β ^ j β j ) T Σ ^ j , j ( β ^ j β j ) = O p ( b n ) ,
where a n ( k ) = o ( log p / n k ) , and b n ( k ) = o ( n k 1 / 4 ) .
It is noteworthy to mention that the conditions above are slightly different from the conditions in [4] due to the different convergence rates by oracle data and imputed data. The conditions above can be satisfied by the rank-based estimators introduced in [6], e.g., rank-based lasso estimator or rank-based Dantzig selector. By letting ϵ ˜ m j = ϵ m j ϵ ¯ i , we have:
ϵ m i * ϵ m j * = ϵ ˜ m i ϵ ˜ m j ϵ ˜ m i ( X m , j * X ¯ j * ) β ^ j ( X m , j X ¯ j ) β j
ϵ ˜ m j ( X m , i * X ¯ i * ) β ^ i ( X m , i X ¯ i ) β i
+ β ^ i T ( X m , i * X ¯ i * ) T ( X m , j * X ¯ j * ) β ^ j β i T ( X m , i X ¯ i ) T ( X m , j X ¯ j ) β j .
First, for term (A3), we have:
| 1 n m = 1 n β ^ i T ( X m , i * X ¯ i * ) T ( X m , j * X ¯ j * ) β ^ j β i T ( X m , i X ¯ i ) T ( X m , j X ¯ j ) β j | = | β ^ i T ( Σ ˜ i , j Σ ^ i , j ) β ^ j + ( β ^ i β i ) T ( Σ ^ i , j Σ i , j ) ( β ^ j β j ) + ( β ^ i β i ) T Σ i , j ( β ^ j β j ) | max i , j | β ^ i T ( Σ ˜ i , j Σ ^ i , j ) β ^ j | + max i , j | ( β ^ i β i ) T ( Σ ^ i , j Σ i , j ) ( β ^ j β j ) | + max i , j | ( β ^ i β i ) T Σ i , j ( β ^ j β j ) | ,
where the last term can be bounded as follows:
max i , j | ( β ^ i β i ) T Σ i , j ( β ^ j β j ) | = O p ( λ max ( Σ ) max 1 i p β ^ i β i 2 2 ) = O p ( b n 2 ) .
It is not hard to show that:
Σ ^ Σ = O p log p n ,
therefore, the second term can also be bounded
max i , j | ( β ^ i β i ) T ( Σ ^ i , j Σ i , j ) ( β ^ j β j ) | = O p a n 2 log p n .
Under some mild regularity conditions (stated in [6]), we have
Σ ˜ Σ ^ = O p log p log 2 n n 1 / 2 ,
thus under the condition that max i , j | β i , j | C 1 and λ min ( Σ ) = o ( ( log p / n ) 3 4 ) , the first term can be bounded as follows:
max i , j | β ^ i T ( Σ ˜ i , j Σ ^ i , j ) β ^ j | max i , j | β i T ( Σ ˜ i , j Σ ^ i , j ) β j | + max i , j | ( β ^ i β i ) T ( Σ ˜ i , j Σ ^ i , j ) ( β ^ j β j ) | = O p log 2 p log 2 n n 3 / 2 + a n 2 log p log 2 n n 1 / 2 .
Combining the three terms above, we have
| 1 n m = 1 n β ^ i T ( X m , i * X ¯ i * ) T ( X m , j * X ¯ j * ) β ^ j β i T ( X m , i X ¯ i ) T ( X m , j X ¯ j ) β j | = O p log 2 p log 2 n n 3 / 2 + a n 2 log p log 2 n n 1 / 2 + b n 2 .
Next, we bound term (A1), which can be rewritten as:
ϵ ˜ m i ( X m , j X ¯ j ) ( β ^ j β j ) + ϵ ˜ m i { ( X m , j X ¯ j ) ( X m , j * X ¯ j * ) } β ^ j ,
where the first term can be further decomposed into two parts,
ϵ ˜ m i ( X m , j X ¯ j ) ( β ^ j β j ) = ϵ ˜ m i ( X m i X ¯ i ) ( β ^ i , j β i , j ) I { i j } + l i , j ϵ ˜ m i ( X m l X ¯ l ) ( β ^ l , j β l , j ) .
To bound l i , j ϵ ˜ m i ( X m l X ¯ l ) ( β ^ l , j β l , j ) , we use the independence between ϵ m i and X m , i . It is easy to show that
max l i | 1 n m = 1 n ϵ ˜ m i ( X m l X ¯ l ) | = O p log p n ,
which indicates that
max i , j | 1 n m = 1 n ( l i , j ϵ ˜ m i ( X m l X ¯ l ) ( β ^ l , j β l , j ) ) | = O p a n log p n .
By the independence between ϵ m i and X m * X m , it is not hard to show
| 1 n m = 1 n ϵ ˜ m i { ( X m , j X ¯ j ) ( X m , j * X ¯ j * ) } β ^ j | | 1 n m = 1 n ϵ ˜ m i { ( X m , j X ¯ j ) ( X m , j * X ¯ j * ) } β j | + | 1 n m = 1 n ϵ ˜ m i { ( X m , j X ¯ j ) ( X m , j * X ¯ j * ) } ( β j β ^ j ) | = O p log p n + a n log p n .
Combing term (A1) and term (A2), we have
1 n m = 1 n ϵ ˜ m i { ( X m , j * X ¯ j * ) β ^ j ( X m , j X ¯ j ) } β j = 1 n m = 1 n ϵ ˜ m i ( X m i X ¯ i ) ( β ^ i , j β i , j ) I { i j } + O p log p n + a n log p n .
Summarizing all the results above, by Equations (22) and (23) of Liu (2013), we have
1 n m = 1 n ϵ m i * ϵ m j * = 1 n m = 1 n ϵ ˜ m i ϵ ˜ m j 1 n m = 1 n ϵ ˜ m i ( X m i X ¯ i ) ( β ^ i , j β i , j ) I { i j } 1 n m = 1 n ϵ ˜ m j ( X m j X ¯ j ) ( β ^ j , i β j , i ) I { i j } + O p log 2 p log 2 n n 3 / 2 + a n 2 log p log 2 n n 1 / 2 + a n log p n + b n 2 .
As 1 n m = 1 n ϵ ˜ m i ( X m i X ¯ i ) = 1 n m = 1 n ϵ ˜ m i 2 + 1 n ϵ ˜ m i ( X m , i X ¯ i ) β i , and Var ( X m , i β i ) = ( σ i i ω i i 1 ) / ω i i C , we have
max 1 i p | 1 n m = 1 n ϵ ˜ m i ( X m , i X ¯ i ) β | = O p log p n ,
therefore
1 n m = 1 n ϵ ˜ m i ( X m i X ¯ i ) = 1 n m = 1 n ϵ ˜ m i 2 + O p ( log p / n ) = 1 n m = 1 n ϵ m i * 2 + O p log 2 p log 2 n n 3 / 2 + a n 2 log p log 2 n n 1 / 2 + a n log p n + b n 2 ,
1 n m = 1 n ϵ m i * ϵ m j * = 1 n m = 1 n ϵ ˜ m i ϵ ˜ m j 1 n m = 1 n ϵ m i * 2 ( β ^ i , j β i , j ) I { i j } 1 n m = 1 n ϵ m j * 2 ( β ^ j , i β j , i ) I { i j } + O p log 2 p log 2 n n 3 / 2 + a n 2 log p log 2 n n 1 / 2 + a n log p n + b n 2 .
In addition
1 n m = 1 n ϵ m i * 2 = 1 n m = 1 n ϵ ˜ m i 2 + O p log 2 p log 2 n n 3 / 2 + a n 2 log p log 2 n n 1 / 2 + a n log p n + b n 2 .
Equation (6) follows immediately by central limit theorem.

References

  1. Guo, J.; Levina, E.; Michailidis, G.; Cai, T.T. Joint estimation of multiple graphical models. Biometrika 2015, 98, 1–15. [Google Scholar] [CrossRef] [Green Version]
  2. Danaher, P.; Wang, P.; Witten, D.M. The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B 2014, 76, 373–397. [Google Scholar] [CrossRef]
  3. Zhao, S.; Cai, T.T.; Li, H. Direct estimation of differential networks. Biometrika 2014, 101, 253–268. [Google Scholar] [CrossRef] [Green Version]
  4. Liu, W. Structural similarity and difference testing on multiple sparse Gaussian graphical models. Ann. Stat. 2017, 45, 2680–2707. [Google Scholar] [CrossRef]
  5. Liu, H.; Lafferty, J.; Wasserman, L. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res. 2009, 10, 2295–2328. [Google Scholar]
  6. Xue, L.; Zou, H. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Ann. Stat. 2012, 40, 2541–2571. [Google Scholar] [CrossRef]
  7. Efron, B. Correlation and large-scale simultaneous significance testing. J. Am. Stat. Assoc. 2007, 102, 93–103. [Google Scholar] [CrossRef] [Green Version]
  8. Zhang, Q. Direct estimation of differential network under high-dimensional nonparanormal graphical models. Can. J. Stat. 2019, 48, 1–17. [Google Scholar] [CrossRef]
  9. The Cancer Genome Atlas Program. Available online: https://cancergenome.nih.gov (accessed on 9 December 2019).
  10. Zhang, Q.; Burdette, J.; Wang, J.-P. Integrative network analysis of tcga data for ovarian cancer. BMC Syst. Biol. 2014, 8, 1–18. [Google Scholar] [CrossRef] [Green Version]
  11. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 2012, 490, 61–70. [Google Scholar] [CrossRef] [Green Version]
  12. Kyoto Encyclopedia of Genes and Genomes. Available online: http://www.genome.jp/kegg/pathway (accessed on 9 December 2019).
  13. Genomic Data Commons portal. Available online: https://gdc.cancer.gov (accessed on 9 December 2019).
  14. Anders, S.; Pyl, P.T.; Huber, W. HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics 2015, 31, 166–169. [Google Scholar] [CrossRef] [PubMed]
  15. Hsu, F.; Serpedin, E.; Hsiao, T.; Bishop, A.; Dougherty, E.; Chen, Y. Reducing confounding and suppression effects in tcga data: An integrated analysis of chemotherapy response in ovarian cancer. BMC Genom. 2012, 13, S13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Zhang, Q. A powerful nonparametric method for detecting differentially co-expressed genes: Distance correlation screening and edge-count test. BMC Syst. Biol. 2018, 12, 58. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Liu, M.C.; Pitcher, B.N.; Mardis, E.R.; Davies, S.R.; Friedman, P.N.; Snider, J.E.; Vickery, T.L.; Reed, J.P.; DeSchryver, K.; Singh, B.; et al. PAM50 gene signatures and breast cancer prognosis with adjuvant anthracycline- and taxane-based chemotherapy: Correlative analysis of C9741. Breast Cancer 2016, 2, 15023. [Google Scholar] [CrossRef]
  18. Haibe-Kains, B.; Desmedt, C.; Loi, S.; Culhane, A.C.; Bontempi, G.; Quackenbush, J.; Sotiriou, C. A three-gene model to robustly identify breast cancer molecular subtypes. J. Natl. Cancer Inst. 2012, 104, 311–325. [Google Scholar] [CrossRef] [Green Version]
  19. Gendoo, D.M.; Ratanasirigulchai, N.; Schroder, M.S.; Pare, L.; Parker, J.S.; Prat, A.; Haibe-Kains, B. Genefu: An R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics 2016, 32, 1097–1099. [Google Scholar] [CrossRef] [Green Version]
  20. Ding, K.; Li, W.; Zou, Z.; Zou, X.; Wang, C. CCNB1 is a prognostic biomarker for ER+ breast cancer. Med. Hypothesis 2014, 83, 359–364. [Google Scholar] [CrossRef]
  21. Sun, G.; Yang, L.; Dong, C.; Ma, B.; Shan, M.; Ma, B. PRKDC regulates chemosensitivity and is a potential prognostic and predictive marker of response to adjuvant chemotherapy in breast cancer patients. Oncol. Rep. 2017, 37, 3536–3542. [Google Scholar] [CrossRef]
  22. Huggett, M.; Tudzarova, S.; Proctor, I.; Loddo, M.; Keane, M.G.; Stoeber, K.; Williams, G.H.; Pereira, S.P. Cdc7 is a potent anti-cancer target in pancreatic cancer due to abrogation of the DNA origin activation checkpoint. Oncotargets 2016, 7, 18495–18507. [Google Scholar] [CrossRef] [Green Version]
  23. Desrichard, A.; Bidet, Y.; Uhrhammer, N.; Bignon, Y. CHEK2 contribution to hereditary breast cancer in non-BRCAfamilies. Breast Cancer Res. 2011, 13, R119. [Google Scholar] [CrossRef] [Green Version]
  24. Liu, W. Gaussian graphical model estimation with false discovery rate control. Ann. Stat. 2013, 41, 2948–2978. [Google Scholar] [CrossRef]
  25. Zou, H. The Adaptive Lasso and Its Oracle Properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  26. Belloni, A.; Chernozhukov, V.; Wang, L. Square-root lasso: Pivotal recovery of sparse signals via conic programming. Biometrika 2011, 98, 791–806. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Empirical false discovery rates (eFDRs) by oracle data and Winsorized imputations under the band graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5, and the solid line is y = x .
Figure 1. Empirical false discovery rates (eFDRs) by oracle data and Winsorized imputations under the band graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5, and the solid line is y = x .
Genes 11 00167 g001
Figure 2. Empirical FDRs (eFDRs) by oracle data and Winsorized imputations under the Erdos–Rényi (ER) graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5, and the solid line is y = x .
Figure 2. Empirical FDRs (eFDRs) by oracle data and Winsorized imputations under the Erdos–Rényi (ER) graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5, and the solid line is y = x .
Genes 11 00167 g002
Figure 3. Statistical powers by oracle data and Winsorized imputations under the band graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5.
Figure 3. Statistical powers by oracle data and Winsorized imputations under the band graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5.
Genes 11 00167 g003
Figure 4. Statistical powers by oracle data and Winsorized estimator under the ER graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5.
Figure 4. Statistical powers by oracle data and Winsorized estimator under the ER graph setting. The x-axis represents the desired FDR levels from 0.05 to 0.5.
Genes 11 00167 g004
Figure 5. Comparison of the proposed test and direct estimator by Zhang (2019), in terms of empirical FDR and statistical power under different sample sizes and dimensions.
Figure 5. Comparison of the proposed test and direct estimator by Zhang (2019), in terms of empirical FDR and statistical power under different sample sizes and dimensions.
Genes 11 00167 g005
Figure 6. The inferred differential networks between the LumA and Basal-like subtypes under different desired FDR levels: (a) 0.05; (b) 0.10; (c) 0.15; (d) 0.20, with all isolated genes being removed. Each connection in the network represents an identified differential edge.
Figure 6. The inferred differential networks between the LumA and Basal-like subtypes under different desired FDR levels: (a) 0.05; (b) 0.10; (c) 0.15; (d) 0.20, with all isolated genes being removed. Each connection in the network represents an identified differential edge.
Genes 11 00167 g006aGenes 11 00167 g006b
Figure 7. The inferred differential networks between the LumB and Basal-like subtypes under different desired FDR levels: (a) 0.05; (b) 0.10; (c) 0.15; (d) 0.20, with all isolated genes being removed. Each connection in the network represents an identified differential edge.
Figure 7. The inferred differential networks between the LumB and Basal-like subtypes under different desired FDR levels: (a) 0.05; (b) 0.10; (c) 0.15; (d) 0.20, with all isolated genes being removed. Each connection in the network represents an identified differential edge.
Genes 11 00167 g007
Figure 8. Desired FDR level against the number of differential edges.
Figure 8. Desired FDR level against the number of differential edges.
Genes 11 00167 g008
Table 1. Running time of the proposed test and direct estimator (in seconds).
Table 1. Running time of the proposed test and direct estimator (in seconds).
n/p406090120
250.88 (7.0)1.59 (110)3.79 (1936)6.49 (23,066)
501.19 (7.7)1.93 (127)4.15 (1973)6.83 (23,119)
1001.87 (9.1)2.61 (146)5.00 (2016)7.80 (23,153)
2002.11 (11)3.04 (165)6.22 (2055)9.61 (23,201)

Share and Cite

MDPI and ACS Style

Zhang, Q. Testing Differential Gene Networks under Nonparanormal Graphical Models with False Discovery Rate Control. Genes 2020, 11, 167. https://doi.org/10.3390/genes11020167

AMA Style

Zhang Q. Testing Differential Gene Networks under Nonparanormal Graphical Models with False Discovery Rate Control. Genes. 2020; 11(2):167. https://doi.org/10.3390/genes11020167

Chicago/Turabian Style

Zhang, Qingyang. 2020. "Testing Differential Gene Networks under Nonparanormal Graphical Models with False Discovery Rate Control" Genes 11, no. 2: 167. https://doi.org/10.3390/genes11020167

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop