Detecting the Common and Individual Effects of Rare Variants on Quantitative Traits by Using Extreme Phenotype Sampling

Zhou, Ya-Jing; Wang, Yong; Chen, Li-Li

doi:10.3390/genes7010002

Open AccessArticle

Detecting the Common and Individual Effects of Rare Variants on Quantitative Traits by Using Extreme Phenotype Sampling

by

Ya-Jing Zhou

^1,2

,

Yong Wang

^1,* and

Li-Li Chen

^1,2

¹

Department of Mathematics, School of Science, Harbin Institute of Technology, Harbin 150001, China

²

School of Mathematical Sciences, Heilongjiang University, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Genes 2016, 7(1), 2; https://doi.org/10.3390/genes7010002

Submission received: 29 September 2015 / Revised: 21 December 2015 / Accepted: 5 January 2016 / Published: 14 January 2016

(This article belongs to the Section Human Genomics and Genetic Diseases)

Download

Browse Figures

Versions Notes

Abstract

:

Next-generation sequencing technology has made it possible to detect rare genetic variants associated with complex human traits. In recent literature, various methods specifically designed for rare variants are proposed. These tests can be broadly classified into burden and nonburden tests. In this paper, we take advantage of the burden and nonburden tests, and consider the common effect and the individual deviations from the common effect. To achieve robustness, we use two methods of combining p-values, Fisher’s method and the minimum-p method. In rare variant association studies, to improve the power of the tests, we explore the advantage of the extreme phenotype sampling. At first, we dichotomize the continuous phenotypes before analysis, and the two extremes are treated as two different groups representing a dichotomous phenotype. We next compare the powers of several methods based on extreme phenotype sampling and random sampling. Extensive simulation studies show that our proposed methods by using extreme phenotype sampling are the most powerful or very close to the most powerful one in various settings of true models when the same sample size is used.

Keywords:

association study; extreme sampling; random sampling; rare variants

1. Introduction

Hundreds of common genetic variants associated with many complex diseases and human traits have been successfully identified by the genome-wide association studies (GWAS). However, these common genetic variants with minor allele frequencies (MAF ) >3% have small to moderate effects, and explain only a small fraction of disease heritability for common disease [1,2,3,4]. Thus, it has been hypothesized that rare variants with MAF <3% may account for some of the missing hereditability [5,6,7,8,9,10]. Next-generation sequencing technology will soon sequence the whole genome of large groups of individuals and thus will make testing rare variants possible. Unfortunately, rare variants are difficult to detect even with large sample size. Thus, we need to develop powerful study designs.

Various methods have been proposed to detect association between rare variants and complex diseases. These tests can be broadly classified into burden and nonburden tests. Burden tests collapse multiple rare variants in a genetic region into a single variant, and then test the association between the single variant and the trait of interest. Burden tests include the cohort allelic sums test (CAST) [11], the combined multivariate and collapsing (CMC) method [12], the weighted methods [13], and the variable minor allele frequency threshold method [14]. The same strategy is used in many methods [15,16,17,18]. In fact, burden tests detect the common effect of all rare variants in a region. Thus, burden tests are powerful when the effects of all rare variants in a region are in the same direction and all variants are causal variants. However, these tests will suffer great power loss when these assumptions are violated. Nonburden tests, which are called “variance component tests”, use the kernel machine regression framework. In this framework, the effects of variants are assumed to be independently and identically distributed with a mean 0 and variance

τ^{2} .

To test whether a set of variants is associated with the phenotype, it is equivalent to test whether the variance

τ^{2} = 0

. Examples of nonburden tests include C-alpha [19], the sequence kernel association test (SKAT) [20], the optimal SKAT (SKAT-O) [21], the mixed effects test (MiST) [22], and an optimally weighted combination of variants (TOW) [23]. Variance component tests are more powerful than burden tests when a genetic region has both protective and deleterious variants or many noncausal variants.

From the nonburden tests, we can see that the average association across variants is zero. However, unless the effect of all rare variants are in opposite directions with the same strength, and they cancel out, the average effect will not be zero. Thus, a model restricting the average effect to be zero may lose power. Thus, we use a more flexible model proposed by Wang et al. [24]. In this model, we take advantage of the burden and nonburden tests, and consider the common effect and the individual deviations from the common effect. In order to increase the power, we consider Fisher’s method and minimum-p method of combining p-values.

In this paper, we also explore the advantage of the extreme phenotype sampling in rare variant analysis and refine this design framework for future large-scale association studies on quantitative traits. Sampling individuals with extreme phenotypes can enrich the frequency of rare variants and therefore lead to an increase in power compared to random sampling. Recently, several statistical methods have been proposed for rare variants association study when extreme phenotypes are sampled [25,26,27,28,29]. Here, we use random sampling and extreme phenotype sampling, and compare the powers of different methods by the two sample techniques in the same sample size. In extreme phenotype sampling, we sample the individuals with higher trait value as cases and sample the individuals with lower trait value as controls. A logistic model is used for these “case-control” data. We conduct a large number of simulations and then analyze the type I error rates and powers of several methods.

2. Materials and Methods

2.1. Materials

We only consider the quantitative traits. Consider a sample of n individuals and p rare variants in a genomic region. For

i = 1, 2, \dots, n,

let

Y_{i}

denote the trait value of the ith individual; for

i = 1, 2, \dots, n

and

j = 1, 2, \dots, p

, let

G_{i j}

denote the number of the minor alleles that the ith subject carries at the jth variant site.

2.2. Methods

Consider a linear regression model

Y_{i} = β_{0} + β_{1} G_{i 1} + \dots + β_{p} G_{i p} + ε_{i},

(1)

where

β = {(β_{1}, β_{2}, \dots, β_{p})}^{'}

is the regression coefficients for

G_{i} = {(G_{i 1}, G_{i 2}, \dots, G_{i p})}^{'}

,

ε_{i}

is an error term with a mean of zero and a variance of

σ^{2} .

Testing whether there is effect of a set of rare variants on a trait is equivalent to testing the null hypothesis

H_{0} : β = 0,

that is,

β_{1} = β_{2} = \dots = β_{p} = 0 .

For rare variants, the likelihood ratio test with p degrees of freedom has low power.

To decrease degrees of freedom and increase the power, burden test collapses the p rare variants into a single variant. Then, the model is simplified as

Y_{i} = θ_{0} + θ_{1} C_{i} + ε_{i},

(2)

where

C_{i} = \sum_{j = 1}^{p} G_{i j}

,

θ_{1}

represents the common effect across all rare variants. The null hypothesis of no association is

H_{0} : θ_{1} = 0 .

Burden score test statistic of

θ_{1} = 0

is

T_{B} = {(\frac{1}{σ^{2}} \sum_{i = 1}^{n} C_{i} (Y_{i} - \bar{Y}))}^{2},

(3)

where

\bar{Y} = \frac{1}{n} \sum_{i = 1}^{n} Y_{i}

, and

σ^{2}

is estimated by the variance of Y. When these individuals are randomly sampled, the burden test is denoted as RS_burden.

In practice, it may be more likely that the very extremes of phenotype distribution may consist of unknown genetic heterogeneity due to genes with large effects (i.e., Mendelian disorders). In such cases, the corresponding variants will be enriched in the extreme sample. Thus, we think that extreme phenotype sampling will be more powerful than random sampling. For example, in clinical practice, diseases such as hypertension and obesity, are dichotomized by setting a threshold for quantitative traits. When individuals with extreme phenotypes are sampled, the high phenotypic extremes are regarded as cases and the low phenotypic extremes are regarded as controls. The logistic model for these “case-control” data is

logit P (Y_{i} = 1) = θ_{0} + θ_{1} C_{i} .

(4)

The test statistic of

H_{0} : θ_{1} = 0

is

T_{B}^{'} = {(\sum_{i = 1}^{n} C_{i} (Y_{i} - \bar{Y}))}^{2} .

(5)

This method of using extreme phenotype sampling is denoted as ES_burden. We note that the two models assume that all rare variants have the same magnitude and the same direction effects on the phenotypes. When these assumptions are violated, the two methods can suffer from power loss.

Here, we use the following model [24]

Y_{i} = θ_{0} + θ_{1} C_{i} + β_{1} G_{i 1} + \dots + β_{p} G_{i p} + ε_{i}, E (β_{j}) = 0, cov (β) = σ_{β}^{2} I_{p}, ε_{i} \sim N (0, σ^{2}),

(6)

where

θ_{1}

represents the common effect across all rare variants and is regarded as fixed effect,

β = {(β_{1}, β_{2}, \dots, β_{p})}^{'}

represents the vector of individual effect deviations from the common effect and is regarded as random effect. Under this model, testing whether the rare variants influence the phenotype corresponds to testing the null hypothesis

H_{0} : θ_{1} = 0, σ_{β}^{2} = 0 .

(7)

In the Appendix, we show that the score statistic for the null hypothesis

H_{0}

is given by

S = {(S_{1}, S_{2})}^{'} = {(\frac{1}{σ^{2}} \sum_{i = 1}^{n} C_{i} (Y_{i} - \bar{Y}), \frac{1}{2} U^{'} U - \frac{1}{2 σ^{2}} t r (G^{'} G))}^{'}

(8)

where

U = {(U_{1}, \dots, U_{p})}^{'}

,

U_{k} = \frac{1}{σ^{2}} \sum_{i = 1}^{n} G_{i k} (Y_{i} - \bar{Y})

and

G = {(G_{1}, G_{2}, \dots, G_{n})}^{'} .

Let

p_{1}

denote the two-sided p-value of

S_{1}

, and

p_{2}

denote the p-value of

S_{2} .

To combine the p-values obtained from

S_{1}

and

S_{2}

, we propose to use Fisher’s method, and the test statistic of

H_{0} : θ_{1} = 0, σ_{β}^{2} = 0

is defined as

T_{1} = - 2 log p_{1} - 2 log p_{2} .

(9)

In addition, we consider other methods for combining p-values, such as the minimum-p approach. In the minimum-p approach, the test statistic of

H_{0}

is defined as

T_{2} = min {p_{1}, p_{2}} .

(10)

The two methods of combining P-values have been studied by many authors [22,30]. When we consider random sampling, the two methods are respectively called RS_Fisher and RS_min-p.

Next, we consider extreme phenotype sampling. Dichotomizing the higher and the lower phenotypic extremes as cases and controls, the logistic regression model for these “case-control” data is

logit {P (Y_{i} = 1)} = θ_{0} + θ_{1} C_{i} + β_{1} G_{i 1} + \dots + β_{p} G_{i p}, E (β_{j}) = 0, cov (β) = σ_{β}^{2} I_{p} .

(11)

In the Appendix, we show that the score statistic for the null hypothesis

H_{0} : θ_{1} = 0, σ_{β}^{2} = 0

is

S = {(S_{1}, S_{2})}^{'} = {(\sum_{i = 1}^{n} (Y_{i} - \bar{Y}) C_{i}, \frac{1}{2} U^{'} U - \frac{1}{2} t r (\bar{Y} (1 - \bar{Y}) G^{'} G))}^{'},

(12)

where

U = {(U_{1}, \dots, U_{p})}^{'}

and

U_{k} = \sum_{i = 1}^{n} G_{i k} (Y_{i} - \bar{Y}) .

p_{3}

and

p_{4}

are respectively the p-values obtained from

S_{1}

and

S_{2}

. We also use the two methods of combining p-values: Fisher’s method and minimum-p method. The test statistics of

H_{0} : θ_{1} = 0, σ_{β}^{2} = 0

is

T_{3} = - 2 log p_{3} - 2 log p_{4}, or T_{4} = min {p_{3}, p_{4}} .

(13)

We denote the two methods as ES_Fisher and ES_min-p, respectively. We use permutation approach to obtain the p-value of the statistic

T_{j}

, for

j = 1, 2, 3, 4 .

The permutation process is the same as that of Lin et al. [31].

3. Simulation and Results

3.1. Simulation Design

The GAW17 provides the Mini-Exome genotype data for simulation studies. This dataset contains genotypes and phenotypes for 697 unrelated individuals on 3205 genes. We follow the simulation set-up of Sha et al. [23]. Specifically, we select a gene (ADAMTS4) with 40 variants, and infer its haplotypic phases for the 697 individuals. To generate the genotypes with 40 variants for N individuals, we randomly combine two haplotypes of the 697 individuals.

To evaluate type I error rate, we generate quantitative trait values by using the model:

Y_{i} = β_{0} + ε_{i},

(14)

where

β_{0} = 0.1

,

ε_{i}

follows a standard normal distribution. We estimate the empirical type I error rate as the proportion of p-values less than

α = 0.01

or

0.05 .

To evaluate power, we generate phenotypes for the N individuals by using the following model:

Y_{i} = β_{0} + β_{1} G_{i 1} + β_{2} G_{i 2} + \dots + β_{p} G_{i p} + ε_{i},

(15)

where

β_{0} = 0.1

,

ε_{i}

follows a standard normal distribution. Effects of causal variants depend on minor allele frequencies (MAF), i.e.,

| β_{j} | = - 0.2 {log}_{10} ({MAF}_{j}) .

The percentages of causal variants with MAF

< 0.03

are assigned three values: 40%, 60%, and 80%. The percentages of causal variants with positive effect are assigned three values: 50%, 80%, and 100%. We also consider different sample sizes (n = 500, 1000, and 2000) and the proportions of extreme phenotypes (10% and 20%).

After the genotype and phenotype data for N individuals are simulated, for random sampling, n individuals are arbitrarily selected from the N individuals. For extreme phenotype sampling, we denote the highest

n / 2

extremes from the N individuals as cases and the lowest

n / 2

extremes as controls. The method of Wang et al. [24] is denoted as JOINT. The RS_Fisher, the RS_min-p, the RS_burden and the JOINT use random sampling, the ES_Fisher, the ES_min-p, and the ES_burden use extreme phenotype sampling.

3.2. Evaluation on Type I Error Rates

For type I error rates, we consider different sample sizes, different proportions of extreme phenotypes, and different significance levels. In each simulation setting, p-values are estimated by 500 permutations and type I error rates are evaluated by 1000 replications. The estimated type I error rates of the seven methods (JOINT, RS_Fisher, RS_min-p, ES_Fisher, ES_min-p, RS_burden and ES_burden) are summarized in Table 1. From Table 1, we can see that the estimated type I error rates are not significantly different from the nominal levels. Thus, all tests are valid tests.

Table 1. The estimated type I error rates for all tests.

**Table 1.** The estimated type I error rates for all tests.
Tails	Sample Size	α	JOINT	RS_Fisher	RS_min-p	ES_Fisher	ES_min-p	RS_Burden	ES_Burden
0.1	500	0.01	0.015	0.013	0.013	0.014	0.013	0.010	0.016
	1000	0.01	0.011	0.010	0.018	0.008	0.003	0.015	0.008
	2000	0.01	0.011	0.012	0.012	0.011	0.013	0.012	0.016
	500	0.05	0.047	0.051	0.045	0.043	0.043	0.048	0.047
	1000	0.05	0.057	0.048	0.052	0.050	0.052	0.042	0.047
	2000	0.05	0.050	0.049	0.050	0.050	0.050	0.049	0.052
0.2	500	0.01	0.008	0.005	0.007	0.009	0.011	0.009	0.015
	1000	0.01	0.012	0.012	0.012	0.011	0.010	0.010	0.013
	2000	0.01	0.009	0.013	0.013	0.014	0.011	0.018	0.018
	500	0.05	0.053	0.049	0.054	0.041	0.044	0.054	0.037
	1000	0.05	0.046	0.040	0.039	0.052	0.061	0.036	0.052
	2000	0.05	0.041	0.049	0.052	0.050	0.045	0.050	0.049

Note: “tails” represents 10% or 20% high/low extreme phenotype sampling; α represents the significance level.

3.3. Power Comparisons

For power comparisons, we consider different sample sizes, different proportions of extreme phenotypes, different percentages of causal variants, and different percentages of causal variants with positive effects. In each simulation scenario, p-values are estimated by 500 permutations and powers are evaluated using 500 replications at a significance level of 0.05. In all cases, the threshold value of rare variants is selected as 0.03.

Power comparisons of the seven tests for half risk variants and half protective variants are given in Figure 1. As shown in Figure 1, the three tests with extreme phenotype sampling are more powerful than the other four tests with random sampling. The ES_Fisher and the ES_min-p have similar powers, and they are much more powerful than the ES_burden. Among the other four tests (JOINT, RS_Fisher, RS_min-p, and RS_burden), the JOINT is the least powerful one. The RS_Fisher and RS_min-p are slightly better than the RS_burden. The powers of all tests increase with the increase of the sample size. All tests show an increase in power with the increase of the percentage of causal variants given the same sample size. In particular, as the percentage of extreme sample increases from 10% to 20%, the powers of all tests decrease, but the power difference among all tests reduces. This is because a big percentage of extreme sampling is similar to the random sampling, which makes minor allele frequencies decrease, so that the test powers suffer loss.

Figure 1. Power comparisons of seven tests when 50% causal variants have a positive effect on phenotype while the remaining 50% have a negative effect. The left panel considers 10% high/low extreme phenotype sampling with the three rows corresponding to 40%, 60%, and 80% causal variants. The right panel considers 20% high/low extreme phenotype sampling. Three sample sizes are considered: n = 500, 1000, 2000. Powers are estimated at the 0.05 significance level.

Power comparisons of the seven tests for 80% risk variants and 20% protective variants are given in Figure 2. By comparing Figure 2 with Figure 1, we can see that the powers of all tests increase uniformly and patterns of power comparisons is very similar. The difference of the ES_Fisher and the ES_burden decreases.

Figure 2. Power comparisons of seven tests when 80% causal variants have a positive effect on phenotype while the remaining 20% have a negative effect. The left panel considers 10% high/low extreme phenotype sampling with the three rows corresponding to 40%, 60%, and 80% causal variants. The right panel considers 20% high/low extreme phenotype sampling. Three sample sizes are considered: n = 500, 1000, 2000. Powers are estimated at the 0.05 significance level.

Power comparisons of the seven tests for the same direction effect of variants are given in Figure 3. The ES_Fisher, the ES_min-p and the ES_burden have similar powers. The ES_burden are slightly better than the ES_Fisher. From Figure 1, Figure 2 and Figure 3, we can see that the difference of the ES_Fisher and the ES_burden decreases gradually. This is because the burden tests assume that variants have the same direction effects and all variants are causal, but our proposed methods allow for different direction effects of variants and also allow for the inclusion of noncausal variants. Thus, when risk and protective variants are present, the burden tests suffer substantial loss of power.

Figure 3. Power comparisons of seven tests when all causal variants have the same effect direction. The left panel considers 10% high/low extreme phenotype sampling with the three rows corresponding to 40%, 60%, and 80% causal variants. The right panel considers 20% high/low extreme phenotype sampling. Three sample sizes are considered: n=500, 1000, 2000. Powers are estimated at the 0.05 significance level.

In summary, the ES_Fisher and the ES_min-p are either the most powerful tests or have similar powers to the most powerful one in each setting. The powers of the ES_Fisher and the ES_min-p are relatively robust to the increase of protective variants and neutral variants. It means that in rare variants association studies, extreme phenotype sampling is superior to random sampling in the same sample size.

4. Discussion

GWAS have identified many genetic variants associated with many multifactorial diseases. However, most GWAS approaches do not consider the disease heterogeneity and the follow up functional analysis of risk variants. Recently, a new field of ‘molecular pathological epidemiology (MPE)’ has emerged as an interdisciplinary integration of ‘molecular pathology’ and epidemiology” [32]. The MPE research approach mainly examines the relationships between potential etiological factors and disease subtypes based on molecular signatures [33]. In addition, MPE also assesses the interactive effects of environmental influences and disease molecular signatures on disease progression. MPE can be one of the next steps of GWAS. Thus, the GWAS-MPE approach was proposed, to take disease heterogeneity into account following GWAS analyses [34]. In the traditional GWAS, a disease of interest is regarded as a single entity without consideration of heterogeneity. By employing the MPE approach, molecular disease classification can help to identify a specific disease subtype that is more strongly associated with a given risk variant than other subtypes of the same disease. A basic approach of MPE is a case-case approach, where diseases are classified into subtypes according to a molecular feature and then distributions of an exposure variable of interest among different subtypes are compared. Thus, in this paper, we may classify into subtypes according to a molecular feature, and then compare the distributions of an exposure variable of interest among different subtypes. We may also examine how lifestyle or genetic factors interact with the molecular features to influence prognosis or clinical outcome. This is something the authors are working on for the future.

The idea of sampling the extremes was initially proposed in linkage analysis as a way to increase efficiency [35]. However, the potential gain by sampling the extremes and technical details of this design has not been well established. For planning future large-scale association studies, we explored the advantage of extreme phenotype sampling for rare variants. In fact, Li et al. [28] have demonstrated the potential cost advantages of this design. In this paper, we have demonstrated that with the higher information content in the extreme sample, the performance of our proposed methods can be substantially improved in comparison with traditional designs. While clear advantages exist in applying extreme phenotype sampling for a quantitative trait, the realization of such advantages depends greatly on the underlying diseases mechanism. However, cancer or cardiovascular disease might have a more complex underlying mechanism, the use of extreme phenotype sampling may be limited, and the investigators need to evaluate the appropriateness of using underlying quantitative traits as a proxy for these disease mechanisms.

Our proposed methods easily adjust for covariates, such as age, gender, and principal components for population stratifications. When considering covariates, we use the following model

g (E (Y_{i})) = θ_{0} + θ_{1} C_{i} + β^{'} G_{i} + α^{'} X_{i},

(16)

where

G_{i} = {(G_{i 1}, G_{i 2}, \dots, G_{i p})}^{'}

and

X_{i} = {(X_{i 1}, X_{i 2}, \dots, X_{i m})}^{'}

are respectively the genotype and covariate of the ith subject.

g (\cdot)

is a link function:

g (P (Y_{i} = 1)) = log {P (Y_{i} = 1) / P (Y_{i} = 0)}

for extreme phenotype sampling;

g (E (Y_{i})) = E (Y_{i})

for random sampling.

5. Conclusions

In this paper, we propose two methods for testing whether a set of variants is associated with continuous phenotypes. We use the same model with the JOINT method, in which common effects of all rare variants and individual effect deviations from the common effect are jointly considered. However, the SKAT assumes that the average effect is zero. In fact, the average effect will not be zero unless the effects of all rare variants are in opposite directions with the same strength.

Compared with Fisher’s method and the minimum-p method, the JOINT method is the sum of standardized

S_{1}

and standardized

S_{2}

, but the Fisher’s method and the minimum-p method combine the p-values of

S_{1}

and

S_{2}

. So the Fisher’s and minimum-p methods are more powerful when only all rare variants have common effect on the trait or when only rare variants have individual effects on the trait. When the true underlying disease model includes risk variants and protective variants, the Fisher’s and minimum-p methods are more powerful than burden tests. In the same sample size, each of the three methods (Fisher, minimum-p, and burden) uses random sampling and extreme phenotype sampling. Our simulation results show that sampling from extreme phenotypes outperforms random sampling methods when the same size is used.

Acknowledgments

The authors would like to thank the joint Editor and referees for comments that greatly improved the presentation of the paper. This research was supported by the National Natural Science Foundation of China (No. 11201129). The Genetic Analysis Workshops (GAW) are supported by GAW grant R01 GM031575 from the National Institute of General Medical Sciences. Preparation of the Genetic Analysis Workshop 17 Simulated Exome Dataset was supported in part by National Institutes of Health NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (http://www.1000genomes.org).

Author Contributions

Ya-Jing Zhou and Yong Wang designed the study and prepared the manuscript. Ya-Jing Zhou and Li-Li Chen prepared the material of the study and performed the genotype experiments. Yong Wang revised the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix. Score Vector

Use the notation in the Methods section. Under the linear model

Y_{i} = θ_{0} + θ_{1} C_{i} + β_{1} G_{i 1} + \dots + β_{p} G_{i p} + ε_{i}, E (β_{j}) = 0, cov (β) = σ_{β}^{2} I_{p}, ε_{i} \sim N (0, σ^{2}),

(A1)

the log-likelihood is given by

log L = - \frac{n}{2} log (2 π σ^{2}) - \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {(Y_{i} - θ_{0} - θ_{1} C_{i} - β_{1} G_{i 1} - \dots - β_{p} G_{i p})}^{2} .

(A2)

Then,

\frac{\partial log L}{\partial θ_{1}} = \frac{1}{σ^{2}} \sum_{i = 1}^{n} C_{i} (Y_{i} - θ_{0} - θ_{1} C_{i} - β_{1} G_{i 1} - \dots - β_{p} G_{i p}), \frac{\partial log L}{\partial β_{j}} = \frac{1}{σ^{2}} \sum_{i = 1}^{n} G_{i j} (Y_{i} - θ_{0} - θ_{1} C_{i} - β_{1} G_{i 1} - \dots - β_{p} G_{i p}), \frac{\partial^{2} log L}{\partial β_{j} \partial β_{l}} = - \frac{1}{σ^{2}} \sum_{i = 1}^{n} G_{i j} G_{i l} .

(A3)

Let

\hat{θ_{0}}

and

\hat{σ^{2}}

denote the maximum likelihood estimates of

θ_{0}

and

σ^{2}

under null hypothesis

H_{0} : θ_{1} = 0, σ_{β}^{2} = 0 .

Then,

\hat{θ_{0}} = \bar{Y}, \hat{σ^{2}} = \frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - \bar{Y})}^{2} .

Using the results in Lemma 3 of Goeman et al. [36], we obtain

\frac{\partial log L}{\partial σ_{β}^{2}} |_{H_{0}} = \frac{1}{2} U^{'} U - \frac{1}{2} t r (I)

(A4)

where

U = \frac{\partial log L}{\partial β} |_{H_{0}} = \frac{1}{\hat{σ^{2}}} {(\sum_{i = 1}^{n} G_{i 1} (Y_{i} - \bar{Y}), \dots, \sum_{i = 1}^{n} G_{i p} (Y_{i} - \bar{Y}))}^{'}, I = - \frac{\partial^{2} log L}{\partial β \partial β^{'}} |_{H_{0}} = \frac{1}{\hat{σ^{2}}} G^{'} G .

(A5)

So the score vector of

H_{0}

is

S = {(S_{1}, S_{2})}^{'} = {(\frac{\partial log L}{\partial θ_{1}}, \frac{\partial log L}{\partial σ_{β}^{2}})}^{'} |_{H_{0}} = {(\frac{1}{\hat{σ^{2}}} \sum_{i = 1}^{n} C_{i} (Y_{i} - \bar{Y}), \frac{1}{2} U^{'} U - \frac{1}{2 \hat{σ^{2}}} t r (G^{'} G))}^{'} .

(A6)

The score vector under the logistic model is similar to the score vector under the linear model.

References

Bansal, V.; Libiger, O.; Torkamani, A.; Schork, N.J. Statistical analysis strategies for association studies involving rare variants. Nat. Rev. Genet. 2010, 11, 773–785. [Google Scholar] [CrossRef] [PubMed]
Maher, B. Personal genomes: The case of the missing heritability. Nature 2008, 456, 18–21. [Google Scholar] [CrossRef] [PubMed]
McCarthy, M.I.; Abecasis, G.R.; Cardon, L.R.; Goldstein, D.B.; Little, J.; Ioannidis, J.P.; Hirschhorn, J.N. Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. Genet. 2008, 9, 356–369. [Google Scholar] [CrossRef] [PubMed]
Schork, N.J.; Murray, S.S.; Frazer, K.A.; Topol, E.J. Common vs. rare allele hypotheses for complex diseases. Curr. Opin. Genet. Dev. 2009, 19, 212–219. [Google Scholar] [CrossRef] [PubMed]
Bodmer, W.; Bonilla, C. Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet. 2008, 40, 695–701. [Google Scholar] [CrossRef] [PubMed]
Gorlov, I.P.; Gorlova, O.Y.; Sunyaev, S.R.; Spitz, M.R.; Amos, C.I. Shifting paradigm of association studies: Value of rare single-nucleotide polymorphisms. Am. J. Hum. Genet. 2008, 82, 100–112. [Google Scholar] [CrossRef] [PubMed]
Ji, W.; Foo, J.N.; Oa̧ŕRoak, B.J.; Zhao, H.; Larson, M.G.; Simon, D.B.; Newton-Cheh, C.; State, M.W.; Levy, D.; Lifton, R.P. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat. Genet. 2008, 40, 592–599. [Google Scholar] [CrossRef] [PubMed]
Manolio, T.A.; Collins, F.S.; Cox, N.J.; Goldstein, D.B.; Hindorff, L.A.; Hunter, D.J.; McCarthy, M.I.; Ramos, E.M.; Cardon, L.R.; Chakravarti, A.; et al. Finding the missing heritability of complex diseases. Nature 2009, 461, 747–753. [Google Scholar] [CrossRef] [PubMed]
Nejentsev, S.; Walker, N.; Riches, D.; Egholm, M.; Todd, J.A. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 2009, 324, 387–389. [Google Scholar] [CrossRef] [PubMed]
Pritchard, J.K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 2001, 69, 124–137. [Google Scholar] [CrossRef] [PubMed]
Morgenthaler, S.; Thilly, W.G. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test (CAST). Mutat. Res. 2007, 615, 28–56. [Google Scholar] [CrossRef] [PubMed]
Li, B.; Leal, S.M. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am. J. Hum. Genet. 2008, 83, 311–321. [Google Scholar] [CrossRef] [PubMed]
Madsen, B.E.; Browning, S.R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5, e1000384. [Google Scholar] [CrossRef] [PubMed]
Price, A.L.; Kryukov, G.V.; de Bakker, P.I.; Purcell, S.M.; Staples, J.; Wei, L.J.; Sunyaev, S.R. Pooled association tests for rare variants in exon-resequencing studies. Am. J. Hum. Genet. 2010, 86, 832–838. [Google Scholar] [CrossRef] [PubMed]
Basu, S.; Pan, W. Comparison of statistical tests for disease association with rare variants. Genet. Epidemiol. 2011, 35, 606–619. [Google Scholar] [CrossRef] [PubMed]
Fang, S.; Sha, Q.; Zhang, S. Two adaptive weighting methods to test for rare variant associations in family-based designs. Genet. Epidemiol. 2012, 36, 499–507. [Google Scholar] [CrossRef] [PubMed]
Feng, T.; Elston, R.C.; Zhu, X. Detecting rare and common variants for complex traits: Sibpair and odds ratio weighted sum statistics (SPWSS, ORWSS). Genet. Epidemiol. 2011, 35, 398–409. [Google Scholar] [CrossRef] [PubMed]
Lin, D.Y.; Tang, Z.Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am. J. Hum. Genet. 2011, 89, 354–367. [Google Scholar] [CrossRef] [PubMed]
Neale, B.M.; Rivas, M.A.; Voight, B.F.; Altshuler, D.; Devlin, B.; Orho-Melander, M.; Kathiresan, S.; Purcell, S.M.; Roeder, K.; Daly, M.J. Testing for an unusual distribution of rare variants. PLoS Genet. 2011, 7, e1001322. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wu, M.C.; Lee, S.; Cai, T.; Li, Y.; Boehnke, M.; Lin, X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011, 89, 82–93. [Google Scholar] [CrossRef] [PubMed]
Lee, S.; Emond, M.J.; Bamshad, M.J.; Barnes, K.C.; Rieder, M.J.; Nickerson, D.A.; Christiani, D.C.; Wurfel, M.M.; Lin, X. Optimal unified approach for rare variant association testing with application to small sample case-control whole-exome sequencing studies. Am. J. Hum. Genet. 2012, 91, 224–237. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Zheng, Y.; Hsu, L. A unified mixed-effects model for rare-variant association in sequencing studies. Genet. Epidemiol. 2013, 37, 334–344. [Google Scholar] [CrossRef] [PubMed]
Sha, Q.; Wang, X.; Wang, X.; Zhang, S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genet. Epidemiol. 2012, 36, 561–571. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Chen, Y.-H.; Yang, Q. Joint rare variant association test of the average and individual effects for sequencing studies. PLoS ONE 2012, 7, e32485. [Google Scholar] [CrossRef] [PubMed]
Barnett, I.J.; Lee, S.; Lin, X. Detecting rare variant effects using extreme phenotype sampling in sequencing association studies. Genet. Epidemiol. 2013, 37, 142–151. [Google Scholar] [CrossRef] [PubMed]
Hu, S.; Zhong, Y.; Hao, Y.; Luo, M.; Zhou, Y.; Guo, H.; Liao, W.; Wan, D.; Wei, H.; Gao, Y.; et al. Novel rare alleles of ABCA1 are exclusively associated with extreme high-density lipoprotein-cholesterol levels among the Han Chinese. Clin. Chem. Lab. Med. 2009, 47, 1239–1245. [Google Scholar] [CrossRef] [PubMed]
Huang, B.E.; Lin, D.Y. Efficient association mapping of quantitative trait loci with selective genotyping. Am. J. Hum. Genet. 2007, 80, 567–576. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Lewinger, J.P.; Gauderman, W.J.; Murcray, C.E.; Conti, D. Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet. Epidemiol. 2011, 35, 790–799. [Google Scholar] [CrossRef] [PubMed]
Wallace, C.; Chapman, J.M.; Clayton, D.G. Improved power offered by a score test for linkage disequilibrium mapping of quantitative-trait loci by selective genotyping. Am. J. Hum. Genet. 2006, 78, 498–504. [Google Scholar] [CrossRef] [PubMed]
Derkach, A.; Lawless, J.F.; Sun, L. Robust and powerful tests for rare variants using Fishera̧ŕs method to combine evidence of association from two or more complementary tests. Genet. Epidemiol. 2013, 37, 110–121. [Google Scholar] [CrossRef] [PubMed]
Lin, W.-Y.; Lou, X.-Y.; Gao, G.; Liu, N. Rare variant association testing by adaptive combination of p-values. PLoS ONE 2014, 9, e85728. [Google Scholar] [CrossRef] [PubMed]
Ogino, S.; Chan, A.T.; Fuchs, C.S.; Giovannucci, E. Molecular pathological epidemiology of colorectal neoplasia: An emerging transdisciplinary and interdisciplinary field. Gut 2011, 60, 397–411. [Google Scholar] [CrossRef] [PubMed]
Ogino, S.; Lochhead, P.; Chan, A.T.; Nishihara, R.; Cho, E.; Wolpin, B.M.; Meyerhardt, J.A.; Meissner, A.; Schernhammer, E.S.; Fuchs, C.S.; et al. Molecular pathological epidemiology of epigenetics: Emerging integrative science to analyze environment, host, and disease. Mod. Pathol. 2013, 26, 465–484. [Google Scholar] [CrossRef]
Ogino, S.; Campbell, P.T.; Nishihara, R.; Phipps, A.I.; Beck, A.H.; Sherman, M.E.; Chan, A.T.; Troester, M.A.; Bass, A.J.; Fitzgerald, K.C.; et al. Proceedings of the second international molecular pathological epidemiology (MPE) meeting. Cancer Causes Control 2015, 26, 959–972. [Google Scholar] [PubMed]
Risch, N.; Zhang, H. Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 1995, 268, 1584–1589. [Google Scholar] [CrossRef] [PubMed]
Goeman, J.J.; van de Geer, S.A.; van Houwelingen, H.C. Testing against a high dimensional alternative. J. R. Stat. Soc. B 2006, 68, 477–493. [Google Scholar] [CrossRef]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Y.-J.; Wang, Y.; Chen, L.-L. Detecting the Common and Individual Effects of Rare Variants on Quantitative Traits by Using Extreme Phenotype Sampling. Genes 2016, 7, 2. https://doi.org/10.3390/genes7010002

AMA Style

Zhou Y-J, Wang Y, Chen L-L. Detecting the Common and Individual Effects of Rare Variants on Quantitative Traits by Using Extreme Phenotype Sampling. Genes. 2016; 7(1):2. https://doi.org/10.3390/genes7010002

Chicago/Turabian Style

Zhou, Ya-Jing, Yong Wang, and Li-Li Chen. 2016. "Detecting the Common and Individual Effects of Rare Variants on Quantitative Traits by Using Extreme Phenotype Sampling" Genes 7, no. 1: 2. https://doi.org/10.3390/genes7010002

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting the Common and Individual Effects of Rare Variants on Quantitative Traits by Using Extreme Phenotype Sampling

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Methods

3. Simulation and Results

3.1. Simulation Design

3.2. Evaluation on Type I Error Rates

3.3. Power Comparisons

4. Discussion

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix. Score Vector

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI