Association Testing of a Group of Genetic Markers Based on Next-Generation Sequencing Data and Continuous Response Using a Linear Model Framework

: Association testing has been widely used to study the relationship between phenotypes and genetic variants. Most testing methods are based on genotypes. To avoid genotype calling and directly test on next-generation sequencing (NGS) data, sequencing data-based methods have been proposed and shown advantages over genotype-based testing methods in scenarios where genotype calling is inaccurate. Most sequencing data-based testing methods are based on a single genetic marker. The objective of this paper is to extend the methods to allow testing for the association of a continuous response variable with a group of common variants or a group of rare variants without genotype calling. Our proposed methods are derived based on a standard linear model framework. We derive the joint signiﬁcant test (JS) for a group of common genetic variables and the variable collapse test (VC) for a group of rare genetic variables. We have conducted extensive simulation studies to evaluate the performance of different estimators. According to our results, we found (1) all methods, including our proposed NGS data-based methods and genotype-based methods, can control the Type I error rate probability well; (2) our proposed NGS data-based methods can achieve better performance in terms of statistical power compared with their corresponding genotype-based methods in the literature; (3) when sequencing depth increases, the performance of all methods increases, and the difference between the performance of NGS data-based methods and corresponding genotype-based methods decreases. In conclusion, we have proposed NGS data-based methods that allow testing for the signiﬁcance of a group of variants using a linear model framework and have shown the advantage of our NGS data-based methods over genotype-based methods in the literature.


Introduction
With recent advances in sequencing technology, next-generation sequencing (NGS) has become increasingly popular in genetic association studies. The new technologies have lower costs and higher sequencing throughput compared with traditional technologies, such as Sanger sequencing [1]. Massive sequencing data have been generated for genetic studies from NGS platforms, such as Solexa and Solid [2,3].
Next-generation sequencing (NGS) data in the format of raw sequencing reads are collected from NGS platforms [4][5][6]. There are no genotype data provided by these platforms. To obtain genotype data, multiple processing procedures, including quality control, sequence alignment, genetic variant calling and genotype calling (GC), have to be conducted [5,7,8]. Genotype calling is the process of determining the genotype for each individual and is typically only performed for positions in which a SNP or a 'variant' has already been called [9]. Genotype calls refer to the estimated genotypes obtained by genotype calling [9]. Bioinformatics pipelines have been developed to obtain genotype calls based on NGS data [10][11][12]. Based on these obtained genotypes, association testing is conducted to study how genotypes and other variables (environmental factors, socioeconomic factors, etc.) are associated with phenotypes [9,13,14].
Regression models have been used to develop association testing approaches. Phenotypes are responses. Explanatory variables include genotypes and other variables, such as batch effects, environmental variables, socioeconomic status and behavioral variables. To model a continuous phenotype, researchers adopt linear models or linear mixed models depending on whether individual observations are independent or not [15,16]. Random effects can be used to control the relatedness of individuals. For example, when individuals contain multiple persons from the same family, which can be detected by checking whether different persons have the same family identifier, they are related individuals instead of independent individuals.
To test for association between genetic variables and the phenotype, different strategies are recommended depending on whether genetic variables are common variants or rare variants. A genetic marker can be classified as a common variant or a rare variant depending on whether its minor allele frequency (MAF) is higher or lower than a threshold value c, which is between 0.01 and 0.05 [17,18]. For common markers, association testing can be based on an individual marker assuming a genetic dominant or recessive or additive model, and genome-wide association studies (GWAS) are used to repeat the individual-marker test for all common markers genome-wide [9,19]. Other testing methods for common markers include testing for a group of common markers, which can be F tests and Chi-square tests of the joint significance of explanatory variables in linear models [20,21]. Group testing can be gene-based, pathway-based or range-based. Gene-based group testing treats all genetic markers within the same gene as a group. Pathway-based group testing treats all genetic markers within the same pathway as a group. Range-based testing divides chromosomes into different genomic regions. For example, range-based testing can divide chromosomes into intervals of a fixed length, such as 1 Mb, and then group testing treats all genetic markers within the same interval (genomic region) as a group. All genetic markers in the same group are tested simultaneously using a group test. Genome-wide group testing refers to repeating the individual group testing for all groups genome-wide.
Rare variants are characteristics of low variations in genotypes since their minor allele frequencies (MAF) are less than a pre-specified threshold value, typically 0.05. Thus, there may not be enough statistical power for the association testing of a single rare variant [17,22]. Testing of rare variants is usually performed on a group of rare markers instead of a single marker. The group of rare variants can be within a genomic region (range-based), within a gene (gene-based) or within a pathway (pathway-based) [22,23].
A range of rare-variant testing methods has been proposed, and different methods have been compared [17,22]. Among all the rare-variant tests, two big categories of rarevariant testing methods are widely used. The first category is by variable collapsing, which first combines multiple genetic variables into one variable or calculates an index based on multiple genetic variables. It then tests the association between phenotype and the merged variable or the index [24,25]. Within the first category of methods, i.e., variable collapse (VC) methods, the burden test is a widely-used method, which first calculates the sum of rare alleles and then conducts regression of the phenotype on the number of rare alleles and other covariates [23,25]. The second category is the variants of Sequence Kernel Association testing (SKAT) methods. The second category of methods includes SKAT, MK-SKAT, BESKAT and SKAT-O [17,24,26,27].
The burden test and the SKAT test are, respectively, representative methods of the two categories of rare-variant testing methods [25,27]. They have been widely used in association studies of a group of rare variants with phenotypes. Both tests are genotypebased tests in that they first conduct genotype calling to obtain or estimate genotypes and then conduct association testing based on phenotypes and the estimated genotypes.
However, there are uncertainties and errors in obtaining genotypes by genotype calling. Multiple factors influence genotyping accuracy, such as mapping accuracy, sequencing errors and sequencing depth. There may only be a few sequencing reads at some specific locations, i.e., low sequencing depth, which makes genotype calling imprecise. Uncertainty in genotype calling may influence downstream genotype-based association studies [28,29]. To improve genetic studies of genomic locations, alternative testing approaches avoiding genotype calling have been proposed. Directly modeling next-generation sequencing (NGS) data instead of genotypes considers uncertainty and errors in genotype calling and improves statistical performances in association testing [30,31].
Recent technological advances and real situations motivate the development of NGS data-based methods. One reason is the decreasing costs and increasing throughputs in recent sequencing technologies. Massive sequencing data have been generated and are available to researchers, whereas there are no direct genotype calls provided by these sequencing platforms [2,32]. The second reason is that many large studies on the genotypephenotype relationship have their samples or individuals sequenced at a low depth with the purpose of increasing sample sizes given budget constraints. Genotype calling from low-sequencing data has uncertainty [33]. Considering uncertainty in genotype calls may improve association testing performance [31]. However, commonly-used association testing approaches ignore the uncertainty and use only one single value to represent the genotype [14,34,35]. The single genotype value used in testing is the best-guess genotype or the expected genotype, i.e., dosage, in association testing [36,37]. Ignoring genotype calling uncertainty influences association testing performance. The third reason is that there may be different sequencing depths. For example, in case-control studies, sequencing data in the case group and control group may have different sequencing depths. This difference in sequencing depth may influence the case-control study results [38]. Another example is that researchers may have sequencing data with different depths. Researchers have to choose whether to use a smaller data set with similar sequencing depths or a larger data set with heterogeneous sequencing depths. Researchers have to ignore the bias due to heterogeneous depth or correct for sequencing depth, such as the inclusion of it as an additional covariate [31,39]. NGS data-based association testing methods can avoid time-consuming and likely imprecise genotype calling and directly model NGS data. The fourth reason is from the perspective of information theory. Genotypes are obtained from the proceeding NGS data that results in a loss of information. To maximize the information that is used, association testing based on sequencing data is preferred and expected to have better or at least the same performance as genotype-based testing methods.
Researchers have proposed association testing methods based on sequencing data without genotype calling [30,31,40]. The methods can achieve better performance under the scenario of low sequencing depth, heterogeneous sequencing depths and imprecise genotype calls [30,31,40]. However, most of these methods are designed for the testing of a single marker. Testing for a group of genetic markers is also crucial, and there have been a range of genotype-based group-testing methods for common variants (joint significance test) and rare variants (burden test and SKAT test) [24,27,41]. There are no sequencing databased testing methods for a group of markers developed in the literature using the statistical framework of linear models. We fill the literature gap by developing sequencing data-based testing methods for the association of a continuous phenotype with a group of genetic markers based on independent individuals using a standard linear model framework. We propose the joint significance test (JS) for a group of common variants and the variable collapse test (VC) for a group of rare variants. Their corresponding genotype-based testing methods are the F test and Chi-square test for a group of common variants and the burden test for a group of rare variants.
The rest of the article is organized as follows. Section 2 states the methodology, including the sequence-data joint significance test and variance collapse test. Section 3 shows the results of our simulation studies. Section 4 provides the discussion, and Section 5 draws conclusions.

Methodology
Suppose there are N individuals. For individual i, we have (y i , g i , x i ), where y i is the phenotype, g i represents the genotypes and x i represents additional covariates, such as age, gender and environmental variables. Assume the genetic variants under consideration are bi-allelic so that the genotypes can only take values of 0, 1 and 2.
Suppose we consider a group of d g genetic variables in the test so that the genotypes of individual i are represented as a row vector g i = (g i1 , g i2 , . . . , g i dg ), where g ij is the genotype value at the j-th genetic variant for individual i. We allow our test to include d x additional covariates and let the row vector x i = (1, x i1 , x i2 , . . . , x id x ) represent the intercept and d x additional covariates, where x ij is the value of the j-th additional covariate for individual i.
For N individuals under consideration, we have the genotype matrix of size N × d g as g = (g 1 , g 2 , . . . , g N ), response vector of length n as y = (y 1 , y 2 , . . . , y N ), and additional covariates matrix of size N × (d x + 1) as x = (x 1 , x 2 , . . . , x N ).

Model Continuous Phenotype Using a Linear Model Framework
We model continuous phenotypes by a linear model (LM) framework. The derivation is a laborious extension of the work in Skotte et al., 2012 for NGS data-based association testing in an individual marker [30]. Skotte et al., 2012 represent the genotype of a single genetic marker as a variable taking only three possible values (i.e., 0, 1 and 2) and then derive the association test for this single marker based on NGS data without genotype calling [30]. In their article, they mentioned that an alternative way to test for this single marker with three genotype values is to use two dummy or binary variables that can only take two values, i.e., 0 and 1. Their derivations in association testing for a numerical genetic variable taking values of 0, 1 and 2 can be modified for the testing of two binary variables corresponding to this numerical variable, which is actually the representation of a categorical variable with levels 0, 1 and 2, instead of a numerical variable taking values 0, 1 and 2 [30,42].
Motivated by their statement, we further extend their work to allow for the testing of a group of multiple genetic variables, each taking three possible values, i.e., 0, 1 and 2. In this way, we extend Skotte et al., 2012's testing method for a single genetic variable to our methods, which allows testing for a group of common variants. In addition, we also derive the variable collapse method (VC) for the testing of a group of rare variants based on NGS data. The derivation of the NGS data-based variable collapse method makes use of the chain rule in calculus.
We model a continuous phenotype using a linear model [43,44]. For individual i, his or her phenotype is modeled using the formula where the row vector α ∈ R d x +1 and the row vector β ∈ R d g . Thus, so that the distribution of the phenotype y i depends on the individual predictors (x i and g i ) as well as the parameters (α, β, φ), where φ = σ 2 .

Uncertain Genotypes
Since the actual genotypes remain unobserved in NGS studies, we directly model the joint distribution of the phenotypes and the observed sequencing data. Because phenotypes are conditionally independent of the sequencing data given true genotypes, the density of the joint distribution can be factorized as where θ = (α, β, φ), y i , D i and x i are, respectively, the phenotype, sequencing reads and additional covariates for individual i. The term G is the genotype state space, including all possible genotype values g, which means that in Equation (2), the summation is over all possible values of g. Since there are d g genetic markers in testing, and each genetic marker takes possible values of 0, 1 or 2, then G = {0, 1, 2} d g . The term h in Equation (2) is the shorthand notation for the joint distribution of the genotype and the observed sequencing data, i.e., h(g, D i ) = p(D i |g)p(g|f ), wheref is the estimated allele frequency, as modeled in Skotte et al. (2012) [30]. The log-likelihood function for the model with genotype uncertainties (i.e., latent genotypes) thus becomes In this model, testing for genetic effects means testing with the null hypothesis H 0 :β = 0. Under this null hypothesis, the density f does not depend on the genotype, so it can be extracted out of the summation. Thus the log-likelihood function under this null hypothesis (H 0 :β = 0) can be simplified as follows: where c(y i , φ) = −y 2 i /(2φ) − log(2πφ)/2. We have provided full derivation details for this simplification in Appendix A.
Therefore, the constrained MLE under the null hypothesis (H 0 :β = 0) can be easily found from the linear regression of phenotype y on the additional covariates x only when no genotypes g show up in the formula η i = αx T i + βg T under H 0 :β = 0. This motivates us to develop our method using the score test because the score test considers the constrained MLE under H 0 :β = 0.

Joint Significance Test for a Group of Common Genetic Variants
We use a standard score test with our null hypothesis [45,46]. The score function is where α is a row vector of length d x + 1, β is a row vector of length d g and φ is a scalar. The observed information matrix is In the following, we denote the constrained maximum likelihood estimate of the parameters under the null hypothesis H 0 : β = 0 asθ = (α, 0,φ). The score statistic is Under the null hypothesis, R(y, D) is approximately distributed as a Chi-square random variable with degrees of freedom d g . The score test is conducted based on score statistics R(y, D), and the p-value of the test is then calculated.
Below, we derive analytical formulae and values to be used in the above test, including 1. the analytical formula for the sore function; 2. the evaluation of the score function at the constrained MLE; 3. the analytical formula for the observed information matrix; 4. the evaluation of the observed information matrix evaluated at the constrained MLE.
The analytical formula of the score function is derived to be We have provided full derivation details in Appendix B. The value of evaluating the above score function at the constrained MLE under the null hypothesis, i.e.,θ = (α, 0,φ), is derived to be is the posterior expectation of the genotype of individual i given sequencing data D i . We have provided full derivation details in Appendix C.
The analytical formula for the observed information matrix o y,D (α, β, φ) is derived as the following. We have provided full derivation details in Appendix D.
The value of evaluating the above observed information matrix at constrained MLEθ is derived to be the following. We have provided full derivation details in Appendix E.

Variable Collapse Test for a Group of Rare Genetic Variants
Variable collapse testing methods of rare variants combines individual genetic effects into an aggregate overall effect in the testing [25,47,48]. Different variable collapse methods use different ways to aggregate rare variants. In the genotype-based weighted burden test, the variable aggregating p rare variants AG i = ∑ p j=1 w j G ij , where G ij is the genotype at rare variant j for individual i with allele coding of 1 as the rare allele and 0 as the wild or reference allele. The term w j is the weight. The equal weight (w 1 = w 2 = · · · = w p = 1) means that AG i = ∑ p j=1 G ij is the sum of rare alleles in all the rare variants in the group. The use of AG i in the linear model means that researchers assume that the influence of rare variants G ij on phenotype y i is through the aggregate variable AG i . We model a continuous phenotype using a linear model. For individual i, his or her phenotype is modeled using the formula where the aggregate genetic variable AG i = ∑ d g j=1 w j G ij aggregates or collapses or combines d g rare genetic variables into one aggregate variable as the linear combination of the d g rare genetic values. Thus, so that the distribution of the phenotype y i depends on the individual predictors (x i and AG i ) as well as the parameters (α, β 0 , φ), where φ = σ 2 . Note that, typically, for rare variants, rare alleles are coded as 1 and wild alleles are coded as 0.
Consider the model we specified before for a group of common variants where β ∈ R d g . We found there is a connection between the effects of the two models. This allows us to use the chain rule in calculus to easily derive the likelihood function, score function, observed information matrix and evaluations of these functions at constrained MLE for the rare variant model, i.e., Equation (6) based on what we have derived before based on Equation (7).
To be more specific, since the effects of rare variants as modeled by β 0 satisfy the condition that β = β 0 W, where β ∈ R d g are the effects of d g rare variants and β 0 ∈ R. For example, when w 1 = w 2 = · · · = w d g = 1, we have W = [1, 1, . . . , 1], which is a unit row vector of length d g . In this situation, we have AG i = ∑ d g j=1 G ij [25,47]. In the linear model framework, we have Other weights may also be used to collapse rare variants, such as w j = β 0 f Beta (MAF j , 1, 25), where f Beta is the Beta density function, and MAF j is the minor allele frequency of rare variant j and β 0 is the common factor of all weights [26,27,31].
In our NGS data-based variable collapse method (VC), we use the same assumption about weights as used in most genotype-based variable-collapse rare variant testing methods in the literature, i.e., the weighted burden test assumption [25,47]. We model β = β 0 W, where the row vector W = (w 1 , w 2 , . . . , w d g ) is the weight and β 0 is a common factor. For the purpose of identification, we impose the constraint ∑ d g j=1 w j = d g . We use the same linear model framework as before. For individual i, the phenotype is modeled as y i ∼ N(η i , σ 2 ). In our previous joint significance testing method for a group of common genetic variants, the mean is where β is a row vector of length d g . In comparison, in our variable collapse testing method for a group of rare genetic variants, the mean is where β 0 is a scalar. First, under the null hypothesis H 0 : β = 0 or H 0 :β 0 = 0, we will get the same constrained MLE for α and φ, no matter which log-likelihood function (l y,D (α, β, φ) or l y,D (α, β 0 , φ)) we maximize. Thus, we can use the same notationθ = (α, 0,φ) to represent both the constrained MLE in l y,D (α, β, φ) (the term 0 in constrained MLE is a zero row vector of length d g ) and the constrained MLE in l y,D (α, β 0 , φ) (the term 0 in the constrained MLE is a scalar of 0). By the chain rule in calculus, the score function evaluated at the constrained MLE is derived as the following.
Note that the evaluation of the last two score functions at the constrained MLE is equal to 0 because the constrained MLE maximizes the log-likelihood under the null hypothesis H 0 :β 0 = 0, so that the derivatives of the log-likelihood with respect to α and φ are equal to 0 when being evaluated at the constrained MLE.
Working similarly and applying the chain rule in calculus, we can derive the observed information matrix, and evaluate the observed information matrix at the constrained MLE. Based on the chain rule, we have All the formulae except the last three are derived by the chain rule in calculus. The last three formulae are obtained by matrix transposition, assuming continuous second derivatives so that the second cross derivatives will remain the same no matter which parameter is used first in calculating cross derivatives. A chi-square statistic is derived and used in the test. The score statistic is Under the null hypothesis H 0 : β 0 = 0, R(y, D) is approximately distributed as a Chi-square random variable with one degree of freedom. The score test is conducted based on score statistics R(y, D), and the p-value is calculated. The code to implement the proposed method is accessible via the link https://github.com/zhengxu0459/GroupAsso ciationTesting (accessed on 1 February 2023).

Results of Simulation Studies
We perform extensive simulation studies under a range of settings to evaluate the performance of our proposed sequencing data-based methods (joint significance test for a group of common variants and variable collapse test for a group of rare variants) and their corresponding genotype-based methods in the literature. For common variants, we compare our sequencing data-based joint significance test (JS) with the corresponding genotype-based methods (F test and Chi-square test) in the literature. For rare variants, we compare our sequencing data-based variable collapse test (VC) with the genotype-based burden test and the genotype-based SKAT test in the literature.
We use COSI software's bestfit model to generate 100-kb regions that mimic LD patterns, the local recombination rate and the population history of Europeans through a coalescent model [49]. Within the simulated regions, chromosomes are generated. Sequencing data are simulated using the software ShotGun [50] with a per base pair error rate of 0.5%. ShotGun is available via the link https://yunliweb.its.unc.edu/shotgun.html (accessed on 15 November 2022). We consider a wide range of sequencing depth scenarios by choosing the average sequencing depths to be 1X, 2X, 4X and 10X, respectively. Rare variants and common variants are separated depending on whether their MAFs are above or equal to 0.05. We generate two additional covariates in our simulation: a binary covariate X 1 ∼ Bernoulli(0.5) and a continuous covariate The continuous phenotype is generated via a linear regression model: where ∼ N(0, 1), β 0 = 0, β 1 = 1, β 2 = 1 and d g is the number of genetic variables. The values of β gj , i.e., genetic effects, are set differently in different scenarios so that we can evaluate Type I error and conduct power analysis. Under the null hypothesis, i.e., β g1 = β g1 = · · · = β gd g = 0, 9000 replicates are generated to evaluate Type I errors. We evaluate Type I errors for all possible combinations of three sample sizes and four sequencing depths. The three sample sizes are 300, 500 and 1000. The four sequencing depths are 1X, 2X, 4X and 10X. We reported the Type I error results for (1) the joint significance test of a group of common genetic variants with a continuous phenotype and (2) the variable collapse test of a group of rare genetic variants with a continuous phenotype.
We next consider the alternative hypothesis, i.e., there are some non-zero genetic effects for the d g genetic variants. Under the alternative hypothesis, we randomly choose a group of genetic markers as causal markers in simulating the phenotype. We first generate a random integer number between two and five as the number of common causal variants, and a random integer between two and ten as the number of rare causal variants. Then we use these causal genetic variants and our additional simulated covariates (X 1 , X 2 ) to simulate continuous phenotypes. We let the total genetic effective size be between 0 and 2.5 (Scale Parameter = 0.5 multiplied by the magnitude range of 0 to 5). The individual effect is the total effect divided by the number of causal variants in each simulation.
Our simulation studies have considered various scenarios for a range of sample sizes, sequencing depths, number of causal SNPs and genetic effects for the testing of a group of common variants and a group of rare variants using different testing methods (joint significance test and variable collapse test). We summarize our simulation results as (1) Tables 1 and 2 reporting Type I errors, (2) Figures 1 and 2 reporting power analyses and (3) Table 3 reporting the performance of two allele frequency estimators. We compare our proposed NGS data-based testing methods with their corresponding genotype-based testing methods. In the following, we present our results on (1) Type I errors, (2) Power analyses, i.e., probability of not committing Type II errors, and (3) allele frequency estimation.

Results of Type I Errors
We report the results of Type I errors in all scenarios of sample sizes (300, 500, 1000) and sequencing depths (1X, 2X, 4X, 10X). For different types of genetic variants (common or rare), we apply different testing methods (joint significance test or variable collapse test).
For common genetic variants and continuous phenotypes, we evaluate the Type I errors of our NGS data-based joint significance tests using true allele frequencies (AF) and two ways of estimating allele frequencies as in Skotte et al., 2012 and genotype-based F tests [30]. NGS data-based joint significance tests 1, 2 and 3 refer, respectively, to NGS databased joint significance test using (1) true allele frequencies, (2) estimated allele frequencies using a two-step genotype-based method (first estimate genotypes, and then calculate allele frequencies based on the estimated genotypes) and (3) one-step MLE estimator of allele frequencies based on the log-likelihood of sequencing data in Skotte et al., 2012 [30]. Both methods of estimating allele frequencies are proposed in Skotte et al., 2012 [30]. In general, we expect our NGS data-based testing method will have the best performance when true allele frequencies are used, i.e., NGS data-based joint significance test 1. However, this method (Test 1) is not feasible because, in reality, we do not know allele frequencies. Therefore, we need to use estimated allele frequencies, which are expected to make performance a little worse but still much better than their corresponding genotype-based methods. We expect the one-step MLE of allele frequency (NGS data-based) to be more accurate than the two-step genotype-based estimator of allele frequency. Thus, we expect the NGS data-based joint significance test 3 to be better than test 2. We evaluate the performance of the two allele frequency estimators in our simulation studies, and the allele frequency estimation performance is reported in Section 3.3.
In Table 1, we report the simulation results of Type I error for a group of common genetic variants with a continuous phenotype. The genotype-based F test conducts genotype calling first and then conducts an F test for the joint significance of a group of common variants, assuming phenotype is a linear model of genotypes. We can see that the genotypebased testing method first estimates genotypes, then treats the obtained genotypes as true genotypes, and directly conducts the test based on genotypes and phenotypes. In comparison, the NGS data-based method treats genotypes as latent variables and directly models sequencing data and phenotypes. According to Table 1, Type I errors are controlled in most scenarios, as we expected.
For rare genetic variants and continuous phenotypes, we also evaluated Type 1 errors of our sequencing data-based variable collapse tests using true allele frequencies, estimated allele frequencies by the two-step genotype-based estimation method and estimated allele frequencies by the one-step sequencing data-based estimation method. Like the joint significance test, NGS data-based variable collapse tests 1, 2 and 3 refer, respectively, to our testing methods using true allele frequencies and two allele frequency estimators.
In Table 2, we report the simulation results of Type I errors for a group of rare genetic variants with a continuous phenotype. We evaluate the Type I errors of our sequencing data-based variable collapse tests using true allele frequencies and two ways of estimating allele frequencies as previously described, and two genotype-based rare-variant testing methods (burden test and SKAT test). According to Table 2, Type I errors are controlled in most scenarios, as we expected.

Results of Type II Errors and Power Analyses
We evaluate the performance of various methods from the perspective of statistical power. Type 2 error probability is the probability of accepting the null hypothesis when the truth is the alternative hypothesis, which is equal to one minus the statistical power, which is the probability of rejecting the null hypothesis when the truth is the alternative hypothesis.
In Figure 1, we report the power curves of tests for a group of common variants and a continuous phenotype. The four rows from top to bottom are for sequencing depths 1X, 2X, 4X and 10X. The three columns from left to right are for sample size n = 300, 500, 1000. The powers of the genotype-based F test (brown), NGS data-based joint significance test 1 (blue), test 2 (black) and test 3 (purple) are shown as curves in different colors. We found that in the low sequencing scenario (depth = 1X, 2X), there are advantages of NG data-based methods over the genotype-based F test. However, as the sequencing depth increases, the difference between the performance of sequencing data-based methods and the performance of genotype-based F test decreases. In the scenario of sequencing depth 10X, genotype-based methods are nearly the same or only slightly worse than NGS data-based methods. For all testing methods, we found increasing power as sample size increases or sequencing depth increases. Within the three NGS-data joint significance testing methods, NGS data-based joint significance test 3 (using sequencing data-based MLE of allele frequencies [30]) achieves nearly the same performance as NGS data-based Test 1 (using true allele frequencies). NGS data-based Test 2 (using a two-step genotypebased estimator of allele frequencies) has a relatively worse performance compared to Tests 1 and 3, but Test 2 is still better than their corresponding genotype-based methods. As sequencing depth increases, the difference between the performance of the three NGS data-based testing methods decreases.
In Figure 2, we report the power curve of tests for a group of rare genetic variants and continuous phenotypes. The four rows from top to bottom are for sequencing depths 1X, 2X, 4X and 10X. The three columns from left to right are for sample size n = 300, 500, 1000. The powers are of genotype-based burden test (brown), SKAT test (purple), NGS data-based variable collapse test 1 (black), test 2 (blue) and test 3 (red). We found that sequencing data-based methods are better than genotype-based methods in the scenarios of low sequencing depths (1X and 2X). As sequencing depth increases, the difference between the performance of genotype-based methods and sequencing data-based methods becomes smaller. For the scenario of sequencing depth 10X, the genotype-based burden test method and sequencing-based methods achieve similar performance. Within the three sequencing data-based variable collapse testing methods, Test 3 (estimate allele frequencies using a NGS data-based method) achieves nearly the same performance as Test 1 (using true allele frequencies). Test 2 (estimate allele frequencies using a two-step genotype-based method) is relatively worse compared to Tests 1 and 3. As sequencing depth increases, the difference between the performances of the three sequencing data-based methods disappears. Within the genotype-based methods, the burden test achieves better performance than SKAT methods because our scenarios assume all positive genetic effects that are not assumptions of SKAT, which allows both positive effects and negative effects.
We also conduct tests under the scenarios consistent with the assumptions of SKAT, i.e., genetic effects in both positive and negative directions, and find that burden tests, no matter whether they are genotype-based tests in the literature or NGS data-based tests proposed by us, fail as we expected because the burden test requires genetic effects to be only in one direction (positive or negative). The failure of the burden test under the scenarios for SKAT was recognized in the literature [21,26]. We leave the work of developing NGS data-based methods corresponding to genotype-based SKAT as our future study since it is beyond the scope of this article. The main objective of this article is to use a standard linear model framework to develop NGS data-based testing methods corresponding to genotype-based joint significance tests and genotype-based burden test in the literature.

Results of Estimating Allele Frequencies
Based on our simulated NGS data and corresponding true genotypes, we conducted genotype calling to obtain the estimated genotypes and then calculated allele frequency as a summary statistic based on estimated genotypes. We found that both genotype calling and allele frequency (a summary statistic) calculation based on estimated genotypes became more accurate when sequencing depth increased. To further evaluate the performance of estimators based on estimated genotypes versus estimators based on NGS data, we considered two allele-frequency estimators: (1) the two-step genotype-based estimation method, which first estimated genotypes based on sequencing data, and then estimated allele frequencies based on the estimated genotypes in the first step, and (2) the onestep sequencing data-based Maximum Likelihood Estimator (MLE) of allele frequencies.
The use of the two estimation methods of allele frequencies and the use of true allele frequencies to deal with NGS data are proposed in Skotte et al., 2012 [30]. They stated that the one-step sequencing data-based method is better than the two-step genotype-based method in estimating allele frequencies [30]. In Table 3, we report the performance of the two estimators in terms of mean square error (MSE) and mean absolute deviation (MAD) for different sample sizes (n = 300, 500, 1000) and different sequencing depths (1X, 2X, 4X, 10X). Denote AF(i, j, k, l), AF(i, j, k, l) and AF(i, j, k, l), respectively, as true allele frequency, estimated allele frequency using the two-step genotype-based method and estimated allele frequency using the one-step sequencing data-based method in sample size scenario i (i = 1, 2, 3), sequencing depth scenario j (j = 1, 2, 3, 4), simulated NGS dataset k (k = 1, 2, . . . , K) and genetic variant l (l = 1, 2, . . . , L). MSE and MAD are calculated by the formulae Table 3, we find that the one-step sequencing data-based estimator is better than the two-step genotype-based estimator in all 12 scenarios (all combinations of three sample sizes and four sequencing depths). The performance of both estimators improves with respect to an increase in sequencing depth or an increase in sample size.

Discussion
In this article, we adopted a standard linear model framework to derive NGS databased testing methods for a group of common genetic variants and a group of rare genetic variants. The proposed methods are extensions of an NGS data-based testing method for a single genetic variant proposed by Skotte et al., 2012 [30] and Yan et al., 2015 [31]. With rapid advances in NGS technology and data, there is strong motivation to develop novel methods to deal with NGS data, preferably to have corresponding genotype-based methods in the literature for comparison. Since there are established methods for testing a group of common variants and a group of rare variants using genotypes in the literature, our proposed methods fill the literature gap by providing corresponding testing methods using sequencing data instead of the called genotypes.
We adopted a standard linear model framework and noticed that under the null hypothesis of no genetic effects, the likelihood function and its derivatives could be greatly simplified and easily calculated. Thus, we proposed a score test that only considers the evaluation of score functions and the information matrix at a constrained MLE, and the constrained MLE can be easily obtained. In this way, the chi-square statistic of our joint significance test is derived. Then, we apply the chain rule in calculus to derive our variable collapse test. We have provided full derivation details in the Appendices A-E.
There are different models to describe the genetic effects of a single marker with genotype g coded as 0, 1 and 2. Four widely-used genetic effect models are (1) additive model: e f f ect = β 0 + β a g, (2) dominant model: e f f ect = β 0 + β d I (g ≥ 1), (3) recessive model: e f f ect = β 0 + β r I (g = 2), (4) heterogeneous effect model: e f f ect = β 0 + β 1 I (g = 1) + β 2 I (g = 2), where I(.) is the indicator function, which is equal to 1 when the condition specified in the parenthesis is satisfied, and 0 otherwise [51]. The phenotype modeled based on the dominant model, recessive model and heterogeneous effect model with d g genetic markers are, respectively, where β d , β a , β 1 and β 2 are the row vectors of length d g , and η i is the linear predictor part. Although our proposed testing methods are described based on the additive model as described in Equation (7), the methods can be adapted to work for the other three genetic effects models by modifying the linear predictor η i , and its first and second derivatives with respect to model parameters.
The main objective of this article is to apply standard statistical framework or mathematical statistics to develop novel methods to satisfy the need for group testing of common variants or rare variants based on next-generation data from the society of bioinformatics, biostatistics and modern biology. The proposed methods have a strict mathematical statistics foundation and show advantages over their corresponding genotype-based testing methods in the literature.
We only evaluated Type I errors (false positives) and Type II errors (false negatives) in our study. There are also other types of errors, including Type III errors involved in hypothesis testing. Type III errors refer to correctly rejecting the null hypothesis for the wrong reason. Other types of errors are also important in hypothesis testing, and future studies can be on other types of errors in addition to the widely-used evaluation of Type I errors and Type II errors.
Our proposed method is based on a linear model framework dealing with a continuous phenotype. In association studies, there are also other types of phenotypes, such as a binary phenotype, which is modeled as a logistic regression, and an integer phenotype (count data), which is modeled as a Poisson regression. In this article, our current methods are based on a standard linear model framework to handle a continuous phenotype. We aim to contribute to this area (NGS data-based association testing of a group of common variants or rare variants) by focusing on continuous phenotypes using a linear model framework. We are working on extending our methods to handle other types of phenotypes using a generalized linear model framework. A GLM framework-based derivation of testing methods to handle other types of phenotypes is feasible and under our development.
Sequencing depth plays an important role in the performance comparison of NGS databased methods versus their corresponding genotype-based methods. When sequencing is deep, the genotype can be precisely called, and there are no big differences between the performances of sequencing data-based methods and their corresponding genotype-based methods. However, when the sequencing depth is not large (1X, 2X), genotypes can not be precisely called so that sequencing data-based methods can achieve better performance compared with their corresponding genotype-based methods [30,31]. Given the same budget constraint, a low sequencing depth with more individuals sequenced is preferred to a deep sequencing depth with few individuals sequenced [30,31]. Sequencing depths can also be different for different individuals. Future studies can be on NGS data-based association testing for individuals with heterogeneous sequencing depths.

Conclusions
We extend the NGS data-based methods to allow testing for a group of common variants or rare variants based on a linear model framework. Our proposed methods fill the literature gap and can achieve better performance compared with their corresponding genotype-based testing methods in the literature.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix B. Analytical Formula of Score Function
The score function is We derived each element in the above vector as the following.
Therefore, written compactly, we have

Appendix C. Evaluation of Score Function at Constrained MLE
We evaluated the score function at the constrained MLE under the null hypothesis, i.e.,θ = (α, 0,φ). Under the null hypothesis, we have i under the null hypothesis does not include g. Thus where E(g T |D i ) = {∑ g∈G h(g, D i )} −1 ∑ g∈G g T h(g, D i ) = (∑ g∈G g T h(g, D i ))/(∑ g∈G h(g, D i )) is the posterior expectation of the genotype of individual i given sequencing data D i .

Appendix D. Analytical Formula of Observed Information Matrix
The observed information is We derive each element in the matrix as the following. We start from the score function, which is derived as Take the derivative of the score function.
The following terms can be calculated by matrix transposition.

. By Constrained MLE First-Order Condition
In summary, we have ∂ 2 l y,D (α, β, φ) ∂α T ∂α In addition, we can obtain the following terms by matrix transposition.