Next Article in Journal
Base Dependence of Benford Random Variables
Previous Article in Journal
Robust Causal Estimation from Observational Studies Using Penalized Spline of Propensity Score for Treatment Comparison
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping

Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11790, USA
*
Author to whom correspondence should be addressed.
Stats 2021, 4(3), 550-577; https://doi.org/10.3390/stats4030033
Submission received: 8 May 2021 / Revised: 11 June 2021 / Accepted: 22 June 2021 / Published: 25 June 2021

Abstract

:
In genome-wide association studies (GWAS), efficient incorporation of linkage disequilibria (LD) among densely typed genetic variants into association analysis is a critical yet challenging problem. Functional linear models (FLM), which impose a smoothing structure on the coefficients of correlated covariates, are advantageous in genetic mapping of multiple variants with high LD. Here we propose a novel constrained generalized FLM (cGFLM) framework to perform simultaneous association tests on a block of linked SNPs with various trait types, including continuous, binary and zero-inflated count phenotypes. The new cGFLM applies a set of inequality constraints on the FLM to ensure model identifiability under different genetic codings. The method is implemented via B-splines, and an augmented Lagrangian algorithm is employed for parameter estimation. For hypotheses testing, a test statistic that accounts for the model constraints was derived, following a mixture of chi-square distributions. Simulation results show that cGFLM is effective in identifying causal loci and gene clusters compared to several competing methods based on single markers and SKAT-C. We applied the proposed method to analyze a candidate gene-based COGEND study and a large-scale GWAS data on dental caries risk.

1. Introduction

Recent advances in technology have facilitated the development of association studies using a highly dense map of genetic markers, such as single nucleotide polymorphisms (SNPs). Nowadays, up to a million SNPs can be readily genotyped along the human genome for thousands of subjects. Even though analyses based on univariate tests are still popular and useful as pre-screening techniques [1], such methods have a major drawback—the high correlation and composite effect among multiple genetic variants are omitted. To accommodate this challenge, a few more advanced statistical methods that are based on multiple linked SNPs have been developed in recent years [2,3,4,5], aiming to improve the power of associations between genetic markers and phenotypic traits of interest.
As SNPs naturally cluster together and form the so-called linkage disequilibrium (LD) blocks, multi-loci association tests that take into account this special genomic structure are usually more powerful than single SNP-based ones [2,3,4,5]. Further, due to this locally connected/correlated LD structure, if a casual SNP, even unobserved in some cases, exists in one LD block, its information, i.e., genetic effect, can be partially carried and jointly represented by its neighboring SNPs in the same block, which increases the likelihood of the causal SNPs being detected. Another benefit of testing at the LD block level is on multiple test adjustment in GWAS, including not only significant reduction of the total number of tests, but also improvement on the independence of the tests. Several multi-SNP methods have been proposed, such as principal component analysis [2,6], entropy-based methods [7,8], kernel machine methods (SKAT, SKAT-C and CKAT) [3,5] and two-marker LD mapping [4]. In addition, some variable selection models, such as LASSO [9,10] and smoothed minimax concave penalized regression (SMCP) [11], were applied to identify a core subset of potential causal variants. Although these methods are useful, most of them overlook marker ordering information in their physical positions.
Functional linear models (FLMs) serve as a good solution to the above-mentioned problems, as it can effectively preserve the intrinsic correlation structure and spatial information of SNPs in an LD [12]. In essence, FLMs take into account the effect that neighboring correlated SNPs in an LD would show similar genetic effects, and treat the regression coefficients of these SNPs with an explicit functional form. The smooth coefficient function can be further expanded in terms of spline bases, which allows for substantial dimension reduction in parameter estimation [12,13]. Both functional principal component analysis (FPCA) and beta-smooth only approaches have been applied to construct the FLMs [12,14,15,16]. Since the beta-smooth only approach is more straightforward, it was used for the construction of FLM in this study.
One critical issue with FLM is that the fitted coefficient functions are often noisy and hard to interpret. They usually fluctuate dramatically due to several reasons: (1) Strong LD among nearby SNPs causes multicollinearity, which leads to erratic changes in the signs of adjacent functional coefficients; (2) FLMs cannot yield estimates that are exactly zero over regions with no significant association, thus generating unnatural wiggles in the fitted genetic function; (3) population-specific phenomena such as mutation, genetic drift, population structure and variations in allele frequencies result in the LD not decaying with distance. Excessive local fluctuation may be relieved by adding a smoothness penalty in the model or controlling the number of spline bases. However, these methods are still not able to identify the null regions in which the coefficient function should be zero and may suffer from loss of detection power due to oversmoothing. In addition, no proper test other than the permutation approach is available for penalty-based methods, which brings heavy computational burden to large-scale studies.
In this paper, we propose a novel constrained generalized functional linear model (cGFLM) for flexible and reasonable multi-loci genetic mapping on a block of correlated genetic variants. The cGFLM retains the merits of FLM such as preserving the spatial and LD information among genetic markers, as well as compressing the high-dimensional problem into aggregate inference about several smoothing components. In addition to these benefits, the cGFLM is more powerful and enables easier interpretation of the functional coefficients. Specifically, we reconstruct FLM by separating the genetic effect into two sign-specific coefficient functions and imposing an equality constraint to encourage overall spatial sparsity. The cGFLM tends to constrain the functional coefficients to be zero in “null regions” where no determinative positive or negative effect is present.
We further extend the cGFLM framework to several types of quantitative traits, such as continuous, binary and count data. We put more focus on count traits as they are common in practice but relatively less mentioned in the literature. Poisson and negative binomial (NB) models provide tractable methods to most count traits, however, in traits with excessive zeros, zero-inflated Poisson (ZIP) or negative binomial (ZINB) models were also developed [17]. Since the ZINB is the most complicated model here, we discuss how to apply the cGFLM to it in more detail. Applications of cGFLM to other models would be similar.
The remainder of the article is organized as follows. In Section 2 and Section 3, we introduce a generalized functional linear model (GFLM) framework, discussing the continuous and categorical traits (Section 2) and then zero-inflated negative binomial phenotypic traits (Section 3). In Section 4, we describe the proposed cGFLM and cGFLM-ZINB models, as well as their estimation and testing processes. In Section 5, we present Monte Carlo simulations to validate the proposed test and compare our model with several alternative methods. In Section 6 and Section 7, we apply the proposed method to the COGEND (The Collaborative Genetic Study of Nicotine Dependence) and Dental Caries GWAS data. Section 8 and Section 9 are the discussion and conclusion, respectively.

2. Generalized Functional Linear Model

Suppose n subjects are sampled, each characterized by p linked SNP markers j = 1 , , p , where m j is the spatial location of SNP marker j with 0 m 1 < < m p . SNP location can be measured by the distance of a SNP from starting position of a block. Denote the SNP genotype at marker position m j for subject i as X i ( m j ) . Suppose the genotypes of a SNP are AA, Aa and aa, which are coded as 0, 1 and 2, according to the number of copies of the allele a. That is,
X i ( m j ) = 0 , if the genotype of SNP j at position m j is A A ; 1 , if the genotype of SNP j at position m j is A a ; 2 , if the genotype of SNP j at position m j is a a .
Suppose the genetic effect is denoted by β ( m ) , a smooth coefficient function over all marker positions m in an LD block. The value of the smooth coefficient function at marker position m j is then β ( m j ) . Further, suppose a set of other covariates z i = ( z i 1 , , z i q ) are also observed for subject i and let α = ( α 1 , , α q ) be the q × 1 vector of their corresponding coefficients. The global-featured effect is included as the intercept, denoted as α 0 . For the ith subject, let y i denote its phenotypic response. If y i s are continuous, a functional linear model (FLM) can be formulated as
y i = α 0 + u = 1 q z i u α u + j = 1 p X i ( m j ) β ( m j ) + ϵ i , ϵ i N ( 0 , σ ϵ 2 ) .
This type of association has been explored in [12,14]. Alternatively, if y i s are categorical, a link function g ( · ) can be applied to to transform its mean μ ( y i ) = E ( y i ) to be g ( μ ( y i ) ) . In this case, a generalized functional linear model (GFLM) can be formulated as
g ( μ ( y i ) ) = α 0 + u = 1 q z i u α u + j = 1 p X i ( m j ) β ( m j ) .
In the FLM/GFLM above, the genetic effects of a block of SNPs are assumed to be a smooth function. In other words, SNPs located close to each other are expected to show similar effects.
Here we represent β ( m ) with B-splines and express β ( m ) as
β ( m ) = r = 1 d γ r B r ( m ) = B ( m ) γ ,
where d 1 is the number of basis functions, γ d × 1 = ( γ 1 , , γ d ) is a vector of real-valued coefficients, ( B 1 , , B d ) is a set of d basis functions, and B ( m ) = ( B 1 ( m ) , , B d ( m ) ) is a vector for the d basis functions evaluated at position m. Let y n × 1 = ( y 1 , , y n ) , Z n × q = ( z 1 , , z n ) ,
X n × p = X 1 ( m 1 ) X 1 ( m p ) X n ( m 1 ) X n ( m p ) and B p × d = B 1 ( m 1 ) B d ( m 1 ) B 1 ( m p ) B d ( m p ) .
In matrix form, the GFLM in (3) can be reformulated as
g ( μ ( y ) ) = α 0 + Z α + X β = α 0 + Z α + XB γ = 1 Z XB α 0 α γ .
Suppose the responses y n × 1 are i.i.d. samples from a distribution with density f ( y ; β ) . By substituting β with B γ , the number of parameters changes from ( p + q + 1 ) to ( d + q + 1 ) . Then the log-likelihood function can be expressed as
l ( α 0 , α , γ ) = i = 1 n log f ( y i ; α 0 , α , γ )
The maximum likelihood estimators (MLE) for parameters can be computed by maximizing the log-likelihood function
( α ^ 0 , α ^ , γ ^ ) = arg max α 0 , α , γ l ( α 0 , α , γ ) .
Since we are modeling a block of SNPs simultaneously in (3), the hypothesis test of association between genetic variants and phenotypic trait will be made at the block level. That is,
H 0 : β ( m j ) = 0 , for all j = 1 , , p . H a : Not H 0 .
Approximately, the hypotheses can also be expressed as:
H 0 : γ r = 0 , for all r = 1 , , d . H a : γ r R d .
The likelihood ratio test (LRT) can be applied to draw inference about whether a block of SNPs may be associated with the phenotypic trait. Under H 0 , the LRT statistic, defined as 2 ( l 0 l 1 ) , follows asymptotically a χ d 2 distribution (Chi-square distribution with df = d, the number of basis vectors used in Equation (5)).

3. GFLM for Zero-Inflated Negative Binomial (ZINB) Traits

Most count traits can be modeled with Poisson or negative binomial (NB) distributions. However, when excessive zeroes are present in the data, a zero-inflated Poisson or negative binomial regression framework needs to be employed [17], which utilizes a latent Bernoulli distribution and a Poisson distribution to incorporate explanatory variables. In this section, we describe a GFLM framework for count traits that follow a ZINB distribution.
To illustrate the ZINB model, we use dental caries, more commonly known as tooth decay, as an example. Let Y denote the number of dental caries, and the probability distribution of Y is given as:
L Bernoulli ( π ) ; P r ( Y = 0 | L = 1 ) = 1 ; P r ( Y = k | L = 0 ) N B ( μ , ϕ ) , k = 0 , 1 ,
Here L is a latent Bernoulli random variable that categorizes caries “risk” into two states: “risk-free” (L = 1) and “risky” (L = 0). π is the probability of L = 1 , μ and ϕ are the mean and dispersion parameter of the NB distribution, respectively.
In a ZINB regression model, let y i denote the observation of number of caries for subject i , i = 1 , , n . Then,
y i 0 , with probability π i ; N B ( μ i , λ ) , with probability 1 π i .
We can see that zeroes may come from two sources: the conditional point mass distribution or the conditional NB distribution. Thus, the occurrence of dental caries y i is:
y i = 0 , with probability π i + ( 1 π i ) ( ϕ ϕ + μ i ) ϕ ; k , with probability ( 1 π i ) Γ ( ϕ + k ) k ! Γ ( ϕ ) ( ϕ ϕ + μ i ) ϕ ( μ i ϕ + μ i ) k , k = 1 , 2 ,
An expectation-maximization (EM) algorithm can be applied to estimate the parameters in the ZINB model. Suppose at step t, the parameter estimates are ( π ( t ) , μ ( t ) , ϕ t ) . Then the detailed EM implementation is given as follows.
E-step
For subject i , i = 1 , , n , estimate L i by its conditional mean
L i ( t ) = E L | y i , π ( t ) , μ ( t ) , ϕ ( t ) ( L i ) = 0 , y i > 0 ; π ( t ) π ( t ) + ( 1 π ( t ) ) ( ϕ t ϕ ( t ) + μ ( t ) ) ϕ ( t ) , y i = 0 .
M-step
  • Find π ( t + 1 ) by maximizing
    l B e r ( π | y , z ( t ) ) = i = 1 n L i ( t ) log ( π ) + ( 1 L i ( t ) ) log ( 1 π )
  • Find μ ( t + 1 ) , ϕ ( t + 1 ) i by maximizing
    l N B ( μ , ϕ | y , z ( t ) ) = i = 1 n ( 1 L i ( t ) ) log Γ ( ϕ + y i ) y i ! Γ ( ϕ ) ( ϕ ϕ + μ ) ϕ ( μ ϕ + μ ) y i
The maximization can be performed by the Newton–Raphson algorithm for the two models int the M-step simultaneously.
Using the same notations as in the previous section, the Bernoulli probabilities and the negative binomial means can be transformed using the logit and the log links, respectively, and are modeled as follows:
logit ( π i ) = log ( π i 1 π i ) = α 0 B e r + u = 1 q z i u α u B e r + j = 1 p X i ( m j ) β B e r ( m j ) ,
and
log ( μ i ) = α 0 N B + u = 1 q z i u α u N B + j = 1 p X i ( m j ) β N B ( m j ) .
Since usually we do not have strong pre-assumption about genetic components and covariates, the same sets of genetic components and covariates can be used in both Bernoulli and NB models simultaneously. For univariate analyses on single SNPs, it is equivalent to the case that one LD block only contains one SNP, and we can simply set p = 1 in the above equations. For multi-loci mapping purposes, we set all SNPs in an LD block as input/explanatory variables.
We substitute the unstructured coefficients to functional coefficients, which are represented by linear combination of B-Splines. That is, Equation (5) is plugged into the coefficients for both the latent Bernoulli model and NB model to form the GFLM-ZINB model. Using the same notation above, the GFLM-ZINB model is reformulated as:
logit ( π i ) = 1 Z i X i B ( Ber ) α 0 B e r α B e r γ d B e r × 1 B e r = U i B e r η B e r .
log ( μ i ) = 1 Z i X i B ( NB ) α 0 N B α N B γ d B e r × 1 B e r = U i N B η N B .
Then we can estimate the parameters using an EM Algorithm again, which is implemented as follows.
  • Let L ^ i ( t ) = π i ( t ) π i ( t ) + ( 1 π i ( t ) ) ( ϕ t ϕ ( t ) + μ i ( t ) ) ϕ ( t ) I ( y i = 0 )
    • Perform logistic regression of L ^ i ( t ) on U B e r to estimate η B e r ( t + 1 ) .
    • Perform weighted negative binomial regression of y on U N B with weights 1 L ^ i ( t ) to obtain estimate η N B ( t + 1 ) and ϕ ( t + 1 ) .
  • Let π i ( t + 1 ) = exp ( U i B e r η B e r ( t + 1 ) ) 1 + exp ( U i B e r η B e r ( t + 1 ) ) and μ ( t + 1 ) = exp ( U i N B η N B ( t + 1 ) ) , iterate back to step 1.
For hypothesis testing, since we model a block of SNPs simultaneously, the test of association between genetic variants and a phenotypic trait will be made at the block level. In the meantime, we have two sets of parameters, one for the latent Bernoulli model and one for the negative binomial model. The hypothesis test should consider the overall effect of both models. That is, we consider testing at least one coefficients from either model being nonzero. Note that we obtained dimension reduction by changing the estimator of interest from the 2 p -dimensional β B e r ( m j ) and β N B ( m j ) , j = 1 , , p to the ( d B e r + d N B ) -dimensional γ d B e r × 1 B e r and γ d N B × 1 N B by applying the functional coefficients implemented via the B-spline bases. The hypotheses are formulated as
H 0 : γ b B e r = 0 , γ m N B = 0 , for all b , n . H a : Not H 0 .
The likelihood ratio test (LRT) can be applied to draw inference about whether a block of SNPs may be associated with the phenotypic trait. Under H 0 , the LRT statistic asymptotically follows a χ d B e r + d N B 2 distribution (Chi-square distribution with df = d B e r + d N B ):
χ 2 = 2 ( l 0 l 1 ) d χ d B e r + d N B 2 .

4. Constrained Generalized Functional Linear Model

In practice, we have observed that the fitted coefficient functions in the GFLM can fluctuate dramatically, and the fluctuation makes it difficult to explain the functional patterns and determine the locations of causal SNPs. To address this, here we propose a more reasonable constrained functional linear model (cGFLM). Specifically, we separate the genetic effect into two sign-specific coefficient functions and impose an equality constraint to promote spatial sparsity. The cGFLM is formulated as follows:
g ( μ ( y i ) ) = α 0 + u = 1 q z i u α u + j = 1 p X i ( m j ) β + ( m j ) + j = 1 p X i ( m j ) β ( m j ) subject to β + ( m j ) 0 , β ( m j ) 0 , β + ( m j ) · β ( m j ) = 0 for all j ,
where β + ( m ) , β ( m ) are smooth coefficient functions.
We express β + ( m ) , β ( m ) in terms of B-spline bases B 1 × d 1 + ( m ) , B 1 × d 2 ( m ) and the modified coefficient vectors γ d 1 × 1 + , γ d 2 × 1 , respectively:
β + ( m ) = B + ( m ) γ + , β ( m ) = B ( m ) γ .
Let ∘ denote the Hadamard product of two vectors. In matrix form, the cGFLM is formulated as:
g ( μ ( y ) ) = α 0 + Z α + XB + γ + + XB γ = ( 1 Z ) XB + XB γ 0 γ + γ = U γ * = η * subject to B + γ + 0 , B γ 0 , ( B + γ + ) ( B γ ) = 0 .
The log-likelihood function for cGFLM is then
l ( γ 0 , γ + , γ ) = i = 1 n log f ( y i ; γ 0 , γ + , γ ) .
In order to obtain the MLEs for parameters, the following nonlinear optimization problem with inequality/equality constraints needs to be solved:
maximize l ( γ 0 , γ + , γ ) subject to B + γ + 0 , B γ 0 , ( B + γ + ) ( B γ ) = 0 .
An augmented Lagrangian algorithm (ALA) can be applied to this constrained maximization [18,19]. Let γ * = ( γ 0 , γ + , γ ) . Denote the p equality constraints defined by ( B + γ + ) ( B γ ) = 0 as h ( γ * ) = 0 , and the 2 p inequality constraints defined by B + γ + 0 , B γ 0 as g ( γ * ) 0 . For notation purposes, let I = p , J = 2 p , then the corresponding augmented Lagrangian for (26) to be minimized is
L ρ ( γ * , λ 1 , λ 2 ) = l ( γ * ) + ρ 2 { i = 1 I [ h i ( γ * ) + λ 1 i ρ ] 2 + j = 1 J [ max ( 0 , g j ( γ * ) + λ 2 j ρ ) ] 2 } ,
where λ 1 R I , λ 2 R + J and ρ > 0 . Let λ 1 min < λ 1 max , λ 2 max > 0 , ξ > 1 , 0 < τ < 1 , ϵ c be a small constant value (e.g., 10 4 ), and { ϵ t } be a sequence of nonnegative numbers such that lim t ϵ t = 0 . A sketch of the augmented Lagrangian algorithm is given below and more implementation details can be found in [20]:
  • Step 0. Let λ 1 i ( 1 ) [ λ 1 min , λ 1 max ] , i = 1 , , I , λ 2 j ( 1 ) [ 0 , λ 2 max ] , j = 1 , , J and ρ 1 > 0 . Let γ * ( 0 ) in the parameter space Ω be an arbitrary initial point. Set t 1 .
  • Step 1. Find the approximate minimizer γ * ( t ) of L ρ t ( γ * , λ 1 ( t ) , λ 2 ( t ) ) subject to γ * Ω , satisfying
    P Ω ( γ * ( t ) L ρ t ( γ * ( t ) , λ 1 ( t ) , λ 2 ( t ) ) ) γ * ( t ) ϵ t ,
    where P Ω is the Euclidean projection onto Ω [21] and L ρ t ( γ * ( t ) , λ 1 ( t ) , λ 2 ( t ) ) ) = L ρ t ( γ * , λ 1 ( t ) , λ 2 ( t ) ) γ * evaluated at γ * = γ * ( t ) .
  • Step 2. Define V j ( t ) = max { g j ( γ * ( t ) ) , λ 2 j ( t ) ρ t } , j = 1 , , J .
    ρ t + 1 = ρ t , If t = 1 or max { h ( γ * ( t ) ) , V ( t ) } τ max { h ( γ * ( t 1 ) ) , V ( t 1 ) } ξ ρ t , otherwise
  • Step 3. Update λ 1 s and λ 2 s:
    λ 1 i ( t + 1 ) = min { max { λ 1 min , λ 1 i ( t ) + ρ t h i ( γ * ( t + 1 ) ) } , λ 1 max } for i = 1 , , I ,
    λ 2 j ( t + 1 ) = min { max { 0 , λ 2 j ( t + 1 ) + ρ t g j ( γ * ( t + 1 ) ) } , λ 2 max } for j = 1 , , J .
  • Step 4. if γ * ( t ) γ * ( t 1 ) ϵ c , set t + 1 t and go to Step 1; else stop. ■
Similar to what has been performed for (10), a likelihood ratio test can be performed to investigate the overall genetic effects represented by a block of SNPs in contiguous genomic regions. However, the null and alternative hypotheses in (10) are updated with regard to the new parameter space and imposed constraints, as follows:
H 0 : γ d 1 × 1 + = 0 and γ d 2 × 1 = 0 . H a : B + γ + 0 , B γ 0 , ( B + γ + ) ( B γ ) = 0
Since we constrained that the Hadamard product of sign-specific coefficient functions is 0 , at least one basis coefficient in the positive or negative part would be constrained to zero. Therefore, the dimension of the alternative parameter space should be K = max ( d 1 , d 2 ) . According to [22,23], it has been shown that in nonlinear optimization, the alternative parameter space Ω can be approximated at the null estimate by a polyhedral convex cone defined by the gradient vectors of the constraint functions. If the unconstrained true parameter value is an interior point of Ω , the test statistic has an asymptotic χ K 2 distribution under H 0 . Otherwise when the unconstrained parameter estimate does not fall in the admissible parameter space, the test statistic is defined by the projection of the unconstrained estimate on the k-dimensional boundary of Ω taken metrics according to the Hessian matrix I ( γ * ) , and it may follow an asymptotic χ k 2 distribution under H 0 ( k = 0 , , K 1 ) . Therefore, the LRT statistic asymptotically follows a mixture of chi-square distributions with mixing probabilities w j such that j = 0 K w j = 1 , denoted as
χ ¯ 2 = 2 ( l 0 l 1 ) d j = 0 K w j χ j 2 .
For any c R , the p-value of the χ ¯ 2 test statistic is defined as
P r ( χ ¯ 2 c 2 ) = j = 0 K w j P ( χ j 2 c 2 ) .
The mixing probabilities can be calculated using Monte Carlo techniques. The algorithm is given as follows: (1) Take 1000 draws from a multivariate normal distribution with mean zero and covariance matrix equaling to the Hessian matrix I ( γ * ) ; (2) for each draw compute and count the number of sign-agree elements of the vectors that fall in the k-dimensional boundaries ( k = 0 , , K ) of the admissible parameter space. In this case w j is computed as the proportion of the 1000 draws in which it has exactly k non-zero coefficients projected on the alternative parameter space. The Monte Carlo technique is easy to implement and able to circumvent complicated numerical integrations.
The LRT can be adapted to the constrained functional coefficients (cGFLM-ZINB) model as follows. Since the parameter estimation is conducted with two independent sets of constraints, the hypothesis test consists of two parts as well, one for the latent Bernoulli distribution and one for the NB distribution.
H 0 : γ d B e r 1 × 1 B e r + = 0 and γ d B e r 2 × 1 B e r = 0 , γ d N B 1 × 1 N B + = 0 and γ d B e r 2 × 1 N B = 0 . H a : B B e r + γ B e r + 0 , B B e r γ B e r 0 , ( B B e r + γ B e r + ) ( B B e r γ B e r ) = 0 B N B + γ N B + 0 , B N B γ N B 0 , ( B N B + γ N B + ) ( B N B γ N B ) = 0
Assuming that we used the same number of spline bases for both positive and negative coefficient functions, we have d B e r and d N B degrees of freedom for the latent Bernoulli and the negative binomial models. Since a mixture of chi-square distributions (chi-bar test) is needed for each model, the overall LRT statistic for the above test then follows a mixture of mixture of chi-square distributions. Letting the mixing probabilities be w j B e r and w k N B for Bernoulli and NB models and j = 0 d B e r w j B e r = 1 and k = 0 d N B w k = 1 , we have
χ ¯ Z I N B 2 = 2 ( l 0 l 1 ) d j = 0 d B e r w j B e r χ j 2 + k = 0 d N B w k N B χ k 2 .
The p-value of the χ ¯ 2 test statistic is then
P r ( χ ¯ Z I N B 2 c 2 ) = j , k w j B e r w k N B P ( χ j + k 2 c 2 ) .

5. Simulation Studies

To study the statistical properties of the proposed cGFLM, we carried out simulations under different sampling schemes. Genotypic data were simulated under two settings: (1) random LD block with varying structures [9,11], and (2) LD block of gene CHRNA7 [24] borrowed from an existing data (COGEND, the Collaborative Genetic Study of Nicotine Dependence), which represents a real-data genomic structure. For the phenotypes, we considered traits following either a binomial or ZINB distribution. Sample sizes were set to range from 500 to 2000.
To investigate whether cGFLM can correctly control type I error, β c a u s a l was set to be zero under the null hypothesis. For empirical power evaluation, we examined two different scenarios. First, we assumed only one causal SNP was located in the LD block, with varying β c a u s a l s. Then we considered another interesting setting when two causal loci with reversed sign effects are located in the same LD block, which mimics the scenario when both deleterious and protective SNPs exist in a genomic region. The two causal loci chosen were weakly correlated ( r 2 < 0.01 ). The corresponding regression coefficients were set to be ( β c a u s a l 1 , β c a u s a l 2 ) where β c a u s a l 1 = β c a u s a l 2 . In this case, we plan to examine if the proposed test can deal with sign-heterogeneous genetic effects. Causal SNPs were removed before analyses to mimic the real data setting that causal SNPs may not be genotyped.
In terms of functional parameters, the order of B-spline basis was set to 4 (degree = 3) to construct cubic curves with desired smoothing properties. Knots were placed evenly in the position domain. In general, the number of spline bases would be determined according to the number of SNPs (p) in an LD block. Data-adaptive choices for the number or the placement of knots can be made via cross-validation, but for simplicity we will not provide further discussions here. Empirically, we suggest using the maximum of 4 and the integer part of p / 6 as the number of bases so that it is possible to capture clustering genetic effects in the fitted function. Sensitivity analyses using a broad range of parameters were performed to make sure our results are robust.
Because the simulation results using the random LD blocks and the LD block structure from CHRNA7 gene are very similar, here we only show results with the random LD blocks, whereas the results with the CHRNA7 gene are located in Appendix A (Table A1 and Table A2, Figure A1, Figure A2, Figure A3, Figure A4, Figure A5 and Figure A6). For a random LD block, all SNP genotypes within it were generated following the strategy introduced in [9,11]. Briefly, genotypes of p SNPs were generated based on a random p-dimensional multivariate normal matrix ζ n × p with mean 0 and covariance Σ p × p . Assuming that SNPs have equal allele frequencies, the following rule would be applied to generate the genotype of the jth SNP for the ith subject. Let z 0.25 be the third quartile of standard normal distribution, we have:
X i j = 0 , if ζ i j < z 0.25 . 1 , if z 0.25 ζ i j < z 0.25 . 2 , if ζ i j z 0.25 .
The covariance matrix was defined as follows. For each block, 10 % of the SNPs were selected as “tag SNPs”. They were highly correlated with each other ( C o r r ( X j 1 , X j 2 ) = 0.8 ), moderately correlated with 30 % of other SNPs ( C o r r ( X j 1 , X j 2 ) = 0.5 ), and weakly correlated with the remaining 60 % SNPs ( C o r r ( X j 1 , X j 2 ) = 0.2 ). The correlations among the 90 % “non-tag” SNPs are determined by their physical locations ( C o r r ( X j 1 , X j 2 ) = 0.7 | j 1 j 2 | ). In this case, we would not violate the assumptions that SNPs are physically adjacent and linked. Further, the LD block structures vary among different randomly generated arrays.

5.1. Simulation Using Binary Traits

The following binary outcomes were simulated based on causal genotypes under the logit model:
logit ( μ ( y i ) ) = log P r ( y i = 1 ) 1 P r ( y i = 1 ) = α 0 + X i β causal
We considered simulation scenarios where α 0 = 0.2 . Each scenario was replicated 10,000 times in order to observe the type I error rates under small genome-wide thresholds (nominal α = 0.05 , 0.01 , 0.005 and 0.001 ). We ran 1000 replicates for each scenario for power evaluation, in which a p-value smaller than 0.05 would be declared for significance. We compared the empirical power of our proposed model with three existing methods: single marker association test (smAT), SKAT for the combined effect of rare and common variants (SKAT-C) and the functional linear model (GFLM). p-values for smAT were adjusted by Bonferroni correction, and p-values for SKAT-C and GFLM were calculated by combined sum test and χ 2 test.
We can see from Table 1 that cGFLM can effectively maintain the type I error. Evaluation of empirical power was based on settings when regression coefficient β c a u s a l = 0.1 , 0.2 , 0.3 , 0.4 or 0.5 . For power calculation with single causal locus, we used the same settings as that in type I error simulation. When causal SNPs were not genotyped, we can see from Figure 1 that the cGFLM showed better power than all other methods. In the second scenario when two causal loci had reverse-sign effects, we included more SNPs in a block so that it would be possible to locate two weakly correlated markers within the region. We used p = 25 SNPs in a group and d 1 = d 2 = 4 as the number of spline bases. From Figure 2, we can see that cGFLM consistently demonstrates better power than other methods.
The association patterns of a sample simulation are presented in Figure 3. For smAT, a modified Manhattan plot of the −log10 (p-values) by the sign of the fitted coefficients is used. For all other methods, the coefficient estimates are plotted. The causal loci is highlighted with dashed lines (left in red for positive effect, right in blue for negative effect). Compared to smAT and cGFLM, the cGFLM fitted coefficient function is more reasonable and correctly identify the causal loci.

5.2. Simulation Using ZINB Traits

In this case, we simulate phenotypic traits conditional on causal genotypes under the following ZINB model:
logit ( π i ) = log π i 1 π i = α 0 B e r + X i β causal Ber log ( μ i ) = α 0 N B + X i β causal NB
The outcomes are generated with the latent mixture process as discussed in Section 3.
The intercepts, α 0 B e r and α N B , were set to be 0.0 in all subsequent simulations. β c a u s a l was set to zero under the null hypothesis in the evaluation of type I error. Since the computational burden for ZINB models is much higher than that in the binary outcome model, we reduced the replicated times to 1000 times for each scenario. Type I error rates were investigated under small genome-wide thresholds (nominal α = 0.05 , 0.01 and 0.005 ). For assessment of empirical power, we used similar settings as those in the simulation for binary traits. However, we examined two general scenarios where the genetic effect is in either the latent Bernoulli or the negative binomial models. Then for each general scenario, we first set only one causal SNP in the LD block, affecting either β c a u s a l B e r or β c a u s a l N B . Then we considered two causal loci with reversed sign effects in the LD block. The two causal loci chosen were weakly correlated ( r 2 < 0.01 ). The corresponding regression coefficients were set to ( β c a u s a l 1 B e r , β c a u s a l 2 B e r ) , or ( β c a u s a l 1 N B , β c a u s a l 2 N B ) where β c a u s a l 1 = β c a u s a l 2 . A total of 100 replicates were run for each scenario. We compared the empirical power of our proposed model cGFLM-ZINB with three existing methods: functional linear model with negative binomial traits (GFLM-NB), single marker association test with ZINB traits(smAT-ZINB) and functional linear model with ZINB traits (GFLM-ZINB). p-values for smAT-ZINB were adjusted by Bonferroni correction, and p-values for GFLM-NB and GFLM-ZINB were calculated by the likelihood ratio tests.
Table 2 demonstrated that the proposed cGFLM maintains the type I errors very well. Evaluation of empirical power is based on settings with regression coefficients ranging from 0.1 to 0.5 and sample sizes from 500 to 2000. When the genetic effect was set to be in the latent Bernoulli model, we can observe the apparent failure of using a negative binomial (NB) regression model, by looking at the significantly lower power when using the GFLM-NB model compared with other models in Figure 4 and Figure 5. This fortifies our assumption that using a simple NB distribution will lead to loss of power when modeling genetic effects affecting excess zero in zero-inflated count process. While using ZINB models, similar performances were observed for smAT-ZINB, GFLM-ZINB and cGFLM-ZINB (see Figure 6 and Figure 7). The figures demonstrate that ZINB models are more advantageous than the NB regression model when excessive zeros exist. More importantly, the cGFLM-ZINB model consistently shows the best performance among the tested models, in scenarios when one causal locus and two causal loci were set in the LD block. Collectively, these simulations demonstrate the advantages and robustness of our proposed cGFLM model.
We then applied the cGFLM model to two independent studies to assess its practical usage. The first dataset is a candidate-gene-based SNP study and the other is a GWAS for dental caries.

6. Application 1: COGEND Study

According to the World Health Statistics Report [25], cigarette smoking is the single biggest cause of preventable mortality worldwide, causing more than 5 million deaths per year and accounting for one in 10 adult deaths. Nicotine dependence, the primary psychoactive component in tobacco, profoundly impacts people’s ability to cease tobacco smoking. The etiology of nicotine dependence is multifactorial, and evidence from various epidemiology studies suggest that genetic factors have a substantial impact on smoking behaviors. Identification of these genetic factors and the development of targeted treatments could be promoted to further reduce smoking related morbidity and mortality.
The Collaborative Genetic Study of Nicotine Dependence (COGEND) is a nationwide project aiming to detect the genetic mechanisms and environmental features of nicotine dependence. In this study on CHRN candidate genes, a total of 216 SNPs were genotyped for 2022 individuals (1114 cases with nicotine dependence and 908 controls). In the phenotypic data, all cases and controls were current or former smokers who reported smoking more than 100 cigarettes lifetime. Rates of current nicotine dependence were defined by the Fagerstrom Test for Nicotine Dependence (FTND). Subjects having FTND 4 were classified as nicotine dependent (case). Subjects having lifetime FTND = 0 or 1 were classified as control. The original SNP set was divided into 12 LD blocks according to their physical locations and LD structure, all of which consist of one or more contiguous gene regions. Since functional models are not well-suited for LD blocks having small number of SNPs, four small blocks with fewer than seven SNPs were excluded for analyses. We applied our proposed method cGFLM, along with smAT, SKAT-C and GFLM to analyze the final dataset, which consists of 191 SNPs in eight LD blocks. Age, gender and race were included as covariates. A Bonferroni significance threshold of 0.05 / 8 = 6.2 × 10 3 is used for cGFLM, GFLM and SKAT-C, and a threshold 0.05 / 191 = 2.6 × 10 4 is used for smAT.
Table 3 summarizes the results. All four methods yielded small p-values (cGFLM: 5.25 × 10 6 ; GFLM: 8.24 × 10 6 ; SKAT-C: 1.7 × 10 3 ; smAT: 1.72 × 10 4 ) for the CHRNA5 cluster (“IREB2 + LOC123688 + PSMA4 + CHRNA5 + CHRNA3 + CHRNB4” gene cluster) on chromosome 15. However, only cGFLM and SKAT-C showed significance for the “CHRNB3 + CHRNA6” gene cluster (cGFLM: 8.67 × 10 4 ; SKAT-C: 1.3 × 10 3 ). For these two blocks, p-values calculated by cGFLM are much smaller than those calculated by other methods. It is also worth mentioning that both gene clusters have been shown to be associated with nicotine dependence in previous studies [26,27,28,29]. Other LD blocks (candidate gene clusters) are not significantly associated with the phenotypic trait in this cohort.

7. Application 2: Whole Genome Association Study for Dental Caries

More than 40% children and adolescents, and 90% adults in the US are being affected by dental caries, or more commonly known as tooth decay. Multiple factors are considered to contribute to the risk of having dental caries, such as some environmental factors and social behaviors [30,31,32]. Evidence has shown that some individuals are more susceptible to caries while some others are more resistant, almost irrelevant to the environmental risk factors they are exposed to, suggesting that genetic factors may play crucial roles in the risk of developing caries [33]. According to several previous studies, the heritability of dental caries were evaluated to be as high as 60%.
To better understand the genetic mechanisms of the risk of dental caries, a GWAS study has been conducted as part of the Gene Environment Association Studies initiative (deposited in dbGaP Study Accession: phs000095.v2.p1) [4,34]. A total of 4020 individuals were genotyped with a large panel of SNPs (610,000) and examined with multiple outcomes. Our study focused on traits related to caries in permanent teeth. Two indexes, D1MFT and D1MFS, which quantifies the total permanent tooth/surface caries with white spots, were included in the analyses. Since the outcomes of interest were both count traits with excess zeroes (Figure 8), the proposed methods, zero-inflated negative binomial model (smAT-ZINB for single-marker tests) and its application with functional coefficient (GFLM-ZINB and cGFLM-ZINB), were applied to the data set. The final analytic sample consists of 1480 individuals with complete permanent teeth phenotypic data. Age, gender and total number of teeth/surfaces were included as covariates in the analyses.
Table 4 and Table 5 summarize the significant findings. The Manhattan plots for GWAS scans using the ZINB model are presented in Figure 9 and Figure 10. For genome-wide univariate screening, a significance threshold of 1 × 10 7 was used. For the LD block-based analysis, a genome-wide significance threshold of 2.5 × 10 6 was used. Several SNPs were identified significantly associated with D1MFT and D1MFS in genome-wide scan, of which rs7990965 in chromosome 13 and rs1058595 in chromosome 10 demonstrate consistent significance for both traits. In the LD block based association tests, the cGFLM-ZINB model identified two significant genetic regions: PKDCC in chromosome 2 ( both traits), and the intergenic region between DCN and BTG1 in chromosome 12 (D1MFS). It is worth mentioning these two genetic regions cannot be identified by other competing methods, suggesting that the cGFLM model may provide potentially new insights into understanding the risk of dental caries. More interestingly, the gene PKDCC was found to be associated with craniofacial morphogenesis in previous dental studies [35], further supporting and validating the biological significance of our findings.

8. Discussion

Joint analyses of multiple contiguous SNPs, which can form an LD block, are expected to provide better inference about unknown causal variants, since these SNPs all carry partial information about the causal variants and collectively they should be more powerful. In principle, LD between genetic loci should decline with their intervening physical distance, given that more cross-over events occur within longer ranges. Therefore, how to effectively incorporate and take full advantage of the distance/alignment order information of these SNPs is an important yet challenging problem. While FLMs seem to provide a good solution, its estimated coefficient function is usually noisy and hard to interpret, which consequently leads to potential power loss. Improvements to existing methods are then of great interest in order to better detect significant genetic variants.
In this article, we proposed a novel cGFLM for flexible and more reasonable multi-loci mapping strategy. Our model is built upon the general FLM framework by imposing constraints to specify sign-specific effects. Our limited simulations suggest that these constraints encourage spatial sparsity in the estimated coefficient function, which needs to be further validated in other experimental settings. Due to these constraints, the likelihood ratio test statistic does not follow a fixed chi-square distribution, but a null weighted mixture of chi-square distributions. The simulation results show that compared to three competing methods, our proposed cGFLM generally demonstrated better power when effect size is moderate and large, and comparable performance when effect size is small, while at the same time maintaining correct type I errors.
Applications of the cGFLM to two real datasets demonstrate the applicability of our method. Particulary, its application to the GWAS of dental caries risk identifies new genetic regions that are have not been discovered before, validating that our method can potentially provide new insights into large-scale genomic studies. However, it is also important to note that some significant SNPs identified by other methods were missed by cGFLM as well. This is likely because in cGFLM, identification of significant SNPs is based on the collective evidence of a group of linked SNP, and in the case of causal SNPs being weakly linked to its neighboring SNPs, our methods may miss them. Additionally, cGFLM inherently would not work well for variants not in LD with neighboring SNPs, such as singletons and doubletons. Therefore, these observations suggest that our method should be considered as an additional approach in our toolbox for more discoveries, while not replacing existing ones.
When the cGFLM method is applied to large-scale genome-wide scanning of LD blocks, one concern is how to group SNPs into LD blocks. Several software, such as PLINK [36] and LDExplorer [37], have embedded functions to define LD blocks for genomic data, and we can utilize them to help partition the genome. Even with these tools, LD blocks can vary with different LD thresholds. That is, a weaker LD threshold can lead to larger LD blocks and vice versa. Another concern is about LD blocks with few SNPs, since they are not suitable for functional analysis. They can be either combined into adjacent larger blocks with weak LD, or can be just analyzed with those single-marker methods. Prospectively, in order to discover the core subset of causal genes, further group selection among multiple candidate blocks is of great interest, and regularization methods such as group LASSO and group SCAD can be included as extensions to our current framework. Additionally, in this study cGFLM only considers LD-based blocks for association analyses; however, it would be also of great interest to see if cGFLM can be applied to gene-level analyses as genes are functional units of the genome. Finally, if subpopulations exist in the genotypic sample, stratification can be addressed in our model by including principal components of population variation as additional covariates [38].

9. Conclusions

Our proposed GFLM, cGFLM and their related implementations with the ZINB model (GFLM-ZINB, cGFLM-ZINB) are novel and complex methods. They are specifically designed for multi-loci mapping in naturally formed LD blocks. The models can simultaneously incorporate multiple linked SNPs, including their physical alignments and block-wide linkage structure, and make inference at the LD block level. Simulation studies show that cGFLM and cGFLM-ZINB have desirable performances in terms of both simpler coefficients functions and empirical power gain. The practical usage of cGFLM is demonstrated with a candidate-gene study of nicotine dependence and a GWAS on dental caries. Considering its flexibility and comprehensiveness, cGFLM would be a very attractive method for future gene-based association studies.

Author Contributions

Conceptualization, J.H., J.Y., W.Z. and S.W.; methodology, J.H., Z.G. and S.W.; software, J.H.; Data Analyses and interpretation, J.H. and S.W.; writing—original draft preparation, J.H.; writing—review and editing, J.H., J.Y., W.Z. and S.W.; Critical revision, J.H., J.Y., Z.G., W.Z. and S.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable as this was secondary use of publicly available data.

Informed Consent Statement

Not applicable as this was secondary use of publicly available data.

Data Availability Statement

The application datasets are publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Simulation Results Using the LD Block Information in CHRNA7 Gene

Table A1. Type I error simulation using cGFLM for binary outcomes based on the CHRNA7 gene.
Table A1. Type I error simulation using cGFLM for binary outcomes based on the CHRNA7 gene.
Nominal α N = 500N = 1000N = 1500N = 2000
0.050.05080.05120.04930.0506
0.010.01000.00990.01020.0093
0.0050.00560.00530.00480.0046
0.0010.00110.00090.00110.0007
Table A2. Type I error simulation using cGFLM-ZINB for ZINB outcomes based on the CHRNA7 gene.
Table A2. Type I error simulation using cGFLM-ZINB for ZINB outcomes based on the CHRNA7 gene.
Nominal α N = 500N = 1000N = 1500N = 2000
0.050.0370.0460.0450.042
0.010.0110.0110.0060.008
0.0050.0050.0070.0020.006
Figure A1. Power simulation for binary outcomes based on the CHRNA7 gene, single causal locus.
Figure A1. Power simulation for binary outcomes based on the CHRNA7 gene, single causal locus.
Stats 04 00033 g0a1
Figure A2. Power simulation for binary outcomes based on the CHRNA7 gene, two reverse-sign causal loci.
Figure A2. Power simulation for binary outcomes based on the CHRNA7 gene, two reverse-sign causal loci.
Stats 04 00033 g0a2
Figure A3. Power simulation for ZINB outcomes based on the CHRNA7 gene, single causal locus, effect in latent Bernoulli distribution.
Figure A3. Power simulation for ZINB outcomes based on the CHRNA7 gene, single causal locus, effect in latent Bernoulli distribution.
Stats 04 00033 g0a3
Figure A4. Power simulation for ZINB outcomes based on the CHRNA7 gene, two causal loci, effect in latent Bernoulli distribution.
Figure A4. Power simulation for ZINB outcomes based on the CHRNA7 gene, two causal loci, effect in latent Bernoulli distribution.
Stats 04 00033 g0a4
Figure A5. Power simulation for ZINB outcomes based on the CHRNA7 gene, single causal locus, effect in NB distribution.
Figure A5. Power simulation for ZINB outcomes based on the CHRNA7 gene, single causal locus, effect in NB distribution.
Stats 04 00033 g0a5
Figure A6. Power simulation for ZINB outcomes based on the CHRNA7 gene, two causal loci, effect in NB distribution.
Figure A6. Power simulation for ZINB outcomes based on the CHRNA7 gene, two causal loci, effect in NB distribution.
Stats 04 00033 g0a6

References

  1. Cantor, M.R.; Lange, K.; Sinsheimer, S.J. Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application. Am. J. Hum. Genet. 2010, 86, 6–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Wang, K.; Abbott, D. A principal components regression approach to multilocus genetic association studies. Genet. Epidemiol. 2008, 32, 108–118. [Google Scholar] [CrossRef] [PubMed]
  3. Ionita-Laza, I.; Lee, S.; Makarov, V.; Buxbaum, J.D.; Lin, X. Sequence kernel association tests for the combined effect of rare and common variants. Am. J. Hum. Genet. 2013, 92, 841–853. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Yang, J.; Zhu, W.; Chen, J.; Zhang, Q.; Wu, S. Genome-wide Two-marker linkage disequilibrium mapping of quantitative trait loci. BMC Genet. 2014, 15, 20. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Zhang, H.; Zhao, N.; Mehrotra, D.V.; Shen, J. Composite Kernel Association Test (CKAT) for SNP-set joint assessment of genotype and genotype-by-treatment interaction in Pharmacogenetics studies. Bioinformatics 2020, 36, 3162–3168. [Google Scholar] [CrossRef]
  6. Tian, Y.; Ma, L.; Cai, X.; Zhu, J. Statistical Method Based on Bayes-Type Empirical Score Test for Assessing Genetic Association with Multilocus Genotype Data. Int. J. Genom. 2020, 2020, 4708152. [Google Scholar] [CrossRef]
  7. Cui, Y.; Kang, G.; Sun, K.; Qian, M.; Romero, R.; Fu, W. Gene-centric genomewide association study via entropy. Genetics 2008, 179, 637–650. [Google Scholar] [CrossRef] [Green Version]
  8. Malten, J.; König, I.R. Modified entropy-based procedure detects gene-gene-interactions in unconventional genetic models. BMC Med. Genom. 2020, 13, 65. [Google Scholar] [CrossRef] [Green Version]
  9. Wu, T.T.; Chen, Y.F.; Hastie, T.; Sobel, E.; Lange, K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009, 25, 714–721. [Google Scholar] [CrossRef]
  10. Yang, S.; Wen, J.; Eckert, S.T.; Wang, Y.; Liu, D.J.; Wu, R.; Li, R.; Zhan, X. Prioritizing genetic variants in GWAS with lasso using permutation-assisted tuning. Bioinformatics 2020, 36, 3811–3817. [Google Scholar] [CrossRef]
  11. Liu, J.; Wang, K.; Ma, S.; Huang, J. Regularized regression method for genome-wide association studies. BMC Proc. 2011, 5, S67. [Google Scholar] [CrossRef] [Green Version]
  12. Fan, R.; Wang, Y.; Mills, J.L.; Wilson, A.F.; Bailey-Wilson, J.E.; Xiong, M. Functional linear models for association analysis of quantitative traits. Genet. Epidemiol. 2013, 37, 726–742. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Cardot, H.; Ferraty, F.; Sarda, P. Spline estimators for the functional linear model. Stat. Sin. 2003, 13, 571–592. [Google Scholar]
  14. Luo, L.; Zhu, Y.; Xiong, M. Quantitative trait locus analysis for next-generation sequencing with the functional linear models. J. Med. Genet. 2012, 49, 513–524. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Huang, T.; Saporta, G.; Wang, H.; Wang, S. Spatial functional linear model and its estimation method. arXiv 2018, arXiv:1811.00314. [Google Scholar]
  16. Belonogova, N.M.; Svishcheva, G.R.; Wilson, J.F.; Campbell, H.; Axenovich, T.I. Weighted functional linear regression models for gene-based association analysis. PLoS ONE 2018, 13, e0190486. [Google Scholar] [CrossRef] [Green Version]
  17. Ridout, M.; Hinde, J.; DeméAtrio, C.G. A Score Test for Testing a Zero-Inflated Poisson Regression Model Against Zero-Inflated Negative Binomial Alternatives. Biometrics 2001, 57, 219–223. [Google Scholar] [CrossRef]
  18. Birgin, E.G.; Martínez, J.M. Improving ultimate convergence of an augmented Lagrangian method. Optim. Methods Softw. 2008, 23, 177–195. [Google Scholar] [CrossRef] [Green Version]
  19. Birgin, E.; Martínez, J. Complexity and performance of an Augmented Lagrangian algorithm. Optim. Methods Softw. 2020, 35, 885–920. [Google Scholar] [CrossRef]
  20. Andreani, R.; Birgin, E.G.; Martinez, J.M.; Schuverdt, M.L. Augmented Lagrangian methods under the Constant Positive Linear Dependence constraint qualification. Math. Programm. 2008, 111, 5–32. [Google Scholar] [CrossRef] [Green Version]
  21. Liu, J.; Ye, J. Efficient Euclidean Projections in Linear Time. In Proceedings of the 26th International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 657–664. [Google Scholar]
  22. Shapiro, A. Asymptotic distribution of test statistics in the analysis of moment structures under inequality constraints. Biometrika 1985, 72, 133–144. [Google Scholar] [CrossRef]
  23. Liu, X. Likelihood ratio test for and against nonlinear inequality constraints. Metrika 2007, 65, 93–108. [Google Scholar] [CrossRef]
  24. Cameli, C.; Bacchelli, E.; De Paola, M.; Giucastro, G.; Cifiello, S.; Collo, G.; Cainazzo, M.M.; Pini, L.A.; Maestrini, E.; Zoli, M. Genetic variation in CHRNA7 and CHRFAM7A is associated with nicotine dependence and response to varenicline treatment. Eur. J. Hum. Genet. 2018, 26, 1824–1831. [Google Scholar] [CrossRef] [PubMed]
  25. World Health Organization. World Health Statistics 2013; World Health Organization: Geneva, Switzerland, 2013; Volume 1. [Google Scholar]
  26. Saccone, N.L.; Wang, J.C.; Breslau, N.; Johnson, E.O.; Hatsukami, D.; Saccone, S.F.; Grucza, R.A.; Sun, L.; Duan, W.; Budde, J.; et al. The CHRNA5-CHRNA3-CHRNB4 nicotinic receptor subunit gene cluster affects risk for nicotine dependence in African-Americans and in European-Americans. Cancer Res. 2009, 69, 6848–6856. [Google Scholar] [CrossRef] [Green Version]
  27. Culverhouse, R.C.; Johnson, E.O.; Breslau, N.; Hatsukami, D.K.; Sadler, B.; Brooks, A.I.; Hesselbrock, V.M.; Schuckit, M.A.; Tischfield, J.A.; Goate, A.M.; et al. Multiple distinct CHRNB3–CHRNA6 variants are genetic risk factors for nicotine dependence in African Americans and European Americans. Addiction 2014, 109, 814–822. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Wen, L.; Han, H.; Liu, Q.; Su, K.; Yang, Z.; Cui, W.; Yuan, W.; Ma, Y.; Fan, R.; Chen, J.; et al. Significant association of the CHRNB3-CHRNA6 gene cluster with nicotine dependence in the Chinese Han population. Sci. Rep. 2017, 7, 9745. [Google Scholar] [CrossRef] [Green Version]
  29. Liu, Q.; Han, H.; Wang, M.; Yao, Y.; Wen, L.; Jiang, K.; Ma, Y.; Fan, R.; Chen, J.; Su, K.; et al. Association and cis-mQTL analysis of variants in CHRNA3-A5, CHRNA7, CHRNB2, and CHRNB4 in relation to nicotine dependence in a Chinese Han population. Transl. Psychiatry 2018, 8, 1–10. [Google Scholar] [CrossRef] [Green Version]
  30. Ditmyer, M.M.; Dounis, G.; Howard, K.M.; Mobley, C.; Cappelli, D. Validation of a multifactorial risk factor model used for predicting future caries risk with Nevada adolescents. BMC Oral Health 2011, 11, 18. [Google Scholar] [CrossRef] [Green Version]
  31. Silva, M.J.; Kilpatrick, N.M.; Craig, J.M.; Manton, D.J.; Leong, P.; Burgner, D.P.; Scurrah, K.J. Genetic and early-life environmental influences on dental caries risk: A twin study. Pediatrics 2019, 143, e20183499. [Google Scholar] [CrossRef] [PubMed]
  32. Lendrawati, L.; Pintauli, S.; Rahardjo, A.; Bachtiar, A.; Maharani, D.A. Risk factors of dental caries: Consumption of sugary snacks among indonesian adolescents. Pesqui. Bras. Odontopediatria Clin. Integr. 2019, 19. [Google Scholar] [CrossRef]
  33. Bretz, W.A.; Corby, P.M.; Melo, M.R.; Coelho, M.Q.; Costa, S.M.; Robinson, M.; Schork, N.J.; Drewnowski, A.; Hart, T.C. Heritability estimates for dental caries and sucrose sweetness preference. Arch. Oral Biol. 2006, 51, 1156–1160. [Google Scholar] [CrossRef]
  34. Wang, Q.; Jia, P.; Cuenco, K.T.; Zeng, Z.; Feingold, E. Association signals unveiled by a comprehensive gene set enrichment analysis of dental caries genome-wide association studies. PLoS ONE 2013, 8, e72653. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  35. Melvin, V.S.; Feng, W.; Hernandez-Lagunas, L.; Artinger, K.B.; Williams, T. A morpholino-based screen to identify novel genes involved in craniofacial morphogenesis. Dev. Dyn. 2013, 242, 817–831. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  36. Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; De Bakker, P.I.; Daly, M.J.; et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. Taliun, D.; Gamper, J.; Pattaro, C. LDExplorer. 2013. Available online: http://www.eurac.edu/en/research/health/biomed/services/Pages/LDExplorer.aspx (accessed on 20 November 2015).
  38. Price, A.L.; Patterson, N.J.; Plenge, R.M.; Weinblatt, M.E.; Shadick, N.A.; Reich, D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006, 38, 904–909. [Google Scholar] [CrossRef]
Figure 1. Power simulation for binary outcomes based on random LD blocks, single causal locus.
Figure 1. Power simulation for binary outcomes based on random LD blocks, single causal locus.
Stats 04 00033 g001
Figure 2. Power simulation for binary outcomes based on random LD blocks, two reverse-sign causal loci.
Figure 2. Power simulation for binary outcomes based on random LD blocks, two reverse-sign causal loci.
Stats 04 00033 g002
Figure 3. Illustration of fitted genetic mapping patterns using different models. A modified Manhattan plot of the −log10 (p-values) by the sign of the fitted coefficients is used for smAT. For all other methods, the coefficient estimates are plotted. The causal loci is highlighted with dashed lines (left in red for positive effect, right in blue for negative effect).
Figure 3. Illustration of fitted genetic mapping patterns using different models. A modified Manhattan plot of the −log10 (p-values) by the sign of the fitted coefficients is used for smAT. For all other methods, the coefficient estimates are plotted. The causal loci is highlighted with dashed lines (left in red for positive effect, right in blue for negative effect).
Stats 04 00033 g003
Figure 4. Power simulation for ZINB outcomes based on random LD blocks, single causal locus, effect in latent Bernoulli distribution.
Figure 4. Power simulation for ZINB outcomes based on random LD blocks, single causal locus, effect in latent Bernoulli distribution.
Stats 04 00033 g004
Figure 5. Power simulation for ZINB outcomes based on random LD blocks, two causal loci, effect in latent Bernoulli distribution.
Figure 5. Power simulation for ZINB outcomes based on random LD blocks, two causal loci, effect in latent Bernoulli distribution.
Stats 04 00033 g005
Figure 6. Power simulation for ZINB outcomes based on random LD blocks, single causal locus, effect in NB distribution.
Figure 6. Power simulation for ZINB outcomes based on random LD blocks, single causal locus, effect in NB distribution.
Stats 04 00033 g006
Figure 7. Power simulation for ZINB outcomes based on random LD blocks, two causal loci, effect in NB distribution.
Figure 7. Power simulation for ZINB outcomes based on random LD blocks, two causal loci, effect in NB distribution.
Stats 04 00033 g007
Figure 8. Histograms and fitted densities for traits D1MFT and D1MFS in dental caries study.
Figure 8. Histograms and fitted densities for traits D1MFT and D1MFS in dental caries study.
Stats 04 00033 g008
Figure 9. Genome-wide scanning for trait D1MFT using single-marker association tests based on the ZINB model.
Figure 9. Genome-wide scanning for trait D1MFT using single-marker association tests based on the ZINB model.
Stats 04 00033 g009
Figure 10. Genome-wide scanning for trait D1MFS using single-marker association tests based on ZINB model.
Figure 10. Genome-wide scanning for trait D1MFS using single-marker association tests based on ZINB model.
Stats 04 00033 g010
Table 1. Type I error simulation using cGFLM for binary outcomes based on randomly generated LD blocks.
Table 1. Type I error simulation using cGFLM for binary outcomes based on randomly generated LD blocks.
Nominal α N = 500N = 1000N = 1500N = 2000
0.050.05000.04730.04660.0482
0.010.01090.00760.00990.0082
0.0050.00540.00450.00460.0033
0.0010.00100.00090.00110.0006
Table 2. Type I error simulation using cGFLM-ZINB for ZINB outcomes based on randomly generated LD blocks.
Table 2. Type I error simulation using cGFLM-ZINB for ZINB outcomes based on randomly generated LD blocks.
Nominal α N = 500N = 1000N = 1500N = 2000
0.050.0550.0460.0510.040
0.010.0090.0090.0120.007
0.0050.0030.0070.0080.004
Table 3. Association tests for COGEND study based on LD blocks using single-marker association tests with Bonferroni correction (smAT), SKAT-C, GFLM and cGFLM.
Table 3. Association tests for COGEND study based on LD blocks using single-marker association tests with Bonferroni correction (smAT), SKAT-C, GFLM and cGFLM.
p-Value LD BlockCHRLength# of p-Value
(Genes) (kb)SNPscGFLMsmATSKAT-CGFLM
CHRND243100.09900.48180.19200.1321
CHRNA9420110.43080.75800.39410.3915
CHRNA2831110.70780.75960.56980.8036
CHRNB3 + CHRNA6810125 8.67 × 10 4 0.00640.00130.0033
CHRNA715126390.27590.33820.60780.5587
CHRNA5 cluster1521272 5.25 × 10 6 1.72 × 10 4 0.0017 8.24 × 10 6
CHRNB11784140.01370.39750.02740.0602
CHRNA4201790.05560.11340.02550.1409
Table 4. Significant findings for dental caries GWAS scanning using single-marker association tests based on the ZINB model.
Table 4. Significant findings for dental caries GWAS scanning using single-marker association tests based on the ZINB model.
Trait: D1MFT
SNP IDCHRGeneMAFp-Value
rs799096513-0.033 1.02 × 10 12
rs105859510PHYH0.063 2.52 × 10 9
rs123441209-0.029 4.79 × 10 8
rs46946664MTHFD2L0.135 5.48 × 10 8
rs170781403LIMD10.029 5.77 × 10 8
rs989353617USP320.091 8.80 × 10 8
rs733452513RFC30.093 9.52 × 10 8
Trait: D1MFS
SNP IDCHRGeneMAFp-Value
rs799096513-0.033 2.40 × 10 10
rs105859510PHYH0.063 3.29 × 10 9
Table 5. Significant findings for dental caries association tests based on LD blocks (gene clusters) using the ZINB model (smAT-ZINB with Bonferroni correction, GFLM-ZINB, cGFLM-ZINB).
Table 5. Significant findings for dental caries association tests based on LD blocks (gene clusters) using the ZINB model (smAT-ZINB with Bonferroni correction, GFLM-ZINB, cGFLM-ZINB).
Trait: D1MFT
LD BlockCHRLength# of p-Value
(Genes) (kb)SNPscGFLMGFLMsmAT
PKDCC27018 8.41 × 10 7 8.06 × 10 5 4.17 × 10 5
Trait: D1MFS
LD BlockCHRLength# of p-Value
(Genes) (kb)SNPscGFLMGFLMsmAT
Intergenic between DCN and BTG1127021 6.07 × 10 7 5.92 × 10 6 4.41 × 10 3
PKDCC27018 1.38 × 10 6 1.59 × 10 4 7.73 × 10 6
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Huang, J.; Yang, J.; Gu, Z.; Zhu, W.; Wu, S. A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping. Stats 2021, 4, 550-577. https://doi.org/10.3390/stats4030033

AMA Style

Huang J, Yang J, Gu Z, Zhu W, Wu S. A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping. Stats. 2021; 4(3):550-577. https://doi.org/10.3390/stats4030033

Chicago/Turabian Style

Huang, Jiayu, Jie Yang, Zhangrong Gu, Wei Zhu, and Song Wu. 2021. "A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping" Stats 4, no. 3: 550-577. https://doi.org/10.3390/stats4030033

APA Style

Huang, J., Yang, J., Gu, Z., Zhu, W., & Wu, S. (2021). A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping. Stats, 4(3), 550-577. https://doi.org/10.3390/stats4030033

Article Metrics

Back to TopTop