A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping

: In genome-wide association studies (GWAS), efﬁcient incorporation of linkage disequilibria (LD) among densely typed genetic variants into association analysis is a critical yet challenging problem. Functional linear models (FLM), which impose a smoothing structure on the coefﬁcients of correlated covariates, are advantageous in genetic mapping of multiple variants with high LD. Here we propose a novel constrained generalized FLM (cGFLM) framework to perform simultaneous association tests on a block of linked SNPs with various trait types, including continuous, binary and zero-inﬂated count phenotypes. The new cGFLM applies a set of inequality constraints on the FLM to ensure model identiﬁability under different genetic codings. The method is implemented via B-splines, and an augmented Lagrangian algorithm is employed for parameter estimation. For hypotheses testing, a test statistic that accounts for the model constraints was derived, following a mixture of chi-square distributions. Simulation results show that cGFLM is effective in identifying causal loci and gene clusters compared to several competing methods based on single markers and SKAT-C. We applied the proposed method to analyze a candidate gene-based COGEND study and a large-scale GWAS data on dental caries risk.


Introduction
Recent advances in technology have facilitated the development of association studies using a highly dense map of genetic markers, such as single nucleotide polymorphisms (SNPs). Nowadays, up to a million SNPs can be readily genotyped along the human genome for thousands of subjects. Even though analyses based on univariate tests are still popular and useful as pre-screening techniques [1], such methods have a major drawbackthe high correlation and composite effect among multiple genetic variants are omitted. To accommodate this challenge, a few more advanced statistical methods that are based on multiple linked SNPs have been developed in recent years [2][3][4][5], aiming to improve the power of associations between genetic markers and phenotypic traits of interest.
As SNPs naturally cluster together and form the so-called linkage disequilibrium (LD) blocks, multi-loci association tests that take into account this special genomic structure are usually more powerful than single SNP-based ones [2][3][4][5]. Further, due to this locally connected/correlated LD structure, if a casual SNP, even unobserved in some cases, exists in one LD block, its information, i.e., genetic effect, can be partially carried and jointly represented by its neighboring SNPs in the same block, which increases the likelihood of the causal SNPs being detected. Another benefit of testing at the LD block level is on multiple test adjustment in GWAS, including not only significant reduction of the total number of tests, but also improvement on the independence of the tests. Several multi-SNP methods have been proposed, such as principal component analysis [2,6], entropy-based methods [7,8], kernel machine methods (SKAT, SKAT-C and CKAT) [3,5] and two-marker LD mapping [4]. In addition, some variable selection models, such as LASSO [9,10] and Stats 2021, 4 smoothed minimax concave penalized regression (SMCP) [11], were applied to identify a core subset of potential causal variants. Although these methods are useful, most of them overlook marker ordering information in their physical positions.
Functional linear models (FLMs) serve as a good solution to the above-mentioned problems, as it can effectively preserve the intrinsic correlation structure and spatial information of SNPs in an LD [12]. In essence, FLMs take into account the effect that neighboring correlated SNPs in an LD would show similar genetic effects, and treat the regression coefficients of these SNPs with an explicit functional form. The smooth coefficient function can be further expanded in terms of spline bases, which allows for substantial dimension reduction in parameter estimation [12,13]. Both functional principal component analysis (FPCA) and beta-smooth only approaches have been applied to construct the FLMs [12,[14][15][16]. Since the beta-smooth only approach is more straightforward, it was used for the construction of FLM in this study.
One critical issue with FLM is that the fitted coefficient functions are often noisy and hard to interpret. They usually fluctuate dramatically due to several reasons: (1) Strong LD among nearby SNPs causes multicollinearity, which leads to erratic changes in the signs of adjacent functional coefficients; (2) FLMs cannot yield estimates that are exactly zero over regions with no significant association, thus generating unnatural wiggles in the fitted genetic function; (3) population-specific phenomena such as mutation, genetic drift, population structure and variations in allele frequencies result in the LD not decaying with distance. Excessive local fluctuation may be relieved by adding a smoothness penalty in the model or controlling the number of spline bases. However, these methods are still not able to identify the null regions in which the coefficient function should be zero and may suffer from loss of detection power due to oversmoothing. In addition, no proper test other than the permutation approach is available for penalty-based methods, which brings heavy computational burden to large-scale studies.
In this paper, we propose a novel constrained generalized functional linear model (cGFLM) for flexible and reasonable multi-loci genetic mapping on a block of correlated genetic variants. The cGFLM retains the merits of FLM such as preserving the spatial and LD information among genetic markers, as well as compressing the high-dimensional problem into aggregate inference about several smoothing components. In addition to these benefits, the cGFLM is more powerful and enables easier interpretation of the functional coefficients. Specifically, we reconstruct FLM by separating the genetic effect into two sign-specific coefficient functions and imposing an equality constraint to encourage overall spatial sparsity. The cGFLM tends to constrain the functional coefficients to be zero in "null regions" where no determinative positive or negative effect is present.
We further extend the cGFLM framework to several types of quantitative traits, such as continuous, binary and count data. We put more focus on count traits as they are common in practice but relatively less mentioned in the literature. Poisson and negative binomial (NB) models provide tractable methods to most count traits, however, in traits with excessive zeros, zero-inflated Poisson (ZIP) or negative binomial (ZINB) models were also developed [17]. Since the ZINB is the most complicated model here, we discuss how to apply the cGFLM to it in more detail. Applications of cGFLM to other models would be similar.
The remainder of the article is organized as follows. In Sections 2 and 3, we introduce a generalized functional linear model (GFLM) framework, discussing the continuous and categorical traits (Section 2) and then zero-inflated negative binomial phenotypic traits (Section 3). In Section 4, we describe the proposed cGFLM and cGFLM-ZINB models, as well as their estimation and testing processes. In Section 5, we present Monte Carlo simulations to validate the proposed test and compare our model with several alternative methods. In Sections 6 and 7, we apply the proposed method to the COGEND (The Collaborative Genetic Study of Nicotine Dependence) and Dental Caries GWAS data. Sections 8 and 9 are the discussion and conclusion, respectively.

Generalized Functional Linear Model
Suppose n subjects are sampled, each characterized by p linked SNP markers j = 1, . . . , p, where m j is the spatial location of SNP marker j with 0 ≤ m 1 < · · · < m p . SNP location can be measured by the distance of a SNP from starting position of a block. Denote the SNP genotype at marker position m j for subject i as X i (m j ) . Suppose the genotypes of a SNP are AA, Aa and aa, which are coded as 0, 1 and 2, according to the number of copies of the allele a. That is, , if the genotype of SNP j at position m j is AA; 1, if the genotype of SNP j at position m j is Aa; 2, if the genotype of SNP j at position m j is aa. (1) Suppose the genetic effect is denoted by β(m), a smooth coefficient function over all marker positions m in an LD block. The value of the smooth coefficient function at marker position m j is then β(m j ). Further, suppose a set of other covariates z i = (z i1 , . . . , z iq ) are also observed for subject i and let α = (α 1 , . . . , α q ) be the q × 1 vector of their corresponding coefficients. The global-featured effect is included as the intercept, denoted as α 0 . For the ith subject, let y i denote its phenotypic response. If y i s are continuous, a functional linear model (FLM) can be formulated as This type of association has been explored in [12,14]. Alternatively, if y i s are categorical, a link function g(·) can be applied to to transform its mean µ(y i ) = E(y i ) to be g(µ(y i )). In this case, a generalized functional linear model (GFLM) can be formulated as In the FLM/GFLM above, the genetic effects of a block of SNPs are assumed to be a smooth function. In other words, SNPs located close to each other are expected to show similar effects.
Since we are modeling a block of SNPs simultaneously in (3), the hypothesis test of association between genetic variants and phenotypic trait will be made at the block level. That is, Approximately, the hypotheses can also be expressed as: The likelihood ratio test (LRT) can be applied to draw inference about whether a block of SNPs may be associated with the phenotypic trait. Under H 0 , the LRT statistic, defined as −2(l 0 − l 1 ), follows asymptotically a χ 2 d distribution (Chi-square distribution with df = d, the number of basis vectors used in Equation (5)).

GFLM for Zero-Inflated Negative Binomial (ZINB) Traits
Most count traits can be modeled with Poisson or negative binomial (NB) distributions. However, when excessive zeroes are present in the data, a zero-inflated Poisson or negative binomial regression framework needs to be employed [17], which utilizes a latent Bernoulli distribution and a Poisson distribution to incorporate explanatory variables. In this section, we describe a GFLM framework for count traits that follow a ZINB distribution.
To illustrate the ZINB model, we use dental caries, more commonly known as tooth decay, as an example. Let Y denote the number of dental caries, and the probability distribution of Y is given as: Here L is a latent Bernoulli random variable that categorizes caries "risk" into two states: "risk-free" (L = 1) and "risky" (L = 0). π is the probability of L = 1 , µ and φ are the mean and dispersion parameter of the NB distribution, respectively.
In a ZINB regression model, let y i denote the observation of number of caries for subject i, i = 1, . . . , n. Then, We can see that zeroes may come from two sources: the conditional point mass distribution or the conditional NB distribution. Thus, the occurrence of dental caries y i is: An expectation-maximization (EM) algorithm can be applied to estimate the parameters in the ZINB model. Suppose at step t, the parameter estimates are (π (t) , µ (t) , φ t ). Then the detailed EM implementation is given as follows.

E-step
For subject i, i = 1, . . . , n, estimate L i by its conditional mean

M-step
• Find π (t+1) by maximizing The maximization can be performed by the Newton-Raphson algorithm for the two models int the M-step simultaneously.
Using the same notations as in the previous section, the Bernoulli probabilities and the negative binomial means can be transformed using the logit and the log links, respectively, and are modeled as follows: and Since usually we do not have strong pre-assumption about genetic components and covariates, the same sets of genetic components and covariates can be used in both Bernoulli and NB models simultaneously. For univariate analyses on single SNPs, it is equivalent to the case that one LD block only contains one SNP, and we can simply set p = 1 in the above equations. For multi-loci mapping purposes, we set all SNPs in an LD block as input/explanatory variables.
We substitute the unstructured coefficients to functional coefficients, which are represented by linear combination of B-Splines. That is, Equation (5) is plugged into the coefficients for both the latent Bernoulli model and NB model to form the GFLM-ZINB model. Using the same notation above, the GFLM-ZINB model is reformulated as: Then we can estimate the parameters using an EM Algorithm again, which is implemented as follows.
For hypothesis testing, since we model a block of SNPs simultaneously, the test of association between genetic variants and a phenotypic trait will be made at the block level. In the meantime, we have two sets of parameters, one for the latent Bernoulli model and one for the negative binomial model. The hypothesis test should consider the overall effect of both models. That is, we consider testing at least one coefficients from either model being nonzero. Note that we obtained dimension reduction by changing the estimator of interest from the 2p-dimensional β Ber (m j ) and β NB (m j ), j = 1, . . . , p to the (d Ber + d NB )dimensional γ Ber d Ber ×1 and γ NB d NB ×1 by applying the functional coefficients implemented via the B-spline bases. The hypotheses are formulated as The likelihood ratio test (LRT) can be applied to draw inference about whether a block of SNPs may be associated with the phenotypic trait. Under H 0 , the LRT statistic asymptotically follows a χ 2

Constrained Generalized Functional Linear Model
In practice, we have observed that the fitted coefficient functions in the GFLM can fluctuate dramatically, and the fluctuation makes it difficult to explain the functional patterns and determine the locations of causal SNPs . To address this, here we propose a more reasonable constrained functional linear model (cGFLM). Specifically, we separate the genetic effect into two sign-specific coefficient functions and impose an equality constraint to promote spatial sparsity. The cGFLM is formulated as follows: Let • denote the Hadamard product of two vectors. In matrix form, the cGFLM is formulated as: The log-likelihood function for cGFLM is then In order to obtain the MLEs for parameters, the following nonlinear optimization problem with inequality/equality constraints needs to be solved: An augmented Lagrangian algorithm (ALA) can be applied to this constrained maximization [18,19]. Let γ * = (γ 0 , γ + , γ − ). Denote the p equality constraints defined by

and the 2p inequality constraints defined by
For notation purposes, let I = p, J = 2p, then the corresponding augmented Lagrangian for (26) to be minimized is where c be a small constant value (e.g., 10 −4 ), and { t } be a sequence of nonnegative numbers such that lim t→∞ t = 0. A sketch of the augmented Lagrangian algorithm is given below and more implementation details can be found in [20]: Step 0. Let λ where P Ω is the Euclidean projection onto Ω [21] and Step Step 3. Update λ 1 s and λ 2 s: Step 4. if γ * (t) − γ * (t−1) ∞ ≥ c , set t + 1 → t and go to Step 1; else stop. Similar to what has been performed for (10), a likelihood ratio test can be performed to investigate the overall genetic effects represented by a block of SNPs in contiguous genomic regions. However, the null and alternative hypotheses in (10) are updated with regard to the new parameter space and imposed constraints, as follows: Since we constrained that the Hadamard product of sign-specific coefficient functions is 0, at least one basis coefficient in the positive or negative part would be constrained to zero. Therefore, the dimension of the alternative parameter space should be K = max(d 1 , d 2 ). According to [22,23], it has been shown that in nonlinear optimization, the alternative parameter space Ω can be approximated at the null estimate by a polyhedral convex cone defined by the gradient vectors of the constraint functions. If the unconstrained true parameter value is an interior point of Ω, the test statistic has an asymptotic χ 2 K distribution under H 0 . Otherwise when the unconstrained parameter estimate does not fall in the admissible parameter space, the test statistic is defined by the projection of the unconstrained estimate on the k-dimensional boundary of Ω taken metrics according to the Hessian matrix I(γ * ), and it may follow an asymptotic χ 2 k distribution under H 0 (k = 0, . . . , K − 1). Therefore, the LRT statistic asymptotically follows a mixture of chisquare distributions with mixing probabilities w j such that ∑ K j=0 w j = 1, denoted as For any c ∈ R, the p-value of theχ 2 test statistic is defined as The mixing probabilities can be calculated using Monte Carlo techniques. The algorithm is given as follows: (1) Take 1000 draws from a multivariate normal distribution with mean zero and covariance matrix equaling to the Hessian matrix I(γ * ); (2) for each draw compute and count the number of sign-agree elements of the vectors that fall in the k-dimensional boundaries (k = 0, . . . , K) of the admissible parameter space. In this case w j is computed as the proportion of the 1000 draws in which it has exactly k non-zero coefficients projected on the alternative parameter space. The Monte Carlo technique is easy to implement and able to circumvent complicated numerical integrations.
The LRT can be adapted to the constrained functional coefficients (cGFLM-ZINB) model as follows. Since the parameter estimation is conducted with two independent sets of constraints, the hypothesis test consists of two parts as well, one for the latent Bernoulli distribution and one for the NB distribution.
Assuming that we used the same number of spline bases for both positive and negative coefficient functions, we have d Ber and d NB degrees of freedom for the latent Bernoulli and the negative binomial models. Since a mixture of chi-square distributions (chi-bar test) is needed for each model, the overall LRT statistic for the above test then follows a mixture of mixture of chi-square distributions. Letting the mixing probabilities be w Ber The p-value of theχ 2 test statistic is then

Simulation Studies
To study the statistical properties of the proposed cGFLM, we carried out simulations under different sampling schemes. Genotypic data were simulated under two settings: (1) random LD block with varying structures [9,11], and (2) LD block of gene CHRNA7 [24] borrowed from an existing data (COGEND, the Collaborative Genetic Study of Nicotine Dependence), which represents a real-data genomic structure. For the phenotypes, we considered traits following either a binomial or ZINB distribution. Sample sizes were set to range from 500 to 2000.
To investigate whether cGFLM can correctly control type I error, β causal was set to be zero under the null hypothesis. For empirical power evaluation, we examined two different scenarios. First, we assumed only one causal SNP was located in the LD block, with varying β causal s. Then we considered another interesting setting when two causal loci with reversed sign effects are located in the same LD block, which mimics the scenario when both deleterious and protective SNPs exist in a genomic region. The two causal loci chosen were weakly correlated (r 2 < 0.01). The corresponding regression coefficients were set to be (β causal1 , β causal2 ) where β causal1 = −β causal2 . In this case, we plan to examine if the proposed test can deal with sign-heterogeneous genetic effects. Causal SNPs were removed before analyses to mimic the real data setting that causal SNPs may not be genotyped.
In terms of functional parameters, the order of B-spline basis was set to 4 (degree = 3) to construct cubic curves with desired smoothing properties. Knots were placed evenly in the position domain. In general, the number of spline bases would be determined according to the number of SNPs (p) in an LD block. Data-adaptive choices for the number or the placement of knots can be made via cross-validation, but for simplicity we will not provide further discussions here. Empirically, we suggest using the maximum of 4 and the integer part of p/6 as the number of bases so that it is possible to capture clustering genetic effects in the fitted function. Sensitivity analyses using a broad range of parameters were performed to make sure our results are robust.
Because the simulation results using the random LD blocks and the LD block structure from CHRNA7 gene are very similar, here we only show results with the random LD blocks, whereas the results with the CHRNA7 gene are located in Appendix A (Tables A1 and A2, Figures A1-A6). For a random LD block, all SNP genotypes within it were generated following the strategy introduced in [9,11]. Briefly, genotypes of p SNPs were generated based on a random p-dimensional multivariate normal matrix ζ n×p with mean 0 and covariance Σ p×p . Assuming that SNPs have equal allele frequencies, the following rule would be applied to generate the genotype of the jth SNP for the ith subject. Let z 0.25 be the third quartile of standard normal distribution, we have: The covariance matrix was defined as follows. For each block, 10% of the SNPs were selected as "tag SNPs". They were highly correlated with each other (Corr(X j1 , X j2 ) = 0.8), moderately correlated with 30% of other SNPs (Corr(X j1 , X j2 ) = 0.5), and weakly correlated with the remaining 60% SNPs (Corr(X j1 , X j2 ) = 0.2). The correlations among the 90% "nontag" SNPs are determined by their physical locations (Corr(X j1 , X j2 ) = 0.7 |j 1 −j 2 | ). In this case, we would not violate the assumptions that SNPs are physically adjacent and linked. Further, the LD block structures vary among different randomly generated arrays.

Simulation Using Binary Traits
The following binary outcomes were simulated based on causal genotypes under the logit model: logit(µ(y i )) = log Pr(y i = 1) 1 − Pr(y i = 1) = α 0 + X i β causal (35) We considered simulation scenarios where α 0 = 0.2. Each scenario was replicated 10,000 times in order to observe the type I error rates under small genome-wide thresholds (nominal α = 0.05, 0.01, 0.005 and 0.001). We ran 1000 replicates for each scenario for power evaluation, in which a p-value smaller than 0.05 would be declared for significance. We compared the empirical power of our proposed model with three existing methods: single marker association test (smAT), SKAT for the combined effect of rare and common variants (SKAT-C) and the functional linear model (GFLM). p-values for smAT were adjusted by Bonferroni correction, and p-values for SKAT-C and GFLM were calculated by combined sum test and χ 2 test.
We can see from Table 1 that cGFLM can effectively maintain the type I error. Evaluation of empirical power was based on settings when regression coefficient β causal = 0.1, 0.2, 0.3, 0.4 or 0.5. For power calculation with single causal locus, we used the same settings as that in type I error simulation. When causal SNPs were not genotyped, we can see from Figure 1 that the cGFLM showed better power than all other methods. In the second scenario when two causal loci had reverse-sign effects, we included more SNPs in a block so that it would be possible to locate two weakly correlated markers within the region. We used p = 25 SNPs in a group and d 1 = d 2 = 4 as the number of spline bases. From Figure 2, we can see that cGFLM consistently demonstrates better power than other methods.   The association patterns of a sample simulation are presented in Figure 3. For smAT, a modified Manhattan plot of the −log10 ( p-values) by the sign of the fitted coefficients is used. For all other methods, the coefficient estimates are plotted. The causal loci is highlighted with dashed lines (left in red for positive effect, right in blue for negative effect). Compared to smAT and cGFLM, the cGFLM fitted coefficient function is more reasonable and correctly identify the causal loci.

Simulation Using ZINB Traits
In this case, we simulate phenotypic traits conditional on causal genotypes under the following ZINB model: The outcomes are generated with the latent mixture process as discussed in Section 3. The intercepts, α Ber 0 and α NB , were set to be 0.0 in all subsequent simulations. β causal was set to zero under the null hypothesis in the evaluation of type I error. Since the computational burden for ZINB models is much higher than that in the binary outcome model, we reduced the replicated times to 1000 times for each scenario. Type I error rates were investigated under small genome-wide thresholds (nominal α = 0.05, 0.01 and 0.005). For assessment of empirical power, we used similar settings as those in the simulation for binary traits. However, we examined two general scenarios where the genetic effect is in either the latent Bernoulli or the negative binomial models. Then for each general scenario, we first set only one causal SNP in the LD block, affecting either β Ber causal or β NB causal . Then we considered two causal loci with reversed sign effects in the LD block. The two causal loci chosen were weakly correlated (r 2 < 0.01). The corresponding regression coefficients were set to (β Ber causal1 , β Ber causal2 ), or (β NB causal1 , β NB causal2 ) where β causal1 = −β causal2 . A total of 100 replicates were run for each scenario. We compared the empirical power of our proposed model cGFLM-ZINB with three existing methods: functional linear model with negative binomial traits (GFLM-NB), single marker association test with ZINB traits(smAT-ZINB) and functional linear model with ZINB traits (GFLM-ZINB). p-values for smAT-ZINB were adjusted by Bonferroni correction, and p-values for GFLM-NB and GFLM-ZINB were calculated by the likelihood ratio tests. Table 2 demonstrated that the proposed cGFLM maintains the type I errors very well. Evaluation of empirical power is based on settings with regression coefficients ranging from 0.1 to 0.5 and sample sizes from 500 to 2000. When the genetic effect was set to be in the latent Bernoulli model, we can observe the apparent failure of using a negative binomial (NB) regression model, by looking at the significantly lower power when using the GFLM-NB model compared with other models in Figures 4 and 5. This fortifies our assumption that using a simple NB distribution will lead to loss of power when modeling genetic effects affecting excess zero in zero-inflated count process. While using ZINB models, similar performances were observed for smAT-ZINB, GFLM-ZINB and cGFLM-ZINB (see Figures 6 and 7). The figures demonstrate that ZINB models are more advantageous than the NB regression model when excessive zeros exist. More importantly, the cGFLM-ZINB model consistently shows the best performance among the tested models, in scenarios when one causal locus and two causal loci were set in the LD block. Collectively, these simulations demonstrate the advantages and robustness of our proposed cGFLM model. Table 2. Type I error simulation using cGFLM-ZINB for ZINB outcomes based on randomly generated LD blocks.      We then applied the cGFLM model to two independent studies to assess its practical usage. The first dataset is a candidate-gene-based SNP study and the other is a GWAS for dental caries.

Application 1: COGEND Study
According to the World Health Statistics Report [25], cigarette smoking is the single biggest cause of preventable mortality worldwide, causing more than 5 million deaths per year and accounting for one in 10 adult deaths. Nicotine dependence, the primary psychoactive component in tobacco, profoundly impacts people's ability to cease tobacco smoking. The etiology of nicotine dependence is multifactorial, and evidence from various epidemiology studies suggest that genetic factors have a substantial impact on smoking behaviors. Identification of these genetic factors and the development of targeted treatments could be promoted to further reduce smoking related morbidity and mortality.
The Collaborative Genetic Study of Nicotine Dependence (COGEND) is a nationwide project aiming to detect the genetic mechanisms and environmental features of nicotine dependence. In this study on CHRN candidate genes, a total of 216 SNPs were genotyped for 2022 individuals (1114 cases with nicotine dependence and 908 controls). In the phenotypic data, all cases and controls were current or former smokers who reported smoking more than 100 cigarettes lifetime. Rates of current nicotine dependence were defined by the Fagerstrom Test for Nicotine Dependence (FTND). Subjects having FTND ≥ 4 were classified as nicotine dependent (case). Subjects having lifetime FTND = 0 or 1 were classified as control. The original SNP set was divided into 12 LD blocks according to their physical locations and LD structure, all of which consist of one or more contiguous gene regions. Since functional models are not well-suited for LD blocks having small number of SNPs, four small blocks with fewer than seven SNPs were excluded for analyses. We applied our proposed method cGFLM, along with smAT, SKAT-C and GFLM to analyze the final dataset, which consists of 191 SNPs in eight LD blocks. Age, gender and race were included as covariates. A Bonferroni significance threshold of 0.05/8 = 6.2 × 10 −3 is used for cGFLM, GFLM and SKAT-C, and a threshold 0.05/191 = 2.6 × 10 −4 is used for smAT. Table 3 summarizes the results. All four methods yielded small p-values (cGFLM: 5.25 × 10 −6 ; GFLM: 8.24 × 10 −6 ; SKAT-C: 1.7 × 10 −3 ; smAT: 1.72 × 10 −4 ) for the CHRNA5 cluster ("IREB2 + LOC123688 + PSMA4 + CHRNA5 + CHRNA3 + CHRNB4" gene cluster) on chromosome 15. However, only cGFLM and SKAT-C showed significance for the "CHRNB3 + CHRNA6" gene cluster (cGFLM: 8.67 × 10 −4 ; SKAT-C: 1.3 × 10 −3 ). For these two blocks, p-values calculated by cGFLM are much smaller than those calculated by other methods. It is also worth mentioning that both gene clusters have been shown to be associated with nicotine dependence in previous studies [26][27][28][29]. Other LD blocks (candidate gene clusters) are not significantly associated with the phenotypic trait in this cohort.

Application 2: Whole Genome Association Study for Dental Caries
More than 40% children and adolescents, and 90% adults in the US are being affected by dental caries, or more commonly known as tooth decay. Multiple factors are considered to contribute to the risk of having dental caries, such as some environmental factors and social behaviors [30][31][32]. Evidence has shown that some individuals are more susceptible to caries while some others are more resistant, almost irrelevant to the environmental risk factors they are exposed to, suggesting that genetic factors may play crucial roles in the risk of developing caries [33]. According to several previous studies, the heritability of dental caries were evaluated to be as high as 60%.
To better understand the genetic mechanisms of the risk of dental caries, a GWAS study has been conducted as part of the Gene Environment Association Studies initiative (deposited in dbGaP Study Accession: phs000095.v2.p1) [4,34]. A total of 4020 individuals were genotyped with a large panel of SNPs (610,000) and examined with multiple outcomes. Our study focused on traits related to caries in permanent teeth. Two indexes, D1MFT and D1MFS, which quantifies the total permanent tooth/surface caries with white spots, were included in the analyses. Since the outcomes of interest were both count traits with excess zeroes (Figure 8), the proposed methods, zero-inflated negative binomial model (smAT-ZINB for single-marker tests) and its application with functional coefficient (GFLM-ZINB and cGFLM-ZINB), were applied to the data set. The final analytic sample consists of 1480 individuals with complete permanent teeth phenotypic data. Age, gender and total number of teeth/surfaces were included as covariates in the analyses. Tables 4 and 5 summarize the significant findings. The Manhattan plots for GWAS scans using the ZINB model are presented in Figures 9 and 10. For genome-wide univariate screening, a significance threshold of 1 × 10 −7 was used. For the LD block-based analysis, a genome-wide significance threshold of 2.5 × 10 −6 was used. Several SNPs were identified significantly associated with D1MFT and D1MFS in genome-wide scan, of which rs7990965 in chromosome 13 and rs1058595 in chromosome 10 demonstrate consistent significance for both traits. In the LD block based association tests, the cGFLM-ZINB model identified two significant genetic regions: PKDCC in chromosome 2 ( both traits), and the intergenic region between DCN and BTG1 in chromosome 12 (D1MFS). It is worth mentioning these two genetic regions cannot be identified by other competing methods, suggesting that the cGFLM model may provide potentially new insights into understanding the risk of dental caries. More interestingly, the gene PKDCC was found to be associated with craniofacial morphogenesis in previous dental studies [35], further supporting and validating the biological significance of our findings.

Discussion
Joint analyses of multiple contiguous SNPs, which can form an LD block, are expected to provide better inference about unknown causal variants, since these SNPs all carry partial information about the causal variants and collectively they should be more powerful. In principle, LD between genetic loci should decline with their intervening physical distance, given that more cross-over events occur within longer ranges. Therefore, how to effectively incorporate and take full advantage of the distance/alignment order information of these SNPs is an important yet challenging problem. While FLMs seem to provide a good solution, its estimated coefficient function is usually noisy and hard to interpret, which consequently leads to potential power loss. Improvements to existing methods are then of great interest in order to better detect significant genetic variants.
In this article, we proposed a novel cGFLM for flexible and more reasonable multi-loci mapping strategy. Our model is built upon the general FLM framework by imposing constraints to specify sign-specific effects. Our limited simulations suggest that these constraints encourage spatial sparsity in the estimated coefficient function, which needs to be further validated in other experimental settings. Due to these constraints, the likelihood ratio test statistic does not follow a fixed chi-square distribution, but a null weighted mixture of chi-square distributions. The simulation results show that compared to three competing methods, our proposed cGFLM generally demonstrated better power when effect size is moderate and large, and comparable performance when effect size is small, while at the same time maintaining correct type I errors.
Applications of the cGFLM to two real datasets demonstrate the applicability of our method. Particulary, its application to the GWAS of dental caries risk identifies new genetic regions that are have not been discovered before, validating that our method can potentially provide new insights into large-scale genomic studies. However, it is also important to note that some significant SNPs identified by other methods were missed by cGFLM as well. This is likely because in cGFLM, identification of significant SNPs is based on the collective evidence of a group of linked SNP, and in the case of causal SNPs being weakly linked to its neighboring SNPs, our methods may miss them. Additionally, cGFLM inherently would not work well for variants not in LD with neighboring SNPs, such as singletons and doubletons. Therefore, these observations suggest that our method should be considered as an additional approach in our toolbox for more discoveries, while not replacing existing ones.
When the cGFLM method is applied to large-scale genome-wide scanning of LD blocks, one concern is how to group SNPs into LD blocks. Several software, such as PLINK [36] and LDExplorer [37], have embedded functions to define LD blocks for genomic data, and we can utilize them to help partition the genome. Even with these tools, LD blocks can vary with different LD thresholds. That is, a weaker LD threshold can lead to larger LD blocks and vice versa. Another concern is about LD blocks with few SNPs, since they are not suitable for functional analysis. They can be either combined into adjacent larger blocks with weak LD, or can be just analyzed with those single-marker methods. Prospectively, in order to discover the core subset of causal genes, further group selection among multiple candidate blocks is of great interest, and regularization methods such as group LASSO and group SCAD can be included as extensions to our current framework. Additionally, in this study cGFLM only considers LD-based blocks for association analyses; however, it would be also of great interest to see if cGFLM can be applied to gene-level analyses as genes are functional units of the genome. Finally, if subpopulations exist in the genotypic sample, stratification can be addressed in our model by including principal components of population variation as additional covariates [38].

Conclusions
Our proposed GFLM, cGFLM and their related implementations with the ZINB model (GFLM-ZINB, cGFLM-ZINB) are novel and complex methods. They are specifically designed for multi-loci mapping in naturally formed LD blocks. The models can simultaneously incorporate multiple linked SNPs, including their physical alignments and block-wide linkage structure, and make inference at the LD block level. Simulation studies show that cGFLM and cGFLM-ZINB have desirable performances in terms of both simpler coefficients functions and empirical power gain. The practical usage of cGFLM is demonstrated with a candidate-gene study of nicotine dependence and a GWAS on dental caries. Considering its flexibility and comprehensiveness, cGFLM would be a very attractive method for future gene-based association studies.  Institutional Review Board Statement: Not applicable as this was secondary use of publicly available data.
Informed Consent Statement: Not applicable as this was secondary use of publicly available data.
Data Availability Statement: The application datasets are publicly available.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Type I error simulation using cGFLM for binary outcomes based on the CHRNA7 gene.  Table A2. Type I error simulation using cGFLM-ZINB for ZINB outcomes based on the CHRNA7 gene.