Gene Association Analysis of Quantitative Trait Based on Functional Linear Regression Model with Local Sparse Estimator

Functional linear regression models have been widely used in the gene association analysis of complex traits. These models retain all the genetic information in the data and take full advantage of spatial information in genetic variation data, which leads to brilliant detection power. However, the significant association signals identified by the high-power methods are not all the real causal SNPs, because it is easy to regard noise information as significant association signals, leading to a false association. In this paper, a method based on the sparse functional data association test (SFDAT) of gene region association analysis is developed based on a functional linear regression model with local sparse estimation. The evaluation indicators CSR and DL are defined to evaluate the feasibility and performance of the proposed method with other indicators. Simulation studies show that: (1) SFDAT performs well under both linkage equilibrium and linkage disequilibrium simulation; (2) SFDAT performs successfully for gene regions (including common variants, low-frequency variants, rare variants and mix variants); (3) With power and type I error rates comparable to OLS and Smooth, SFDAT has a better ability to handle the zero regions. The Oryza sativa data set is analyzed by SFDAT. It is shown that SFDAT can better perform gene association analysis and eliminate the false positive of gene localization. This study showed that SFDAT can lower the interference caused by noise while maintaining high power. SFDAT provides a new method for the association analysis between gene regions and phenotypic quantitative traits.


Introduction
In recent years, with the development of high-throughput sequencing technology and the application of second-generation and third-generation sequencing platforms, unprecedented large-scale and high-dimensional genetic variation data have been generated. SNP (single nucleotide polymorphism) or CNVS (copy number variation) have become common genetic markers for studying the genetic mechanism of traits. Linkage disequilibriumbased association analysis has achieved great success in detecting the pathogenic genes of human diseases and the genetic structure of complex traits of animals and plants [1][2][3]. Genome-wide association analysis (GWAS) is a high-throughput genotyping technique, of which the millions of SNPs or CNVS use as genetic markers to identify causal genes by association analysis. GWAS has achieved primary success in the genetic studies of humans, animals and plants. GWAS falls into two categories of analysis projects: common variant association study (CVAS, Minor Allele Frequency (MAF) > 5%) and rare variant association study (RVAS, MAF <= 1~5% or MAF < 1%), where MAF <= 1~5% is called low-frequency variants and MAF < 1% is called rare variants. Most of the causal loci identified by the 3 of 32 variation on phenotypic traits and analyze the complex relationship between phenotypic traits and genetic loci, it will be more accurate to analyze the impact of genetic variation on phenotypic traits.
Lin et al. [33] proposed the fSCAD (functional smoothly clipped absolute deviation) method, which improved the original FLM by adding a SCAD (smoothly clipped absolute deviation) penalty item based on FLM. This method can accurately compress the zero areas of the model region to zero without excessively compressing the non-zero area of the model region, which also means that this method can allow unassociated variants to be ignored in gene association analysis and remain the true associated variants.
In consideration of these advantages of fSCAD and the problems of gene association analysis methods based on FDA, a new gene association analysis approach called sparse functional data association test (SFDAT) which is based on FDA [33] was proposed in this paper, and the computer simulation was used to evaluate the effect coefficient estimation accuracy, the type I error rate and power. The real data set of O. sativa was analyzed by SFDAT to demonstrate the applicability of real data of SFDAT.

Genetic Model
Let y i be the phenotypic value of ith individual. For i-th individual, the traditional linear genetic model can be expressed as: where x ij is a genotype profile (if A and a represent a pair of alleles, then when the genotype is AA, x ij is taken as 2; when the genotype is Aa, it is taken as 1; when the genotype is aa, it is 0). β j represents the effect coefficient of genetic marker, ε i ∼ N 0, σ 2 , σ 2 is the environmental genetic variance, K is the number of genetic markers. With the increase in the number of genetic markers, the degree of freedom gradually increases, and the multi-collinearity among variables becomes more and more serious, eventually leading to the reduction in estimation accuracy and power. This is especially true when the genetic markers are low-frequency variations. In order to reduce the degree freedom of the model and the multiple collinearities of variables due to low-frequency variation, the functional linear model (FLM) can be used instead of the multiple linear genetic model: where ε i is an independent and normal distribution with zero mean and variance σ 2 , [0, T] represents the genomic region under consideration, that is, a DNA fragment that contains multiple SNP loci, among which there may be SNP loci that can affect the target quantitative trait. The discrete genetic markers in Equation (1) are converted into the continuous genetic marker function, and the effects of genetic markers β j are also converted into a continuous genetic effect function β(t).
For Equation (2), the B-spline function is used to fit the genetic variants and the genetic effects. According to the functional data analysis method [26,34]: First, let l variants be in a sequence of their physical locations t 0 < t 1 ≤ t 2 · · · ≤ t l < t l+1 < · · · < t T = T which constitutes the genomic region [0, T]; Second, a series of B-spline basis functions are defined and let B k (t) be a B-spline basis function; Third, define M + 1 equidistant nodes 0 = v 0 < v 1 < · · · < v M = T in the interval [0, T]. After that, these discrete genetic variants can be expanded as a continuous function: where the coefficient could be obtained by minimizing the following equation: The coefficient u ik is estimated to beû i = (B T B) −1 B T X i . Similar to the genetic variants, the genetic effects also can be expanded as where θ(t) = [B 1 (t), · · · , B K β (t)] T = [θ 1 (t), · · · , θ K β (t)] T , β = [β 1 , · · · , β K β ] T .
The functional linear model of Equation (2) can be rewritten as · · · · · · T 0 B K G (t)θ 1 (t)dt · · · T 0 B K G (t)θ K β (t)dt      W = UJ Bθ , then Equation (6) can be rewritten as The form of the linear regression equation is: In the actual operation, the integral interval [0, T] can be converted into [0, 1].

Parameter Estimation
It can be seen from the equations established above that the genetic variants and effects in the functional linear genetic model are transformed into a continuous function through B-spline. Finally, the functional genetic model is transformed into the traditional linear regression model. In order to obtain the local sparse estimation of β(t), we use the method of parameter estimation based on the penalty function proposed by Lin et al. [33]. Lin et al. [33] proposed thatβ(t) andμ in Equation (2) could be estimated by minimizing the following loss function: where d, M, T as defined above, D m is the m order differential operator, m ≤ d, which we usually take m = 2, • is L 2 norm. As defined by Fan and Li's [35] p λ (•) is: Its domain is [0, +∞], the value of a could be 3.7 which is suggested by Fan and Li [35] and the value of λ is determined by the sample size. The γ D m β 2 term in the loss function Loss(β, µ) is the roughness penalty of β(t), it controls the smoothness of β(t), where parameter γ can further adjust the severity of roughness penalty, γ is called the smoothing parameter. Due to the roughness penalty, the functional linear regression model FLM has a certain ability to resist "noise". M T T 0 p λ (|β(t)|)dt is the local sparse penalty of β(t), it can compress the tiny β(t) directly to zero. The parameter λ determines how tiny value would be compressed, λ is called the compression parameter. In addition, the local sparse penalty will play different constraints according to the specific form of β(t).
For Equation (9), when γ = 0, λ = 0, Loss(β, µ) is the same as the loss function of ordinary functional linear regression model (FLM), the method of parameter estimation is called Ordinary Least-Square Estimator (OLS); when γ = 0, λ = 0, Loss(β, µ) equals to the loss function of the smoothed functional linear regression model, the method of parameter estimation is called Smoothing Spline Estimator (Smooth); when γ = 0, λ = 0, Loss(β, µ) is a loss function for the functional linear regression model with locally sparse. The method of parameter estimation is called smooth and locally sparse (SLoS) estimator. There are two advantages of SLoS's loss function: first, these rough results due to false correlation effects could be smoothed by the roughness penalty; second, the small and insignificant effects would be directly compressed to zero, which further reduces the false positive.

Test Statistics
Another major problem in the genetic study for quantitative traits is whether the association between genetic regions and phenotypic traits is real existence. In general, we consider the following hypothesis-testing questions: Since the genetic effect function is the expansion of the basis function, the above assumption is equal to the following assumptions: The statistic can be defined as: where RSS is the regression square sum of Equation (11), and ESS is the residual square sum of Equation (11). The above statistical test is for the entire gene region [0, T], let us move on to the test for the subgene area. Suppose N(β) is the zero value area of β(t) and S(β) is non-zero value area of β(t), then N(β) = {t ∈ [0, T] : β(t) = 0}, S(β) = {t ∈ [0, T] : β(t) = 0}. The estimationβ(t) has Oracle property for N(β) by SLoS estimator. Lin [33] have proven the following conclusion: if β(t) = 0 for any t ∈ [0, T], then the estimation ofβ(t) = 0 for any t ∈ [0, T] in probability. In other words, if the estimation ofβ(t) = 0 for any t ∈ [0, T], then β(t) = 0 for any t ∈ [0, T] in probability. Suppose N(β) is the zero value area ofβ(t) and S(β) is the non-zero value area ofβ(t), N(β) is convergence to N(β) in probability and S(β) is convergence to S(β) in probability. Therefore, the zero and non-zero area of β(t) represent that of β(t). The view of statistical genetic, the effect functionβ(t) to be zero means that there is no correlation between phenotypic traits and locus t; the effect functionβ(t) to be non-zero, it means that there is a correlation between phenotypic traits and locus t.
Therefore, it is necessary to establish indicators for SFDAT to estimate the effect function in zero or non-zero regions. On the one hand, it can measure the accuracy of the model estimation; on the other hand, it can provide a reference for a more accurate analysis of functional linear model regional association.

Indicators of Estimated Accuracy
There are numerous SNP loci in the gene region, and the identified causal SNP loci will inevitably have location deviation, so we regard the region with a total of 200 loci centered on the causal SNP loci as the region of acceptable deviation (Abbr. RAD), that is, we can accept that the identified causal SNP is within the RAD. Then, on the basis of being closer to reality, in order to evaluate the ability of the function to compress the zero regions on the one hand, and measure the accuracy of the function to identify the non-zero region on the other hand, we evaluate the identification ability and the region-selection ability of the model through the correct selection ratio (Abbr.CSR) of zero regions outside RAD and the discovery length (Abbr. DL) for non-zero regions in RAD, which was defined, respectively, by S 0 (β(t)) and S 1 (β(t)) are denoted as the length of β(t) in zero region and non-zero region. β(t) andβ(t) represent the real and estimated effect values at the locus t, respectively. For a good test method, its CSR should be enough large to handle non-association signals region effectively; in the meantime, the more precise ability to identify the association signals, the lower DL it has.
For areas with zero or non-zero effects in the integral region, the following integral squared errors (ISE) are defined by Lin [33]: where l 0 is the length of zero areas, l 1 is the length of non-zero areas. ISE 0 and ISE 1 can be used to estimate the error between estimatedβ(t) and true β(t) on zero and non-zero areas, respectively. In addition to the performance of model prediction, it is judged by prediction mean squared errors (PMSE): where test is the test individual set, N is the number of samples,μ andβ(t) are estimated of µ and β(t). According to the above definition, we can define ISE 0 , ISE 1 and PMSE as criteria for evaluating the accuracy of effect estimates for gene regions. To determine the degree of fitting on zero effects, we define A 0 denotes a set of variants loci in which no association exists, |A 0 | denotes the number of elements in set A 0 .β(t) and β(t), respectively, denote estimated effects and actual effect values on locus t in set A 0 . It indicates the degree of the overall deviation of the true and estimated values at the zero effects, the lower ISE 0 , the more accurate estimation of zero effects. To determine the degree of fitting on the non-zero effects, we define A 1 represents the set of associated variants loci in the region, |A 1 | represents the number of elements in set A 1 A 1 .β(t) and β(t) represent the estimated and actual effect values on locus t in set A 1 , respectively. It indicates the degree of overall deviation between true and estimated values at the non-zero effects, the lower ISE 1 , the more precise estimation of non-zero effects. To determine the degree of fit for the genetic model, we define test represents the test individual set, N represents the number of individuals in the test set, y i represents the true effect value of ith individual in the test set andŷ i represents the predicted value of ith individual in the test set, which indicates the overall deviation degree between true and estimated trait values at the test set, the lower PMSE, the more powerful predict ability.

Determine Tune Parameters
In the SFDAT method, the choice of parameter M value is not very important [36] as long as the selected M is large enough to reflect the local appearance of β(t) (including the area of zero). For selection of γ and λ, a series of candidate values are given and find out the optimal parameters based on cross-validation, generalized cross-validation, BIC (Bayesian information criterion), AIC (Akaike information criterion) or RIC (Risk Inflation Criterion).

Simulation Studies
In order to verify the feasibility and effectiveness of the SFDAT method, a computer simulation was carried out. The simulated SNP genotype data were used to study the power, type I error rate and estimation accuracy of the method. To compare the advantages of the SFDAT method, the OLS and Smooth methods were also adopted. The computer simulation code was written in R language.
The power of the model and type I error rate can be obtained by the following steps: firstly, test Equation (2) to obtain the p value of the genetic model under different assumptions; secondly, count the number of p values less than a certain threshold; thirdly, the counted number divided by the total number of simulations, and then the ratio under the non-zero hypothesis is the power and the ratio under the null hypothesis is the type I error rate. The reason for calculating in this way is: under a non-zero hypothesis, there is an associated variant in the region, if the p value is less than threshold α at this time, the phenotype trait and gene region have an associated relationship, so the ratio is the power; under the null hypothesis, there is no associated variant in the region, if the p value is still less than threshold α at this time, the phenotype trait and gene region have a false associated relationship, so this ratio become the type I error rate.
At last, a test set for each simulation (an additional 100 individuals from the simulated SNP genotype data) was generated to calculate the PMSE.

Simulated SNP Genotype Data
In the simulation, we consider the linkage equilibrium and linkage disequilibrium simulation. For the linkage equilibrium simulation, the simulated SNP genotype dataset consists of many simulated gene regions, each of which contains 900 SNPs, and the MAF of SNPs within each region is generated by uniform distribution U(a, b). In fact, we generate the MAF of each SNP through uniform distribution, then generate the corresponding SNP through the MAF, and finally, form the simulated gene region from these SNPs. For the generation of the simulated SNP genotype of linkage disequilibrium simulation, we refer to Wang and Pan [37,38] and set the measure of linkage disequilibrium 0.2 in the simulation.
In the simulation, the number of basis functions K G , K B is 15, the order d is 4, the node M is 11. A set of smoothing parameters γ [10 2 ,10 3 ,10 4 ,10 5 ,10 6 ] and a set of compression parameters λ [0.03, 0.04, 0.05, 0.06] are given here. In the calculation process, the optimal parameters will be automatically selected according to BIC.
For linkage equilibrium and linkage disequilibrium simulation, four kinds of gene regions will be discussed: rare variants gene regions, low-frequency variants gene regions, common variants gene regions and mixed variants gene regions. The rare variant's gene region is constituted by rare variants of which MAFs are generated by uniform distribution U(0.0005, 0.01); The low-frequency variants gene region is constituted by low-frequency variants of which MAFs are generated by uniform distribution U(0.01, 0.05); The common variants gene region is constituted by common variants of which MAFs are generated by uniform distribution U(0.05, 0.5); The mixed variants gene region is randomly composed by 60% rare variants, 15% low-frequency variants, and 25% common variants.
Simulated phenotypic trait values were generated by y = µ + ∑ i∈A x i β i + ε, where A is the set of causal SNPs, µ = 1 , ε ∼ N(0, 1). Two different types of simulation cases were considered: Case I: A total of three scenarios will be considered: Scenario I: setting a positive causal SNP at locus 450 in the gene region; Scenario II: setting a positive causal SNP at locus 100 and 800 in the gene region, respectively; Scenario III: setting a positive causal SNP and a negative causal SNP at locus 100 and 800 in the gene region, respectively. The effect size of each scenario is fixed at 5 (β = 5).
Case II: The value of genetic effect β is ln(c)× log 10 (MAF)|/2 (MAF is the minor allele frequency of the SNP). A total of 27 scenarios will be considered: the number of causal SNPs is 5, 10 or 20. Namely, the associated variants proportion of gene region (900 SNP) is 1/180, 1/90 and 1/45 ; the proportion of negative effect in causal SNPs was 0%, 20%, or 40%; The parameter c in the genetic effect ln(c)× log 10 (MAF i )|/2 equals to 3, 5 or 7.
The simulated gene regions of Case I and Case II were shown in Figure 1. For each case, we generated 2000 samples for each gene region to simulate, and all simulations were replicated 1000 times. CSR, DL will be calculated in Case I and power, ISE 0 , ISE 1 and PMSE will be calculated in Case II. Note that SFDAT can compress β(t) to 0 in the region of most unassociated SNPs. However, neither OLS nor Smooth has this ability, which results in their failure to compress unassociated SNP loci, that is, OLS and Smooth estimate the coefficients of these SNP loci with no genetic effect as non-zero. The ability of SFDAT to compress the non-effect region in common variants and low-frequency variants is stronger than that of mixed variants and rare variants, indicating that the gene regions with rare variants may limit the compression ability of SFDAT. SFDAT performs better in gene regions with only one causal SNP, and worst in gene regions with a positive and negative causal SNP. Meanwhile, compared with linkage equilibrium, SFDAT performs better in the simulation of linkage disequilibrium. As can be seen from Figure 3, OLS, Smooth and SFDAT can all find causal signals in RAD, but the signal regions found by SFDAT are more concentrated. OLS and Smooth explore the causal signal at each locus in the RAD, while SFDAT only identified the causal SNPs in part of the RAD under all cases. This is because OLS and Smooth do not have the ability to compress the zero region, resulting the noise fluctuations when estimating the effect functions on the unassociated SNPs loci, which mistakenly deems all the loci have the effect. SFDAT cannot only perfectly detect unassociated SNPs regions, but also accurately identify causal SNP loci. When there is only one causal SNP in the gene region, SFDAT can detect the locations of rare variants more accurately. The DL 100 probed by SFDAT is similar to DL 800 under scenario II and scenario III (Gene regions exist two causal SNPs). In general, the DL of linkage disequilibrium simulation is smaller than that of linkage equilibrium, that is, the location of the identified causal SNPs is more concentrated.

Figures 2 and 3 show the CSR and DL of OLS, Smooth
, and SFDAT in the three s narios of Case I, respectively. Note that SFDAT can compress ( ) to 0 in the region most unassociated SNPs. However, neither OLS nor Smooth has this ability, which resu in their failure to compress unassociated SNP loci, that is, OLS and Smooth estimate coefficients of these SNP loci with no genetic effect as non-zero. The ability of SFDAT compress the non-effect region in common variants and low-frequency variants stronger than that of mixed variants and rare variants, indicating that the gene regio with rare variants may limit the compression ability of SFDAT. SFDAT performs better gene regions with only one causal SNP, and worst in gene regions with a positive a negative causal SNP. Meanwhile, compared with linkage equilibrium, SFDAT perfor better in the simulation of linkage disequilibrium. As can be seen from Figure 3, O  Table 1 illustrates the power of three methods of Case II under the significant level 0.01. As can be seen from Table 1, under the simulation of linkage equilibrium, the power for common variant regions and low-frequency variant regions are similar, the powers of rare variants regions and mixed variants regions are similar. The powers of the former (common variants and low-frequency variants) are higher than that of the latter (rare variants and mixed variants), and the powers of rare variant regions are also lower than that of mixed variant regions. That is because the former does not contain rare variants, the latter contains rare variants and the rare variants contained in the mixed variants regions will be fewer than the rare variants regions, indicating that the gene regions that do not contain rare variants have a higher power, the power of the gene region containing common variants is higher than that of a gene region containing only rare variants. For common variant regions and low-frequency variant regions, there are similar powers in various situations for the OLS, Smooth and SFDAT methods, and the OLS method is slightly better. However, for rare variants regions and mixed variants regions, the power of the OLS method is significantly better. Obviously, the powers of methods are higher when the number of causal SNPs or the effect size increases, but when the proportion of negative effect increases, the powers of methods are just the opposite. In rare variant regions, the detection results are unstable while the causal SNPs contained in the gene region become less. Finally, it has a higher power in all cases, when the number of pathogenic SNPs in the gene region reaches 20. It is shown that similar performance patterns are observed in linkage disequilibrium. However, compared with the simulation of linkage equilibrium, the power of linkage disequilibrium has improved enormously. This is owing to the overall effect of gene regions as the correlation of loci.  The ISE 0 and its standard deviation of three methods for linkage equilibrium and linkage disequilibrium under Case II are shown in Table 2. From Table 2, under the simulation of linkage equilibrium, the standard deviation increases with the MAF decreases, it shows that the common variants fit better in the region where the effect is zero; the ISE 0 value or standard deviation of the OLS method is the largest among three methods, while there are smaller and similar results for the Smooth and SFDAT methods; the ISE 0 standard deviation of the Smooth method has a larger deviation than that of SFDAT method when the number of causal SNPs in the gene region is small; the ISE 0 values are similar (in other words the degree of the fitting is close) when the number of the causal SNPs and the effect size are the same regardless of the proportion of the negative effect; the ISE 0 and its standard deviation increase while the number of the causal SNPs and the effect size increase. Compared to the simulation of linkage equilibrium, ISE 0 decreased significantly in the regions of common variants and mixed variants under linkage disequilibrium, and slightly increased in the regions of low-frequency variants and rare variants. It is also verified that the models can better fit the zero region of common variants. Table 3 displays the ISE 1 and its standard deviation of three methods under linkage equilibrium and linkage disequilibrium based on Case II. In the situation of linkage equilibrium, the standard deviation increases as MAF decreases which indicated that three methods fitted better on the common variants in the non-zero region; the ISE 1 and its standard deviation of the three methods were similar: as the effect size c or the proportion of the negative effects increases. Under linkage disequilibrium, the results of OLS, Smooth and SFDAT in fitting non-zero regions of common variants, low-frequency variants and rare variants decreased slightly. When fitting the rare variants that do not exist negative causal SNPs, the fitting results are also slightly lower than that of linkage equilibrium, but the ISE 1 of linkage disequilibrium simulation is significantly lower than that of linkage equilibrium when the rare variants region exists negative causal SNPs.   Table 1 illustrates the power of three methods of Case II under the significant lev 0.01. As can be seen from Table 1, under the simulation of linkage equilibrium, the pow for common variant regions and low-frequency variant regions are similar, the powers rare variants regions and mixed variants regions are similar. The powers of the form (common variants and low-frequency variants) are higher than that of the latter (rare v iants and mixed variants), and the powers of rare variant regions are also lower than th of mixed variant regions. That is because the former does not contain rare variants, t  Table 4. In general, the functional-based genetic model containing rare variants fits better than those containing common variants. PMSE and its standard deviation of the three methods are similar. PMSE and its standard deviation increase with the number of causal SNPs or effect size, while the proportion of negative effect only has little influence on it. Compared with linkage equilibrium simulations, PMSE and its standard deviation of linkage disequilibrium decrease significantly in common variants, low-frequency variants and rare variants, but slightly increase in mixed variants.      Note: The average ISE 1 value is shown above each data unit, and the standard error of ISE 1 is shown in parentheses below. r 2 represents the measure of linkage disequilibrium, r 2 equals to 0 means there is linkage equilibrium between SNP.    Note: The average PMSE value is shown above each data unit, and the standard deviation of PMSE is shown in parentheses below. r 2 represents the measure of linkage disequilibrium, r 2 equals to 0 means there is linkage equilibrium between SNP.

The Type I Error Rate
We set five sample sizes of 500, 1000, 1500, 2000 and 2500, each sample size is simulated 10,000 times. Simulated traits are generated by model y = µ + ε, where µ = 1, ε ∼ N(0, 1). The significance levels are 0.05, 0.01, 0.001 and 0.0001. Table 5 summarizes the type I error rates of OLS, Smooth and SFDAT under the linkage equilibrium and disequilibrium simulation. It can be seen from Table 5 that Smooth and SFDAT controlled type I error rates correctly across all sample sizes and all significance levels. While the type I error rates of OLS is severely inflated, which means the use of OLS for gene association analysis produces false positives that can lead to false associations. Note that the SFDAT method will appear conservative when the sample size and significance level are relatively small, but as the two increase, the SFDAT method also reaches a sufficient level of significance. Relative to linkage equilibrium, the type I error rates of the three models under linkage disequilibrium increase in smaller sample sizes (500, 1000, 1500) and decrease in larger sample sizes (2000, 2500). Combining Figures 2 and 3 and Tables 1-5, SFDAT has a competitive performance to the OLS and Smooth in terms of location of non-causal SNP regions and identification of causal SNPs regions, as well as power, the type I error rates and other indicators. It is appreciable that OLS has relatively higher power, but its ISE 0 , the standard deviation of ISE 0 and type I error rates are larger than those of Smooth and SFDAT methods, which declares that there may be a false correlation in the association analysis using OSL method, and the zero effect may be recognized as a non-zero effect. In addition, although Smooth has a similar performance to SFDAT in power, ISE 1 and PMSE, it does not have the capacity to shrink sparse regions, which leads to noise fluctuations in the application of real data usually and further causes false association. Therefore, if we consider the SFDAT's performance of the power, type I error rates and combine its performance of other indicators (the CSR, DL, ISE 0 , ISE 1 and PMSE), the SFDAT method would be an excellent method which has extraordinarily high power and has a marvelous ability to reduce false positives. of the eight gene regions to be ested. We perform association analysis on the eight gene regions to be detected for each candidate SNP. The associated gene regions detected by OLS, Smooth and SFDAT at different significant levels are shown in Table 7

Discussion
GWAS is a new strategy that uses millions of SNPs in the genome as molecular g

Discussion
GWAS is a new strategy that uses millions of SNPs in the genome as molecular genetic markers to conduct genome-wide comparative analysis or association analysis, and to find out the genetic variation that affects complex traits through comparison. In 2005, Science magazine reported the first age-related GWAS study of macular degeneration [40]. For more than a decade, research on genome-wide association analysis has grown rapidly, but most methods are aimed at common variants. In recent years, more and more scholars have begun to pay attention to the study of rare variants. With the development of a new generation of high-throughput sequencing technology, TB or even more sequence data will be generated every day, and the data will be gradually changed from discrete to dense. We can regard it as continuous data, and thus functional data analysis methods have emerged. It can be seen from the above analysis that it can analyze both common and rare variants [26]. In recent years, more and more articles on functional data analysis have been published in genome-wide association analysis [26,31,32,[41][42][43][44]. The association analysis method based on the functional linear regression model can not only estimate the additive and dominant effects of genes, but also estimate the epistasis effects of genes [31,32], and extended to the study of dynamic development and multiple traits, Li [45] proposed a longitudinal functional data association test (LFDAT) based on the function-function regression model, which can provide a feasible method for studying the formation and expression of longitudinal traits. Li [46] put forward an integrative functional linear model for GWAS with multiple traits, which effectively accommodates the high dimensionality of SNPs and correlation among multiple traits. However, current analysis methods can develop only based on SNP gene region, it is impossible to further study whether SNP inside gene regions are associated with phenotypic traits.
In practice, gene loci are linkage disequilibrium with each other. The simulation results of this paper also show that most of the indictors under the linkage disequilibrium are better than that based on the linkage equilibrium, especially since the power of linkage disequilibrium is much higher than that of linkage equilibrium. Therefore, there is a considerable gap between the simulation results with a measure of 0.2 linkage disequilibrium and linkage equilibrium, so this paper does not further explore the simulation of linkage disequilibrium with a higher measure. Figure 5 shows the effect of functionβ(t) curve by OLS, Smooth and SFDAT methods. Firstly, from Figure 5, it can be seen that the effect function of the OLS and Smooth methodŝ β(t) have frequent fluctuations. Compared to the OLS and Smooth methods, the estimated effect functionβ(t) by the SFDAT method could smooth the effect value which real effect values are zero to around zero and still retain the non-zero part of the real effect. Combining the CSR and DL of SFDAT, it shows that the smoothing function can indeed remove some "noise", which also explains the reason why false associations are detected by the OLS and Smooth methods. Secondly, there are similar results for non-zero genetic effects estimated to use the Smooth and SFDAT methods. This shows that the compression capability of the SFDAT method can be regarded as the further compression of the effect value close to zero on the basis of the smooth estimation results, while retaining the non-zero effect. Therefore, if the compression parameter is small and there are no effect values worth compressing in the gene region, the estimated effect function of the SFDAT method should be consistent with that of the Smooth method. It is the reason why the Smooth and SFDAT methods have similar power, ISE 0 , ISE 1 , PMSE and the type I error rate in computer simulations, but relatively speaking, SFDAT still has a higher accuracy. In addition, in the application of real data, QTL analysis often takes a lot of time. Therefore, during GWAS, all gene regions can be quickly scanned through SFDAT to find the gene regions where significant SNP loci are located, and then QTL analysis can be performed on these gene regions to accurately locate the positions of significant SNPs, which can save a lot of time and improve efficiency beyond all doubt. the type I error rate in computer simulations, but relatively speaking, SFDAT still has a higher accuracy. In addition, in the application of real data, QTL analysis often takes a lot of time. Therefore, during GWAS, all gene regions can be quickly scanned through SFDAT to find the gene regions where significant SNP loci are located, and then QTL analysis can be performed on these gene regions to accurately locate the positions of significant SNPs, which can save a lot of time and improve efficiency beyond all doubt.  It must be pointed out that Luo et al. [26] proposed a functional linear regression model for QTL association analysis based on next-generation high-throughput sequencing (NGS), which used the functional linear regression model method (FLM). The smoothed functional linear model (SFLM) and eight other statistical methods (WSS, VT, RVT1, RVT2, PCA regression, multiple regression, simple regression and SKAT) were compared in six cases (uniformity of effects means the same direction; heterogeneity of effects means different directions; only low-frequency variants; all variants; different proportions of causal variants; different proportions of variants included). It found that the FLM method and SFLM method are similar in each case, but they are obviously superior to other methods, including collapsing-based methods RVT1, RVT2, kernel-based methods SKAT. The FLM and SFLM proposed by Luo et al. [26] differ from our OLS and Smooth methods in loss function Loss(β, µ). The first term of loss function Loss(β, µ) for the OLS, Smooth and SFDAT methods is T 0 X i (t)β(t)dt] 2 divided by the sample size n, the first term of loss function Loss(β, µ) for the FLM and SFLM methods is not divided by the sample size n. However, the idea of FLM and SFLM is similar to that of the OLS, Smooth and SFDAT methods. SFLM and Smooth method are only an adjustment among parameters. Although the OLS, Smooth and SFDAT methods have not been compared with traditional methods, the SFDAT method should be a good method according to Luo et al. [26] and our simulation's results. It should be noted that the SFDAT method is only based on a single-gene region for a one-by-one search in this paper. In fact, we can extend the method to multiple-gene regions detection which is our next research direction.
In the application of the functional linear regression model for gene association analysis, we mainly convert the functional linear regression model into the classical linear regression model for parameter estimation and statistical tests. However, we know that gene variants have common variants and rare variants so there are unique methods for rare variant detection [47]. This is especially true for statistical test problems of functional linear regression model which has also been proposed by some scholars [48]. Therefore, how to better perform statistical tests on gene association analysis using the functional linear regression models remains to be further studied.

Conclusions
In this paper, to address the problem of the sparsity of gene regions in association analysis, we develop a functional linear model with a SCAD penalty to genome-wide association analysis and propose a sparse functional data association analysis test (SFDAT) method. SFDAT can compress unassociated SNP loci in gene regions without reducing the power too much, and reduces false positives in association analysis. Our simulation studies and real data analysis also show that SFDAT can detect non-effect SNP loci more accurately and compress their coefficients to zero compared to OLS and Smooth, while maintaining higher power and lower false positives. Thus, SFDAT is a powerful tool for GWAS using next-generation sequencing data.

Data Availability Statement:
Publicly available datasets were analyzed in this study. The data can be found here: www.ricediversity.org/44kgwas (accessed on 1 December 2022) and www.gramene.org (accessed on 1 December 2022).