Gene-Based Genome-Wide Association Study Identified Genes for Agronomic Traits in Maize

Simple Summary Genome-wide association studies (GWAS) have successfully detected many SNPs related to complex quantitative traits. However, SNPs significantly associated with quantitative traits usually have only mild effects. Quantitative traits are usually caused by the combined effects of multiple loci in a gene. Maize is one of the world’s most important foods and feed crops. Earlier silking, kernel oil concentration, and fatty acid composition are all important agronomic traits in maize. To further explore the gene-level variations affecting maize economic traits, we propose an efficient gene-based GWAS method. We applied this method to the economic traits of maize and identified many candidate genes. Many of the same candidate genes were found in the analysis of related maize traits, which proved the reliability of our method. These findings will provide a theoretical basis for maize breeding with the targeted earlier silking and kernel oil concentration traits. Abstract A gene integrates the effects of all SNPs in its sequence span, which benefits the genome-wide association study. To explore gene-level variations affecting economic traits in maize, we extended the SNP-based GWAS analysis software Single-RunKing developed by our team to gene-based GWAS, which used the FaST-LMM algorithm to convert the linear mixed model into simple linear model association analysis. An F-test statistic was formulated to test and identify candidate genes. We compared the statistical efficiency of using 80% principal components (EPC), the first principal component (FPC), and all SNP markers (ALLSNP) as independent variables, which predecessors commonly used to integrate SNPs and represent genes. With a Huazhong Agricultural University (HAU) genomic dataset of 2.65M SNPs from 540 maize plants, 34,774 genes were annotated across the whole genome. Genome-wide association studies with 20 agronomic traits were performed using the software developed here. Another maize dataset from the Ames panel (AP) was also analyzed. The EPC method fits the model well and has good statistical efficiency. It not only overcomes the false negative problem when using all SNP markers for analysis (ALLSNP) but also solves the false positive problem of its corresponding simple linear model method EPCLM. Compared with FPC, the EPC method has higher statistical efficiency. A total of 132 quantitative trait genes (QTG) were identified for the 20 traits from HAU maize dataset and one trait of AP maize.


Introduction
Genome-wide association studies (GWAS) have successfully detected many SNPs related to complex quantitative traits. However, the method of using single SNPs for association studies has its disadvantages. For example, SNPs that have been shown to be significantly associated with complex diseases usually have only mild effects [1]. Common diseases are usually caused by the combined effects of multiple loci in a gene. If only tested genes. If only testing the large or highly significant genes obtained by EMMAX, it can help further reduce the whole genome gene association analysis to one or two rounds of whole genome regression scans. Based on these ideas, the Single-RunKing software [23] was developed by our team to perform the rapid whole genome mixed model association study. We have applied this software to the analysis of haplotypes [24]. However, performing gene-based GWAS analysis would be more biologically meaningful, considering that genes are the basic physical and functional units of heredity that control biological traits [25]. In addition, the results of gene-based GWAS are also helpful for further research on pathway-based GWAS. Based on the Single-RunKing software and further considering the independent variables, we adopted using 80% PCs as the independent variable and named it the EPC method. The PC independent variable for each gene was obtained by performing principal component analysis on the SNPs within each gene block separately. We compared EPC's efficiency with two other linear mixed model methods: One is the traditional method ALLSNP, which represents each gene with its internal SNPs to perform genome-wide gene association analysis. The other one uses the first PC of each gene as the independent variable to perform the linear mixed model regression analysis with the phenotype, which is named the FPC method. The results of the simple linear model method (EPCLM), which corresponds to the linear mixed model method EPC, are also compared. By reanalyzing twenty traits of the Huazhong Agricultural University (HAU) maize dataset [26] and one trait of the Ames panel (AP) maize dataset [27], the EPC method is proved to be more efficient than ALLSNP, FPC, and EPCLM in terms of model fitting, quantitative trait gene (QTG) identification. We proposed a simple and efficient gene-based GWAS method in this study and identified candidate genes for maize economic traits.

Maize Genomic Data Processing
In this study, we analyzed AP and HAU maize datasets to assess the performance of our proposed method. Due to a large number of individuals in the AP dataset, we used it to carry out simulation experiments. The case analyses were conducted using AP and HAU datasets, respectively. The AP maize dataset consists of 2279 inbred lines, with 681,258 SNPs genotyped. The trait we analyzed is days to silking (DTS). The AP maize datasets are free to download from the website (http://www.panzea.org/#!genotypes/cctl (accessed on 6 November 2022)) [27]. The 540 maize inbred lines of the HAU maize datasets were from a global collection [28], including representative temperate and tropical/ subtropical inbred lines. More than 2.65M SNPs were obtained for these 540 individuals, 1.25 M of which had a MAF ≥ 5% and were used for further studies. The HAU maize datasets are publicly available on the website of Jianbing Yan (http://www.maizego.org/ Resources.html (accessed on 6 November 2022)) [26]. The traits analyzed were kernel oil concentration and fatty acid composition, measured in multiple environments as described in the previous study [29]. Table 1 lists the details of the 20 agronomic traits in the HAU maize dataset.
Because genes are to be used as genetic units for GWAS analysis, SNP markers need to be assigned to each gene first. Both AP maize data reference genome annotation file ZmB73_5a.59_WGS.gff3 and HAU maize data reference genome annotation file ZmB73_5b.60_FGS.gff were downloaded from the website https://www.maizegdb.org/ (accessed on 6 November 2022). Then, SNP markers were assigned to genes. Multiple SNPs within a gene may collectively have a large effect but individually contribute little. The gene-based GWAS method can find genes containing these minor mutations. Because a certain proportion of genes only contain a few SNPs, in order to focus on the integration effects of genes on their internal SNP effects under different methods, the genes with more than 10 SNP markers were further screened for the following analysis, and their corresponding SNP markers form the final genotype matrix.
The 681,258 SNPs of AP maize were annotated into 37,292 genes, while a total of 16,893 genes with at least 10 SNPs were screened for further gene-based GWAS analysis, covering 347,481 SNPs, accounting for 51.01% of initial SNP markers. This way, the number of remaining genes after screening is not too small, and each gene contains sufficient SNPs. We finally analyzed all genes in the actual case analysis and listed the complete results in Table S3. The 1.25 M SNPs of HAU maize data were correspondingly annotated into 34,774 genes, while a total of 24,594 genes with at least 10 SNPs were screened for further gene-based GWAS analysis, covering 932,712 SNPs, accounting for 74.6% of the initial SNP markers.

FaST-LMM for Genes
The general LMM of GWAS can be expressed in matrix notation as: where y is the objective trait of n individuals, µ is the population mean, β is the additive genetic effect of the tested genes, a is the random polygenic effect of the mixed model, which obeys distribution N n 0, Kσ 2 a , where K is the realized relationship matrix (RRM) [30][31][32][33] calculated by genetic markers and σ 2 a is the polygenic variance, ε is a vector of errors, which follow a distribution N n 0, Iσ 2 ε , where σ 2 ε is the residual variance, 1 is a column vector of 1, X and Z are the corresponding design matrices for β and a, respectively.
According to the distribution assumptions of the additive genetic and random residual effects, the adjusted phenotypic variance-covariance matrix can be described as follows: After replacing σ 2 a with polygenic heritability h 2 = σ 2 a / σ 2 a + σ 2 ε , the phenotypic variance-covariance matrix becomes: Spectrally decompose K = USU T according to the FaST-LMM algorithm [20], where U and S are eigenvectors and eigenvalues of K (RRM), respectively. U T denotes the transpose of U. As U is an orthogonal matrix (UU T = I), the variance-covariance matrix can be described as: Let y = U T y and X = U T [1 X], and then the LMM can be changed to a linear model (LM) in the form as follows: where e ∼ N n 0, Wσ 2 ε with W = h 2 1−h 2 S + I being the diagonal matrix. Using the weighted least square (WLS) or maximum likelihood (ML) estimation procedure, the parameters of β and σ 2 ε can be estimated by maximum likelihood as: Usingβ andσ 2 ε , the ML function of the LM is constructed as follows: To further simplify the log-likelihood value as: where the polygenic heritability h 2 has been integrated into W. Therefore, we can optimize the log-likelihood function by using a one-dimensional scan within the open interval (0, 1) of h 2 to obtain the maximum likelihood estimate. Meanwhile, the additive genetic effect size of the tested genes can be statistically inferred byβ andσ 2 ε using optimized h 2 . F statistic is constructed for the gene as: where the degrees of freedom d f β is the number of PCs screened from the tested gene and

Implementation
As mentioned above, using the re-weighted least squares estimations of genes effects and optimizing polygenic heritabilities, FaST-LMM converts the whole genome mixedmodel association analysis into a simple linear model association analysis. In order to improve the calculation speed, we used the fastLmPure function in the R language package RcppArmadillo to perform regression analysis on the tested genes. The calculation speed of regression analysis using fastLmPure function is faster than that of lm function. Because fastLmPure just outputs the genetic effect and standard error for the tested gene, after running the fastLmPure function, we need to additionally calculate statistics such as σ 2 ε , −2logL, student t-value, and p-value.
The input variables are obtained by converting X and y into X' and y', respectively. After polygenic heritability is given, we can get the weighted diagonal matrix W. Then, the independent and dependent variables can be obtained according to (X * = W − 1 2 X) and (y * = W − 1 2 y). After preparing these variables, the process of solving LMM through barebones regression can be described as the following subroutine: lmm <-function(ystar, xstar, w){ fit0 <-fastLmPure(y = ystar, X = as.matrix(xstar[,1])) yd <-ystar -xstar [,1]*fit0$coefficients [1] ssy <-sum(yd^2) fit <-fastLmPure(y = ystar, X = xstar) resi <-ystar-xstar%*%fit$coefficients sse <-sum(resi^2) ssr <-ssy-sse dfe <-fit$df.residual ve <-sse/dfe dfb <-ncol(xstar)-1 F <-(ssr/dfb)/ve p <-1-pf(F, dfb, dfe, lower.tail = FALSE) logL<-log(det(w)) + nobs*log(ve) } Gene heritability is the proportion of phenotypic variation explained by the tested gene. Theoretically, subtracting the heritability of the tested gene from the genomic heritability of the trait yields the polygenic heritability of the gene. Although the polygenic heritabilities of genes are different. However, because most genes do not have an effect on the quantitative trait except QTGs, the polygenic heritabilities of genes are quite close to the genomic heritability of the trait. We estimated the genomic heritability of the trait by the LMM without gene effects whose residual variance is the genetic variance. Then for the polygenic heritability of the tested gene, we can quickly find its maximum likelihood estimate by searching down from the estimate of the genomic heritability of the quantitative trait.
Substituting genome heritability for the polygenic heritability of each gene can simplify the aforementioned fast regression scan to the EMMAX algorithm [15]. The fastLm-Pure function is used, and the polygenic heritabilities no longer need to be optimized so that the whole genome scanning speed reaches the highest value. We can use EMMAX to estimate genetic effects and statistical probabilities as a reference for rapid regression scanning of each gene. We just select genes with a high significance level (0.05 or 0.01) or large effects in the EMMAX algorithm and optimize their polygenic heritability estimation. In this way, computational efficiency can be further improved. Therefore, the time complexity of whole genome LMM association analysis turns into O (imn), and i is the time spent in whole genome regression scans (1 < i ≤ 2). Relying on this, the software Single-RunKing was designed to carry out whole genome LMM association analysis on genes at a very fast speed (for the software code used for analysis, please refer to the Additional File 1 or download through the link (https://pan.baidu.com/s/1PSip3OUXOcRhnOZQynPuRQ? pwd=d3ib (accessed on 6 November 2022)).

Simulations
A simulation study is conducted to investigate the statistical behavior of our proposed method using AP maize data [27]. Groups of 100, 300, 600, and 1000 QTNs were randomly assigned to the SNPs within our randomly selected 100 genes from the 16,893 genes of AP maize data, and these genes were used repeatedly in the following repeated simulations. The simulated QTNs were set to account for 60% of the phenotypic variation, and the genetic effects of these QTNs were obtained by sampling from a gamma distribution (shape = 1.66 and scale = 0.4). Next, principal component analysis was performed on the SNP markers in each gene block of the genotype matrix to make the SNP markers independent of each other. After that, each gene's first PC and 80% PC were screened out as independent variables to represent the tested genes for GWAS analysis. Figure 1 shows the frequency of genes that use different numbers of SNPs in ALLSNP method and the frequency of genes that screen out different numbers of PCs by EPC method. The kinship matrix is constructed based on the genotype matrix composed of all SNP markers. Finally, we repeated each simulation 50 times to get the final average result.  The simulations in this study were executed on a server with a configuration of 512 GB of RAM and 2.60-GHz Intel Xeon E5-2660 Opteron(tm) Processor, and the operating system is CentOS 6.5. ALLSNP, FPC, and EPC barebones regression scans took 24.204, 15.886, and 19.870 min, respectively, which were faster than the time taken by the linear model performed with the R/lm function (32.331, 24.353, and 31.770 min). More importantly, the statistical properties of linear mixed model methods such as EPC and FPC are much better than that of the linear model method. The false-negative/false-positive error rates were evaluated based on Q-Q plots. As shown in Figure 2, most part of the real line for −log10(p) obtained by the FPC and EPC method almost overlaps with the theoretical expectation, while only the high end of the line flies up due to the significant genes. This suggests that FPC and EPC exhibit good statistical properties and fit the model better than EPCLM and ALLSNP methods. EPCLM inflates test statistics rigorously, while ALLSNP deflates test statistics. The simulations in this study were executed on a server with a configuration of 512 GB of RAM and 2.60-GHz Intel Xeon E5-2660 Opteron(tm) Processor, and the operating system is CentOS 6.5. ALLSNP, FPC, and EPC barebones regression scans took 24.204, 15.886, and 19.870 min, respectively, which were faster than the time taken by the linear model performed with the R/lm function (32.331, 24.353, and 31.770 min). More importantly, the statistical properties of linear mixed model methods such as EPC and FPC are much better than that of the linear model method. The false-negative/false-positive error rates were evaluated based on Q-Q plots. As shown in Figure 2, most part of the real line for −log10(p) obtained by the FPC and EPC method almost overlaps with the theoretical expectation, while only the high end of the line flies up due to the significant genes. This suggests that FPC and EPC exhibit good statistical properties and fit the model better than EPCLM and ALLSNP methods. EPCLM inflates test statistics rigorously, while ALLSNP deflates test statistics.  A Bonferroni-corrected critical threshold of 5.529 was calculated at the 5% significance level based on the number of genes subjected to genome-wide association analysis [−log10(0.05/16,893) = 5.529]. This threshold was used to declare the significance of genes. If the tested gene passes this critical threshold and contains a pre-placed simulated QTN, then a QTG is identified. Statistical power is defined as the number of simulated QTGs identified. Statistical power plots corresponding to different type-I error levels under different QTN settings are shown in Figure 3. The power level of EPC is always higher than that of ALLSNP and FPC. No comparison to EPCLM was made because it has very high false-positive rates. [−log10(0.05/16,893) = 5.529]. This threshold was used to declare the significance of genes. If the tested gene passes this critical threshold and contains a pre-placed simulated QTN, then a QTG is identified. Statistical power is defined as the number of simulated QTGs identified. Statistical power plots corresponding to different type-I error levels under different QTN settings are shown in Figure 3. The power level of EPC is always higher than that of ALLSNP and FPC. No comparison to EPCLM was made because it has very high false-positive rates.

Case Analyses
We analyzed 20 agronomic traits of HAU maize and the days to silking (DTS) trait of AP maize [27] to evaluate the performance of the EPC method. On the same server used in the previous simulation experiments, ALLSNP, FPC, and EPC barebones regression scans took an average of 2.118, 1.159, and 1.272 min for the 20 traits in the HAU maize data, respectively. The AP dataset has more individuals than the HAU dataset, thus the running time of the three methods has increased accordingly. For the DTS trait in AP maize data, ALLSNP, FPC, and EPC barebones regression scans took 22.059, 17.052, and 19.671 min, respectively, which were faster than the time taken by the linear model performed with the R/lm function (29.545, 26.512, and 30.885 min). If only genes that pass the significance threshold of 0.05 are optimized based on EMMAX, the regression scan runtime will be reduced to 6.312, 4.901, and 5.619 min.
The Q-Q profiles of the analyzed traits are depicted on the right side of Figures 4, 5, and S1-S19. As shown in Table 1, for the traits analyzed above, the corresponding genomic control values (GC) for ALLSNP, FPC, EPC, and EPCLM were obtained, respectively. For example, for trait C16:0/C16:1, the GC values of methods ALLSNP, FPC, EPC, and EPCLM are 0.381, 1.017, 1.035, and 2.508, respectively. These results suggest that ALLSNP deflates test statistics significantly, whereas EPCLM significantly inflates test

Case Analyses
We analyzed 20 agronomic traits of HAU maize and the days to silking (DTS) trait of AP maize [27] to evaluate the performance of the EPC method. On the same server used in the previous simulation experiments, ALLSNP, FPC, and EPC barebones regression scans took an average of 2.118, 1.159, and 1.272 min for the 20 traits in the HAU maize data, respectively. The AP dataset has more individuals than the HAU dataset, thus the running time of the three methods has increased accordingly. For the DTS trait in AP maize data, ALLSNP, FPC, and EPC barebones regression scans took 22.059, 17.052, and 19.671 min, respectively, which were faster than the time taken by the linear model performed with the R/lm function (29.545, 26.512, and 30.885 min). If only genes that pass the significance threshold of 0.05 are optimized based on EMMAX, the regression scan runtime will be reduced to 6.312, 4.901, and 5.619 min.
The Q-Q profiles of the analyzed traits are depicted on the right side of Figures 4, 5 and S1-S19. As shown in Table 1, for the traits analyzed above, the corresponding genomic control values (GC) for ALLSNP, FPC, EPC, and EPCLM were obtained, respectively. For example, for trait C16:0/C16:1, the GC values of methods ALLSNP, FPC, EPC, and EPCLM are 0.381, 1.017, 1.035, and 2.508, respectively. These results suggest that ALLSNP deflates test statistics significantly, whereas EPCLM significantly inflates test statistics. In addition, the ALLSNP method presents an abnormal morphology in some traits, as shown in Figures 5 and S3. Compared with these two methods, methods FPC and EPC have desirable statistical properties. In the following, we would just compare the GWAS results of FPC and EPC. statistics. In addition, the ALLSNP method presents an abnormal morphology in some traits, as shown in Figures 5 and S3. Compared with these two methods, methods FPC and EPC have desirable statistical properties. In the following, we would just compare the GWAS results of FPC and EPC.   statistics. In addition, the ALLSNP method presents an abnormal morphology in some traits, as shown in Figures 5 and S3. Compared with these two methods, methods FPC and EPC have desirable statistical properties. In the following, we would just compare the GWAS results of FPC and EPC.    As shown in Figures 4, 5 and Figures S1-S19, and Tables S1 and S2, the EPC method identified more QTGs than FPC method. The horizontal reference lines represent the critical thresholds at a significance level of 5%, as obtained by Bonferroni correction. After Bonferroni correction (5.692 for HAU maize and 5.529 for AP maize), a total of 132 QTGs were detected by EPC method, while 79 QTGs were detected by FPC method, which suggests that method EPC has higher statistical efficiency. It was found that many QTGs appear in the analysis results of multiple HAU traits, which is consistent with the fact that the analyzed HAU maize traits are all traits related to kernel oil concentration and fatty acid composition. The fact that many of the same QTGs were found in the analysis of related traits also proved the reliability of our method to some extent. Finally, we reanalyzed all 34,774 genes in the HAU dataset using the EPC method (Table S3) and summarized the QTGs shared between different HAU traits ( Table 2). These results provide good references for further research on pathway-based GWAS. In addition, we did a rough inspection of the genes in Table 2 and some of the most significant genes in Tables S1 and S2, and found some genes involved in fatty acid synthesis and metabolism (Table 3) according to maizeGDB website annotation. In the association analysis of the above fatty acid traits, the number of times these genes are found by the EPC method is generally much higher than that of FPC, but there are some special cases (Table 3).

Gene Shared among Traits
Trait Name search the optimal polygenic heritability for the tested gene. Given the heritability of the genome, EMMAX requires just one round of whole genome regression scans. If only large or highly significant genes are inspected, Single-RunKing will run whole genome regression analysis based on the EMMAX method within two rounds. Continuous advancement of re-sequencing technology will produce more high-throughput SNPs. Thus, using all the markers to construct the kinship matrix in a mixed model association study will consume increasing memory and time. In addition, if the kinship matrix changes along with the tested genes, the required computing time will be incredibly high. Moreover, computing the kinship matrix with too many SNPs could result in proximal contamination due to overestimation of polygenic variance, especially for large genetic units like genes [20,34,35]. The easiest way is to construct kinship matrices with random samples of genetic markers [16,35]. Compared with calculating the kinship matrix based on a random sample of genetic markers or all markers, selectively adding and removing pseudo-QTNs to get the kinship matrix could increase the statistical power [34,36]. In addition, in the CMLM method, individuals are divided into groups according to selected genetic markers, in this way, the dimension of RRM is reduced. For too large a resource population, genomic heritability could be estimated quickly by taking a random sample of the population. In brief, we could introduce other simplification procedures related to the whole genome mixed-model association analysis in the Single-RunKing software to enhance computational speed and power.

Conclusions
The FaST-LMM algorithm is used to convert mixed model association analysis into linear regression analysis. Gene effect and the maximum likelihood value were rapidly estimated with fastLmPure, a function for linear model fitting. When only large or highly significant genes obtained from EMMAX are tested, the extended Single-RunKing software for genes performs the entire genome regression scan within two rounds. Based on these improvements, we further compared the statistical efficiency of using 80% PC (EPC), the first PC (FPC), and all SNP markers (ALLSNP) as independent variables, which predecessors commonly use to integrate SNPs and represent genes. The EPC method fits the model well and has good statistical efficiency. It not only overcomes the false negative problem when using all SNP markers for analysis but also solves the false positive problem of EP-CLM. Compared with the FPC method, it has higher statistical efficiency. The algorithm has been applied to the whole genome association study of agronomic traits in HAU and AP maize, and 132 QTGs were identified. The fact that many of the same QTGs were found in the analysis of related traits also proved the reliability of our method. Among these genes, we found some genes involved in fatty acid synthesis and metabolism according to the maizeGDB website annotation. The frequency of finding these genes by EPC method is generally much higher than that by FPC method, but there are exceptions. Therefore, we recommend using EPC for initial analysis and then using FPC as an aid to ensure foolproofness. We have been inspired that using multiple reliable and efficient methods to analyze the data simultaneously will help get better results. Different methods can be mutually validated and complementary. Finally, we observed the appearance of other significant candidate genes in multiple traits. Whether they are implicated in fatty acid synthesis and metabolism or how they influence this process is still waiting for us to explore.