An Improved Expectation–Maximization Bayesian Algorithm for GWAS

: Genome-wide association studies (GWASs) are flexible and comprehensive tools for identifying single nucleotide polymorphisms (SNPs) associated with complex traits or diseases. The whole-genome Bayesian models are an effective way of incorporating important prior information into modeling. Bayesian methods have been widely used in association analysis. However, Bayesian analysis is often not feasible due to the high-throughput genotype and large sample sizes involved. In this study, we propose a new Bayesian algorithm under the mixed linear model framework: the expectation and maximization BayesB Improved algorithm (emBBI). The emBBI algorithm corrects polygenic and environmental noise and reduces dimensions; then, it estimates and tests marker effects using emBayesB and the LOD test, respectively. We conducted two simulation experiments and analyzed a real dataset related to flowering time in Arabidopsis to demonstrate the validation of the new algorithm. The results show that the emBBI algorithm is more flexible and accurate in simulation studies compared to established methods, and it performs well under complex genetic backgrounds. The analysis of the Arabidopsis real dataset further illustrates the advantages of the em-BBI algorithm for GWAS by detecting known genes. Furthermore, 12 candidate genes are identified in the neighborhood of the significant quantitative trait nucleotides (QTNs) of flowering-related QTNs in Arabidopsis . In addition, we also performed enrichment analysis and tissue expression analysis of candidate genes, which will help us better understand the genetic basis of flowering-related traits in Arabidopsis .


Introduction
Genome-wide association studies (GWASs) are a powerful tool for identifying genetic variations associated with complex diseases and important traits [1].Over the past few decades, GWAS has been successfully applied in genetic analysis.Hundreds of thousands of genes have been identified to be associated with target traits in humans [1][2][3][4], animals [5][6][7][8], and plants [9][10][11][12][13].This is essential for dissecting complex diseases as well as for innovation in genetics and germplasm.
The mixed linear model (MLM) approach [14,15], which considers Q + K, has been introduced to the concept of GWASs, significantly increasing the power of quantitative trait nucleotide (QTN) detection.Based on that, an efficient mixed-model association (EMMA) [16] and GEMMA [17] treated the polygenic effect as the random effect to fit the mixed model, which is now becoming the mainstream process in GWAS.Several methods are continuously emerging, such as EMMA eXpedited (EMMAX) [18], FaST-LMM [19], SUPER [20], FarmCPU [21], and FastGWA [22].Because of the genetic variant dissection and computational speed, all these methods have been successfully applied in genomic analysis.However, most of them comprise a one-dimensional marker scan by testing one marker at a time, which is involved in multiple test corrections for the threshold value of significance tests.The widely used Bonferroni correction is often too conservative, which may be disadvantageous to the detection of important loci in GWASs [13,23].
For a complex trait governed by multiple QTNs, it is reasonable to consider the effects of multiple QTNs in the model as this aligns with the internal genetic mechanism of these traits.The whole-genome Bayesian models are an effective way of incorporating important prior information into modeling.Bayesian methods [24][25][26][27] are widely used in association analysis.Bayesian algorithms based on MCMC correctly estimate the locations and genetic effects of QTNs when the dimensionality and data size are small.Additionally, several Bayesian methods can also perform both GWASs and genome selection, such as emBayesB (emBB) [27], expectation and maximization BayesA (emBA) [28], expectation and maximization BayesLasso (emBL) [29][30][31], expectation and maximization Gaussian maximum likelihood (emML) [29,32], expectation-maximization ridge regression (emRR) [33], and least absolute shrinkage and selection operator (LASSO) [34].The current biotechnology has generated abundant high-throughput molecular markers and high-dimensional sample sizes, promoting genetic study in complex trait dissection.However, Bayesian analysis is often not feasible due to the large-scale data involved.
In this study, to improve the accuracy of GWASs in identifying important QTNs and gene mining, we combine mixed linear models and the Bayesian concept and propose a new multi-stage flexible method, namely, the emBBI (expectation and maximization BayesB Improved) algorithm, which takes into account the population structure and polygenic background in the mixed linear model [23,35].In the first stage, the emBBI algorithm whitens the polygenetic and environmental noise.Subsequently, we conduct the variable reduction stage to remove the unassociated loci from the model and then apply the em-BayesB method to estimate the QTN effect value.In the last stage, the LOD test is used to assess the significance of potential QTNs.In this study, a series of simulation experiments and real data set analysis were carried out to illustrate the advantages of this new method.As a comparison, the existing methods, including emBB, emRR, emML, emBL, and emBA are used to analyze the above data sets.The results of both the simulation experiments and real dataset analysis demonstrate the advantages of our approach in QTN detection and testing.

Genetic Model
Consider the following mixed linear model relating phenotypes y to genotypes Z, assuming n individuals and p genetic markers, p >> n, it can be described as: where y = y 1 , ..., y n ) T , y i is an n-vector of phenotypes measured on n individuals; α is a c × 1 vector of the fixed effects, including the intercept, population structure effect, and so on, W is an n × c matrix of the corresponding designed matrix for α; Z is an n × 1 vector of marker genotypes, and γ ∼ N 0, σ 2 γ is a p × 1 vector of a random effect of each genetic marker, σ 2 γ is the variance of γ; u ∼ MV N 0, σ g 2 K is an n × 1 random vector of polygenic effects, σ g 2 is the variance of the polygenic background, K is a known n × n genetic relationship matrix between individuals; ε ∼ MV N 0, σ 2 I n is an n × 1 vector of residual errors; and σ 2 is the variance of residual error; I n is an n × n identity matrix.MVN denotes multivariate normal distribution.
As γ is treated as a random effect, the variance of y in Model ( 1) is as follows: where λ γ = σ γ 2 /σ 2 , λ g = σ g 2 /σ 2 , λ γ is the ratio of genetic variance to the variance of residual error, and λ g is the ratio between the variance of the polygenic background and the variance of the residual error.

Genome-Wide Association Analysis of the emBBI Algorithm
The emBBI algorithm is a multi-stage approach for GWAS, which simultaneously estimates the regression effect and performs hypothesis testing.We describe it as the following four stages: 2.2.1.Polygenic and Residual Noise Whitening Stage First, we borrow the idea from Wen [23] and conduct the polygenic and residual noise whitening stage.The estimations of two ratios λ γ and λ g , cause an expensive computational burden in GWAS.The polygenic variance is always larger than zero, thus assuming λ γ = 0 since most markers are not associated with the trait.Therefore, we estimate λg by the reduced Model (1), which remove Zγ with only polygenic background, and replace λ g in ( 2) by the λg [23,36], avoiding time-consuming re-estimate λ g for each single marker scanning.Thus, The eigen (or spectral) decomposition of the positive semidefinite matrix B = λg K + I n .
where Q is orthogonal and Λ is a diagonal matrix with positive eigenvalues.Assuming that is changed to the following: where y c = Cy, W c = CW, Z c = CZ, and ε c = Cu + Cε ∼ MV N 0, σ 2 I n [13,36].

Variable Reduction Stage
Nowadays, the advanced biotechnology has generated millions of genetic markers, it is inflexible to analyze so many markers in the Bayesian model.Meanwhile, several studies illustrated that most quantitative traits are controlled by a small portion of genes [35,37], which means that most SNPs are not associated with target traits; thus, it is critical to conduct dimension reduction.In the variable reduction stage, the unassociated loci were removed according to ordinary least squares under Model (5).All the most potential top m markers were selected to construct the reduced model for the next EM stage.In this study, the top 200 loci remain for the next stage [38,39].

EM Algorithm Stage
In the emBayesB method [27], we focused on the following linear model: where y c is the n × 1 vector of phenotypic value after polygenic background correction, which is the same as it in Model (5).γ is an m × 1 random effect vector of SNP effects, m is the number of SNPs after variable reduction stage, and Z * c is a corresponding corrected n × m genotypic matrix for γ. ε c is an n × 1 vector of residuals, which is assumed to be the same as in Model (5).Hence, it is described as y c |γ ∼ N Z * c γ , σ 2 I n .First, we consider the prior distribution of the SNP effect γ.An unknown indicator variable h j , is defined to indicate whether the jth SNP has linkage disequilibrium (LD) with the QTN.If h j = 1 (the jth SNP is in LD with the QTN), and the SNP effect γ j is assumed to be from a double-exponential (DE) distribution in (7).Otherwise, the SNP effect γ j is assumed to be from a Dirac Delta (DD) distribution in (7).Hence, the conditional distribution of γ j given h j is Intuitively, we assume that the proportion τ of SNPs are in LD with at least one QTN in this study.Assuming the independence of the m SNP effects, the joint prior for h and γ is as follows: The log posterior is proportional to the following: The indicator variable h j is considered as missing data.Then, we maximize the log posterior in an EM algorithm.In the E-step, an analytical expression of the posterior probability that SNP j is in LD with at least one QTN for iteration k, τ k j , is derived as [27]: For the M-step, we fix τ k j and maximize E h [log p(h, γ|y c )] with respect to the parame- ters γ j , τ, λ, and σ 2 .Setting the derivative to zero and rearranging, we obtain the following expressions at iteration k: where τ k is the posterior probability vector at iteration k.DE mode is the posterior mode of the DE distribution, and DD mode is the posterior mode of the DD distribution.The EM algorithm for emBayesB is summarized as follows: 1.
Provide the initial values for the parameters.2.
E-step: for each SNP, calculate γ j and the posterior probability τ k j as shown in (10).
The E-step and M-step are repeated until the convergence criterion is satisfied.To reduce the computational burden, we adopted the maximum iterations 200 for the EM algorithm [29] in this study.

Likelihood Ratio (LR) Test
Based on the estimates of SNP effects γ j in the EM algorithm, all the markers with γ j < 1e −4 could be viewed as not being associated with the quantitative trait; the other markers with the effects are possibly associated with the trait.To test the null hypothesis (H 0 : γ (j) = 0 means no QTN), the LR test was conducted as follows: where θ T = γ (1) , . . ., γ (j) , . . ., γ (m) , σ 2 and θ T −j = γ (1) , . . ., γ (j−1) , γ (j+1) , . . .γ (m) , σ 2 is the parameter vector that exclude γ j and L(.) is the log-likelihood function.As pointed out by Kao et al. [40], selecting critical values of significant QTNs becomes complicated for multiple QTN tests.For simplicity, we used the logarithm of odds (LOD) as a criterion in real data [41,42], and LOD = LR/ln (10).The critical value for significance was set to LOD = 2.5, equal to p-value = 0.00069, which is converted from the LOD score by using p-value = Pr χ 2 (1) > 2.5 × 4.61 = 0.00069 under the null hypothesis following a Chi-square distribution with one degree of freedom.

Comparison Algorithm
In this study, we compare several classical Bayesian methods to demonstrate the effectiveness and superiority of emBBI.
Expectation and maximization BayesB (emBB [27]): This method assumes that the proportion of SNPs in LD is associated with at least one QTN.The SNP effect follows the DE distribution or DD distribution according to whether associated, and the computational speed is improved compared to MCMC-based BayesB.
Expectation-maximization Bayesian ridge regression (emRR [33]): This method assumes that all regression coefficients have an equal variance.It obtains the posterior distribution of the parameters by introducing the regular term automatically, thereby avoiding overfitting in large-scale likelihood estimation in the estimation process.
Expectation-Maximization Gaussian Maximum Likelihood (emML [29]): This is a classical Bayesian estimation algorithm that focuses on evaluating a set of interested parameters under the maximum probability.It is a straightforward and stable approach under the Bayesian framework, with excellent estimation properties as well [32].
Expectation and Maximization Bayesian Lasso (emBL [29]): This method assumes that the SNP-effect follows the double exponential prior distribution.It constructs a full Bayesian analysis, providing credible intervals for the estimates, and the value of λ can be chosen by marginal maximum likelihood or hyperprior methods.It is a flexible method [30,31].
Expectation Maximization BayesA (emBA [28]): This method assumes that all regression coefficients (SNP effects) follow a normal distribution, with the prior of variance parameters further assigned as a Chi-square distribution.The hyperparameters are directly related to genetic structure, and compared to the MCMC-based mode, the accuracy and computational efficiency of emBA are improved.

Experimental Materials 2.4.1. Simulation Datasets
In this study, we performed Monte Carlo simulation experiments to verify the advantages of the new algorithm.The datasets of the simulation experiment were generated from the mixed linear model, the genotypes based on the minor allele frequency (MAF) under Hardy Weinberg equilibrium in the interval [0.1, 0.5].The simulated dataset contained a sample size of 2000 with 10,000 genetic variants (SNPs) generated by the MLM.Both the population mean and the residuals were set to 10.0.
In this study, we conducted two simulation experiments.The first simulation involved a single fixed-position QTN located on the 98th marker with 0.1 heritability.The purpose of this experiment is to illustrate the accuracy of the fixed position.For the second simulation experiment, we randomly selected 50 QTNs from 10,000 genetic variants.Each QTN had an MAF greater than 0.3, and the proportion of phenotypic variance explained by all QTNs was 50%.We repeated each simulation experiment 100 times.Due to the different levels of genetic structures of the various species and populations, for each simulation, we further examined fifteen scenarios involving three background noise levels and three sample size combinations.These combinations included two-, five-, and ten-times polygenic backgrounds and sample sizes of 500, 1000, 2000, 3000, and 5000.

Arabidopsis Datasets
The well-known Arabidopsis dataset consists of 199 Arabidopsis inbred lines (Atwell et al., 2010), containing 216,130 SNPs and 107 traits [43].Among these traits, three traits related to flowering time were used to validate the performance of the various methods in this study: (1) SDV: days to vernalized flowering under short days of sunlight; (2) FT10: days to flowering at 10 • C; and (3) FT22: days to flowering at 22 • C. In this study, we performed quality control using the Plink tool and utilized "-bfile data-maf 0.01-make-bed-out output" to remove data with MAF < 0.01 in Arabidopsis datasets.After filtering, 215,961 SNPs were used to conduct real data analysis.Following quality control and the removal of missing data, the remaining SDV, FT10, and FT22 data were 159, 194, and 193, respectively.The densities of whole markers and MAF are shown in Figure 1A,B, respectively.
under Hardy Weinberg equilibrium in the interval [0.1, 0.5].The simulated dataset contained a sample size of 2000 with 10,000 genetic variants (SNPs) generated by the MLM.Both the population mean and the residuals were set to 10.0.
In this study, we conducted two simulation experiments.The first simulation involved a single fixed-position QTN located on the 98th marker with 0.1 heritability.The purpose of this experiment is to illustrate the accuracy of the fixed position.For the second simulation experiment, we randomly selected 50 QTNs from 10,000 genetic variants.Each QTN had an MAF greater than 0.3, and the proportion of phenotypic variance explained by all QTNs was 50%.We repeated each simulation experiment 100 times.Due to the different levels of genetic structures of the various species and populations, for each simulation, we further examined fifteen scenarios involving three background noise levels and three sample size combinations.These combinations included two-, five-, and ten-times polygenic backgrounds and sample sizes of 500, 1000, 2000, 3000, and 5000.

Arabidopsis Datasets
The well-known Arabidopsis dataset consists of 199 Arabidopsis inbred lines (Atwell et al., 2010), containing 216,130 SNPs and 107 traits [43].Among these traits, three traits related to flowering time were used to validate the performance of the various methods in this study:

Candidate Gene Identification and Enrichment Analysis
All the identified significant putative QTNs of different methods were used to mine the known genes or candidate genes using the Arabidopsis Information Resource (TAIR, https: //www.arabidopsis.org,accessed on 10 March 2024).According to the LD information, the neighboring regions within 20 kb of all significant loci were used to search for the known genes.
To give insight into the genetic basis, we implemented gene ontology (GO) enrichment analysis based on all the genes in the neighborhood of the significant loci.Genes enriched in these significant pathways were considered as candidate genes.This process provides more information on biological functions, pathways, or cellular localizations [44].The online tool agriGO (http://systemsbiology.cau.edu.cn/agriGOv2/#,accessed on 15 March 2023) [45] was used to perform GO enrichment analysis concerning the biological process (BP), molecular function (MF), and cellular component (CC) for the candidate genes.In summary, GO enrichment analysis was conducted using an enrichment analysis tool, and Fisher's exact test (p-value < 0.05) was utilized to select enrichment GO terms.The R package "pheatmap"(version 1.0.12) was used to create a heatmap according to the results of the GO enrichment analysis for the candidate genes.

Tissue-Specific Expression Analysis
In this study, the ATH1 database (http://bar.utoronto.ca/efp,accessed on 17 April 2023) was applied to display the expression levels of candidate genes in various tissues or organs.The obtained results of the expression level are based on the GCOS expression signal (from Affymetrix GeneChip).To visualize the expression of candidate genes in various tissues or organs, we used the "pheatmap" package (version 1.0.12) in the R program to draw heatmaps to show the expression information.

Experimental Results of Simulated Data
We first compared the model estimation evaluation of emBBI with established methods, including emBB, emRR, emML, emBL, and emBA through two Monte Carlo simulation experiments.Each simulation dataset contained 2000 individuals with 10,000 genetic variables.For these two simulation experiments, we considered a QTN with a fixed position and 50 QTNs with random positions, respectively.
For simulation experiment I, a fixed-position 98th QTN with 0.1 heritability was simulated.The power of all the methods for the detection of QTNs was almost 100%, which indicates that emBBI and the other methods easily capture the genetic variant.The MSE between the true effect γ and the estimation γ was adopted to evaluate the accuracy of each method, which was calculated as follows: where γ ij (i = 1, 2, . . ., m; j = 1, 2, . . ., k) is the SNP true effect, γij is the model coefficient estimator, m is the number of SNPs, and k is the number of repetitions.The smaller value of the MSE indicates a higher accuracy of the results.It is obvious that the MSE of emBBI (the red bar) is much less than the others, as shown in Figure 2.Meanwhile, for the sample size of 500, the SNP effect estimation accuracies of emBBI and emBB are much better than that of the other methods, followed by emBL, emBA, emML, and emRR.As the sample size increases, the accuracies of all the other algorithms are improved, except for the emRR method, which assumes that all regression coefficients have an equal variance and might impact the accuracy.An interesting result indicates that although emBBI and emBB have much higher accuracy than the other algorithms, the differences in MSE are significant increasing along with polygenetic background or sample size increasing.Additionally, the accuracy of all the algorithms decreases with an increase in background noise, which implies that the background noise has a great influence on the effect estimation accuracy.This conclusion is consistent with our understanding.The extreme advantages of the emBBI algorithm estimation are illustrated in Figure 2. background or sample size increasing.Additionally, the accuracy of all the algorithms decreases with an increase in background noise, which implies that the background noise has a great influence on the effect estimation accuracy.This conclusion is consistent with our understanding.The extreme advantages of the emBBI algorithm estimation are illustrated in Figure 2.For simulation experiment II, there are 50 random-position QTNs associated with the target phenotype.The total phenotypic variance explained (PVE) is 50%, and the average PVE for each QTN is approximately 1%, which indicates that almost all QTNs have a minor effect.Fortunately, the power of all the methods in the detection of QTNs are greater than 90%, which are slightly weaker than simulation I, the accuracy of QTN estimations For simulation experiment II, there are 50 random-position QTNs associated with the target phenotype.The total phenotypic variance explained (PVE) is 50%, and the average PVE for each QTN is approximately 1%, which indicates that almost all QTNs have a minor effect.Fortunately, the power of all the methods in the detection of QTNs are greater than 90%, which are slightly weaker than simulation I, the accuracy of QTN estimations shows a similar trend.Here, the −log(p-value) is employed to assess the significance of QTNs in different methods.The results of experiments with sample sizes of 3000 and 5000 are not shown because the likelihood ratio test has a huge computational burden.Most methods failed to generate any meaningful results, and all the estimated effects are zero.
Figure S1 shows the violin plots of −log(p-value) after the likelihood ratio test for sample sizes of 500, 1000, and 2000.The results show that under multiple QTNs, emBBI and emBB have significantly smaller p-values obtained from the LOD test compared to the other methods, and the other four methods are similar to each other.It is noteworthy that the emBBI was more significant than the emBB test in most situations.The p-value of the test decreased when the sample size increased or the genetic background decreased for all methods.
In addition, we compared the computation time of the simulation experiments under 100 repetitions.Although emBBI is more accurate and significant, it does not show a large improvement in computational speed compared to the emBB algorithm.

Results of Real Data Analysis
To further validate the emBBI algorithm, three flowering-related traits, including SDV, FT10, and FT22, were analyzed in this study.After removing the missing data, the remaining SDV, FT10, and FT22 data were 159, 194, and 193, respectively.The method emBBI and other classical methods were conducted to examine whether there were significant genetic variants associated with the flowering-related Arabidopsis traits.

Analysis of Phenotypic Differences in Arabidopsis Data
The variation in phenotypic data of the three traits is shown in Figure 3 by box plots, histograms, and distribution plots.The mean values of SDV, FT10, and FT22 are 64.86 days, 63.97 days, and 74.72 days (Table 1), respectively.The average of the FT22 trait is smaller than FT10.The higher the temperature, the shorter the flowering phase is.The phenotypic datasets of SDV, FT10, and FT22 are distributed between 27 and 100 days, 40 and 80 days, and 23 and 130 days, respectively.The FT10 trait had the smallest variance, which means that the phenotypic datasets of FT10 were all clustered around the mean.The SD (71.75) and CV (0.96) of FT22 were larger than those of SDV (SD: 35.07 and CV: 0.54) and FT10 (SD: 17.82 and CV: 0.28).Detailed descriptive statistics for the three phenotypes, including mean, SD, CV, minimum, maximum, and range, are shown in Table 1.All the phenotypes approximatively follow the normal distribution according to the Kolmogorov-Smirnov test.The results of descriptive statistics indicate that there might be a different genetic basis for days to flowering traits between days of sunlight and temperature.It can be inferred that the real datasets are suitable for GWASs.(SD: 17.82 and CV: 0.28).Detailed descriptive statistics for the three phenotypes, including mean, SD, CV, minimum, maximum, and range, are shown in Table 1.All the phenotypes approximatively follow the normal distribution according to the Kolmogorov-Smirnov test.The results of descriptive statistics indicate that there might be a different genetic basis for days to flowering traits between days of sunlight and temperature.It can be inferred that the real datasets are suitable for GWASs.In total, 199 inbred lines with 215,961 SNPs were applied to carry out GWAS for flowering-related Arabidopsis traits, including SDV, FT10, and FT22 by emBBI and the other algorithms.The densities of all SNPs and MAF are shown in Figure 1.These SNPs were evenly distributed across the five chromosomes (Figure 1A).
The QTNs less than the threshold with LOD = 2.5, equal to p-value = 0.00069, were mined from TAIR (https://www.arabidopsis.org,accessed on 10 March 2024) for different methods.The known genes are listed in Table 2 and the Manhattan plot (Figures S2-S4).The emBBI algorithm detected 3 (Table S1 and Figure S2), 6 (Table S1 and Figure S3), and 12 (Table 2 and Figure S4) known genes, which were associated with SDV, FT10, and FT22, respectively.There was a total of 21 confirmed genes, which were associated with three flowering-relate traits, whereas the other methods, emBB, emBA, emBL, emML, and emRR, detected 6, 5, 10, 19, and 13 (Tables 2 and S1 and Figures S2-S4) for the three traits, respectively.An interesting result shows that the emBBI method dissected three clusters of genes associated with the same trait, FT22 (Table 2 and Figure S4A), including the genes AT2G38185, AT2G38195, and AT2G38220, in the neighborhood of the SNP located at 16020457 bp on chromosome 2.More importantly, the AT5G20280 gene was simultaneously detected by three flowering-relate traits, SDV, FT10, and FT22 (Tables 2 and S1), and the AT5G24930 gene was detected by both the SDV and FT22 traits (Tables 2 and S1).It

GWASs and Known Genes of QTNs
In total, 199 inbred lines with 215,961 SNPs were applied to carry out GWAS for flowering-related Arabidopsis traits, including SDV, FT10, and FT22 by emBBI and the other algorithms.The densities of all SNPs and MAF are shown in Figure 1.These SNPs were evenly distributed across the five chromosomes (Figure 1A).
The QTNs less than the threshold with LOD = 2.5, equal to p-value = 0.00069, were mined from TAIR (https://www.arabidopsis.org,accessed on 10 March 2024) for different methods.The known genes are listed in Table 2 and the Manhattan plot (Figures S2-S4).The emBBI algorithm detected 3 (Table S1 and Figure S2), 6 (Table S1 and Figure S3), and 12 (Table 2 and Figure S4) known genes, which were associated with SDV, FT10, and FT22, respectively.There was a total of 21 confirmed genes, which were associated with three flowering-relate traits, whereas the other methods, emBB, emBA, emBL, emML, and emRR, detected 6, 5, 10, 19, and 13 (Table 2 and Table S1 and Figures S2-S4) for the three traits, respectively.An interesting result shows that the emBBI method dissected three clusters of genes associated with the same trait, FT22 (Table 2 and Figure S4A), including the genes AT2G38185, AT2G38195, and AT2G38220, in the neighborhood of the SNP located at 16020457 bp on chromosome 2.More importantly, the AT5G20280 gene was simultaneously detected by three flowering-relate traits, SDV, FT10, and FT22 (Table 2 and Table S1), and the AT5G24930 gene was detected by both the SDV and FT22 traits (Table 2 and Table S1).It seems that emBBI is more powerful in detecting important genes (confirmed by TAIR) associated with relevant flowering traits.

Functional Enrichment Analysis of Candidate Genes
In order to gain insight into the genetic basis of the candidate genes, GO enrichment analysis was performed in this study, which is a powerful bioinformatics tool to better understand the underlying biological processes, molecular functions, and cellular components of candidate genes.According to the results of the GO functional enrichment study, 1655 candidate genes for QTNs are significantly enriched with 234 GO terms related to various biological processes (p-value < 0.05).The results of the enrichment analysis heatmap of all the candidate genes are shown in Figure 4, and the pathway is shown in Supplementary Figure S5.The important GO terms of the candidate genes are marked in red in the rectangular boxes.In order to gain insight into the genetic basis of the candidate genes, GO enrichmen analysis was performed in this study, which is a powerful bioinformatics tool to bette understand the underlying biological processes, molecular functions, and cellular compo nents of candidate genes.According to the results of the GO functional enrichment stud 1655 candidate genes for QTNs are significantly enriched with 234 GO terms related t various biological processes (p-value < 0.05).The results of the enrichment analys heatmap of all the candidate genes are shown in Figure 4, and the pathway is shown i Supplementary Figure S5.The important GO terms of the candidate genes are marked red in the rectangular boxes.In adjacent areas of the significant QTNs, 12 candidate genes (Figure 4) are involved in several biological and metabolic processes during the Arabidopsis growth period, such as flower development, which have not been reported in previous studies.It reveals that these candidate genes have a potential impact on the target traits.For example, the detected candidate gene AT5G37930 is involved in multicellular biological development, zinc ion binding, ubiquitin protein transferase activity, and protein binding.According to the GO analysis (Supplementary Figure S5), reproductive processes, such as postembryonic development (GO:0009791), are functionally related to flower development in Arabidopsis, and cellular macromolecular metabolic processes (GO:0044260), nucleus (GO:0005634) and ion binding (GO:00443167) are most significant in BP, CC, and MF, respectively.In the last layer (Supplementary Figure S5), 144 genes were enriched to the protein modification process (GO:0006464), in which the novel candidate genes AT5G37890, AT5G37930, AT1G32060, AT5G37870, AT2G13950, and AT5G37910 were involved (Figure 4); these genes play an important role in the growth and reproduction of Arabidopsis thaliana.These results provide the critical biological basis for mining novel genes.

Expression Profiling of Candidate Genes
The expression levels of the candidate genes in different organs or tissues are available in the ATH1 database (http://bar.utoronto.ca/efp,accessed on 17 April 2024), including seeds (both dried and soaked), stamens, stems, leaves, leaf roots, inflorescences, and so on.For the candidate genes for SDV traits, the GCOS expression signal expression levels of different tissues or organs are illustrated in Figure 5 (normalized before plotting the heat map).It is evident that the gene AT4G36060 has the highest GCOS expression level in dry seeds, as shown in Figure 5.The genes AT2G02090, AT2G02140, and AT4G01575 all had the highest expression levels in the Arabidopsis mature pollen period.Genes AT5G20240, AT1G23010, AT1G23060, and AT5G20270 showed high expression levels in the flowering stage.
bidopsis, and cellular macromolecular metabolic processes (GO:0044260), nucleu (GO:0005634) and ion binding (GO:00443167) are most significant in BP, CC, and MF, re spectively.In the last layer (Supplementary Figure S5), 144 genes were enriched to th protein modification process (GO:0006464), in which the novel candidate gene AT5G37890, AT5G37930, AT1G32060, AT5G37870, AT2G13950, and AT5G37910 were in volved (Figure 4); these genes play an important role in the growth and reproduction o Arabidopsis thaliana.These results provide the critical biological basis for mining nove genes.

Expression Profiling of Candidate Genes
The expression levels of the candidate genes in different organs or tissues are availa ble in the ATH1 database (http://bar.utoronto.ca/efp,accessed on 17 April 2024), includin seeds (both dried and soaked), stamens, stems, leaves, leaf roots, inflorescences, and s on.For the candidate genes for SDV traits, the GCOS expression signal expression level of different tissues or organs are illustrated in Figure 5 (normalized before plotting th heat map).It is evident that the gene AT4G36060 has the highest GCOS expression leve in dry seeds, as shown in Figure 5.The genes AT2G02090, AT2G02140, and AT4G01575 a had the highest expression levels in the Arabidopsis mature pollen period.Gene AT5G20240, AT1G23010, AT1G23060, and AT5G20270 showed high expression levels in th flowering stage.

Discussion
The emBayesB [27] method is a fast, accurate, and unbiased EM algorithm based on iterated conditional expectation, which was developed by combining the Expectation-Maximization (EM) algorithm.It is a robust algorithm for both GWASs and GS, and its key advantages can be expressed as the algorithm dynamically updates the LD ratio between SNPs and QTNs during the iterative process, resulting in a relatively high estimation accuracy.However, advanced biotechnology generates millions of molecular markers and large-scale SNP markers, and the classical Bayesian algorithms demonstrate its inflexibility in high-dimensional data analysis.Meanwhile, the limitations in whitening the population structure and polygenic background noise lead to the inaccuracy of SNP effect estimation.In this study, we propose a new algorithm, emBBI, under the framework of the mixed linear model framework.This algorithm first assumes that the marker effects are random and employs the model transformation of FASTmrEMMA to whiten the covariance matrices of polygenes and environmental noise.After dimensional reduction, the emBayesB method and LOD test are applied to estimate and test the marker effects.The results of this study show that controlling population structure and polygenic background noise is crucial for emBBI in GWASs and to improve SNP effect estimation, which provides an alternative method for genetic analysis.
We compared emBBI with the established FASTmrEMMA method in both simulation experiments and real data analysis.The powers of the two methods in the detection of the fixed QTN are almost 100% under three polygenic backgrounds.The accuracy of emBBI is slightly better while FASTmrEMMA has a faster computational speed.For random-position QTNs, FASTmrEMMA performs slightly better.The computational time and accuracy have a similar tendency with simulation I, with the gap becoming wider with a stronger polygenic background.In the real data analysis, the FASTmrEMMA detects 11 known genes for SDV, which is less than emBBI (12 known genes).The main reason is that although both emBBI and FASTmrEMMA have a polygenic correction stage to whiten the noise, the following algorithms are different.FASTmrEMMA is a combination of single-locus and multi-locus analysis, while emBBI employs the least squares method to reduce the dimensions; then, it estimates and tests the marker effects using emBayesB and the LOD test, respectively.
In real data analysis, the emBBI algorithm detected 21 known genes associated with three flowering-related traits.The number of known genes detected by emBBI is significantly higher than the number detected by other comparison methods.In fact, it is more than three times greater than the traditional emBB and four times greater than emBA.Based on the multi-stage process, the emBBI method identified three clusters of genes associated with FT22 (Table 2 and Figure S4).Notably, the gene AT5G20280 was simultaneously detected by three flowering-related traits.All the above results revealed that emBBI is more powerful in detecting significant QTNs and important genes.

Figure 1 .
Figure 1.(A) Marker density of the natural population of Arabidopsis with 1 Mb window size.(B) MAF of the natural population of Arabidopsis.Figure 1.(A) Marker density of the natural population of Arabidopsis with 1 Mb window size.(B) MAF of the natural population of Arabidopsis.

Figure 3 .
Figure 3.The descriptive statistics of the phenotypic data for the three flowering-related traits, SDV, FT10, and FT22.

Figure 3 .
Figure 3.The descriptive statistics of the phenotypic data for the three flowering-related traits, SDV, and FT22.

Figure 4 .
Figure 4.The heat map of GO enrichment analysis of candidate genes for SDV, FT10, and FT22.

Figure 5 .
Figure 5. Heat Map for Expression Profile Analysis.

Figure 5 .
Figure 5. Heat Map for Expression Profile Analysis.

Table 1 .
Statistical analysis of three Arabidopsis flowering-related traits.

Table 1 .
Statistical analysis of three Arabidopsis flowering-related traits.