1. Introduction
Genome-wide association studies (GWASs) have been widely applied for genetic research of humans, animals and plants, serving as a powerful tool for dissecting the genetic mechanisms of complex traits and diseases [
1]. Over recent decades, they have successfully identified numerous trait-associated loci in various species, including humans [
1,
2,
3,
4], animals [
5,
6,
7,
8], and plants [
9,
10,
11,
12,
13], significantly enhancing our understanding of polygenic inheritance and facilitating advancements in precision medicine and molecular breeding. Nowadays, with the rapid advancement of sequencing technologies, the emergence of next-generation sequencing technologies has introduced new computational challenges, as contemporary GWASs now routinely analyze datasets containing millions of single nucleotide polymorphisms (SNPs). Traditional analytical approaches face substantial limitations in this high-dimensional framework, particularly regarding computational efficiency and statistical power for detecting minor genetic effects. To address these computational challenges in modern genetic research, a variety of novel and robust methodologies have been developed, such as the Bayesian sparse linear mixed model [
14], pLARmEB [
15] and FASTMrEMMA [
16].
Numerous studies have demonstrated that most quantitative traits are primarily governed by a small proportion of quantitative trait nucleotides (QTNs) [
15,
17,
18], which play crucial roles. This biological basis supports the application of variable selection algorithms for identifying phenotype-associated QTNs, thereby effectively addressing the overfitting challenge in high-dimensional data analysis. For high-dimensional datasets with limited samples, various variable selection methods have been developed, with the Least Absolute Shrinkage and Selection Operator (LASSO) [
19] representing a fundamental approach that combines feature selection with ℓ
1-norm regularization to constrain the residual sum of squares while assigning non-zero coefficients to influential features. However, due to the false discovery rate of LASSO [
20], alternative methods have been proposed, including Elastic Net [
21], Smoothly Clipped Absolute Deviation (SCAD) [
22], Adaptive LASSO [
23], and BigLASSO [
24]. Although these advanced methods have been successfully applied in GWASs, their performance remains constrained in scenarios involving complex, noisy polygenic backgrounds and extremely high-dimensional genetic data.
The mixed linear model (MLM) framework [
25,
26], which incorporates both fixed effects (Q) and random genetic effects (K), has become a fundamental approach in GWASs by substantially increasing the detection power for QTNs. This methodological advancement was further refined through efficient mixed-model association algorithms, such as efficient mixed-model association (EMMA) and genome-wide efficient mixed-model association (GEMMA), which explicitly model the polygenic background as random effects. Subsequent methodological innovations, including FaST-LMM [
27], FarmCPU [
28] and FastGWA [
29], have built on this foundation and been widely adopted in genomic analyses. However, most of these methods are single-marker analytical methods, and the commonly employed Bonferroni correction tends to weaken the detection of important loci in GWASs.
The two-stage methodology effectively addresses these limitations by first applying single-marker scanning for dimensionality reduction, followed by multi-locus analysis for estimation and testing, which significantly enhances both the efficiency and accuracy of GWASs. Several innovative implementations of this approach have been developed: pLARmEB [
15] utilizes least angle regression for SNP screening before empirical Bayes analysis; FASTMrEMMA [
16] identifies putative QTNs (
p < 0.005) for inclusion in multi-trait analyses while adopting relaxed significance thresholds as an alternative to Bonferroni correction; FastRR [
13] combines correlation-based prescreening with an advanced multi-locus ridge regression algorithm (DRR) for parameter estimation. Comparative analyses demonstrate that these two-stage approaches outperform conventional GWAS methods in terms of their superior estimation accuracy, reduced computational demands, and enhanced stability across varying polygenic backgrounds and sample sizes, while maintaining robust statistical power.
In this study, we propose a novel multi-stage GWAS approach, termed improved LASSO screening and sparse Bayesian learning (ILSBL), which synergistically integrates variable selection and Bayesian estimation. This method effectively accounts for noise from population structure and polygenic background, and enables efficient detection of QTNs. We comprehensively evaluated the performance of ILSBL in comparison with LASSO, SCAD and other existing methods using both simulated datasets and real Arabidopsis thaliana data.
2. Methods and Materials
2.1. Genetic Model
Consider the following mixed linear model. Let
denote the phenotype vector for
individuals, and
represent the genotype matrix for
genetic markers (with
). The model is specified as follows:
where
,
is the phenotypic value of the
th individual out of
individuals;
is a
vector of the fixed effects, including the intercept, population structure effect and so on;
is an
matrix of the corresponding designed matrix for
;
is an
vector of marker genotypes; and
is a
vector of a random effect of each genetic marker;
is the variance of
;
is an
random vector of polygenic effects;
is the variance of the polygenic background;
is a known
genetic relationship matrix between individuals;
is an
vector of residual errors;
is the variance of residual error; and
is an
identity matrix.
denotes a multivariate normal distribution.
As
is treated as a random effect, the variance of
in Model (1) is as follows:
where
,
,
is the ratio of genetic variance to the variance of residual error, and
is the ratio between the variance of the polygenic background and the variance of the residual error.
2.2. Improved LASSO Screening and Sparse Bayesian Learning (ILSBL)
The ILSBL method is an innovative multi-stage methodology for GWAS, which simultaneously performs effect size estimation and statistical testing. The method algorithm operates through three computationally efficient phases (see
Figure 1).
2.2.1. Polygenic and Residual Noise-Whitening Stage
We adopt the model transformation approach from FASTMrEMMA [
16] for whitening polygenic effects and residual noise, aiming to address the computational challenges in GWASs. Given that most markers are not associated with the target trait, we simplify the estimation by making the assumption that
and estimate
using a reduced Model (1), which removes
and replaces
in (2) with
[
16,
30], avoiding time-consuming re-estimation of
for each single-marker scanning. The variance components were estimated via Restricted Maximum Likelihood using an optimized EMMA-based algorithm, which follows FASTmrEMMA [
16]. Thus,
where
. The eigen (or spectral) decomposition of
is
In spectral decomposition,
is orthogonal and
contains eigenvalues. This allows us to transform the original model into the following model:
where
,
,
,
and
[
13,
30]. This transformation maintains the original model structure while significantly reducing computational burden by avoiding repeated estimation of
for each marker. The efficient handling of different polygenic backgrounds, preservation of statistical properties, and improved computational scalability of the method laid a solid foundation for subsequent genetic analyses.
2.2.2. BigLASSO Screening
We use efficient LASSO regression [
19] for large-scale dimensionality reduction of the genomic data. Formally, the optimization problem is formulated as
where
is the phenotype vector;
is the genotype matrix;
is the sparse coefficient vector;
controls the
penalty strength; and the optimal
λ is determined by the minimum cross-validation error of 5-fold cross-validation in the BigLASSO step.
The analysis was implemented using the R package
biglasso [
24], which provides computational efficiency and scalability for genome-wide large-scale data. Given that LASSO exhibits low accuracy in estimating effect values, we employed the third-step algorithm to improve parameter estimation. Markers with non-zero coefficient estimates from the BigLASSO process were retained and passed to the subsequent analysis.
2.2.3. Sparse Bayesian Learning
The SBL algorithm [
31] simultaneously estimates all marker effects within a multiple-locus model using a coordinate descent approach. By iteratively updating marker-specific prior variances and effects under an
penalty, this method achieves efficient variable selection and handles very large sample sizes (e.g., >100,000 individuals) without large matrix operations. The approach is implemented in the R package
sbl, offering greater statistical power and improved computational scalability compared to existing GWAS and QTN mapping methods.
The phenotypic values are modeled as a linear model of SNP effects with a Gaussian residual error:
with priors
and
, where
controls the sparsity. The
is a fixed effect and the
is a random effect.
First, set the sparse priors. Determine the priors as follows:
Then, update the parameters:
Step 1: Calculate the fixed effect by least squares.
Step 2: Compute the adjusted phenotype.
Step 3: Update the prior variance
by solving the following equation:
where
, if solution ≤ 0, set
.
Step 4: Update the random effect
via BLUP.
where the variance of
is
.
Step 5: Update the residual variance
.
The iteration proceeds until convergence is achieved or the maximum number of iterations is reached. The SBL algorithm simultaneously analyzes all genetic markers with robust statistical power, demonstrating reduced sensitivity to stringent significance thresholds while providing accurate estimates of effect sizes. This joint modeling approach effectively addresses the limitations of single-marker testing in high-dimensional genomic data (), offering improved detection of variants with modest effects through its hierarchical Bayesian framework.
2.2.4. Wald Test
To test the null hypothesis
for each marker, we employ the Wald test statistic [
31]:
where
is the estimate of the marker effect, and
is its conditional variance given the adjusted phenotype vector
.
Under the null hypothesis,
approximately follows a
distribution with one degree of freedom. The
p-value for marker
is therefore computed as
This approximation holds when the standard error of is sufficiently small, allowing the random effect to be treated as effectively ‘fixed’ for testing purposes.
For the comparison methods, the Likelihood Ratio Test (LRT) [
32,
33,
34] is employed, with the LOD score as the test statistic. In this study, the LOD score is converted to a
p-value via the Chi-square distribution for better visualization and comparison. The
p-value for marker
k in multi-locus analysis is calculated as
.
2.3. Comparison Algorithm
To evaluate the performance of the ILSBL method, we conducted comparisons with several state-of-the-art algorithms in genetic association studies.
LASSO [
19,
24] is a variable selection method that adds a penalty term during model estimation. It can shrink the regression coefficient of unimportant variables to zero and then remove them from the model, thereby achieving the purpose of variable selection. The method effectively reduces data dimensionality and guarantees stability of the model in high-dimensional data analysis. LASSO was implemented using the R program package
biglasso (
https://cran.r-project.org/web/packages/biglasso/vignettes/biglasso.html, 1.6.1 version, accessed on 17 November 2025).
Elastic Net Regression [
21] combines L1 and L2 regularization for variable selection and handling multicollinearity. It balances computional complexity and model bias by incorporating both L1 and L2 regularization terms. The method was implemented through the R package
glmnet (
https://cran.r-project.org/web/packages/glmnet/index.html, 4.1-10 version, accessed on 17 November 2025).
SCAD [
22] is a classical variable selection method and its corresponding regularization model can simultaneously perform variable selection and parameter estimation. This method was implemented using the R package
ncvreg (
https://cran.r-project.org/web/packages/ncvreg/index.html; Version 3.15.0, accessed on 17 November 2025).
Adaptive Lasso [
23] is a mainstream variable selection method that uses adaptive weights to penalize different coefficients in the L1 penalty. It exhibits higher stability in variable selection for data analysis. This method was implemented via the R package
glmnet (
https://cran.r-project.org/web/packages/glmnet/index.html, 4.1-10 version, accessed on 17 November 2025).
Expectation maximization Bayesian ridge regression (emRR [
35]) assumes that all regression coefficients have equal variance. It introduces the regular term automatically in the estimation process, which finally obtains the posterior distribution of the parameters, avoiding overfitting in large-scale likelihood estimation. This method was implemented via the R package
bWGR (
http://github.com/cran/bWGR, 2.2.10 version, accessed on 23 March 2026).
2.4. Experimental Materials
2.4.1. Simulation Datasets
We conducted Monte Carlo simulation experiments to evaluate the performance of the ILSBL compared with other methods. The simulated datasets were generated using a MLM framework containing
genetic variants. Genotypes were simulated according to minor allele frequencies (MAFs) ranging from 0.1 to 0.5 under Hardy–Weinberg equilibrium. The population mean was set to 10.0 with a residual variance of 10.0. To thoroughly assess method performance under varying genetic architectures, we examined three distinct polygenic background scenarios representing different levels of complexity: moderate (2 × polygenic variance), substantial (5 × polygenic variance) and extreme (10 × polygenic variance) polygenic-background conditions. This experimental design allows systematic evaluation of each method’s robustness to increasing polygenic complexity while controlling for population structure and estimating accurate genetic parameters. (1) In the first experiment, we simulated a QTN located on the 98th marker with 0.1 heritability; (2) in the second simulation, five QTNs were assigned with heritabilities of 0.02, 0.05, 0.05, 0.08, and 0.10, respectively, and their genomic positions and corresponding effects are summarized in
Tables S1–S3. Given the varying genetic structures across different species and populations, we further considered nine scenarios for each simulation, combining three levels of background noise (two-, five-, and ten-times larger polygenic backgrounds) and three sample sizes (500, 1000, and 2000 individuals). Each simulation experiment was repeated 100 times.
In the simulation studies, estimated effect, running time, power, false-positive rate (FPR), and mean squared error (MSE) were selected to evaluate the performance of all methods.
MSE represents the accuracy of QTN effect estimation, calculated as
where
indicates the number of times a QTN was detected in the 100 replicates;
indicates the effect estimation for the
th QTN in the
th replicate. A smaller MSE indicates higher model accuracy.
Statistical power is defined as the proportion of significant QTNs detected across all replicates. Power is an important criterion for evaluating models and the higher the value the better the performance.
2.4.2. The Arabidopsis Datasets
In this study, the
Arabidopsis thaliana dataset was used to further validate the new method, it comprised 1307 inbred lines (available at
https://github.com/Gregor-Mendel-Institute/atpolydb, accessed on 3 December 2025) with 214,051 SNPs and 11 phenotypic traits. Three flowering-related traits were selected for analysis: (1) FT10top: the top biomass of the plant when it begins to flower at 10 °C; (2) FT10: number of flowering days of plants under 10 °C growth conditions; and (3) FlowerInterval_of_OuluFall: the total number of days from seeding to flowering in the autumn experiment in Oulu, Finland. The data was filtered by quality control with minor allele frequency (MAF) ≥ 0.01, resulting in 213,304 high-quality SNPs for analysis. The genome-wide marker density distribution (A) and allele frequency spectrum (B) are presented in
Figure 2. After quality control and missing-data elimination, the dataset contained 192, 625 and 61 accessions for FT10top, FT10, and FlowerInterval_of_OuluFall, respectively. Considering the normality assumption of phenotypic data, we performed GWAS using the Box–Cox-transformed phenotypes in this study.
3. Results
3.1. Experimental Results of Simulated Data
We compared the performance of ILSBL with established methods (LASSO, Elastic Net, SCAD, and Adaptive LASSO) through Monte Carlo simulation experiments. The simulation dataset contains 2000 individuals with 10,000 genetic variables for simulation experiments.
For the first simulation, a single causal QTN was fixed at the 98th marker with a heritability of 0.1. In terms of statistical power (
Figure 3A), SCAD and ILSBL exhibited the highest detection power and showed robust performance across all scenarios. Although SCAD showed marginally higher power than ILSBL, its false-positive rate was approximately twice as high as that of ILSBL, or even higher (
Figure 3B). The power of Adaptive LASSO was moderate, followed by LASSO, Elastic Net and emRR. The ILSBL method also showed better control of false positives (FPR ≈ 2 × 10
−4), representing a substantial improvement over SCAD and performing comparably to Adaptive LASSO and Elastic Net. For estimation accuracy (
Figure 3C), ILSBL achieved the lowest MSE among all methods, followed by Adaptive LASSO. These two methods formed the top-performing group and significantly outperformed Elastic Net, SCAD, LASSO and emRR. Computational efficiency comparisons (
Figure 3D) revealed that emRR, LASSO, Elastic Net, ILSBL and Adaptive LASSO operated similar orders of magnitude, among them, emRR was the most computationally efficient (emRR only considered the top 50 SNPs for hypothesis testing, otherwise, testing of more than 6000 significant SNPs would take over 10 h). As a multi-stage algorithm, ILSBL balances speed and scalability. It completed analysis within 1 min for small samples and within 1.4 min for large datasets (
n = 2000), making it 5.8 times faster than SCAD, which required 7.2 min for the same task.
For simulation experiment two, we assumed that five causal QTNs were assigned heritabilities from 0.02 to 0.10 and their genomic positions and corresponding information are listed in
Tables S1–S3. The statistical power of QTN2, which has a heritability of 0.05, is presented in
Figure 4A. It is apparent the statistical power of all methods increased with the continuous increase in sample size, and all methods performed well when the sample size was 2000. Polygenic background is another important factor affecting statistical power; under extreme background conditions, the estimated power of all methods decreased to less than 30%. Among all methods, the ILSBL method exhibited certain advantages, especially when the sample sizes were 500 and 1000, where its advantage in statistical power was more obvious, followed by Adaptive LASSO and then the other three methods. A similar trend was observed for other QTLs (
Tables S1–S3). In terms of FDR, all methods showed the same pattern as in simulation 1. Unlike other methods, which were significantly affected by background and other noises, the ILSBL method also performed well in false-positive control, indicating that the polygenic and residual noise-whitening stage in this method played an important role.
These results indicate that ILSBL is a method capable of balancing multiple indexes, including accurate effect estimation, strict type I error control, and computational efficiency. The observed performance degradation under elevated noise conditions not only highlights the inherent challenges associated with high-noise genomic studies, but also identifies potential opportunities for the further refinement of methodological approaches in future research.
3.2. Experimental Results of Real Data Analysis
To demonstrate the effectiveness of ILSBL in real genomic data, we applied the method to the
Arabidopsis thaliana dataset for validation; the data contains 214,051 SNPs for three flowering-related traits, including FT10top, FT10, and FlowerInterval_of_OuluFall. We performed GWAS analyses using six methods on this dataset. The results of each method for FT10top are displayed as Manhattan plots in
Figure 5, and the plots for the other two traits are shown in
Figures S1 and S2. It can be observed that ILSBL detected more significant loci with smaller
p-values than the other estabilished methods, indicating higher statistical significance. For the phenotypic variance-explained (PVE) values of the trait FT10top, ILSBL is 14.81%, followed by Adaptive LASSO at 10.89%. The PVE values of the other four methods are all below 10%: LASSO is 6.07%; Elastic Net is 6.20%; SCAD is 5.37% and emRR is <1%. Meanwhile, ILSBL also achieved the highest PVE values for the other two traits (
Figure S3).
All the significant QTNs identified by each method were detected, which were mapped to the
Arabidopsis thaliana reference genome sequence, and gene annotation was performed using a ±10 kb sequence window upstream and downstream of each QTN. Further functional verification of the annotated genes was conducted by the TAIR database (
https://www.arabidopsis.org, accessed on 16 December 2025) to screen for previously reported genes associated with flowering traits, with the corresponding detection methods recorded. The results are summarized in
Table 1. Taking FT10top trait as an example, ILSBL detected a total of 13 associated genes corresponding to 7 QTNs, which are mainly distributed on chromosomes one, three and five. For the other two flower-related traits of FT10, and FlowerInterval_of_OuluFall, ILSBL detected 12 and 8 verified associated genes respectively. Collectively, the three traits cumulatively correspond to 33 verified genes. The total numbers of confirmed genes detected of all traits by the other five methods, LASSO, Elastic Net, SCAD, Adaptive LASSO and emRR, were 15, 6, 12, 9 and 6, respectively. ILSBL exhibits superior performance by explaining a substantially higher proportion of phenotypic variance and identifying more functionally validated flowering-related genes.
From the results, several of the same genes were commonly identified by ILSBL and the established algorithms (
Table 1). For example, the AT1G78660 gene was identified by ILSBL as well as multiple algorithms such as Adaptive LASSO, LASSO and SCAD. Interestingly, ILSBL identified a gene cluster for the FT10top trait containing AT5G11260, AT5G11270, and AT5G11320, which were simultaneously detected by emRR. These genes are concentrated in the SNP region at 3,601,859 bp on chromosome 5, suggesting that this region may be the key site regulating the FT10top traits. Additionally, the AT1G16780, AT3G13790 and AT5G01675 genes were simultaneously detected by ILSBL for both the FT10top and FT10 traits. In this study, multiple methods were used to analyze flowering traits in
Arabidopsis thaliana, all confirming that flowering traits are controlled by multiple genes and demonstrating that these genes play a core role in the regulation of flowering traits.
4. Discussion
In this study, we have developed ILSBL, a novel algorithm under the MLM framework, to address the limitations of existing methods in handling large-scale genomic data with complex population structures and polygenic backgrounds. First, it treats marker effects as random and employs the FASTmrEMMA model transformation to whiten the covariance structure of polygenic and environmental noise. Secondly, it utilizes the biglasso package for single-marker screening to achieve significant dimensionality reduction. Finally, it applies SBL for effect estimation and employs Wald tests to evaluate the significance of potential QTNs. This multi-stage approach demonstrates robust performance across varying population structures and polygenic background scenarios, to improve the accuracy of SNP effect estimation in GWAS. The computational efficiency of ILSBL is further supported by the biglasso package, providing: (1) optimized computing space and parallel computation; (2) implementation of the hybrid screening rule to enhance the speed and quality of feature selection; and (3) extensible cross-validation to improve the selection of optimal effect values.
In the present study, we applied six methods to analyze three flowering-related traits using 4,945,006 SNPs in Arabidopsis thaliana. The results showed that ILSBL detected 141 significant QTNs, among which 32 corresponded to previously confirmed flowering-related genes across the three traits, which is more than those identified by the other established methods. Notably, ILSBL identified a gene cluster containing AT5G11260, AT5G11270, and AT5G11320 on chromosome 5, suggesting that this region may be a key locus regulating the FT10top trait. In addition, ILSBL simultaneously detected several core genes for both the FT10top and FT10 traits, highlighting their essential roles in flowering regulation. Meanwhile, the sample sizes for FT10top, FT10, and FlowerInterval_of_OuluFall were relatively small and varied considerably due to the specific experimental conditions and environments at the trial field. Further validation using larger and more diverse populations is therefore recommended in future studies.
While genome-wide Bayesian models provide a powerful framework for integrating prior information in association studies, MCMC-based Bayesian methods are often restricted by intensive computational burdens in large-scale genomic data. By combining LASSO-based dimensionality reduction with sparse Bayesian learning, ILSBL achieves a good balance between estimation accuracy and computational efficiency. However, it is necessary to acknowledge the limitations of ILSBL. First, the computational efficiency of ILSBL decreases slightly when dealing with extremely large SNP datasets, as the LASSO-based screening step still requires a certain amount of computational resources for processing massive genotypic data. Second, in this study, ILSBL only considers the main effects; its application to epistatic interactions and genotype–environment interactions remains to be further explored and improved in future research.
In summary, ILSBL provides an efficient approach for GWAS in large-scale genomic studies, especially for addressing challenges from high noise, population structure, and polygenic background. It exhibits superior false-positive control and computational efficiency, making it well suited for genetic analyses of complex traits. Therefore, the proposed ILSBL represents a valuable alternative tool for the genetic dissection of complex traits.
Supplementary Materials
The following supporting information can be downloaded at
https://www.mdpi.com/article/10.3390/math14071209/s1, Figure S1. Manhattan plot of six GWAS methods for trait FT10. Figure S2. Manhattan plot of six GWAS methods for trait FlowerInterval_of_OuluFall. Figure S3. The phenotypic variance explained (PVE) of the three flower-related traits in Arabidopsis by six different GWAS methods including, ILSBL, LASSO, Elastic Net, SCAD, Adaptive LASSO and emRR algorithms. Table S1. The comparison of ILSBL, Lasso, Elastic Net, SCAD, Adaptive Lasso and emRR method in the simulation experiment 2 (multiple QTNs) under sample size 500. Table S2. The comparison of ILSBL, Lasso, Elastic Net, SCAD, Adaptive Lasso and emRR method in the simulation experiment 2 (multiple QTNs) under sample size 1000. Table S3. The comparison of ILSBL, Lasso, Elastic Net, SCAD, Adaptive Lasso and emRR methods in the simulation experiment 2 (multiple QTNs) under sample size 2000.
Author Contributions
J.C. and J.Z. designed and supervised this study. J.W., J.C. and J.Z. wrote and revised the manuscript. J.W., J.L. and G.L. conducted all the experiments, analyzed the data and revised the manuscript. F.B., Y.W. and S.S. made all figures and forms. All authors participated in the review process. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by the Humanities and Social Sciences Fund of the Ministry of Education (21YJC790011), the Innovation and Entrepreneurship Program of the Nanjing Agriculture University (grant numbers 202410307228Y and 202510307021), the National Natural Science Foundation of China (32270694).
Data Availability Statement
Acknowledgments
The authors would like to thank the editor and reviewers for their suggestions for improving the framework and language within this manuscript. The authors sincerely thank Nanjing Agricultural University for providing the research platform and continuous support.
Conflicts of Interest
The authors have no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| GWAS | Genome-wide association studies |
| SNP | Single nucleotide polymorphism |
| QTN | Quantitative trait nucleotide |
| LASSO | Least absolute shrinkage and selection operator |
| SCAD | Smoothly clipped absolute deviation |
| MLM | Mixed linear model |
| EMMA | Efficient mixed-model association |
| GEMMA | Genome-wide efficient mixed-model association |
| DRR | Ridge regression algorithm |
| ILSBL | Improved LASSO screening and sparse Bayesian learning algorithm |
| SBL | Sparse Bayesian learning |
| emRR | Expectation maximization Bayesian ridge regression |
| LRT | Likelihood ratio test |
| MAF | Minor allele frequencies |
| FPR | False-Positive Rate |
| MSE | Mean Squared Error |
| Chr | Chromosome |
| MCMC | Markov Chain Monte Carlo |
References
- Uffelmann, E.; Huang, Q.Q.; Munung, N.S.; de Vries, J.; Okada, Y.; Martin, A.R.; Martin, H.C.; Lappalainen, T.; Posthuma, D. Genome-wide association studies. Nat. Rev. Methods Primers 2021, 1, 59. [Google Scholar] [CrossRef]
- Visscher, P.M.; Wray, N.R.; Zhang, Q.; Sklar, P.; McCarthy, M.I.; Brown, M.A.; Yang, J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 2017, 101, 5–22. [Google Scholar] [CrossRef] [PubMed]
- Frayling, T.M.; Timpson, N.J.; Weedon, M.N.; Zeggini, E.; Freathy, R.M.; Lindgren, C.M.; Perry, J.R.B.; Elliott, K.S.; Lango, H.; Rayner, N.W.; et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 2007, 316, 889–894. [Google Scholar] [CrossRef] [PubMed]
- Wang, K.; Zhang, H.; Kugathasan, S.; Annese, V.; Bradfield, J.R.; Russell, R.K.; Sleiman, P.M.A.; Imielinski, M.; Glessner, J.; Hou, C.; et al. Diverse Genome-wide Association Studies Associate the IL12/IL23 Pathway with Crohn Disease. Am. J. Hum. Genet. 2009, 84, 399–405. [Google Scholar] [CrossRef] [PubMed]
- Ma, J.W.; Yang, J.; Zhou, L.S.; Ren, J.; Liu, X.X.; Zhang, H.; Yang, B.; Zhang, Z.Y.; Ma, H.B.; Xie, X.H.; et al. A Splice Mutation in the Gene Causes High Glycogen Content and Low Meat Quality in Pig Skeletal Muscle. PLoS Genet. 2014, 10, e1004710. [Google Scholar] [CrossRef] [PubMed]
- Fan, Q.C.; Wu, P.F.; Dai, G.J.; Zhang, G.X.; Zhang, T.; Xue, Q.; Shi, H.Q.; Wang, J.Y. Identification of 19 loci for reproductive traits in a local Chinese chicken by genome-wide study. Genet. Mol. Res. 2017, 16, 16019431. [Google Scholar] [CrossRef]
- Demars, J.; Fabre, S.; Sarry, J.; Rossetti, R.; Gilbert, H.; Persani, L.; Tosser-Klopp, G.; Mulsant, P.; Nowak, Z.; Drobik, W.; et al. Genome-Wide Association Studies Identify Two Novel Mutations Responsible for an Atypical Hyperprolificacy Phenotype in Sheep. PLoS Genet. 2013, 9, e1003482. [Google Scholar] [CrossRef]
- Lin, H.; Zhou, Z.; Zhao, J.; Zhou, T.; Bai, H.; Ke, Q.; Pu, F.; Zheng, W.; Xu, P. Genome-Wide Association Study Identifies Genomic Loci of Sex Determination and Gonadosomatic Index Traits in Large Yellow Croaker (Larimichthys crocea). Mar. Biotechnol. 2021, 23, 127–139. [Google Scholar] [CrossRef]
- Zhao, K.; Tung, C.W.; Eizenga, G.C.; Wright, M.H.; Ali, M.L.; Price, A.H.; Norton, G.J.; Islam, M.R.; Reynolds, A.; Mezey, J.; et al. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat. Commun. 2011, 2, 467. [Google Scholar] [CrossRef]
- Huang, X.H.; Wei, X.H.; Sang, T.; Zhao, Q.A.; Feng, Q.; Zhao, Y.; Li, C.Y.; Zhu, C.R.; Lu, T.T.; Zhang, Z.W.; et al. Genome-wide association studies of 14 agronomic traits in rice landraces. Nat. Genet. 2010, 42, 961–976. [Google Scholar] [CrossRef]
- Li, H.; Peng, Z.Y.; Yang, X.H.; Wang, W.D.; Fu, J.J.; Wang, J.H.; Han, Y.J.; Chai, Y.C.; Guo, T.T.; Yang, N.; et al. Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat. Genet. 2013, 45, 43–50. [Google Scholar] [CrossRef]
- Chao, Z.F.; Chen, Y.Y.; Ji, C.; Wang, Y.L.; Huang, X.; Zhang, C.Y.; Yang, J.; Song, T.; Wu, J.C.; Guo, L.X.; et al. A genome-wide association study identifies a transporter for zinc uploading to maize kernels. Embo Rep. 2023, 24, e55542. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Chen, M.; Wen, Y.J.; Zhang, Y.; Lu, Y.N.; Wang, S.M.; Chen, J.C. A Fast Multi-Locus Ridge Regression Algorithm for High-Dimensional Genome-Wide Association Studies. Front. Genet. 2021, 12, 649196. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.; Carbonetto, P.; Stephens, M. Polygenic Modeling with Bayesian Sparse Linear Mixed Models. PLoS Genet. 2013, 9, e1003264. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Feng, J.Y.; Ni, Y.L.; Wen, Y.J.; Niu, Y.; Tamba, C.; Yue, C.; Song, Q.; Zhang, Y.M. pLARmEB: Integration of least angle regression with empirical Bayes for multilocus genome-wide association studies. Heredity 2017, 118, 517–524. [Google Scholar] [CrossRef]
- Wen, Y.J.; Zhang, H.W.; Ni, Y.L.; Huang, B.; Zhang, J.; Feng, J.Y.; Wang, S.B.; Dunwell, J.M.; Zhang, Y.M.; Wu, R.L. Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Brief. Bioinform. 2018, 19, 700–712. [Google Scholar] [CrossRef]
- Boutorh, A.; Guessoum, A. Complex diseases SNP selection and classification by hybrid Association Rule Mining and Artificial Neural Network-based Evolutionary Algorithms. Eng. Appl. Artif. Intell. 2016, 51, 58–70. [Google Scholar] [CrossRef]
- Yao, X.H.; Yan, J.W.; Risacher, S.; Moore, J.; Saykin, A.; Shen, L. Network-Based Genome Wide Study of Hippocampal Imaging Phenotype in Alzheimer’s Disease to Identify Functional Interaction Modules. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (Icassp), New Orleans, LA, USA, 5–9 March 2017; pp. 6170–6174. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Zhang, J.; Yue, C.; Zhang, Y.M. Bias correction for estimated QTL effects using the penalized maximum likelihood method. Heredity 2012, 108, 396–402. [Google Scholar] [CrossRef]
- Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
- Fan, J.Q.; Li, R.Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
- Zeng, Y.H.; Breheny, P. The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. R J. 2020, 12, 6–19. [Google Scholar] [CrossRef]
- Yu, J.; Pressoir, G.; Briggs, W.H.; Vroh Bi, I.; Yamasaki, M.; Doebley, J.F.; McMullen, M.D.; Gaut, B.S.; Nielsen, D.M.; Holland, J.B.; et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006, 38, 203–208. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.W.; Ersoz, E.; Lai, C.Q.; Todhunter, R.J.; Tiwari, H.K.; Gore, M.A.; Bradbury, P.J.; Yu, J.M.; Arnett, D.K.; Ordovas, J.M.; et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 2010, 42, 355–360. [Google Scholar] [CrossRef]
- Lippert, C.; Listgarten, J.; Liu, Y.; Kadie, C.M.; Davidson, R.I.; Heckerman, D. FaST linear mixed models for genome-wide association studies. Nat. Methods 2011, 8, 833–835. [Google Scholar] [CrossRef]
- Liu, X.L.; Huang, M.; Fan, B.; Buckler, E.S.; Zhang, Z.W. Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies. PLoS Genet. 2016, 12, e1005767. [Google Scholar] [CrossRef]
- Jiang, L.D.; Zheng, Z.L.; Qi, T.; Kemper, K.E.; Wray, N.R.; Visscher, P.M.; Yang, J. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 2019, 51, 1749–1755. [Google Scholar] [CrossRef]
- Wen, Y.J.; Zhang, Y.W.; Zhang, J.; Feng, J.Y.; Zhang, Y.M. The improved FASTmrEMMA and GCIM algorithms for genome-wide association and linkage studies in large mapping populations. Crop J. 2020, 8, 723–732. [Google Scholar] [CrossRef]
- Wang, M.Y.; Xu, S.Z. A coordinate descent approach for sparse Bayesian learning in high dimensional QTL mapping and genome-wide association studies. Bioinformatics 2019, 35, 4327–4335. [Google Scholar] [CrossRef]
- Kao, C.H.; Zeng, Z.B.; Teasdale, R.D. Multiple interval mapping for quantitative trait loci. Genetics 1999, 152, 1203–1216. [Google Scholar] [CrossRef]
- Lander, E.; Kruglyak, L. Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat. Genet. 1995, 11, 241–247. [Google Scholar] [CrossRef]
- Qin, H.; Guo, W.; Zhang, Y.M.; Zhang, T. QTL mapping of yield and fiber traits based on a four-way cross population in Gossypium hirsutum L. Theor. Appl. Genet. 2008, 117, 883–894. [Google Scholar] [CrossRef]
- da Silva, F.A.; Viana, A.P.; Correa, C.C.G.; Santos, E.A.; de Oliveira, J.A.V.S.; Andrade, J.D.G.; Ribeiro, R.M.; Glória, L.S. Bayesian ridge regression shows the best fit for SSR markers in Psidium guajava among Bayesian models. Sci. Rep. 2021, 11, 13639. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |