An Improved LASSO Screening and Sparse Bayesian Learning Algorithm for GWAS

Wang, Jieru; Li, Jiaqi; Lin, Guo; Ban, Fengfei; Wu, Yinan; Su, Siyu; Zhang, Jin; Chen, Juncong

doi:10.3390/math14071209

Open AccessArticle

An Improved LASSO Screening and Sparse Bayesian Learning Algorithm for GWAS

by

Jieru Wang

^1,†,

Jiaqi Li

^1,†,

Guo Lin

¹,

Fengfei Ban

¹,

Yinan Wu

¹,

Siyu Su

¹,

Jin Zhang

^1,* and

Juncong Chen

^2,*

¹

College of Sciences, Nanjing Agricultural University, Nanjing 210095, China

²

School of Finance, Nanjing Agricultural University, Nanjing 210095, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2026, 14(7), 1209; https://doi.org/10.3390/math14071209

Submission received: 5 February 2026 / Revised: 14 March 2026 / Accepted: 31 March 2026 / Published: 3 April 2026

(This article belongs to the Section E3: Mathematical Biology)

Download

Browse Figures

Versions Notes

Abstract

Genome-wide association studies (GWASs) are powerful and flexible tools for identifying single nucleotide polymorphisms (SNPs) associated with quantitative traits (yield, stress resistance) in plants. Variable selection and machine learning are two effective approaches in GWAS. However, both face limitations in complex, noisy data analysis in the big-data era. In this study, we integrated variable selection and machine learning under the mixed linear model framework, proposing a novel method, the improved LASSO screening and sparse Bayesian learning algorithm (ILSBL). The ILSBL first corrects the polygenic and environmental noise, then reduces genotypic dimensionality by LASSO-based variable selection, and finally performs parameter estimation using sparse Bayesian learning. Two simulation experiments and association analyses of three flowering-time-related traits in Arabidopsis thaliana were conducted to validate the new algorithm. The results showed that, compared to established methods, the ILSBL exhibited flexibility in simulation studies and maintained robust performance under complex genetic backgrounds, achieving a favorable balance among statistical power, parameter estimation accuracy, runtime efficiency, and false-positive rate. The analysis of the real Arabidopsis datasets further confirmed the advantages of ILSBL for GWASs, with 30 candidate genes adjacent to significant quantitative trait nucleotides (QTNs) associated with flowering-related traits. These results provide valuable insights for a better understanding of the genetic basis underlying flowering-related traits in Arabidopsis.

Keywords:

GAWS; SBL; BigLASSO; mixed linear model; candidate genes

MSC:

92-08

1. Introduction

Genome-wide association studies (GWASs) have been widely applied for genetic research of humans, animals and plants, serving as a powerful tool for dissecting the genetic mechanisms of complex traits and diseases [1]. Over recent decades, they have successfully identified numerous trait-associated loci in various species, including humans [1,2,3,4], animals [5,6,7,8], and plants [9,10,11,12,13], significantly enhancing our understanding of polygenic inheritance and facilitating advancements in precision medicine and molecular breeding. Nowadays, with the rapid advancement of sequencing technologies, the emergence of next-generation sequencing technologies has introduced new computational challenges, as contemporary GWASs now routinely analyze datasets containing millions of single nucleotide polymorphisms (SNPs). Traditional analytical approaches face substantial limitations in this high-dimensional framework, particularly regarding computational efficiency and statistical power for detecting minor genetic effects. To address these computational challenges in modern genetic research, a variety of novel and robust methodologies have been developed, such as the Bayesian sparse linear mixed model [14], pLARmEB [15] and FASTMrEMMA [16].

Numerous studies have demonstrated that most quantitative traits are primarily governed by a small proportion of quantitative trait nucleotides (QTNs) [15,17,18], which play crucial roles. This biological basis supports the application of variable selection algorithms for identifying phenotype-associated QTNs, thereby effectively addressing the overfitting challenge in high-dimensional data analysis. For high-dimensional datasets with limited samples, various variable selection methods have been developed, with the Least Absolute Shrinkage and Selection Operator (LASSO) [19] representing a fundamental approach that combines feature selection with ℓ₁-norm regularization to constrain the residual sum of squares while assigning non-zero coefficients to influential features. However, due to the false discovery rate of LASSO [20], alternative methods have been proposed, including Elastic Net [21], Smoothly Clipped Absolute Deviation (SCAD) [22], Adaptive LASSO [23], and BigLASSO [24]. Although these advanced methods have been successfully applied in GWASs, their performance remains constrained in scenarios involving complex, noisy polygenic backgrounds and extremely high-dimensional genetic data.

The mixed linear model (MLM) framework [25,26], which incorporates both fixed effects (Q) and random genetic effects (K), has become a fundamental approach in GWASs by substantially increasing the detection power for QTNs. This methodological advancement was further refined through efficient mixed-model association algorithms, such as efficient mixed-model association (EMMA) and genome-wide efficient mixed-model association (GEMMA), which explicitly model the polygenic background as random effects. Subsequent methodological innovations, including FaST-LMM [27], FarmCPU [28] and FastGWA [29], have built on this foundation and been widely adopted in genomic analyses. However, most of these methods are single-marker analytical methods, and the commonly employed Bonferroni correction tends to weaken the detection of important loci in GWASs.

The two-stage methodology effectively addresses these limitations by first applying single-marker scanning for dimensionality reduction, followed by multi-locus analysis for estimation and testing, which significantly enhances both the efficiency and accuracy of GWASs. Several innovative implementations of this approach have been developed: pLARmEB [15] utilizes least angle regression for SNP screening before empirical Bayes analysis; FASTMrEMMA [16] identifies putative QTNs (p < 0.005) for inclusion in multi-trait analyses while adopting relaxed significance thresholds as an alternative to Bonferroni correction; FastRR [13] combines correlation-based prescreening with an advanced multi-locus ridge regression algorithm (DRR) for parameter estimation. Comparative analyses demonstrate that these two-stage approaches outperform conventional GWAS methods in terms of their superior estimation accuracy, reduced computational demands, and enhanced stability across varying polygenic backgrounds and sample sizes, while maintaining robust statistical power.

In this study, we propose a novel multi-stage GWAS approach, termed improved LASSO screening and sparse Bayesian learning (ILSBL), which synergistically integrates variable selection and Bayesian estimation. This method effectively accounts for noise from population structure and polygenic background, and enables efficient detection of QTNs. We comprehensively evaluated the performance of ILSBL in comparison with LASSO, SCAD and other existing methods using both simulated datasets and real Arabidopsis thaliana data.

2. Methods and Materials

2.1. Genetic Model

Consider the following mixed linear model. Let

y \in R^{n}

denote the phenotype vector for

n

individuals, and

Z \in R^{n \times p}

represent the genotype matrix for

p

genetic markers (with

p ≫ n

). The model is specified as follows:

y = W α + Z γ + u + ε

(1)

where

y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}

,

y_{i} (i = 1,2, . . ., n)

is the phenotypic value of the

i

th individual out of

n

individuals;

α

is a

c \times 1

vector of the fixed effects, including the intercept, population structure effect and so on;

W

is an

n \times c

matrix of the corresponding designed matrix for

α

;

Z

is an

n \times 1

vector of marker genotypes; and

γ ~ N (0, {σ_{γ}}^{2})

is a

p \times 1

vector of a random effect of each genetic marker;

{σ_{γ}}^{2}

is the variance of

γ

;

u \sim M V N (0, {σ_{g}}^{2} K)

is an

n \times 1

random vector of polygenic effects;

{σ_{g}}^{2}

is the variance of the polygenic background;

K

is a known

n \times n

genetic relationship matrix between individuals;

ε \sim M V N (0, σ^{2} I_{n})

is an

n \times 1

vector of residual errors;

σ^{2}

is the variance of residual error; and

I_{n}

is an

n \times n

identity matrix.

M V N

denotes a multivariate normal distribution.

As

γ

is treated as a random effect, the variance of

y

in Model (1) is as follows:

v a r (y) = {σ_{γ}}^{2} {Z Z}^{T} + {σ_{g}}^{2} K + σ^{2} I_{n} = σ^{2} (λ_{γ} {Z Z}^{T} + λ_{g} K + I_{n})

(2)

where

λ_{γ} = {σ_{γ}}^{2} / σ^{2}

,

λ_{g} = {σ_{g}}^{2} / σ^{2}

,

λ_{γ}

is the ratio of genetic variance to the variance of residual error, and

λ_{g}

is the ratio between the variance of the polygenic background and the variance of the residual error.

2.2. Improved LASSO Screening and Sparse Bayesian Learning (ILSBL)

The ILSBL method is an innovative multi-stage methodology for GWAS, which simultaneously performs effect size estimation and statistical testing. The method algorithm operates through three computationally efficient phases (see Figure 1).

2.2.1. Polygenic and Residual Noise-Whitening Stage

We adopt the model transformation approach from FASTMrEMMA [16] for whitening polygenic effects and residual noise, aiming to address the computational challenges in GWASs. Given that most markers are not associated with the target trait, we simplify the estimation by making the assumption that

λ_{γ} = 0

and estimate

λ_{g}

using a reduced Model (1), which removes

Z γ

and replaces

λ_{g}

in (2) with

{\hat{λ}}_{g}

[16,30], avoiding time-consuming re-estimation of

λ_{g}

for each single-marker scanning. The variance components were estimated via Restricted Maximum Likelihood using an optimized EMMA-based algorithm, which follows FASTmrEMMA [16]. Thus,

v a r (y) = σ^{2} (λ_{γ} {Z Z}^{T} + λ_{g} K + I_{n}) = σ^{2} (λ_{γ} {Z Z}^{T} + B)

(3)

where

B = {\hat{λ}}_{g} K + I_{n}

. The eigen (or spectral) decomposition of

B

is

B = {Q Λ Q}^{T} = ({Q Λ}^{\frac{1}{2}} Q^{T}) ({Q Λ}^{\frac{1}{2}} Q^{T})

(4)

In spectral decomposition,

Q

is orthogonal and

Λ

contains eigenvalues. This allows us to transform the original model into the following model:

y_{C} = W_{C} α + Z_{C} γ + ε_{C}

(5)

where

y_{C} = C y

,

W_{C} = C W

,

Z_{C} = C Z

,

C = ({Q Λ}^{- \frac{1}{2}} Q^{T})

and

ε_{C} = C u + C ε \sim M V N (0, σ^{2} I_{n})

[13,30]. This transformation maintains the original model structure while significantly reducing computational burden by avoiding repeated estimation of

λ_{g}

for each marker. The efficient handling of different polygenic backgrounds, preservation of statistical properties, and improved computational scalability of the method laid a solid foundation for subsequent genetic analyses.

2.2.2. BigLASSO Screening

We use efficient LASSO regression [19] for large-scale dimensionality reduction of the genomic data. Formally, the optimization problem is formulated as

\arg \min_{γ} {(y_{c} - Z_{c} γ)}^{T} (y_{c} - Z_{c} γ) + λ {‖γ‖}_{1}

(6)

where

y \in R^{n}

is the phenotype vector;

Z \in R^{n x p}

is the genotype matrix;

γ \in R^{p}

is the sparse coefficient vector;

λ

controls the

l_{1} - n o r m

penalty strength; and the optimal λ is determined by the minimum cross-validation error of 5-fold cross-validation in the BigLASSO step.

The analysis was implemented using the R package biglasso [24], which provides computational efficiency and scalability for genome-wide large-scale data. Given that LASSO exhibits low accuracy in estimating effect values, we employed the third-step algorithm to improve parameter estimation. Markers with non-zero coefficient estimates from the BigLASSO process were retained and passed to the subsequent analysis.

2.2.3. Sparse Bayesian Learning

The SBL algorithm [31] simultaneously estimates all marker effects within a multiple-locus model using a coordinate descent approach. By iteratively updating marker-specific prior variances and effects under an

l_{2} - n o r m

penalty, this method achieves efficient variable selection and handles very large sample sizes (e.g., >100,000 individuals) without large matrix operations. The approach is implemented in the R package sbl, offering greater statistical power and improved computational scalability compared to existing GWAS and QTN mapping methods.

The phenotypic values are modeled as a linear model of SNP effects with a Gaussian residual error:

y = X β + Z γ + ε, ε ~ N (0, σ^{2} I_{n})

(7)

with priors

γ_{k} ~ N (0, ϕ_{k}^{2})

and

p (ϕ_{k}^{2}) \propto {(ϕ_{k}^{2})}^{- (τ + 2) / 2}

, where

τ

controls the sparsity. The

β

is a fixed effect and the

γ

is a random effect.

First, set the sparse priors. Determine the priors as follows:

β_{l} = 0, γ_{k} = 0, σ^{2} = 1

(8)

Then, update the parameters:

Step 1: Calculate the fixed effect

β_{l}

by least squares.

Step 2: Compute the adjusted phenotype.

y_{k} = y - \sum X_{l} β_{l} - \sum_{k^{'} \neq k} Z_{k^{'}} γ_{k^{'}}

(9)

Step 3: Update the prior variance

ϕ_{k}^{2}

by solving the following equation:

- (τ + 3) s_{k}^{2} {(ϕ_{k}^{2})}^{2} - [(2 τ + 5) s_{k} - b_{k}^{2}] ϕ_{k}^{2} - (τ + 2) = 0

(10)

where

s_{k} = Z_{k}^{T} Z_{k} / σ^{2}, b_{k} = Z_{k}^{T} y_{k} / σ^{2}

, if solution ≤ 0, set

ϕ_{k}^{2} = 0

.

Step 4: Update the random effect

γ_{k}

via BLUP.

{\hat{γ}}_{k} = λ_{k} Z_{k}^{T} y_{k} - \frac{λ_{k}^{2} (Z_{k}^{T} Z_{k}) (Z_{k}^{T} y_{k})}{λ_{k} Z_{k}^{T} Z_{k} + 1}, λ_{k} = ϕ_{k}^{2} / σ^{2}

(11)

where the variance of

{\hat{γ}}_{k}

is

v a r ({\hat{γ}}_{k}| y_{k}) = [λ_{k} - λ_{k}^{2} (Z_{k}^{T} Z_{k} - λ_{k} \frac{{(Z_{k}^{T} Z_{k})}^{2}}{λ_{k} Z_{k}^{T} Z_{k} + 1})] σ^{2}

.

Step 5: Update the residual variance

σ^{2}

.

\begin{matrix} m_{0} = \sum_{k = 1}^{m} λ_{k} (Z_{k}^{T} Z_{k} - \frac{λ_{k} Z_{k}^{T} Z_{k} Z_{k}^{T} Z_{k}}{λ_{k} Z_{k}^{T} Z_{k} + 1}) \\ σ^{2} = \frac{{(y - \sum_{l = 1}^{q} X_{l} {\hat{β}}_{l} - \sum_{k = 1}^{m} Z_{k} {\hat{γ}}_{k})}^{T} (y - \sum_{l = 1}^{q} X_{l} {\hat{β}}_{l} - \sum_{k = 1}^{m} Z_{k} {\hat{γ}}_{k})}{n - q - m_{0}} \end{matrix}

(12)

The iteration proceeds until convergence is achieved or the maximum number of iterations is reached. The SBL algorithm simultaneously analyzes all genetic markers with robust statistical power, demonstrating reduced sensitivity to stringent significance thresholds while providing accurate estimates of effect sizes. This joint modeling approach effectively addresses the limitations of single-marker testing in high-dimensional genomic data (

p ≫ n

), offering improved detection of variants with modest effects through its hierarchical Bayesian framework.

2.2.4. Wald Test

To test the null hypothesis

H_{0} : γ_{k} = 0

for each marker, we employ the Wald test statistic [31]:

W_{k} = \frac{{\hat{γ}}_{k}^{2}}{v a r ({\hat{γ}}_{k}| y_{k})}

(13)

where

{\hat{γ}}_{k}

is the estimate of the marker effect, and

v a r ({\hat{γ}}_{k}| y_{k})

is its conditional variance given the adjusted phenotype vector

y_{k}

.

Under the null hypothesis,

W_{k}

approximately follows a

χ^{2}

distribution with one degree of freedom. The p-value for marker

k

is therefore computed as

p_{k} = 1 - P r (χ_{1}^{2} \leq W_{k})

(14)

This approximation holds when the standard error of

{\hat{γ}}_{k}

is sufficiently small, allowing the random effect to be treated as effectively ‘fixed’ for testing purposes.

For the comparison methods, the Likelihood Ratio Test (LRT) [32,33,34] is employed, with the LOD score as the test statistic. In this study, the LOD score is converted to a p-value via the Chi-square distribution for better visualization and comparison. The p-value for marker k in multi-locus analysis is calculated as

p_{k} = P r (χ_{2}^{2} > L O D \times 4.605)

.

2.3. Comparison Algorithm

To evaluate the performance of the ILSBL method, we conducted comparisons with several state-of-the-art algorithms in genetic association studies.

LASSO [19,24] is a variable selection method that adds a penalty term during model estimation. It can shrink the regression coefficient of unimportant variables to zero and then remove them from the model, thereby achieving the purpose of variable selection. The method effectively reduces data dimensionality and guarantees stability of the model in high-dimensional data analysis. LASSO was implemented using the R program package biglasso (https://cran.r-project.org/web/packages/biglasso/vignettes/biglasso.html, 1.6.1 version, accessed on 17 November 2025).

Elastic Net Regression [21] combines L1 and L2 regularization for variable selection and handling multicollinearity. It balances computional complexity and model bias by incorporating both L1 and L2 regularization terms. The method was implemented through the R package glmnet (https://cran.r-project.org/web/packages/glmnet/index.html, 4.1-10 version, accessed on 17 November 2025).

SCAD [22] is a classical variable selection method and its corresponding regularization model can simultaneously perform variable selection and parameter estimation. This method was implemented using the R package ncvreg (https://cran.r-project.org/web/packages/ncvreg/index.html; Version 3.15.0, accessed on 17 November 2025).

Adaptive Lasso [23] is a mainstream variable selection method that uses adaptive weights to penalize different coefficients in the L1 penalty. It exhibits higher stability in variable selection for data analysis. This method was implemented via the R package glmnet (https://cran.r-project.org/web/packages/glmnet/index.html, 4.1-10 version, accessed on 17 November 2025).

Expectation maximization Bayesian ridge regression (emRR [35]) assumes that all regression coefficients have equal variance. It introduces the regular term automatically in the estimation process, which finally obtains the posterior distribution of the parameters, avoiding overfitting in large-scale likelihood estimation. This method was implemented via the R package bWGR (http://github.com/cran/bWGR, 2.2.10 version, accessed on 23 March 2026).

2.4. Experimental Materials

2.4.1. Simulation Datasets

We conducted Monte Carlo simulation experiments to evaluate the performance of the ILSBL compared with other methods. The simulated datasets were generated using a MLM framework containing

p = 10,000

genetic variants. Genotypes were simulated according to minor allele frequencies (MAFs) ranging from 0.1 to 0.5 under Hardy–Weinberg equilibrium. The population mean was set to 10.0 with a residual variance of 10.0. To thoroughly assess method performance under varying genetic architectures, we examined three distinct polygenic background scenarios representing different levels of complexity: moderate (2 × polygenic variance), substantial (5 × polygenic variance) and extreme (10 × polygenic variance) polygenic-background conditions. This experimental design allows systematic evaluation of each method’s robustness to increasing polygenic complexity while controlling for population structure and estimating accurate genetic parameters. (1) In the first experiment, we simulated a QTN located on the 98th marker with 0.1 heritability; (2) in the second simulation, five QTNs were assigned with heritabilities of 0.02, 0.05, 0.05, 0.08, and 0.10, respectively, and their genomic positions and corresponding effects are summarized in Tables S1–S3. Given the varying genetic structures across different species and populations, we further considered nine scenarios for each simulation, combining three levels of background noise (two-, five-, and ten-times larger polygenic backgrounds) and three sample sizes (500, 1000, and 2000 individuals). Each simulation experiment was repeated 100 times.

In the simulation studies, estimated effect, running time, power, false-positive rate (FPR), and mean squared error (MSE) were selected to evaluate the performance of all methods.

MSE represents the accuracy of QTN effect estimation, calculated as

M S E = \frac{1}{N} \sum_{j = 1}^{N} {({\hat{γ}}_{i j} - γ_{i})}^{2}

(15)

where

N

indicates the number of times a QTN was detected in the 100 replicates;

{\hat{γ}}_{i j}

indicates the effect estimation for the

i

th QTN in the

j

th replicate. A smaller MSE indicates higher model accuracy.

Statistical power is defined as the proportion of significant QTNs detected across all replicates. Power is an important criterion for evaluating models and the higher the value the better the performance.

2.4.2. The Arabidopsis Datasets

In this study, the Arabidopsis thaliana dataset was used to further validate the new method, it comprised 1307 inbred lines (available at https://github.com/Gregor-Mendel-Institute/atpolydb, accessed on 3 December 2025) with 214,051 SNPs and 11 phenotypic traits. Three flowering-related traits were selected for analysis: (1) FT10top: the top biomass of the plant when it begins to flower at 10 °C; (2) FT10: number of flowering days of plants under 10 °C growth conditions; and (3) FlowerInterval_of_OuluFall: the total number of days from seeding to flowering in the autumn experiment in Oulu, Finland. The data was filtered by quality control with minor allele frequency (MAF) ≥ 0.01, resulting in 213,304 high-quality SNPs for analysis. The genome-wide marker density distribution (A) and allele frequency spectrum (B) are presented in Figure 2. After quality control and missing-data elimination, the dataset contained 192, 625 and 61 accessions for FT10top, FT10, and FlowerInterval_of_OuluFall, respectively. Considering the normality assumption of phenotypic data, we performed GWAS using the Box–Cox-transformed phenotypes in this study.

3. Results

3.1. Experimental Results of Simulated Data

We compared the performance of ILSBL with established methods (LASSO, Elastic Net, SCAD, and Adaptive LASSO) through Monte Carlo simulation experiments. The simulation dataset contains 2000 individuals with 10,000 genetic variables for simulation experiments.

For the first simulation, a single causal QTN was fixed at the 98th marker with a heritability of 0.1. In terms of statistical power (Figure 3A), SCAD and ILSBL exhibited the highest detection power and showed robust performance across all scenarios. Although SCAD showed marginally higher power than ILSBL, its false-positive rate was approximately twice as high as that of ILSBL, or even higher (Figure 3B). The power of Adaptive LASSO was moderate, followed by LASSO, Elastic Net and emRR. The ILSBL method also showed better control of false positives (FPR ≈ 2 × 10⁻⁴), representing a substantial improvement over SCAD and performing comparably to Adaptive LASSO and Elastic Net. For estimation accuracy (Figure 3C), ILSBL achieved the lowest MSE among all methods, followed by Adaptive LASSO. These two methods formed the top-performing group and significantly outperformed Elastic Net, SCAD, LASSO and emRR. Computational efficiency comparisons (Figure 3D) revealed that emRR, LASSO, Elastic Net, ILSBL and Adaptive LASSO operated similar orders of magnitude, among them, emRR was the most computationally efficient (emRR only considered the top 50 SNPs for hypothesis testing, otherwise, testing of more than 6000 significant SNPs would take over 10 h). As a multi-stage algorithm, ILSBL balances speed and scalability. It completed analysis within 1 min for small samples and within 1.4 min for large datasets (n = 2000), making it 5.8 times faster than SCAD, which required 7.2 min for the same task.

For simulation experiment two, we assumed that five causal QTNs were assigned heritabilities from 0.02 to 0.10 and their genomic positions and corresponding information are listed in Tables S1–S3. The statistical power of QTN2, which has a heritability of 0.05, is presented in Figure 4A. It is apparent the statistical power of all methods increased with the continuous increase in sample size, and all methods performed well when the sample size was 2000. Polygenic background is another important factor affecting statistical power; under extreme background conditions, the estimated power of all methods decreased to less than 30%. Among all methods, the ILSBL method exhibited certain advantages, especially when the sample sizes were 500 and 1000, where its advantage in statistical power was more obvious, followed by Adaptive LASSO and then the other three methods. A similar trend was observed for other QTLs (Tables S1–S3). In terms of FDR, all methods showed the same pattern as in simulation 1. Unlike other methods, which were significantly affected by background and other noises, the ILSBL method also performed well in false-positive control, indicating that the polygenic and residual noise-whitening stage in this method played an important role.

These results indicate that ILSBL is a method capable of balancing multiple indexes, including accurate effect estimation, strict type I error control, and computational efficiency. The observed performance degradation under elevated noise conditions not only highlights the inherent challenges associated with high-noise genomic studies, but also identifies potential opportunities for the further refinement of methodological approaches in future research.

3.2. Experimental Results of Real Data Analysis

To demonstrate the effectiveness of ILSBL in real genomic data, we applied the method to the Arabidopsis thaliana dataset for validation; the data contains 214,051 SNPs for three flowering-related traits, including FT10top, FT10, and FlowerInterval_of_OuluFall. We performed GWAS analyses using six methods on this dataset. The results of each method for FT10top are displayed as Manhattan plots in Figure 5, and the plots for the other two traits are shown in Figures S1 and S2. It can be observed that ILSBL detected more significant loci with smaller p-values than the other estabilished methods, indicating higher statistical significance. For the phenotypic variance-explained (PVE) values of the trait FT10top, ILSBL is 14.81%, followed by Adaptive LASSO at 10.89%. The PVE values of the other four methods are all below 10%: LASSO is 6.07%; Elastic Net is 6.20%; SCAD is 5.37% and emRR is <1%. Meanwhile, ILSBL also achieved the highest PVE values for the other two traits (Figure S3).

All the significant QTNs identified by each method were detected, which were mapped to the Arabidopsis thaliana reference genome sequence, and gene annotation was performed using a ±10 kb sequence window upstream and downstream of each QTN. Further functional verification of the annotated genes was conducted by the TAIR database (https://www.arabidopsis.org, accessed on 16 December 2025) to screen for previously reported genes associated with flowering traits, with the corresponding detection methods recorded. The results are summarized in Table 1. Taking FT10top trait as an example, ILSBL detected a total of 13 associated genes corresponding to 7 QTNs, which are mainly distributed on chromosomes one, three and five. For the other two flower-related traits of FT10, and FlowerInterval_of_OuluFall, ILSBL detected 12 and 8 verified associated genes respectively. Collectively, the three traits cumulatively correspond to 33 verified genes. The total numbers of confirmed genes detected of all traits by the other five methods, LASSO, Elastic Net, SCAD, Adaptive LASSO and emRR, were 15, 6, 12, 9 and 6, respectively. ILSBL exhibits superior performance by explaining a substantially higher proportion of phenotypic variance and identifying more functionally validated flowering-related genes.

From the results, several of the same genes were commonly identified by ILSBL and the established algorithms (Table 1). For example, the AT1G78660 gene was identified by ILSBL as well as multiple algorithms such as Adaptive LASSO, LASSO and SCAD. Interestingly, ILSBL identified a gene cluster for the FT10top trait containing AT5G11260, AT5G11270, and AT5G11320, which were simultaneously detected by emRR. These genes are concentrated in the SNP region at 3,601,859 bp on chromosome 5, suggesting that this region may be the key site regulating the FT10top traits. Additionally, the AT1G16780, AT3G13790 and AT5G01675 genes were simultaneously detected by ILSBL for both the FT10top and FT10 traits. In this study, multiple methods were used to analyze flowering traits in Arabidopsis thaliana, all confirming that flowering traits are controlled by multiple genes and demonstrating that these genes play a core role in the regulation of flowering traits.

4. Discussion

In this study, we have developed ILSBL, a novel algorithm under the MLM framework, to address the limitations of existing methods in handling large-scale genomic data with complex population structures and polygenic backgrounds. First, it treats marker effects as random and employs the FASTmrEMMA model transformation to whiten the covariance structure of polygenic and environmental noise. Secondly, it utilizes the biglasso package for single-marker screening to achieve significant dimensionality reduction. Finally, it applies SBL for effect estimation and employs Wald tests to evaluate the significance of potential QTNs. This multi-stage approach demonstrates robust performance across varying population structures and polygenic background scenarios, to improve the accuracy of SNP effect estimation in GWAS. The computational efficiency of ILSBL is further supported by the biglasso package, providing: (1) optimized computing space and parallel computation; (2) implementation of the hybrid screening rule to enhance the speed and quality of feature selection; and (3) extensible cross-validation to improve the selection of optimal effect values.

In the present study, we applied six methods to analyze three flowering-related traits using 4,945,006 SNPs in Arabidopsis thaliana. The results showed that ILSBL detected 141 significant QTNs, among which 32 corresponded to previously confirmed flowering-related genes across the three traits, which is more than those identified by the other established methods. Notably, ILSBL identified a gene cluster containing AT5G11260, AT5G11270, and AT5G11320 on chromosome 5, suggesting that this region may be a key locus regulating the FT10top trait. In addition, ILSBL simultaneously detected several core genes for both the FT10top and FT10 traits, highlighting their essential roles in flowering regulation. Meanwhile, the sample sizes for FT10top, FT10, and FlowerInterval_of_OuluFall were relatively small and varied considerably due to the specific experimental conditions and environments at the trial field. Further validation using larger and more diverse populations is therefore recommended in future studies.

While genome-wide Bayesian models provide a powerful framework for integrating prior information in association studies, MCMC-based Bayesian methods are often restricted by intensive computational burdens in large-scale genomic data. By combining LASSO-based dimensionality reduction with sparse Bayesian learning, ILSBL achieves a good balance between estimation accuracy and computational efficiency. However, it is necessary to acknowledge the limitations of ILSBL. First, the computational efficiency of ILSBL decreases slightly when dealing with extremely large SNP datasets, as the LASSO-based screening step still requires a certain amount of computational resources for processing massive genotypic data. Second, in this study, ILSBL only considers the main effects; its application to epistatic interactions and genotype–environment interactions remains to be further explored and improved in future research.

In summary, ILSBL provides an efficient approach for GWAS in large-scale genomic studies, especially for addressing challenges from high noise, population structure, and polygenic background. It exhibits superior false-positive control and computational efficiency, making it well suited for genetic analyses of complex traits. Therefore, the proposed ILSBL represents a valuable alternative tool for the genetic dissection of complex traits.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/math14071209/s1, Figure S1. Manhattan plot of six GWAS methods for trait FT10. Figure S2. Manhattan plot of six GWAS methods for trait FlowerInterval_of_OuluFall. Figure S3. The phenotypic variance explained (PVE) of the three flower-related traits in Arabidopsis by six different GWAS methods including, ILSBL, LASSO, Elastic Net, SCAD, Adaptive LASSO and emRR algorithms. Table S1. The comparison of ILSBL, Lasso, Elastic Net, SCAD, Adaptive Lasso and emRR method in the simulation experiment 2 (multiple QTNs) under sample size 500. Table S2. The comparison of ILSBL, Lasso, Elastic Net, SCAD, Adaptive Lasso and emRR method in the simulation experiment 2 (multiple QTNs) under sample size 1000. Table S3. The comparison of ILSBL, Lasso, Elastic Net, SCAD, Adaptive Lasso and emRR methods in the simulation experiment 2 (multiple QTNs) under sample size 2000.

Author Contributions

J.C. and J.Z. designed and supervised this study. J.W., J.C. and J.Z. wrote and revised the manuscript. J.W., J.L. and G.L. conducted all the experiments, analyzed the data and revised the manuscript. F.B., Y.W. and S.S. made all figures and forms. All authors participated in the review process. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Humanities and Social Sciences Fund of the Ministry of Education (21YJC790011), the Innovation and Entrepreneurship Program of the Nanjing Agriculture University (grant numbers 202410307228Y and 202510307021), the National Natural Science Foundation of China (32270694).

Data Availability Statement

The Arabidopsis data presented in this study are openly available at https://github.com/Gregor-Mendel-Institute/atpolydb (accessed on 3 December 2025).

Acknowledgments

The authors would like to thank the editor and reviewers for their suggestions for improving the framework and language within this manuscript. The authors sincerely thank Nanjing Agricultural University for providing the research platform and continuous support.

Conflicts of Interest

The authors have no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GWAS	Genome-wide association studies
SNP	Single nucleotide polymorphism
QTN	Quantitative trait nucleotide
LASSO	Least absolute shrinkage and selection operator
SCAD	Smoothly clipped absolute deviation
MLM	Mixed linear model
EMMA	Efficient mixed-model association
GEMMA	Genome-wide efficient mixed-model association
DRR	Ridge regression algorithm
ILSBL	Improved LASSO screening and sparse Bayesian learning algorithm
SBL	Sparse Bayesian learning
emRR	Expectation maximization Bayesian ridge regression
LRT	Likelihood ratio test
MAF	Minor allele frequencies
FPR	False-Positive Rate
MSE	Mean Squared Error
Chr	Chromosome
MCMC	Markov Chain Monte Carlo

References

Uffelmann, E.; Huang, Q.Q.; Munung, N.S.; de Vries, J.; Okada, Y.; Martin, A.R.; Martin, H.C.; Lappalainen, T.; Posthuma, D. Genome-wide association studies. Nat. Rev. Methods Primers 2021, 1, 59. [Google Scholar] [CrossRef]
Visscher, P.M.; Wray, N.R.; Zhang, Q.; Sklar, P.; McCarthy, M.I.; Brown, M.A.; Yang, J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 2017, 101, 5–22. [Google Scholar] [CrossRef] [PubMed]
Frayling, T.M.; Timpson, N.J.; Weedon, M.N.; Zeggini, E.; Freathy, R.M.; Lindgren, C.M.; Perry, J.R.B.; Elliott, K.S.; Lango, H.; Rayner, N.W.; et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 2007, 316, 889–894. [Google Scholar] [CrossRef] [PubMed]
Wang, K.; Zhang, H.; Kugathasan, S.; Annese, V.; Bradfield, J.R.; Russell, R.K.; Sleiman, P.M.A.; Imielinski, M.; Glessner, J.; Hou, C.; et al. Diverse Genome-wide Association Studies Associate the IL12/IL23 Pathway with Crohn Disease. Am. J. Hum. Genet. 2009, 84, 399–405. [Google Scholar] [CrossRef] [PubMed]
Ma, J.W.; Yang, J.; Zhou, L.S.; Ren, J.; Liu, X.X.; Zhang, H.; Yang, B.; Zhang, Z.Y.; Ma, H.B.; Xie, X.H.; et al. A Splice Mutation in the Gene Causes High Glycogen Content and Low Meat Quality in Pig Skeletal Muscle. PLoS Genet. 2014, 10, e1004710. [Google Scholar] [CrossRef] [PubMed]
Fan, Q.C.; Wu, P.F.; Dai, G.J.; Zhang, G.X.; Zhang, T.; Xue, Q.; Shi, H.Q.; Wang, J.Y. Identification of 19 loci for reproductive traits in a local Chinese chicken by genome-wide study. Genet. Mol. Res. 2017, 16, 16019431. [Google Scholar] [CrossRef]
Demars, J.; Fabre, S.; Sarry, J.; Rossetti, R.; Gilbert, H.; Persani, L.; Tosser-Klopp, G.; Mulsant, P.; Nowak, Z.; Drobik, W.; et al. Genome-Wide Association Studies Identify Two Novel Mutations Responsible for an Atypical Hyperprolificacy Phenotype in Sheep. PLoS Genet. 2013, 9, e1003482. [Google Scholar] [CrossRef]
Lin, H.; Zhou, Z.; Zhao, J.; Zhou, T.; Bai, H.; Ke, Q.; Pu, F.; Zheng, W.; Xu, P. Genome-Wide Association Study Identifies Genomic Loci of Sex Determination and Gonadosomatic Index Traits in Large Yellow Croaker (Larimichthys crocea). Mar. Biotechnol. 2021, 23, 127–139. [Google Scholar] [CrossRef]
Zhao, K.; Tung, C.W.; Eizenga, G.C.; Wright, M.H.; Ali, M.L.; Price, A.H.; Norton, G.J.; Islam, M.R.; Reynolds, A.; Mezey, J.; et al. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat. Commun. 2011, 2, 467. [Google Scholar] [CrossRef]
Huang, X.H.; Wei, X.H.; Sang, T.; Zhao, Q.A.; Feng, Q.; Zhao, Y.; Li, C.Y.; Zhu, C.R.; Lu, T.T.; Zhang, Z.W.; et al. Genome-wide association studies of 14 agronomic traits in rice landraces. Nat. Genet. 2010, 42, 961–976. [Google Scholar] [CrossRef]
Li, H.; Peng, Z.Y.; Yang, X.H.; Wang, W.D.; Fu, J.J.; Wang, J.H.; Han, Y.J.; Chai, Y.C.; Guo, T.T.; Yang, N.; et al. Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat. Genet. 2013, 45, 43–50. [Google Scholar] [CrossRef]
Chao, Z.F.; Chen, Y.Y.; Ji, C.; Wang, Y.L.; Huang, X.; Zhang, C.Y.; Yang, J.; Song, T.; Wu, J.C.; Guo, L.X.; et al. A genome-wide association study identifies a transporter for zinc uploading to maize kernels. Embo Rep. 2023, 24, e55542. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Chen, M.; Wen, Y.J.; Zhang, Y.; Lu, Y.N.; Wang, S.M.; Chen, J.C. A Fast Multi-Locus Ridge Regression Algorithm for High-Dimensional Genome-Wide Association Studies. Front. Genet. 2021, 12, 649196. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Carbonetto, P.; Stephens, M. Polygenic Modeling with Bayesian Sparse Linear Mixed Models. PLoS Genet. 2013, 9, e1003264. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Feng, J.Y.; Ni, Y.L.; Wen, Y.J.; Niu, Y.; Tamba, C.; Yue, C.; Song, Q.; Zhang, Y.M. pLARmEB: Integration of least angle regression with empirical Bayes for multilocus genome-wide association studies. Heredity 2017, 118, 517–524. [Google Scholar] [CrossRef]
Wen, Y.J.; Zhang, H.W.; Ni, Y.L.; Huang, B.; Zhang, J.; Feng, J.Y.; Wang, S.B.; Dunwell, J.M.; Zhang, Y.M.; Wu, R.L. Methodological implementation of mixed linear models in multi-locus genome-wide association studies. Brief. Bioinform. 2018, 19, 700–712. [Google Scholar] [CrossRef]
Boutorh, A.; Guessoum, A. Complex diseases SNP selection and classification by hybrid Association Rule Mining and Artificial Neural Network-based Evolutionary Algorithms. Eng. Appl. Artif. Intell. 2016, 51, 58–70. [Google Scholar] [CrossRef]
Yao, X.H.; Yan, J.W.; Risacher, S.; Moore, J.; Saykin, A.; Shen, L. Network-Based Genome Wide Study of Hippocampal Imaging Phenotype in Alzheimer’s Disease to Identify Functional Interaction Modules. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (Icassp), New Orleans, LA, USA, 5–9 March 2017; pp. 6170–6174. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zhang, J.; Yue, C.; Zhang, Y.M. Bias correction for estimated QTL effects using the penalized maximum likelihood method. Heredity 2012, 108, 396–402. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B-Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
Fan, J.Q.; Li, R.Z. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Zeng, Y.H.; Breheny, P. The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. R J. 2020, 12, 6–19. [Google Scholar] [CrossRef]
Yu, J.; Pressoir, G.; Briggs, W.H.; Vroh Bi, I.; Yamasaki, M.; Doebley, J.F.; McMullen, M.D.; Gaut, B.S.; Nielsen, D.M.; Holland, J.B.; et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006, 38, 203–208. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.W.; Ersoz, E.; Lai, C.Q.; Todhunter, R.J.; Tiwari, H.K.; Gore, M.A.; Bradbury, P.J.; Yu, J.M.; Arnett, D.K.; Ordovas, J.M.; et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 2010, 42, 355–360. [Google Scholar] [CrossRef]
Lippert, C.; Listgarten, J.; Liu, Y.; Kadie, C.M.; Davidson, R.I.; Heckerman, D. FaST linear mixed models for genome-wide association studies. Nat. Methods 2011, 8, 833–835. [Google Scholar] [CrossRef]
Liu, X.L.; Huang, M.; Fan, B.; Buckler, E.S.; Zhang, Z.W. Iterative Usage of Fixed and Random Effect Models for Powerful and Efficient Genome-Wide Association Studies. PLoS Genet. 2016, 12, e1005767. [Google Scholar] [CrossRef]
Jiang, L.D.; Zheng, Z.L.; Qi, T.; Kemper, K.E.; Wray, N.R.; Visscher, P.M.; Yang, J. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 2019, 51, 1749–1755. [Google Scholar] [CrossRef]
Wen, Y.J.; Zhang, Y.W.; Zhang, J.; Feng, J.Y.; Zhang, Y.M. The improved FASTmrEMMA and GCIM algorithms for genome-wide association and linkage studies in large mapping populations. Crop J. 2020, 8, 723–732. [Google Scholar] [CrossRef]
Wang, M.Y.; Xu, S.Z. A coordinate descent approach for sparse Bayesian learning in high dimensional QTL mapping and genome-wide association studies. Bioinformatics 2019, 35, 4327–4335. [Google Scholar] [CrossRef]
Kao, C.H.; Zeng, Z.B.; Teasdale, R.D. Multiple interval mapping for quantitative trait loci. Genetics 1999, 152, 1203–1216. [Google Scholar] [CrossRef]
Lander, E.; Kruglyak, L. Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat. Genet. 1995, 11, 241–247. [Google Scholar] [CrossRef]
Qin, H.; Guo, W.; Zhang, Y.M.; Zhang, T. QTL mapping of yield and fiber traits based on a four-way cross population in Gossypium hirsutum L. Theor. Appl. Genet. 2008, 117, 883–894. [Google Scholar] [CrossRef]
da Silva, F.A.; Viana, A.P.; Correa, C.C.G.; Santos, E.A.; de Oliveira, J.A.V.S.; Andrade, J.D.G.; Ribeiro, R.M.; Glória, L.S. Bayesian ridge regression shows the best fit for SSR markers in Psidium guajava among Bayesian models. Sci. Rep. 2021, 11, 13639. [Google Scholar] [CrossRef]

Figure 1. Flow chart of ILSBL.

Figure 2. (A) Marker density of Arabidopsis natural population with 1 Mb window size. (B) MAF of Arabidopsis natural population.

Figure 3. The statistical power (A), FPR (B), MSE (C), and time (D) for the ILSBL, LASSO, Elastic Net, SCAD, Adaptive LASSO and emRR algorithms under combinations of moderate (2 × polygenic variance), substantial (5 × polygenic variance) and extreme (10 × polygenic variance) polygenic-background conditions and sample sizes of 500, 1000 and 2000 in the first simulation experiment.

Figure 4. The statistical power of QTN 1 (A) and FPR (B) for the ILSBL, LASSO, Elastic Net, SCAD, Adaptive LASSO and emRR algorithms under combinations of moderate (2 × polygenic variance), substantial (5 × polygenic variance) and extreme (10 × polygenic variance) polygenic-background conditions and sample sizes of 500, 1000 and 2000 in the second simulation experiment.

Figure 5. Manhattan plot for the trait FT10top. The x-axis and y-axis represent the chromosomal positions and the

-

log₁₀(p-value) of the SNPs, respectively. The dashed horizontal line indicates the significance threshold. The bottom color bar illustrates the SNP density (number of SNPs per 1 Mb window).

Figure 5. Manhattan plot for the trait FT10top. The x-axis and y-axis represent the chromosomal positions and the

-

log₁₀(p-value) of the SNPs, respectively. The dashed horizontal line indicates the significance threshold. The bottom color bar illustrates the SNP density (number of SNPs per 1 Mb window).

Table 1. The identified genes of three flowering-time traits in Arabidopsis using the ILSBL, LASSO, Elastic Net, SCAD, and Adaptive LASSO algorithms.

Trait	Gene	Chr	Position	Method	p-Value	Trait	Gene	Chr	Position	Method	p-Value
FT10top	AT1G16780	1	5753618	Elastic Net	3.16 $\times$ 10⁻¹¹	FT10top	AT5G11320	5	3608562	emRR	7.07 $\times$ 10⁻¹⁰
			5753618	ILSBL	1.67 $\times$ 10⁻¹⁸		AT5G49450	5	20060836	ILSBL	1.19 $\times$ 10⁻¹⁶
			5753618	LASSO	1.03 $\times$ 10⁻¹⁴		AT5G52300	5	3608562	emRR	2.09 $\times$ 10⁻⁷
			5738378	LASSO	4.88 $\times$ 10⁻⁸	FT10	AT1G16780	1	5753618	ILSBL	1.67 $\times$ 10⁻¹⁸
	AT1G78660	1	29590765	Adaptive LASSO	1.00 $\times$ 10⁻⁹		AT1G65480	1	24338260	Adaptive LASSO	2.94 $\times$ 10⁻¹¹
			29590765	ILSBL	1.17 $\times$ 10⁻²⁰		AT1G78700	1	29590765	ILSBL	1.17 $\times$ 10⁻²⁰
			29590765	LASSO	2.69 $\times$ 10⁻¹¹		AT2G19900	2	8589275	Elastic Net	3.80 $\times$ 10⁻¹⁰
			29590765	SCAD	1.66 $\times$ 10⁻¹⁰		AT2G22540	2	9581605	Adaptive LASSO	4.72 $\times$ 10⁻⁸
	AT1G78700	1	29590765	Adaptive LASSO	1.00 $\times$ 10⁻⁹				9581605	SCAD	2.76 $\times$ 10⁻²⁵
			29590765	ILSBL	1.17 $\times$ 10⁻²⁰		AT2G25110	2	10695954	Elastic Net	2.98 $\times$ 10⁻¹³
			29590765	LASSO	2.69 $\times$ 10⁻¹¹		AT2G47310	2	19435811	Elastic Net	1.71 $\times$ 10⁻⁷
			29590765	SCAD	1.66 $\times$ 10⁻¹⁰		AT3G13790	3	4542384	ILSBL	4.45 $\times$ 10⁻¹³
	AT3G13790	3	4542384	ILSBL	4.45 $\times$ 10⁻¹³		AT3G13960	3	4603733	ILSBL	2.04 $\times$ 10⁻⁸
	AT3G13960	3	4603733	Adaptive LASSO	3.24 $\times$ 10⁻¹⁵		AT3G16000	3	5428998	LASSO	1.37 $\times$ 10⁻⁸
			4603733	ILSBL	2.04 $\times$ 10⁻⁸		AT3G48680	3	18030345	Elastic Net	4.13 $\times$ 10⁻¹¹
			4603733	LASSO	6.14 $\times$ 10⁻⁸		AT5G01675	5	3177111	ILSBL	6.03 $\times$ 10⁻¹⁵
			4603733	SCAD	1.05 $\times$ 10⁻⁸				3163523	SCAD	4.57 $\times$ 10⁻¹³
	AT3G16360	3	5553267	emRR	3.68 $\times$ 10⁻¹⁰		AT5G10120	5	3177111	ILSBL	6.03 $\times$ 10⁻¹⁵
	AT5G01675	5	3177111	Adaptive LASSO	7.80 $\times$ 10⁻⁸				3163523	SCAD	4.57 $\times$ 10⁻¹³
			3177111	ILSBL	6.03 $\times$ 10⁻¹⁵		AT5G10140	5	3177111	ILSBL	6.03 $\times$ 10⁻¹⁵
			3177111	SCAD	1.48 $\times$ 10⁻⁷				3163523	SCAD	4.57 $\times$ 10⁻¹³
	AT5G10120	5	3177111	Adaptive LASSO	7.80 $\times$ 10⁻⁸		AT5G10150	5	3177111	ILSBL	6.03 $\times$ 10⁻¹⁵
			3177111	ILSBL	6.03 $\times$ 10⁻¹⁵		AT5G11260	5	3601859	ILSBL	3.87 $\times$ 10⁻⁵⁰
			3177111	SCAD	1.48 $\times$ 10⁻⁷		AT5G11310	5	3601859	ILSBL	3.87 $\times$ 10⁻⁵⁰
	AT5G10140	5	3177111	Adaptive LASSO	7.80 $\times$ 10⁻⁸		AT5G11320	5	3601859	ILSBL	3.87 $\times$ 10⁻⁵⁰
			3177111	ILSBL	6.03 $\times$ 10⁻¹⁵		AT5G13690	5	4406143	SCAD	1.24 $\times$ 10⁻¹¹
			3177111	SCAD	1.48 $\times$ 10⁻⁷		AT5G28640	5	10638482	Elastic Net	2.48 $\times$ 10⁻¹⁰
	AT5G10150	5	3177111	Adaptive LASSO	7.80 $\times$ 10⁻⁸		AT5G49450	5	20060836	ILSBL	1.19 $\times$ 10⁻¹⁶
			3177111	ILSBL	6.03 $\times$ 10⁻¹⁵	FlowerInterval_ of_OuluFall	AT1G65240	1	24228757	ILSBL	7.45 $\times$ 10⁻⁹
			3177111	SCAD	1.48 $\times$ 10⁻⁷		AT1G65250	1	24228757	ILSBL	7.45 $\times$ 10⁻⁹
	AT5G11260	5	3591469	emRR	8.87 $\times$ 10⁻⁸		AT1G65260	1	24228757	ILSBL	7.45 $\times$ 10⁻⁹
			3601859	ILSBL	3.87 $\times$ 10⁻⁵⁰		AT2G28450	2	12174250	ILSBL	7.02 $\times$ 10⁻¹²
	AT5G11270	5	3591469	emRR	8.87 $\times$ 10⁻⁸		AT2G28470	2	12174250	ILSBL	7.02 $\times$ 10⁻¹²
			3601859	ILSBL	3.87 $\times$ 10⁻⁵⁰		AT5G42870	5	17195153	ILSBL	1.73 $\times$ 10⁻⁸
	AT5G11310	5	3608562	emRR	7.07 $\times$ 10⁻¹⁰		AT5G42890	5	17195153	ILSBL	1.73 $\times$ 10⁻⁸
	AT5G11320	5	3601859	ILSBL	3.87 $\times$ 10⁻⁵⁰		AT5G42900	5	17195153	ILSBL	1.73 $\times$ 10⁻⁸

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Li, J.; Lin, G.; Ban, F.; Wu, Y.; Su, S.; Zhang, J.; Chen, J. An Improved LASSO Screening and Sparse Bayesian Learning Algorithm for GWAS. Mathematics 2026, 14, 1209. https://doi.org/10.3390/math14071209

AMA Style

Wang J, Li J, Lin G, Ban F, Wu Y, Su S, Zhang J, Chen J. An Improved LASSO Screening and Sparse Bayesian Learning Algorithm for GWAS. Mathematics. 2026; 14(7):1209. https://doi.org/10.3390/math14071209

Chicago/Turabian Style

Wang, Jieru, Jiaqi Li, Guo Lin, Fengfei Ban, Yinan Wu, Siyu Su, Jin Zhang, and Juncong Chen. 2026. "An Improved LASSO Screening and Sparse Bayesian Learning Algorithm for GWAS" Mathematics 14, no. 7: 1209. https://doi.org/10.3390/math14071209

APA Style

Wang, J., Li, J., Lin, G., Ban, F., Wu, Y., Su, S., Zhang, J., & Chen, J. (2026). An Improved LASSO Screening and Sparse Bayesian Learning Algorithm for GWAS. Mathematics, 14(7), 1209. https://doi.org/10.3390/math14071209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved LASSO Screening and Sparse Bayesian Learning Algorithm for GWAS

Abstract

1. Introduction

2. Methods and Materials

2.1. Genetic Model

2.2. Improved LASSO Screening and Sparse Bayesian Learning (ILSBL)

2.2.1. Polygenic and Residual Noise-Whitening Stage

2.2.2. BigLASSO Screening

2.2.3. Sparse Bayesian Learning

2.2.4. Wald Test

2.3. Comparison Algorithm

2.4. Experimental Materials

2.4.1. Simulation Datasets

2.4.2. The Arabidopsis Datasets

3. Results

3.1. Experimental Results of Simulated Data

3.2. Experimental Results of Real Data Analysis

4. Discussion

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI