Variational Bayesian Variable Selection in Logistic Regression Based on Spike-and-Slab Lasso

Zhang, Juanjuan; Wang, Weixian; Yang, Mingming; Tian, Maozai

doi:10.3390/math13132205

Open AccessArticle

Variational Bayesian Variable Selection in Logistic Regression Based on Spike-and-Slab Lasso

¹

School of Digital Economy and Trade, Guangzhou Huashang College, Guangzhou 511300, China

²

School of Mathematics and Statistics, Guangxi Normal University, Guilin 541006, China

³

School of Tourism, Xinjiang University of Finance and Economics, Urumqi 830012, China

⁴

School of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, China

⁵

Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing 100872, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(13), 2205; https://doi.org/10.3390/math13132205

Submission received: 23 April 2025 / Revised: 4 July 2025 / Accepted: 5 July 2025 / Published: 6 July 2025

(This article belongs to the Section D: Statistics and Operational Research)

Download

Browse Figures

Versions Notes

Abstract

Logistic regression is often used to solve classification problems. This article combines the advantages of Bayesian methods and spike-and-slab Lasso to select variables in high-dimensional logistic regression. The method of introducing a new hidden variable or approximating the lower bound is used to solve the problem of logistic functions without conjugate priors. The Laplace distribution in spike-and-slab Lasso is expressed as a hierarchical form of normal distribution and exponential distribution, so that all parameters in the model are posterior distributions that are easy to deal with. Considering the high time cost of parameter estimation and variable selection in high-dimensional models, we use the variational Bayesian algorithm to perform posterior inference on the parameters in the model. From the simulation results, it can be seen that it is an adaptive prior that can perform parameter estimation and variable selection well in high-dimensional logistic regression. From the perspective of algorithm running time, the method proposed in this article also has high computational efficiency in many cases.

Keywords:

logistic regression; spike-and-slab lasso; variable selection

MSC:

62-08

1. Introduction

Logistic regression is widely used for classification problems and remains a popular model among statisticians and machine learning researchers [1]. Let

X \in R^{p + 1}

denote the

(p + 1)

-dimensional covariate vector, where the first element of X is set to 1, and let

y \in {0, 1}

be the binary response. Suppose we observe n pairs of data

(X_{i}, y_{i})

for

i = 1, \dots, n

, the standard form of the logistic regression model is:

P (y_{i} = 1 ∣ X_{i}) = 1 - P (y_{i} = 0 ∣ X_{i}) = σ (X_{i}^{T} β) = \frac{e^{X_{i}^{T} β}}{1 + e^{X_{i}^{T} β}},

(1)

where

β = (β_{0}, β_{1}, \dots, β_{p}) \in R^{p + 1}

denotes the regression coefficients,

β_{0}

is the intercept term, and

σ (t) = e^{t} / (1 + e^{t})

is the logistic function.

With the advent of the “big data era,” high-dimensional sparsity has become a common assumption in regression modeling. In settings where

p > n

or even

p ≫ n

, variable selection is essential for identifying simpler models in which most regression coefficients are zero (or near zero). This not only enhances model interpretability but also mitigates overfitting and multicollinearity.

Variable selection approaches can be broadly grouped into three categories: (1) Traditional methods based on significance testing or information criteria (e.g., AIC, BIC); (2) Penalization-based approaches that perform selection via regularization, such as the LASSO (Least Absolute Shrinkage and Selection Operator) [2], Smoothly Clipped Absolute Deviation (SCAD) [3], and the Minimax Concave Penalty (MCP) [4]; (3) Bayesian methods that induce sparsity through shrinkage priors, including continuous shrinkage priors and two-component mixture priors such as the spike-and-slab prior.

Among these, significance-based and information-theoretic approaches are increasingly seen as unreliable in high-dimensional settings or in the presence of multicollinearity. Penalization-based methods like the LASSO have gained widespread popularity, and many such methods can be reinterpreted within a Bayesian framework by specifying appropriate priors. In fact, Bayesian variable selection has been shown—both theoretically and empirically—to perform on par with or even outperform frequentist approaches [5,6,7]. Consequently, this work adopts the Bayesian paradigm for variable selection.

While Bayesian shrinkage priors do not perform variable selection in a strict sense, practitioners often apply thresholding rules on posterior estimates to identify important variables—that is, variables are retained if their posterior mean or inclusion probability exceeds a certain threshold. Although widely used, the theoretical properties of such thresholding procedures remain underexplored.

Variable selection can also be viewed as a special case of model selection. A popular Bayesian prior for this purpose is the spike-and-slab prior, expressed as:

β_{j} ∣ γ_{j} \sim γ_{j} φ_{1} (β_{j}) + (1 - γ_{j}) φ_{0} (β_{j}), j = 1, 2, \dots, p,

(2)

where

φ_{1} (β_{j})

is typically a diffuse distribution (“slab”) modeling large coefficients, and

φ_{0} (β_{j})

is a concentrated distribution (“spike”) capturing small or null coefficients. The binary indicator

γ_{j}

determines whether the j-th variable is included (

γ_{j} = 1

) or excluded (

γ_{j} = 0

). The spike-and-slab prior therefore provides an explicit mechanism for variable selection, with

γ_{j}

serving as a measure of variable importance.

A well-known formulation is the Stochastic Search Variable Selection (SSVS) prior by George and McCulloch [8], where both

φ_{1}

and

φ_{0}

are Gaussian distributions with differing variances. However, SSVS is often sensitive to the choice of hyperparameters, and inappropriate tuning may result in the under- or over-selection of variables.

Another widely adopted formulation sets

φ_{0} (β_{j})

to be a point mass at zero, known as the point-mass spike-and-slab prior:

β_{j} ∣ γ_{j} \sim γ_{j} φ_{1} (β_{j}) + (1 - γ_{j}) δ_{0}, j = 1, 2, \dots, p,

(3)

where

δ_{0}

denotes the Dirac delta function at zero. This prior is regarded as the “gold standard” in sparse Bayesian inference [9]. Mitchell and Beauchamp [10] specify a Gaussian slab, while Castillo and van der Vaart [11] advocate for heavy-tailed distributions to ensure good recovery. Ročková and George [12] proposed the spike-and-slab Lasso (SSL), wherein both components are Laplace distributions. For a comprehensive review of SSL, see Bai et al. [13].

Despite their flexibility, spike-and-slab priors can be computationally intensive. Even for moderate p, traditional MCMC methods become infeasible. Variational Bayes (VB) offers a scalable alternative, especially in high-dimensional settings (

p > n

or

p ≫ n

). In linear regression, Carbonetto and Stephens [14] used variational inference with SSVS for genomic applications; Huang et al. [15] developed a variational framework using the point-mass spike-and-slab prior and established its asymptotic properties; Ray et al. [16] used a Laplace slab in this context, noting its superior performance due to heavier tails.

In logistic regression, the lack of conjugacy complicates Bayesian computation. To address this, MacKay [17] used Gaussian approximations; Jaakkola and Jordan [18] proposed lower bounds for the logistic function; Wang and Blei [19] employed Laplace and delta methods; and Polson et al. [20] introduced Polya-Gamma augmentation, which restores conjugacy. More recently, Zhang et al. [21] incorporated binary indicators

γ_{j}

under a lower-bound framework for logistic variable selection, while Ray et al. [22] applied a Laplace slab in variational inference. Tang et al. [23] first introduced SSL into generalized linear models via an EM algorithm.

In this paper, we incorporate SSL priors into Bayesian variable selection for logistic regression. To address the computational burden in high-dimensional settings, we adopt a variational Bayes framework. Due to the non-conjugacy of the logistic likelihood, we use either lower-bound approximations or Polya-Gamma augmentation to facilitate tractable updates. Our key contribution lies not in the novelty of the SSL+VB combination itself, but in designing an efficient and scalable implementation tailored for high-dimensional logistic regression. In particular, we propose a refined reparameterization of the SSL prior that allows for closed-form updates in VB optimization, significantly improving both convergence speed and numerical stability.

The remainder of the paper is organized as follows. Section 2 introduces the Bayesian logistic regression model with both Pólya-Gamma and lower-bound approximations, and describes the SSL prior in detail. Section 3 presents the hierarchical model and the variational Bayes algorithm. Section 4 provides a theoretical analysis of the variational posterior. Section 5 reports the simulation results and real data applications. Section 6 concludes the paper.

2. Bayesian Logistic Regression

2.1. Bayesian Logistic Regression Based on Pólya-Gamma Latent Variables

Logistic regression lacks a conjugate prior, making Bayesian inference challenging. To address this issue, Polson et al. [20] proposed a novel data augmentation strategy using the Pólya-Gamma distribution. If

ω \sim P G (b, c)

, where

P G (b, c)

denotes the Pólya-Gamma distribution with parameters

(b, c)

, its expectation is given by:

E (ω) = \frac{b}{2 c} tanh (\frac{c}{2}) = \frac{b}{2 c} (\frac{e^{c} - 1}{e^{c} + 1}) .

Its density has the following property:

p (ω ∣ b, c) \propto exp (- \frac{c^{2}}{2} ω) p (ω ∣ b, 0),

where

p (ω)

denotes the density of

P G (b, 0)

. Furthermore, for

b > 0

and

κ = a - b / 2

, the following identity holds:

\frac{e^{a ψ}}{{(1 + e^{ψ})}^{b}} = 2^{- b} e^{κ ψ} \int_{0}^{\infty} exp (- ω ψ^{2} / 2) p (ω) d ω .

Applying this identity to logistic regression, the likelihood becomes:

L (β, ω) = \prod_{i = 1}^{n} \frac{{(e^{X_{i}^{T} β})}^{y_{i}}}{1 + e^{X_{i}^{T} β}} \propto \prod_{i = 1}^{n} exp \{(y_{i} - \frac{1}{2}) X_{i}^{T} β - \frac{ω_{i}}{2} {(X_{i}^{T} β)}^{2}\} p (ω_{i} ∣ 1, 0),

where

ω = (ω_{1}, \dots, ω_{n})

and

ω_{i} \sim P G (1, 0)

.

Assuming a prior

p (β)

, the posterior of

β

is:

q (β ∣ ω, y) \propto p (β) \prod_{i = 1}^{n} exp \{κ_{i} X_{i}^{T} β - \frac{ω_{i}}{2} {(X_{i}^{T} β)}^{2}\},

where

κ_{i} = y_{i} - 1 / 2

. If

p (β)

is a Gaussian prior, then the posterior is conjugate with the prior.

The posterior density of the latent variable

ω_{i}

is:

q (ω_{i} ∣ β) \propto exp (- \frac{ω_{i}}{2} {(X_{i}^{T} β)}^{2}) p (ω_{i} ∣ 1, 0),

which implies

ω_{i} \sim P G (1, X_{i}^{T} β)

, and

E (ω_{i} ∣ β) = \frac{1}{2 X_{i}^{T} β} tanh (\frac{X_{i}^{T} β}{2}) .

2.2. Bayesian Logistic Regression Based on Lower-Bound Approximation

The log-likelihood of logistic regression is:

l (β) = log L (β) = \sum_{i = 1}^{n} y_{i} X_{i}^{T} β - g (X_{i}^{T} β),

where

g (t) = log (1 + e^{t}), t \in R

.

We can rewrite the log-sigmoid function as:

log σ (t) = log (\frac{1}{1 + e^{- t}}) = \frac{t}{2} - log (e^{t / 2} + e^{- t / 2}) .

Jaakkola and Jordan [18] approximated the second term via a first-order Taylor expansion:

\begin{matrix} log σ (t) & \geq \frac{t}{2} - log (e^{η / 2} + e^{- η / 2}) - \frac{1}{4 η} tanh (\frac{η}{2}) (t^{2} - η^{2}) \\ = \frac{t - η}{2} + log σ (η) - \frac{1}{4 η} tanh (\frac{η}{2}) (t^{2} - η^{2}) . \end{matrix}

By substituting

- g (X_{i}^{T} β) = log σ (- X_{i}^{T} β)

into

l (β)

, we obtain a lower bound:

l (β) \geq \sum_{i = 1}^{n} log σ (η_{i}) - \frac{η_{i}}{2} + (y_{i} - \frac{1}{2}) X_{i}^{T} β - \frac{1}{4 η_{i}} tanh (\frac{η_{i}}{2}) ({(X_{i}^{T} β)}^{2} - η_{i}^{2}) = : f (β, η) .

Thus, the posterior becomes:

q (β) = p (β) L (β) \approx p (β) exp (f (β, η)) \propto p (β) \prod_{i = 1}^{n} exp \{(y_{i} - \frac{1}{2}) X_{i}^{T} β - \frac{1}{4 η_{i}} tanh (\frac{η_{i}}{2}) {(X_{i}^{T} β)}^{2}\} .

We aim to maximize the lower bound

f (β, η)

. For a given

β

, define:

f_{a} (x) = log σ (x) - \frac{x}{2} - \frac{1}{4 x} tanh (x / 2) (a^{2} - x^{2}),

where

f_{a} : R \to R, a \geq 0

. Ray et al. [22] show that

f_{a} (x)

is symmetric at about

x = 0

and attains its maximum at

x = \pm a

. Therefore, the lower bound

f (β, η)

is maximized when

η_{i} = X_{i}^{T} β

.

Comparing

q (β)

under both methods (Pólya-Gamma and lower-bound approximation), we find that the posterior form of

β

is essentially the same in both frameworks.

2.3. Spike-and-Slab Lasso Prior

The spike-and-slab Lasso (SSL) prior [12] is defined as:

\begin{matrix} β_{j} ∣ γ_{j} & = γ_{j} φ_{1} (β_{j}) + (1 - γ_{j}) φ_{0} (β_{j}), j = 1, 2, \dots, p, \\ φ_{1} (β_{j}) & = \frac{λ_{1}}{2} exp (- λ_{1} | β_{j} |), φ_{0} (β_{j}) = \frac{λ_{0}}{2} exp (- λ_{0} | β_{j} |), \end{matrix}

where

λ_{1}

is typically small to allow for large coefficients, while

λ_{0}

is large to encourage shrinkage.

Since the Laplace distribution is not conjugate, it is typically expressed as a hierarchical representation involving a Gaussian distribution and an exponential distribution:

\begin{matrix} β_{j} ∣ τ_{1 j}^{2}, γ_{j} = 1 & \sim N (0, τ_{1 j}^{2}), τ_{1 j}^{2} ∣ λ_{1}^{2} \sim Exp (λ_{1}^{2} / 2), \\ β_{j} ∣ τ_{0 j}^{2}, γ_{j} = 0 & \sim N (0, τ_{0 j}^{2}), τ_{0 j}^{2} ∣ λ_{0}^{2} \sim Exp (λ_{0}^{2} / 2) . \end{matrix}

When

λ_{1} = λ_{0} = λ

, the SSL prior reduces to the Lasso prior. As

λ_{0} \to \infty

,

φ_{0} (β_{j}) \to 0

, and the SSL prior approaches the “gold standard” spike-and-slab formulation. Thus, SSL effectively integrates penalized likelihood (via Lasso) with spike-and-slab Bayesian variable selection. Moreover, the SSL prior is adaptive; its adaptivity is discussed in detail by Ročková and George [12].

3. Variational Bayesian Variable Selection

In the previous section, we clarified the likelihood function and prior specification of

β

in logistic regression. To avoid manual tuning of hyperparameters, we specify prior distributions on them. The Bayesian logistic regression hierarchical model considered in this paper is

\begin{matrix} p (y_{i} ∣ β, ω_{i}) \propto exp \{(y_{i} - \frac{1}{2}) (X_{i}^{T} β) - \frac{ω_{i}}{2} {(X_{i}^{T} β)}^{2}\}, i = 1, 2, \dots, n, \\ p (ω_{i}) \sim P G (1, 0), β_{0} | σ^{2} \sim N (0, σ^{2}), \\ β_{j} |τ_{1 j}^{2}, γ_{j} = 1 \sim N (0, τ_{1 j}^{2}), τ_{1 j}^{2}| λ_{1}^{2} \sim Exp (λ_{1}^{2} / 2), j = 1, 2, \dots, p, \\ β_{j} |τ_{0 j}^{2}, γ_{j} = 0 \sim N (0, τ_{0 j}^{2}), τ_{0 j}^{2}| λ_{0}^{2} \sim Exp (λ_{0}^{2} / 2), \\ γ_{j} ∣ π \sim Bernoulli (π), π \sim Beta (a_{0}, b_{0}), \\ λ_{1}^{2} \sim Γ (c_{1}, d_{1}), λ_{0}^{2} \sim Γ (c_{0}, d_{0}), \end{matrix}

where

Bernoulli (\cdot)

denotes the Bernoulli distribution,

Beta (\cdot, \cdot)

the beta distribution, and

Γ (\cdot, \cdot)

the gamma distribution.

This paper adopts mean-field variational Bayes (VB). Based on the results in Blei et al. [24], the posterior distribution of each parameter can be derived. Let

β_{p} = (β_{1}, β_{2}, \dots, β_{p})

and

β = (β_{0}, β_{p})

. The notation

E_{- z}

represents the expectation with respect to all variables except z. Then, the variational posterior for

β_{0}

is

q (β_{0}) \propto exp \{E_{- β_{0}} [\sum_{i = 1}^{n} (y_{i} - \frac{1}{2}) β_{0} - \frac{ω_{i}}{2} β_{0}^{2} - \frac{β_{0}^{2}}{2 σ^{2}}]\} .

It follows that

q (β_{0}) \sim N (μ_{0}, σ_{0}^{2})

, where

σ_{0}^{2} = \frac{1}{\sum_{i = 1}^{n} E [ω_{i}] + \frac{1}{σ^{2}}}, μ_{0} = σ_{0}^{2} \sum_{i = 1}^{n} (y_{i} - \frac{1}{2}) .

The variational posterior for

β_{p}

is

\begin{matrix} q (β_{p}) \propto exp \{E_{- β_{p}} [\sum_{i = 1}^{n} (y_{i} - \frac{1}{2}) X_{i}^{T} β - \frac{ω_{i}}{2} {(X_{i}^{T} β)}^{2} - \frac{1}{2} β_{p}^{T} D_{τ} β_{p}]\}, \\ D_{τ} = diag (\frac{1}{τ_{1}^{2}}, \frac{1}{τ_{2}^{2}}, \dots, \frac{1}{τ_{p}^{2}}), τ_{j}^{2} = γ_{j} τ_{1 j}^{2} + (1 - γ_{j}) τ_{0 j}^{2}, \end{matrix}

which gives

q (β_{p}) \sim N (μ, Σ)

, where

\begin{matrix} Σ = {(X^{T} W X + D)}^{- 1}, W = E [diag (ω)], D = E [D_{τ}], \\ μ = Σ X^{T} (y - \frac{1}{2} 1_{n}), y = {(y_{1}, \dots, y_{n})}^{T}, 1_{n} = {(1, 1, \dots, 1)}^{T} . \end{matrix}

When

p > n

, we apply the Woodbury identity to avoid direct inversion of large matrices and improve computational efficiency. Let

U = X \sqrt{W}

,

V = \sqrt{W} X^{T}

, and

I_{n \times n}

be the n-dimensional identity matrix, then

Σ = D^{- 1} - D^{- 1} U {(I_{n \times n} + V D^{- 1} U)}^{- 1} V D^{- 1} .

The variational posterior of

ω_{i}

is

q (ω_{i}) \propto exp \{E_{- ω_{i}} [- \frac{ω_{i}}{2} {(X_{i}^{T} β)}^{2}]\} p (ω_{i} ∣ 1, 0),

which yields

q (ω_{i}) \sim P G (1, {\tilde{c}}_{i})

, where

{\tilde{c}}_{i} = \sqrt{E [{(X_{i}^{T} β)}^{2}]} = \sqrt{{(X_{i}^{T} μ)}^{2} + X_{i}^{T} Σ X_{i}} .

The variational posterior for

τ_{1 j}^{2}

is

q (τ_{1 j}^{2}) \propto exp \{E_{- τ_{1 j}^{2}} [γ_{j} (- \frac{1}{2} log τ_{1 j}^{2} - \frac{β_{j}^{2}}{2 τ_{1 j}^{2}} - \frac{λ_{1}^{2}}{2} τ_{1 j}^{2})]\} .

Recall that if

X \sim G I G (κ, a, b)

with density

GIG (x ∣ κ, a, b) \propto x^{κ - 1} exp \{- \frac{1}{2} (\frac{a}{x} + b x)\}, x > 0,

then the expectations are given by

E [X] = \frac{\sqrt{a} K_{κ + 1} (\sqrt{a b})}{\sqrt{b} K_{κ} (\sqrt{a b})},

where

K_{κ} (\cdot)

denotes the modified Bessel function of the second kind. Hence,

q (τ_{1 j}^{2}) \sim G I G (\frac{1}{2}, a_{1 j}, b_{1 j}),

with

a_{1 j} = E [γ_{j}] E [β_{j}^{2}], b_{1 j} = E [γ_{j}] E [λ_{1}^{2}] .

Similarly, for

τ_{0 j}^{2}

,

q (τ_{0 j}^{2}) \propto exp \{E_{- τ_{0 j}^{2}} [(1 - γ_{j}) (- \frac{1}{2} log τ_{0 j}^{2} - \frac{β_{j}^{2}}{2 τ_{0 j}^{2}} - \frac{λ_{0}^{2}}{2} τ_{0 j}^{2})]\},

so

q (τ_{0 j}^{2}) \sim G I G (\frac{1}{2}, a_{0 j}, b_{0 j}),

where

a_{0 j} = E [1 - γ_{j}] E [β_{j}^{2}], b_{0 j} = E [1 - γ_{j}] E [λ_{0}^{2}] .

The variational posterior for

γ_{j}

satisfies

\begin{matrix} q (γ_{j}) \propto & exp \{E_{- γ_{j}} [γ_{j} (\frac{1}{2} log \frac{τ_{0 j}^{2}}{τ_{1 j}^{2}} + \frac{β_{j}^{2}}{2} (\frac{1}{τ_{0 j}^{2}} - \frac{1}{τ_{1 j}^{2}}) + log \frac{λ_{1}^{2}}{λ_{0}^{2}} + \frac{λ_{0}^{2}}{2} τ_{0 j}^{2} - \frac{λ_{1}^{2}}{2} τ_{1 j}^{2} + log \frac{π}{1 - π})]\}, \end{matrix}

leading to

q (γ_{j}) \sim Bernoulli (ϕ_{j}),

ϕ_{j} = σ (E [\frac{1}{2} log \frac{τ_{0 j}^{2}}{τ_{1 j}^{2}} + \frac{β_{j}^{2}}{2} (\frac{1}{τ_{0 j}^{2}} - \frac{1}{τ_{1 j}^{2}}) + log \frac{λ_{1}^{2}}{λ_{0}^{2}} + \frac{λ_{0}^{2}}{2} τ_{0 j}^{2} - \frac{λ_{1}^{2}}{2} τ_{1 j}^{2} + log \frac{π}{1 - π}]),

where

σ (\cdot)

is the sigmoid function.

The variational posterior for

π

is

q (π) \propto exp \{E_{- π} [(a_{0} - 1 + \sum_{j = 1}^{p} γ_{j}) log π + (b_{0} - 1 + p - \sum_{j = 1}^{p} γ_{j}) log (1 - π)]\},

which gives

q (π) \sim Beta (a, b),

where

a = a_{0} + \sum_{j = 1}^{p} E [γ_{j}], b = b_{0} + p - \sum_{j = 1}^{p} E [γ_{j}],

and

E [π] = \frac{a}{a + b}, E [log π] = ψ (a) - ψ (a + b), E [log \frac{π}{1 - π}] = ψ (a) - ψ (b),

with

ψ (\cdot)

the digamma function.

The variational posterior for

λ_{1}^{2}

is

q (λ_{1}^{2}) \propto exp \{E_{- λ_{1}^{2}} [(c_{1} - 1 + \sum_{j = 1}^{p} γ_{j}) log λ_{1}^{2} - (d_{1} + \sum_{j = 1}^{p} \frac{γ_{j} τ_{1 j}^{2}}{2} λ_{1}^{2})]\},

Thus

q (λ_{1}^{2}) \sim Γ ({\tilde{c}}_{1}, {\tilde{d}}_{1}),

where

{\tilde{c}}_{1} = c_{1} + \sum_{j = 1}^{p} E [γ_{j}], {\tilde{d}}_{1} = d_{1} + \sum_{j = 1}^{p} \frac{E [γ_{j}] E [τ_{1 j}^{2}]}{2},

and

E [λ_{1}^{2}] = \frac{{\tilde{c}}_{1}}{{\tilde{d}}_{1}}, E [log λ_{1}^{2}] = ψ ({\tilde{c}}_{1}) - log {\tilde{d}}_{1} .

The variational posterior for

λ_{0}^{2}

is

q (λ_{0}^{2}) \propto exp \{E_{- λ_{0}^{2}} [(c_{0} - 1 + p - \sum_{j = 1}^{p} γ_{j}) log λ_{0}^{2} - (d_{0} + \sum_{j = 1}^{p} \frac{(1 - γ_{j}) τ_{0 j}^{2}}{2} λ_{0}^{2})]\},

which implies

q (λ_{0}^{2}) \sim Γ ({\tilde{c}}_{0}, {\tilde{d}}_{0}),

with

{\tilde{c}}_{0} = c_{0} + p - \sum_{j = 1}^{p} E [γ_{j}], {\tilde{d}}_{0} = d_{0} + \sum_{j = 1}^{p} \frac{E [1 - γ_{j}] E [τ_{0 j}^{2}]}{2},

and

E [λ_{0}^{2}] = \frac{{\tilde{c}}_{0}}{{\tilde{d}}_{0}}, E [log λ_{0}^{2}] = ψ ({\tilde{c}}_{0}) - log {\tilde{d}}_{0} .

In summary, the algorithm proceeds as follows:

Input data: $(y, X)$ ;
Initialize the variational parameters;
Set iteration counter $t = 1$ and maximum iteration T;
While the convergence criterion is not met and $t < T$ :
(a)
Update $μ_{0}, σ_{0}^{2}$ and $μ, Σ$ via $q (β_{0})$ and $q (β_{p})$ ;
(b)
Update ${\tilde{c}}_{i}$ for $q (ω_{i})$ ;
(c)
Update $(a_{1 j}, b_{1 j})$ and $(a_{0 j}, b_{0 j})$ for $q (τ_{1 j}^{2})$ and $q (τ_{0 j}^{2})$ ;
(d)
Update $ϕ_{j}$ from $q (γ_{j})$ ;
(e)
Update $(a, b)$ for $q (π)$ ;
(f)
Update $({\tilde{c}}_{1}, {\tilde{d}}_{1})$ and $({\tilde{c}}_{0}, {\tilde{d}}_{0})$ for $q (λ_{1}^{2})$ and $q (λ_{0}^{2})$ ;
(g)
Compute entropy of $ϕ$ :

$E n t = ϕ log ϕ + (1 - ϕ) log (1 - ϕ);$

(h)
Check convergence: if the change of entropy between successive iterations is below a preset threshold (e.g., $10^{- 3}$ ), then stop.
Output $μ$ , $Σ$ , and $ϕ_{j}$ .

This algorithm uses coordinate ascent variational inference (CAVI) to update each variational distribution sequentially until convergence. Although the evidence lower bound (ELBO) is usually employed as the convergence criterion, here we monitor the entropy of

ϕ

for practical convergence checks. Through this approach, we can efficiently estimate parameters and perform variable selection in Bayesian logistic regression.

4. Theoretical Results

Let

β_{*}

denote the true regression coefficients of model (1). The number of non-zero elements in

β_{*}

is defined as

s \overset{def}{=} {∥β_{*}∥}_{0}

, and the index set of non-zero elements in

β_{*}

is denoted by

ξ_{*} = \{j : β_{* j} \neq 0\}

. Let E represent the maximum absolute value of non-zero elements:

{max}_{j \in ξ_{*}} |β_{* j}| < E

. The prior distribution of

β

is denoted by

Π_{α} (β)

, where

α

is the hyperparameter of the prior. The prior can be expressed as a product of component-wise terms:

Π_{α} (β) \overset{def}{=} \prod_{j = 1}^{p} π_{α} (β_{j})

.

Based on Theorem 2.1 of Wei [25], for a sequence

{\bar{ϵ}}_{n} ↓ 0

with

n {\bar{ϵ}}_{n}^{2} \to \infty

, the prior of

β

satisfies:

1 - \int_{- a_{n}}^{a_{n}} π_{α} (β) d β \leq p^{- (1 + μ)}

(4)

- log (\int_{\sum_{j = s + 1}^{p} |β_{j}| \leq η {\tilde{ϵ}}_{n}} \prod_{j = s + 1}^{p} π_{α} (β_{j}) d β_{s + 1} \dots d β_{p}) ≺ n {\bar{ϵ}}_{n}^{2},

(5)

log s + 2 log (1 / {\bar{ϵ}}_{n}) - log (inf_{β \in [- E, E]} π_{α} (β)) ≺ n {\bar{ϵ}}_{n}^{2} / s_{*},

(6)

where

μ

and

η

are positive constants, and

a_{n} < {\bar{ϵ}}_{n} / p

.

Let

\bar{s} = max \{s_{*}, L n {\bar{ϵ}}_{n}^{2} / log p\}

, where

L > 0

and

log L_{n} = O (log p)

. If

ϵ_{n} = \sqrt{(\bar{s} log p) / n} ≍ {\bar{ϵ}}_{n}

, and there exist a sufficiently large constant

M > 0

and a constant

c_{0} > 0

, then the true distribution of model (1), denoted by

P^{*}

, satisfies the following: Define the events

B_{n} = \{| β | has at least \bar{s} elements greater than a_{n}\}

and

C_{n} = \{{∥β - β_{*}∥}_{2} > M ϵ_{n}\}

. As

p = p_{n} \to \infty

, we have:

P^{*} (P_{α} (C_{n} \cup B_{n} ∣ X, y) > 2 e^{- c_{0} n {\bar{ϵ}}_{n}^{2}}) \leq \frac{7}{p} \to 0 .

(7)

The following theorem is the core of our argument. In variational Bayesian logistic regression, we measure the estimation error through the variational posterior distribution. This error consists of two parts: (1) the KL divergence between the variational posterior and the true posterior; (2) the estimation error of the true posterior.

Theorem 1.

For parameters

β \in Θ

(parameter space), where

(X, y)

represents the observed data, and Q is a distribution of β. If there exists a constant

C > 0

and a sequence

k_{n} \to \infty

such that:

P (β \in Θ ∣ X, y) \leq C e^{- k_{n}},

(8)

then:

Q (β \in Θ) \leq \frac{2}{k_{n}} [K L (Q (β) ∥ P (β ∣ X, y)) + C e^{- k_{n} / 2}] .

(9)

Proof of Theorem 1.

The dual representation of the KL divergence [16] is:

KL (Q ∥ P) = sup_{f} [\int f d Q - ln \int e^{f} d P], where \int e^{f} d P < \infty .

Therefore,

\int f (β) d Q (β) \leq K L (Q (β) ∥ P (β ∣ X, y)) + ln \int e^{f (β)} d P (β ∣ X, y) .

Let

f (β) = \frac{k_{n}}{2} I (β \in Θ)

. Then,

\begin{matrix} \frac{k_{n}}{2} Q (β \in Θ) & \leq K L (Q (β) ∥ P (β ∣ X, y)) + ln (1 + P (β \in Θ ∣ X, y) e^{\frac{k_{n}}{2}}) \\ \leq K L (Q (β) ∥ P (β ∣ X, y)) + P (β \in Θ ∣ X, y) e^{k_{n} / 2} . \end{matrix}

Since

P (β \in Θ ∣ X, y) \leq C e^{- k_{n}}

, it follows that:

Q (β \in Θ) \leq \frac{2}{k_{n}} [K L (Q (β) ∥ P (β ∣ X, y)) + C e^{- k_{n} / 2}] .

□

Additionally, we assume the KL divergence between the variational posterior and the true posterior of

β

is finite, i.e.,

K L (Q (β) ∥ P (β ∣ X, y)) < \infty

. Since

β

follows commonly used distributions in this work, the scenario where this KL divergence is infinite is unlikely, making this assumption reasonable. When

k_{n} \to \infty

and

β \in C_{n} \cup B_{n}

, it suffices to verify that (8) holds. From (9), we obtain that the variational posterior satisfies:

Q (β \in C_{n} \cup B_{n}) \overset{P^{*}}{\to} 0 .

In summary, when the true posterior satisfies (7), i.e.,

P (C_{n} \cup B_{n} ∣ X, y) < 2 e^{- c_{0} n {\bar{ϵ}}_{n}^{2}} \overset{P^{*}}{\to} 1

, and since

n {\bar{ϵ}}_{n}^{2} \to \infty

, we have

P (C_{n} \cup B_{n} ∣ X, y) \leq_{P^{*}} 2 e^{- c_{0} n {\bar{ϵ}}_{n}^{2}} \overset{P^{*}}{\to} 0

. Therefore, it remains to be verified whether the SSL prior satisfies (4), (5), and (6).

Theorem 2.

In the logistic regression model (1), assume the prior of β is the SSL prior. When

a_{n} < ϵ_{n} / p

,

λ_{1} < \frac{η ϵ_{n}}{p}

, and

E ≪ \frac{n {\bar{ϵ}}_{n}^{2}}{s}

, it holds that:

P (C_{n} \cup B_{n} ∣ X, y) <_{P^{*}} 2 e^{- c_{0} n {\bar{ϵ}}_{n}^{2}} \to 0 .

Proof of Theorem 2.

Proof:

\begin{matrix} 1 - \int_{- a_{n}}^{a_{n}} p (β) d β & = 2 \int_{a_{n}}^{\infty} (1 - γ) ϕ (β; 0, τ_{0}^{2}) Exp (τ_{0}^{2}; λ_{0}^{2} / 2) d τ_{0}^{2} \\ + 2 \int_{a_{n}}^{\infty} γ ϕ (β; 0, τ_{1}^{2}) Exp (τ_{1}^{2}; λ_{1}^{2} / 2) d τ_{1}^{2} \\ ≲ 2 (1 - γ) \int_{a_{n}}^{\infty} e^{- \frac{β^{2}}{2 τ_{0}^{2}} - \frac{λ_{0}^{2}}{2} τ_{0}^{2}} d τ_{0}^{2} + 2 γ \int_{a_{n}}^{\infty} e^{- \frac{β^{2}}{2 τ_{1}^{2}} - \frac{λ_{1}^{2}}{2} τ_{1}^{2}} d τ_{1}^{2} \\ \approx \int_{0}^{4 a_{n}} e^{- a_{n}^{2} / ψ - ψ / 2} d ψ + \int_{4 a_{n}}^{\infty} e^{- a_{n}^{2} / ψ - ψ / 2} d ψ \leq C + \int_{4 a_{n}}^{\infty} ψ^{- 1} e^{- ψ / 2} d ψ, \end{matrix}

where C is a constant independent of

a_{n}

. Since

\int_{4 a_{n}}^{\infty} ψ^{- 1} e^{- ψ / 2} d ψ \leq \int_{2 a_{n}}^{\infty} t^{- 1} e^{- t} d t \leq - ln a_{n}

, and

a_{n} < ϵ_{n} / p

with a constant

C^{'} > 0

, we have:

1 - \int_{- a_{n}}^{a_{n}} p (β) d β \leq C^{'} ln (1 / a_{n}) = C^{'} p^{- (1 + v)} ln (1 / a_{n}) \leq p^{- (1 + μ)} .

Thus, (4) holds for

0 < μ < v

.

For (5),

\int_{\sum_{j = s + 1}^{p} |β_{j}| \leq η ϵ_{n}} \prod_{j = s + 1}^{p} P (β_{j}) d β_{s + 1} \dots d β_{p} \geq {(1 - p^{- (1 + μ)})}^{p - s} P (\sum_{j = s + 1}^{p} |T_{j}| \leq η ϵ_{n}) .

Given:

E (|T_{j}|) = E (E (β_{j} ∣ λ_{1 j})) = O (λ_{1}), E (\sum_{j = s + 1}^{p} |T_{j}|) = (p - s) O (λ_{1}),

and

λ_{1} < \frac{η ϵ_{n}}{p}

, we have:

P (\sum_{j = s + 1}^{p} |T_{j}| \leq η ϵ_{n}) \geq P (\sum_{j = s + 1}^{p} |T_{j}| \leq (p - s) O (λ_{1})) \to 1 / 2 .

Hence, there exists a constant

c^{''} > 0

such that:

\int_{\sum_{j = s + 1}^{p} |β_{j}| \leq η ϵ_{n}} \prod_{j = s + 1}^{p} P (β_{j}) d β_{s + 1} \dots d β_{p} \geq \frac{1}{4} {(1 - p^{- (1 + μ^{'})})}^{p} \geq exp (- c^{''} n {\bar{ϵ}}_{n}^{2}) .

Taking logarithms verifies (5).

For (6),

\begin{matrix} inf_{β \in [- E, E]} & [\int_{0}^{\infty} (1 - γ) ϕ (β; 0, τ_{1}^{2}) Exp (τ_{1}^{2}; λ_{1}^{2} / 2) d τ_{1}^{2} + \int_{0}^{\infty} γ ϕ (β; 0, τ_{2}^{2}) Exp (τ_{2}^{2}; λ_{2}^{2} / 2) d τ_{2}^{2}] \\ \geq \int_{0}^{η^{'}} e^{- E / ψ - ψ / 2} d ψ \geq C^{''} \int_{0}^{η^{'}} e^{- E / ψ} d ψ \geq C^{''} \int_{1 / η^{'}}^{\infty} e^{- E t} d t = C^{''} \frac{1}{E} e^{- E / η^{'}} . \end{matrix}

When

E ≪ \frac{n ϵ_{n}^{2}}{s}

, it follows that:

- ln (inf_{β \in [- E, E]} p (β)) \leq - ln C^{″} + ln E + \frac{E}{η^{'}} ≪ \frac{n {\bar{ϵ}}_{n}^{2}}{s} .

Thus, (6) holds. □

5. Simulation Study and Actual Data Analysis

5.1. Simulation Study

We consider three data generation models, where

X_{i j}

(j = 2, \dots, p + 1)

are independently drawn from a standard normal distribution, and

X_{i 1} = 1

for

i = 1, 2, \dots, n

. The simulated data are generated by the logistic model:

logit (u_{i}) = log (u_{i} / (1 - u_{i})) = X_{i}^{T} β

, and

y_{i}

is sampled from a Bernoulli distribution

Bernoulli (u_{i})

. The values and sparsity levels of

(β_{0} = 1, β_{1}, \dots, β_{s}, {\underset{︸}{0, \dots, 0}}_{p - s})

are as follows: (1)

β_{1} = \dots = β_{s} = 5

,

s = 10

,

n = 300

,

p = 500

; (2)

β_{1}, \dots, β_{s}

are i.i.d. from the uniform distribution

U (- 2, 2)

,

s = 15

,

n = 300

,

p = 500

; (3)

β_{1} = \dots = β_{s} = 10

,

s = 25

,

n = 3000

,

p = 5000

.

We compare the following methods: specifying that both

φ_{1} (β_{j})

and

φ_{0} (β_{j})

follow normal distributions in the spike-and-slab prior, using BinEMVS [26] (R package BinaryEMVS) for variable selection based on the EM algorithm; using the same prior but with varbvs [14] (R package varbvs) for variable selection via variational Bayesian inference; specifying

φ_{1} (β_{j})

as normal in the spike-and-slab prior, using VB (Gauss) [22] (R package sparsevb) for variational Bayesian variable selection; and specifying

φ_{1} (β_{j})

as Laplace in the spike-and-slab prior, using VB (Lap) [22] (R package sparsevb). We also compare with Bayesian Lasso (BL) for logistic regression (R package BayesLogit).

The hyperparameter settings follow those in the original references. The method proposed in this paper (SSLL) uses priors with

a_{0} = 1, b_{0} = 1, c_{1} = 1, d_{1} = 1, c_{0} = 100,

d_{0} = 1

. In our variational Bayes framework, when p is large, explicitly enumerating and reporting all

2^{p}

model probabilities becomes computationally infeasible and difficult to interpret. For example, when

p = 20

, the total number of possible models exceeds one million. Therefore, instead of listing all model probabilities, we summarize the posterior marginal inclusion probabilities

E [γ_{j}]

for each covariate, which is a widely accepted and interpretable measure in high-dimensional Bayesian variable selection. Moreover, as our method is based on variational inference rather than MCMC, we approximate the posterior distribution using a fully factorized mean-field family. Therefore, joint posterior model probabilities are not directly available as in an MCMC-based model averaging, but the marginal inclusion probabilities

ϕ_{j} = E [γ_{j}]

still provide informative summaries of variable relevance.

We define

ϕ_{j} = E_{q} [γ_{j}]

as the approximate posterior inclusion probability for variable j, where

γ_{j} \in {0, 1}

indicates the selection status. These are marginal probabilities and do not correspond to the posterior probability of a particular model. Following standard practice, we apply a threshold

ϕ_{j} > 0.5

as a heuristic criterion for variable inclusion; while this is a convenient and interpretable rule, we emphasize that it does not reflect the selection of a single most probable model, but rather identifies variables with high marginal support across the model space. Hence, any weak Beta prior with initial expectation

E [π] = a_{0} / (a_{0} + b_{0}) = 0.5

is acceptable. In high-dimensional settings, choosing

a_{0} = 1

and

b_{0} = p

is known to yield an optimal posterior [11]. The parameters

c_{1}, d_{1}, c_{0}, d_{0}

are selected such that the expectations satisfy

λ_{1}

(smaller value) and

λ_{0}

(larger value), respectively.

We primarily focus on variable selection and prediction. The evaluation metrics include True Positives (TPs), estimation error

∥ β - \hat{β} ∥_{2}

(

L_{2}

norm), mean squared prediction error

{(\frac{1}{n} \sum_{i = 1}^{n} {(σ (X_{i}^{T} β) - σ (X_{i}^{T} \hat{β}))}^{2})}^{1 / 2} (PMSE),

and runtime T (in seconds). The results are averaged over 20 runs, with standard errors in parentheses.

Table 1 presents results for the first data generation model. All methods achieve perfect TP (1.00) across 500 simulations, indicating no important variables were missed. Regarding parameter estimation, SSLL attains the smallest

L_{2}

error. Figure 1 shows the true coefficients

β_{1} = \dots = β_{s} = 5

(signal, green), where SSLL produces superior estimates, reflecting its adaptability as an SSL prior. However, all methods tend to underestimate

β

, likely due to the independence assumption among parameters. Prediction performance is comparable across methods. Runtime-wise, BinEMVS (EM-based) is the slowest. SSLL expresses the Laplace prior as a mixture of normal and exponential distributions, which improves estimation accuracy and computational speed. Although Bayesian Lasso remains a popular baseline, it can be computationally expensive in high-dimensional logistic regression, motivating our use of scalable shrinkage priors and variational inference for efficiency without sacrificing predictive performance.

Table 2 reports results for the second data generation model, where nonzero

β_{j}

are drawn from

U (- 2, 2)

. This model poses a greater challenge for variable selection. According to the TP results, methods select most but not all important variables. This is because variable selection probabilities

ϕ_{j}

depend on the magnitude of

β_{j}

; smaller true values lead to lower selection probabilities and, thus, exclusion.

From Table 2, it can be seen that BinEMVS attains the highest TP (0.80), followed by SSLL (0.70), outperforming VB and varbvs methods. Varbvs achieves the lowest

L_{2}

error (1.984), with SSLL close behind (2.150), indicating competitive parameter estimation. SSLL also records a low MSPE (0.133), better than BinEMVS and BL, and comparable to VB methods. Notably, SSLL has the shortest runtime (2.231 s), which is substantially faster than BinEMVS and BL (both over 40 s) and more efficient than VB and varbvs. Overall, SSLL balances variable selection accuracy, estimation precision, and predictive performance with superior computational efficiency, highlighting its practical advantage for sparse modeling in high-dimensional logistic regression.

Figure 2 shows estimated coefficients compared to the true signals (green). The SSLL method performs comparably to others, with generally smaller estimation errors than in the first model, possibly due to randomness in nonzero coefficients. The SSLL method also exhibits strong prediction and high computational efficiency.

Table 3 summarizes the results for the third data generation model with

p = 5000

. Due to computational cost, the BinEMVS and BL methods are excluded. VB (Lap) shows strong variable selection, parameter estimation, and prediction performance, but is relatively slow because it estimates each

β_{j}

element-wise, which scales poorly with p; while varbvs is fast, it suffers from higher estimation and prediction errors. The SSLL method serves as a practical compromise between accuracy and computational efficiency.

5.2. Actual Data Analysis

We analyze a breast cancer dataset detailed by Van et al. [27], also available in the R package breastCancerNKI. The dataset comprises 295 women with breast cancer and 4919 gene microarray mRNA expression measurements (selected from 24,885 reliably expressed genes). Among these patients, 88 experienced cancer metastasis after systemic adjuvant therapy. We use the 4919 gene expressions as predictors in a logistic regression model to predict metastasis. All explanatory variables were standardized prior to model fitting.

Although variable selection is an integral part of the modeling process, we do not emphasize the specific genes selected, for two reasons. First, the high dimensionality (4919 genes vs. 295 samples) makes the selection sensitive to sampling variability and prior choices, and variable selection is known to be unstable in such settings with no unique “correct” predictor set. Second, the primary goal is accurate metastasis prediction rather than biological interpretation. Thus, we focus on evaluating predictive performance using cross-validation metrics including area under the ROC curve (AUC) and classification accuracy.

Nevertheless, we use a sparsity-inducing prior (e.g., spike-and-slab Laplace) to regularize the model and encourage parsimony. Though selected genes vary across subsamples, the predictive performance remains robust, suggesting stable predictive patterns captured by the model. For the logistic regression model, prediction accuracy is measured as

A C C = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - σ (X_{i}^{T} β)|, i = 1, 2, \dots, 295,

where “positive” cases (

y_{i} = 1

) indicate metastasis after treatment, and “negative” cases (

y_{i} = 0

) indicate no metastasis.

Due to computational constraints, BinEMVS and BL are excluded from this analysis. Figure 3 compares prediction accuracy and runtime for VB (Lap), VB (Gauss), varbvs, and SSLL, mirroring the simulation results with

p = 5000

. The runtime of SSLL here is shorter than in the simulation (where

n = 3000

vs.

n = 295

here), benefiting from the use of the Woodbury formula to reduce the computational cost of large matrix operations, thereby speeding up execution.

To evaluate out-of-sample prediction, we performed 5-fold stratified cross-validation, preserving the proportion of positive cases (metastasis) within each fold. Each iteration trains on four folds and tests on the remaining one. Predictive performance is measured by PMSE and AUC, with means and standard deviations reported across folds.

Figure 4 shows that SSLL achieves the highest mean AUC, indicating superior discrimination between positive and negative cases. VB (Lap) performs similarly, with AUC only slightly lower than SSLL. VB (Gauss) and varbvs lag behind by a modest but wider margin. These results confirm that SSLL provides the strongest predictive performance under repeated data splits, establishing a benchmark for the competing methods.

6. Conclusions

In Bayesian variable selection, we apply the SSL prior to high-dimensional logistic regression models. Utilizing a fast variational Bayes (VB) algorithm, we perform posterior inference on model parameters. Notably, both the lower bound approximation approach and the introduction of Pólya-Gamma variables lead to the same VB update scheme. Based on simulation studies and real data analysis, the variational Bayesian variable selection algorithm under the SSL prior exhibits several key advantages: (1) the adaptive shrinkage of coefficients, with weak shrinkage for important covariates and strong shrinkage for unimportant ones; (2) simultaneous parameter estimation and variable selection with low computational cost, where the posterior variance (or covariance) of parameters is naturally obtained; (3) no need for manual parameter tuning, as both the penalty parameter

λ

and the model parameters

β

are estimated jointly; (4) the use of the Woodbury matrix identity, which circumvents large matrix operations.

The SSL prior combines the strengths of the Lasso and spike-and-slab priors, while mitigating their individual limitations. As such, it provides a flexible and powerful framework that can be extended to more complex models. In this work, and in related methods, we assume parameter independence for simplicity and tractability. Although relaxing this assumption might improve estimation accuracy, it would also introduce significant complexity in modeling and computation. Thus, this simplification represents a practical trade-off.

The SSL prior framework holds substantial potential for extension to more sophisticated modeling scenarios. As a hybrid approach incorporating both continuous and discrete shrinkage components, it can be naturally generalized to a variety of advanced statistical models. Potential directions include hierarchical models with structured random effects, where SSL priors may facilitate multi-level variable selection, and semi-parametric additive models, where they could enable joint selection of both linear and nonlinear components. Additionally, extensions to multivariate response models, such as multinomial logistic regression for categorical outcomes or zero-inflated models for count data, are promising. Thanks to the computational efficiency of the variational Bayes implementation, the method can scale effectively to these more complex settings. These extensions underscore the versatility of the SSL prior while retaining its core strengths: automatic parameter tuning, adaptive shrinkage, and computationally efficient posterior approximation.

Author Contributions

J.Z. and W.W. were both involved in the development of the methodology, data analysis, and the drafting of the manuscript. M.Y. performed data processing and language translation polishing. M.T. served as the corresponding author, providing guidance on the research design, supervising the study, and revising the manuscript critically for important intellectual content. All authors have read and agreed to the published version of the manuscript.

Funding

The research is supported by the Guangzhou Huashang College Project (2023HSDS25) and the Beijing Natural Science Foundation (1242005).

Data Availability Statement

The dataset is available in the R package “breastCancerNKI”.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Casella, G.; Ghosh, M.; Gill, J.; Kyung, M. Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 2010, 5, 369–411. [Google Scholar] [CrossRef]
Hans, C. Bayesian lasso regression. Biometrika 2009, 96, 835–845. [Google Scholar] [CrossRef]
Li, Q.; Lin, N. The Bayesian elastic net. Bayesian Anal. 2010, 5, 151–170. [Google Scholar] [CrossRef]
George, E.I.; McCulloch, R.E. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 1993, 88, 881–889. [Google Scholar] [CrossRef]
Roková, V. Bayesian estimation of sparse signals with a continuous spike-and-slab prior. Ann. Stat. 2018, 46, 401–444. [Google Scholar] [CrossRef]
Mitchell, T.J.; Beauchamp, J.J. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 1988, 83, 1023–1032. [Google Scholar] [CrossRef]
Castillo, I.; van der Vaart, A. Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. Ann. Stat. 2012, 40, 2069–2101. [Google Scholar] [CrossRef]
Roková, V.; George, E.I. The spike-and-slab lasso. J. Am. Stat. Assoc. 2018, 113, 431–444. [Google Scholar] [CrossRef]
Bai, R.; Rockova, V.; George, E.I. Spike-and-slab meets lasso: A review of the spike-and-slab lasso. arXiv 2020, arXiv:2010.06451. [Google Scholar]
Carbonetto, P.; Stephens, M. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 2012, 7, 73–108. [Google Scholar] [CrossRef]
Huang, X.; Wang, J.; Liang, F. A variational algorithm for Bayesian variable selection. arXiv 2016, arXiv:1602.07640. [Google Scholar]
Ray, K.; Szabó, B. Variational Bayes for high-dimensional linear regression with sparse priors. J. Am. Stat. Assoc. 2022, 117, 1270–1281. [Google Scholar] [CrossRef]
MacKay, D.J.C. The evidence framework applied to classification networks. Neural Comput. 1992, 4, 720–736. [Google Scholar] [CrossRef]
Jaakkola, T.S.; Jordan, M.I. Bayesian parameter estimation via variational methods. Stat. Comput. 2000, 10, 25–37. [Google Scholar] [CrossRef]
Wang, C.; Blei, D.M. Variational inference in nonconjugate models. J. Mach. Learn. Res. 2013, 14, 1005–1031. [Google Scholar]
Polson, N.G.; Scott, J.G.; Windle, J. Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Am. Stat. Assoc. 2013, 108, 1339–1349. [Google Scholar] [CrossRef]
Zhang, C.X.; Xu, S.; Zhang, J.S. A novel variational Bayesian method for variable selection in logistic regression models. Comput. Stat. Data Anal. 2019, 133, 1–19. [Google Scholar] [CrossRef]
Ray, K.; Szabó, B.; Clara, G. Spike and slab variational Bayes for high dimensional logistic regression. Adv. Neural Inf. Process. Syst. 2020, 33, 14423–14434. [Google Scholar]
Tang, Z.; Shen, Y.; Zhang, X.; Yi, N. The spike-and-slab lasso generalized linear models for prediction and associated genes detection. Genetics 2017, 205, 77–88. [Google Scholar] [CrossRef] [PubMed]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
Wei, R.; Ghosal, S. Contraction Properties of Shrinkage Priors in Logistic Regression. J. Stat. Plan. Inference 2020, 207, 215–229. [Google Scholar] [CrossRef]
Mcdermott, P.; Snyder, J.; Willison, R. Methods for Bayesian Variable Selection with Binary Response Data using the EM Algorithm. arXiv 2016, arXiv:1605.05429. [Google Scholar]
Van De Vijver, M.J.; He, Y.D.; Va not Veer, L.J.; Dai, H.; Hart, A.A.; Voskuil, D.W.; Schreiber, G.J.; Peterse, J.L.; Roberts, C.; Marton, M.J.; et al. A Gene-Expression Signature as a Predictor of Survival in Breast Cancer. N. Engl. J. Med. 2002, 347, 1999–2009. [Google Scholar] [CrossRef]

Figure 1. Estimation of

β

in sparse Bayesian methods for high-dimensional logistic regression (

β_{1} = \dots = β_{s} = 5, s = 10, n = 300, p = 500

).

Figure 1. Estimation of

β

in sparse Bayesian methods for high-dimensional logistic regression (

β_{1} = \dots = β_{s} = 5, s = 10, n = 300, p = 500

).

Figure 2. Estimation of

β

in sparse Bayesian methods for high-dimensional logistic regression (

β_{1}, \dots, β_{s} \sim U (- 2, 2), s = 15, n = 300, p = 500

).

Figure 2. Estimation of

β

in sparse Bayesian methods for high-dimensional logistic regression (

β_{1}, \dots, β_{s} \sim U (- 2, 2), s = 15, n = 300, p = 500

).

Figure 3. Comparison of prediction accuracy and runtime for four methods on breast cancer data.